Biostat M280 Homework 3

Due Friday, May 25 @ 11:59PM

Q1 - Big $n$ regression

Those who took my 203B: Introduction to Data Science last quarter had a (painful) experience of wrangling an Apache Spark cluster to do linear regression on a dataset with more than 100 million observations. Now we learnt various methods for solving linear regression and should realize that, with right choice of algorithm, it is a problem that can be handled by any moderate computer.

Q1(1)

Download the flight data from http://stat-computing.org/dataexpo/2009/the-data.html. For this exercise, we only need data from years 2003-2008. If you are using Mac or Linux, you can run the following Bash script, which downloads and unzips files for all years.

# Download flight data by year
for i in {1987..2008}
  do
    echo "$(date) $i Download"
    fnam=$i.csv.bz2
    wget -O ./$fnam http://stat-computing.org/dataexpo/2009/$fnam
    echo "$(date) $i unzip"
    bzip2 -d ./$fnam
  done

# Download airline carrier data
wget -O ./airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS

# Download airports data
wget -O ./airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Find out how many data points in each year.

Q1(2)

We are interested in how the time gain of a flight, defined as DepDelay - ArrDelay, depends on the distance traveled (Distance), departure delay (DepDelay), and carrier (UniqueCarrier).

We want to fit a linear regression Gain ~ 1 + Distance + DepDelay + UniqueCarrier using data from 2003-2008. Note UniqueCarrier is a factor with 23 levels: "9E", "AA", "AQ", "AS", "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", and "YV". We use the dummy coding with "9E" as base level.

Will the design matrix (in double precision) fit into the memory of you computer?

Q1(3)

Review the Summary of Linear Regression and devise a strategy to solve the linear regression.

Report the estimated regression coefficients $\widehat \beta$, estimated variance $\widehat \sigma^2 = \sum_i (y_i - \widehat y_i)^2 / (n - 1)$, and coefficient standard errors.

Hint: It took my laptop less than 3 minutes to import data and fit linear regression.

Q1(4)

Go to your resume/cv and claim you have experience performing analytics on data with hundred millions of observations.

Sample code

Following code explores the data in 2003 and generates the design matrix and responses for that year. Feel free to use the code in your solution.

In [1]:
;head 2003.csv
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2003,1,29,3,1651,1655,1912,1913,UA,1017,N202UA,141,138,119,-1,-4,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA
2003,1,30,4,1654,1655,1910,1913,UA,1017,N311UA,136,138,108,-3,-1,ORD,MSY,837,2,26,0,NA,0,NA,NA,NA,NA,NA
2003,1,31,5,1724,1655,1936,1913,UA,1017,N317UA,132,138,110,23,29,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA
2003,1,1,3,1033,1035,1625,1634,UA,1018,N409UA,232,239,215,-9,-2,OAK,ORD,1835,6,11,0,NA,0,NA,NA,NA,NA,NA
2003,1,2,4,1053,1035,1726,1634,UA,1018,N496UA,273,239,214,52,18,OAK,ORD,1835,13,46,0,NA,0,NA,NA,NA,NA,NA
2003,1,3,5,1031,1035,1640,1634,UA,1018,N412UA,249,239,223,6,-4,OAK,ORD,1835,13,13,0,NA,0,NA,NA,NA,NA,NA
2003,1,4,6,1031,1035,1626,1634,UA,1018,N455UA,235,239,219,-8,-4,OAK,ORD,1835,5,11,0,NA,0,NA,NA,NA,NA,NA
2003,1,5,7,1035,1035,1636,1634,UA,1018,N828UA,241,239,227,2,0,OAK,ORD,1835,5,9,0,NA,0,NA,NA,NA,NA,NA
2003,1,6,1,1031,1035,1653,1634,UA,1018,N453UA,262,239,241,19,-4,OAK,ORD,1835,7,14,0,NA,0,NA,NA,NA,NA,NA
In [2]:
# how many data points
countlines("2003.csv")
Out[2]:
6488541
In [3]:
# import data from csv
using JuliaDB
# only need columns: DepDelay, ArrDelay, UniqueCarrier, Distance
@time yrtable = loadtable(
    "2003.csv", 
    datacols = ["DepDelay", "ArrDelay", "UniqueCarrier", "Distance"])
 47.277701 seconds (153.71 M allocations: 8.170 GiB, 27.08% gc time)
Out[3]:
Table with 6488540 rows, 4 columns:
DepDelay  ArrDelay  UniqueCarrier  Distance
───────────────────────────────────────────
-4        -1        "UA"           837
-1        -3        "UA"           837
29        23        "UA"           837
-2        -9        "UA"           1835
18        52        "UA"           1835
-4        6         "UA"           1835
-4        -8        "UA"           1835
0         2         "UA"           1835
-4        19        "UA"           1835
3         4         "UA"           413
-4        -23       "UA"           413
-3        -19       "UA"           413
⋮
#NA       #NA       "DL"           1891
29        62        "DL"           581
39        66        "DL"           1891
26        27        "DL"           1678
114       134       "DL"           946
44        53        "DL"           813
16        47        "DL"           432
50        54        "DL"           432
-3        -5        "DL"           453
3         3         "DL"           689
-1        -1        "DL"           1929
In [4]:
# drop rows with missing values
yrtable = dropna(yrtable)
Out[4]:
Table with 6375689 rows, 4 columns:
DepDelay  ArrDelay  UniqueCarrier  Distance
───────────────────────────────────────────
-4        -1        "UA"           837
-1        -3        "UA"           837
29        23        "UA"           837
-2        -9        "UA"           1835
18        52        "UA"           1835
-4        6         "UA"           1835
-4        -8        "UA"           1835
0         2         "UA"           1835
-4        19        "UA"           1835
3         4         "UA"           413
-4        -23       "UA"           413
-3        -19       "UA"           413
⋮
70        66        "DL"           1891
29        62        "DL"           581
39        66        "DL"           1891
26        27        "DL"           1678
114       134       "DL"           946
44        53        "DL"           813
16        47        "DL"           432
50        54        "DL"           432
-3        -5        "DL"           453
3         3         "DL"           689
-1        -1        "DL"           1929
In [5]:
# mapping from variable names to X columns
# carrier "9E" is used as base level
const var2col = Dict(
        "Intercept" => 1,
        "Distance" => 2,
        "DepDelay" => 3,
        "AA" => 4,
        "AQ" => 5,
        "AS" => 6,
        "B6" => 7,
        "CO" => 8,
        "DH" => 9,
        "DL" => 10,
        "EV" => 11,
        "F9" => 12,
        "FL" => 13,
        "HA" => 14,
        "HP" => 15,
        "MQ" => 16,
        "NW" => 17,
        "OH" => 18,
        "OO" => 19,
        "TZ" => 20,
        "UA" => 21,
        "US" => 22,
        "WN" => 23,
        "XE" => 24,
        "YV" => 25,
        "Gain" => 26)
# mapping from column to variable names
const col2var = map(reverse, var2col)

# a custom function to generate [X y] from data table
function generate_xy(tbl::NextTable)
    # X matrix
    XY = zeros(length(tbl), 26)
    # intercept term
    @views fill!(XY[:, 1], 1)
    # Distance term
    @views copy!(XY[:, 2], columns(tbl, :Distance))
    # DepDelay term
    @views copy!(XY[:, 3], columns(tbl, :DepDelay))
    # Dummy coding for airline
    @inbounds for i in 1:length(tbl)
        yrtable[i][:UniqueCarrier] == "9E" && continue # base level
        XY[i, var2col[tbl[i][:UniqueCarrier]]] = 1
    end
    # last column is response: gain = depdelay - arrdelay
    XY[:, 26] = select(tbl, 
        (:DepDelay, :ArrDelay) => p -> Float64(p.DepDelay - p.ArrDelay))
    # return
    XY
end
Out[5]:
generate_xy (generic function with 1 method)
In [6]:
@time xy = generate_xy(yrtable)
  2.518327 seconds (19.38 M allocations: 1.961 GiB, 17.96% gc time)
Out[6]:
6375689×26 Array{Float64,2}:
 1.0   837.0   -4.0  0.0  0.0  0.0  …  0.0  1.0  0.0  0.0  0.0  0.0   -3.0
 1.0   837.0   -1.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0    2.0
 1.0   837.0   29.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0    6.0
 1.0  1835.0   -2.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0    7.0
 1.0  1835.0   18.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0  -34.0
 1.0  1835.0   -4.0  0.0  0.0  0.0  …  0.0  1.0  0.0  0.0  0.0  0.0  -10.0
 1.0  1835.0   -4.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0    4.0
 1.0  1835.0    0.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0   -2.0
 1.0  1835.0   -4.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0  -23.0
 1.0   413.0    3.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0   -1.0
 1.0   413.0   -4.0  0.0  0.0  0.0  …  0.0  1.0  0.0  0.0  0.0  0.0   19.0
 1.0   413.0   -3.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0   16.0
 1.0   413.0    0.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0  0.0  0.0   12.0
 ⋮                             ⋮    ⋱       ⋮                          ⋮  
 1.0   406.0  104.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0   -4.0
 1.0  1891.0   70.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0    4.0
 1.0   581.0   29.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  -33.0
 1.0  1891.0   39.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  -27.0
 1.0  1678.0   26.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0   -1.0
 1.0   946.0  114.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  -20.0
 1.0   813.0   44.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0   -9.0
 1.0   432.0   16.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  -31.0
 1.0   432.0   50.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0   -4.0
 1.0   453.0   -3.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0    2.0
 1.0   689.0    3.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0    0.0
 1.0  1929.0   -1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0    0.0

Q2 - Google PageRank

We are going to try different numerical methods learnt in class on the Google PageRank problem.

Q2(1)

Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries $$ \begin{eqnarray*} a_{ij}= \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} $$ $r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules.

  • From a page $i$ with $r_i>0$
    • with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page
    • with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page
  • From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page

The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix.

Q2(2)

According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an eigen-problem. Show that it can also be cast as solving a linear system. Since the row sums of $\mathbf{P}$ are 1, $\mathbf{P}$ is rank deficient. We can replace the first equation by the $\sum_{i=1}^n x_i = 1$.

Q2(3)

Download the ucla.zip package from course webpage. Unzip the package, which contains two files U.txt and A.txt. U.txt lists the 500 URL names. A.txt is the $500 \times 500$ connectivity matrix. Read data into Julia. Compute summary statistics:

  • number of pages
  • number of edges
  • number of dangling nodes (pages with no out links)
  • which page has max in-degree?
  • which page has max out-degree?
  • visualize the sparsity pattern of $\mathbf{A}$

Q2(4)

Set the teleportation parameter at $p = 0.85$. Try the following methods to solve for $\mathbf{x}$ using the ucla.zip data.

  1. A dense linear system solver such as LU decomposition.
  2. A simple iterative linear system solver such as Jacobi or Gauss-Seidel.
  3. A dense eigen-solver.
  4. A simple iterative eigen-solver such as the power method.

For iterative methods, you can use the IterativeSolvers.jl package. Make sure to utilize the special structure of $\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication.

Q2(5)

List the top 20 ranked URLs you found.

Q2(6)

As of Monday May 11 2018, there are at least 1.83 billion indexed webpages on internet according to http://www.worldwidewebsize.com/. Explain whether each of these methods works for the PageRank problem at this scale.