Biostat M280 Homework 3

Due May 19 @ 11:59PM

In this assignment, we are going to try different numerical methods learnt in class on the Google PageRank problem.

Q1

Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries $$ \begin{eqnarray*} a_{ij}= \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} $$ $r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules.

  • From a page $i$ with $r_i>0$
    • with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page
    • with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page
  • From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page

The process defines a Markov chain on the space of $n$ pages. Write down the transition matrix $\mathbf{P}$ of the Markov chain in matrix/vector notation.

Q2

According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an eigen-problem. Show that it can also be cast as solving a linear system. Remember $\mathbf{x}$ is a distribution so we normalize it to have $\sum_{i=1}^n x_i = 1$.

Q3

Download the ucla.zip package from course webpage. Unzip the package, which contains two files U.txt and A.txt. U.txt lists the 500 URL names. A.txt is the $500 \times 500$ connectivity matrix. Read data into Julia. Compute summary statistics:

  • number of pages
  • number of edges
  • number of dangling nodes (pages with no out links)
  • which page has max in-degree?
  • which page has max out-degree?
  • visualize the sparsity pattern of $\mathbf{A}$

Q4

Set the teleportation parameter at $p = 0.85$. Try the following methods to solve for $\mathbf{x}$ using the ucla.zip data.

  1. A dense linear system solver such as LU decomposition.
  2. A simple iterative linear system solver such as Jacobi or Gauss-Seidel.
  3. A dense eigen-solver.
  4. A simple iterative eigen-solver such as the power method.

Q5

List the top 20 ranked URLs you found.

Q6

As of Monday May 2 2017, there are at least 4.49 billion indexed webpages on internet according to http://www.worldwidewebsize.com/. Explain whether each of these methods works for the PageRank problem at this scale.