Due May 19 @ 11:59PM
In this assignment, we are going to try different numerical methods learnt in class on the Google PageRank problem.
Let $\mathbf{A} \in \{0,1\}^{n \times n}$ be the connectivity matrix of $n$ web pages with entries $$ \begin{eqnarray*} a_{ij}= \begin{cases} 1 & \text{if page $i$ links to page $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} $$ $r_i = \sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules.
The process defines a Markov chain on the space of $n$ pages. Write down the transition matrix $\mathbf{P}$ of the Markov chain in matrix/vector notation.
According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\mathbf{x} = (x_1,\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\mathbf{P}$. That is $\mathbf{P}^T \mathbf{x} = \mathbf{x}$. Therefore $\mathbf{x}$ can be solved as an eigen-problem. Show that it can also be cast as solving a linear system. Remember $\mathbf{x}$ is a distribution so we normalize it to have $\sum_{i=1}^n x_i = 1$.
Download the ucla.zip
package from course webpage. Unzip the package, which contains two files U.txt
and A.txt
. U.txt
lists the 500 URL names. A.txt
is the $500 \times 500$ connectivity matrix. Read data into Julia. Compute summary statistics:
Set the teleportation parameter at $p = 0.85$. Try the following methods to solve for $\mathbf{x}$ using the ucla.zip
data.
List the top 20 ranked URLs you found.
As of Monday May 2 2017, there are at least 4.49 billion indexed webpages on internet according to http://www.worldwidewebsize.com/. Explain whether each of these methods works for the PageRank problem at this scale.