• I didn’t find any dangling pages in this data set.
    Some of you found there are no dangling pages (pages with no links on it) in this data set. This is caused by an error I made when preparing the input files! Actually the link data in the A.txt file is the transpose of the A matrix I described in HW4. After transposing it, I find there are 82 dangling pages.

  • Do self-links count?
    Yes they do.

  • What do you mean by visualize the sparsity pattern?
    I mean something like below, which is produced by the spy() function in Matlab. There should be analog in R. Such visualization tools are extremely helpful in exploratory data analysis.

  • What are those urls: http:/, http://www, http://www., http://drupal.org), …?
    I think they appear in the template of the urls being parsed, and the web crawler I used wrongly recognizes this as a web address. As a good data analyst, you may consider cleaning data by removing those template pages for web authoring.

  • Where can I read about the PageRank algorithm?

    • Wiki has a good introduction.
    • We also have an expert on Google PageRank right in our SAS building! Dr. Carl Meyer of the math department just wrote a book about it.
    • The $25,000,000,000 eigenvector by Kurt Bryan and Tanya Leise.