Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

Scraping from webpage

Rank

  • Use SelectorGadget to highlight the element we want to scrape

  • Use the CSS selector to get the rankings

    # Use CSS selectors to scrap the rankings section
    (rank_data_html <- html_nodes(webpage, '.text-primary'))
    ## {xml_nodeset (100)}
    ##  [1] <span class="lister-item-index unbold text-primary">1.</span>
    ##  [2] <span class="lister-item-index unbold text-primary">2.</span>
    ##  [3] <span class="lister-item-index unbold text-primary">3.</span>
    ##  [4] <span class="lister-item-index unbold text-primary">4.</span>
    ##  [5] <span class="lister-item-index unbold text-primary">5.</span>
    ##  [6] <span class="lister-item-index unbold text-primary">6.</span>
    ##  [7] <span class="lister-item-index unbold text-primary">7.</span>
    ##  [8] <span class="lister-item-index unbold text-primary">8.</span>
    ##  [9] <span class="lister-item-index unbold text-primary">9.</span>
    ## [10] <span class="lister-item-index unbold text-primary">10.</span>
    ## [11] <span class="lister-item-index unbold text-primary">11.</span>
    ## [12] <span class="lister-item-index unbold text-primary">12.</span>
    ## [13] <span class="lister-item-index unbold text-primary">13.</span>
    ## [14] <span class="lister-item-index unbold text-primary">14.</span>
    ## [15] <span class="lister-item-index unbold text-primary">15.</span>
    ## [16] <span class="lister-item-index unbold text-primary">16.</span>
    ## [17] <span class="lister-item-index unbold text-primary">17.</span>
    ## [18] <span class="lister-item-index unbold text-primary">18.</span>
    ## [19] <span class="lister-item-index unbold text-primary">19.</span>
    ## [20] <span class="lister-item-index unbold text-primary">20.</span>
    ## ...
    # Convert the ranking data to text
    (rank_data <- html_text(rank_data_html))
    ##   [1] "1."   "2."   "3."   "4."   "5."   "6."   "7."   "8."   "9."   "10." 
    ##  [11] "11."  "12."  "13."  "14."  "15."  "16."  "17."  "18."  "19."  "20." 
    ##  [21] "21."  "22."  "23."  "24."  "25."  "26."  "27."  "28."  "29."  "30." 
    ##  [31] "31."  "32."  "33."  "34."  "35."  "36."  "37."  "38."  "39."  "40." 
    ##  [41] "41."  "42."  "43."  "44."  "45."  "46."  "47."  "48."  "49."  "50." 
    ##  [51] "51."  "52."  "53."  "54."  "55."  "56."  "57."  "58."  "59."  "60." 
    ##  [61] "61."  "62."  "63."  "64."  "65."  "66."  "67."  "68."  "69."  "70." 
    ##  [71] "71."  "72."  "73."  "74."  "75."  "76."  "77."  "78."  "79."  "80." 
    ##  [81] "81."  "82."  "83."  "84."  "85."  "86."  "87."  "88."  "89."  "90." 
    ##  [91] "91."  "92."  "93."  "94."  "95."  "96."  "97."  "98."  "99."  "100."
    # Turn into numerical values
    (rank_data <- as.integer(rank_data))
    ##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
    ##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
    ##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
    ##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
    ##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
    ##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

Title

  • Use SelectorGadget to find the CSS selector .lister-item-header a.

    # Using CSS selectors to scrap the title section
    (title_data_html <- html_nodes(webpage, '.lister-item-header a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/title/tt1727824/?ref_=adv_li_tt">Bohemian Rhapsody</a>
    ##  [2] <a href="/title/tt4530422/?ref_=adv_li_tt">Overlord</a>
    ##  [3] <a href="/title/tt6966692/?ref_=adv_li_tt">Green Book</a>
    ##  [4] <a href="/title/tt1477834/?ref_=adv_li_tt">Aquaman</a>
    ##  [5] <a href="/title/tt5083738/?ref_=adv_li_tt">The Favourite</a>
    ##  [6] <a href="/title/tt1517451/?ref_=adv_li_tt">A Star Is Born</a>
    ##  [7] <a href="/title/tt3606756/?ref_=adv_li_tt">Incredibles 2</a>
    ##  [8] <a href="/title/tt1571234/?ref_=adv_li_tt">Mortal Engines</a>
    ##  [9] <a href="/title/tt4218572/?ref_=adv_li_tt">Widows</a>
    ## [10] <a href="/title/tt4532826/?ref_=adv_li_tt">Robin Hood</a>
    ## [11] <a href="/title/tt6155172/?ref_=adv_li_tt">Roma</a>
    ## [12] <a href="/title/tt6266538/?ref_=adv_li_tt">Vice</a>
    ## [13] <a href="/title/tt4633694/?ref_=adv_li_tt">Spider-Man: Into the Spi ...
    ## [14] <a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>
    ## [15] <a href="/title/tt7349662/?ref_=adv_li_tt">BlacKkKlansman</a>
    ## [16] <a href="/title/tt2737304/?ref_=adv_li_tt">Bird Box</a>
    ## [17] <a href="/title/tt4595882/?ref_=adv_li_tt">Can You Ever Forgive Me? ...
    ## [18] <a href="/title/tt1034415/?ref_=adv_li_tt">Suspiria</a>
    ## [19] <a href="/title/tt5095030/?ref_=adv_li_tt">Ant-Man and the Wasp</a>
    ## [20] <a href="/title/tt8359848/?ref_=adv_li_tt">Climax</a>
    ## ...
    # Converting the title data to text
    (title_data <- html_text(title_data_html))
    ##   [1] "Bohemian Rhapsody"                             
    ##   [2] "Overlord"                                      
    ##   [3] "Green Book"                                    
    ##   [4] "Aquaman"                                       
    ##   [5] "The Favourite"                                 
    ##   [6] "A Star Is Born"                                
    ##   [7] "Incredibles 2"                                 
    ##   [8] "Mortal Engines"                                
    ##   [9] "Widows"                                        
    ##  [10] "Robin Hood"                                    
    ##  [11] "Roma"                                          
    ##  [12] "Vice"                                          
    ##  [13] "Spider-Man: Into the Spider-Verse"             
    ##  [14] "Avengers: Infinity War"                        
    ##  [15] "BlacKkKlansman"                                
    ##  [16] "Bird Box"                                      
    ##  [17] "Can You Ever Forgive Me?"                      
    ##  [18] "Suspiria"                                      
    ##  [19] "Ant-Man and the Wasp"                          
    ##  [20] "Climax"                                        
    ##  [21] "The Mule"                                      
    ##  [22] "The Man Who Killed Hitler and Then The Bigfoot"
    ##  [23] "First Man"                                     
    ##  [24] "Black Panther"                                 
    ##  [25] "Hunter Killer"                                 
    ##  [26] "The Girl in the Spider's Web"                  
    ##  [27] "Venom"                                         
    ##  [28] "Bumblebee"                                     
    ##  [29] "Bad Times at the El Royale"                    
    ##  [30] "The Ballad of Buster Scruggs"                  
    ##  [31] "Mary Queen of Scots"                           
    ##  [32] "Uncle Drew"                                    
    ##  [33] "Solo: A Star Wars Story"                       
    ##  [34] "Dragon Ball Super: Broly"                      
    ##  [35] "A Quiet Place"                                 
    ##  [36] "Fantastic Beasts: The Crimes of Grindelwald"   
    ##  [37] "Ready Player One"                              
    ##  [38] "Arctic"                                        
    ##  [39] "A Simple Favor"                                
    ##  [40] "A Private War"                                 
    ##  [41] "Deadpool 2"                                    
    ##  [42] "The Front Runner"                              
    ##  [43] "The Grinch"                                    
    ##  [44] "Boy Erased"                                    
    ##  [45] "At Eternity's Gate"                            
    ##  [46] "Todos lo saben"                                
    ##  [47] "Tag"                                           
    ##  [48] "Prospect"                                      
    ##  [49] "Mary Poppins Returns"                          
    ##  [50] "Beautiful Boy"                                 
    ##  [51] "Annihilation"                                  
    ##  [52] "Crazy Rich Asians"                             
    ##  [53] "Cold War"                                      
    ##  [54] "Mission: Impossible - Fallout"                 
    ##  [55] "If Beale Street Could Talk"                    
    ##  [56] "Hereditary"                                    
    ##  [57] "The Nutcracker and the Four Realms"            
    ##  [58] "Instant Family"                                
    ##  [59] "Halloween"                                     
    ##  [60] "Burning"                                       
    ##  [61] "Ralph Breaks the Internet"                     
    ##  [62] "Ocean's 8"                                     
    ##  [63] "The Sisters Brothers"                          
    ##  [64] "Creed II"                                      
    ##  [65] "The Predator"                                  
    ##  [66] "Hotel Transylvania 3: Summer Vacation"         
    ##  [67] "Replicas"                                      
    ##  [68] "Stan & Ollie"                                  
    ##  [69] "The Hate U Give"                               
    ##  [70] "On the Basis of Sex"                           
    ##  [71] "Destroyer"                                     
    ##  [72] "Upgrade"                                       
    ##  [73] "Red Sparrow"                                   
    ##  [74] "The Guilty"                                    
    ##  [75] "The House That Jack Built"                     
    ##  [76] "Isle of Dogs"                                  
    ##  [77] "Searching"                                     
    ##  [78] "The Old Man & the Gun"                         
    ##  [79] "Nobody's Fool"                                 
    ##  [80] "Game Night"                                    
    ##  [81] "Capharnaüm"                                    
    ##  [82] "Sicario: Day of the Soldado"                   
    ##  [83] "Dumplin'"                                      
    ##  [84] "The Meg"                                       
    ##  [85] "Black Mirror: Bandersnatch"                    
    ##  [86] "Rampage"                                       
    ##  [87] "The Nun"                                       
    ##  [88] "Untogether"                                    
    ##  [89] "To All the Boys I've Loved Before"             
    ##  [90] "Then Came You"                                 
    ##  [91] "Jurassic World: Fallen Kingdom"                
    ##  [92] "Little Italy"                                  
    ##  [93] "K.G.F: Chapter 1"                              
    ##  [94] "Eighth Grade"                                  
    ##  [95] "Love, Simon"                                   
    ##  [96] "Peranbu"                                       
    ##  [97] "High Life"                                     
    ##  [98] "Mile 22"                                       
    ##  [99] "Mowgli: Legend of the Jungle"                  
    ## [100] "Mandy"

Description

  • # Using CSS selectors to scrap the description section
    (description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))
    ## {xml_nodeset (100)}
    ##  [1] <p class="text-muted">\n    The story of the legendary rock band <a ...
    ##  [2] <p class="text-muted">\n    A small group of American soldiers find ...
    ##  [3] <p class="text-muted">\n    A working-class Italian-American bounce ...
    ##  [4] <p class="text-muted">\n    Arthur Curry, the human-born heir to th ...
    ##  [5] <p class="text-muted">\n    In early 18th century England, a frail  ...
    ##  [6] <p class="text-muted">\n    A musician helps a young singer find fa ...
    ##  [7] <p class="text-muted">\n    The Incredibles hero family takes on a  ...
    ##  [8] <p class="text-muted">\n    In a post-apocalyptic world where citie ...
    ##  [9] <p class="text-muted">\n    Set in contemporary Chicago, amid a tim ...
    ## [10] <p class="text-muted">\n    A war-hardened Crusader and his Moorish ...
    ## [11] <p class="text-muted">\n    A year in the life of a middle-class fa ...
    ## [12] <p class="text-muted">\n    The story of <a href="/name/nm0155515"> ...
    ## [13] <p class="text-muted">\n    Teen Miles Morales becomes Spider-Man o ...
    ## [14] <p class="text-muted">\n    The Avengers and their allies must be w ...
    ## [15] <p class="text-muted">\n    Ron Stallworth, an African American pol ...
    ## [16] <p class="text-muted">\n    Five years after an ominous unseen pres ...
    ## [17] <p class="text-muted">\n    When Lee Israel falls out of step with  ...
    ## [18] <p class="text-muted">\n    A darkness swirls at the center of a wo ...
    ## [19] <p class="text-muted">\n    As Scott Lang balances being both a Sup ...
    ## [20] <p class="text-muted">\n    French dancers gather in a remote, empt ...
    ## ...
    # Converting the description data to text
    description_data <- html_text(description_data_html)
    # take a look at first few
    head(description_data)
    ## [1] "\n    The story of the legendary rock band Queen and lead singer Freddie Mercury, leading up to their famous performance at Live Aid (1985)."                                                                         
    ## [2] "\n    A small group of American soldiers find horror behind enemy lines on the eve of D-Day."                                                                                                                         
    ## [3] "\n    A working-class Italian-American bouncer becomes the driver of an African-American classical pianist on a tour of venues through the 1960s American South."                                                     
    ## [4] "\n    Arthur Curry, the human-born heir to the underwater kingdom of Atlantis, goes on a quest to prevent a war between the worlds of ocean and land."                                                                
    ## [5] "\n    In early 18th century England, a frail Queen Anne occupies the throne and her close friend, Lady Sarah, governs the country in her stead. When a new servant, Abigail, arrives, her charm endears her to Sarah."
    ## [6] "\n    A musician helps a young singer find fame, even as age and alcoholism send his own career into a downward spiral."
    # strip the '\n'
    description_data <- str_replace(description_data, "^\\n\\s+", "")
    head(description_data)
    ## [1] "The story of the legendary rock band Queen and lead singer Freddie Mercury, leading up to their famous performance at Live Aid (1985)."                                                                         
    ## [2] "A small group of American soldiers find horror behind enemy lines on the eve of D-Day."                                                                                                                         
    ## [3] "A working-class Italian-American bouncer becomes the driver of an African-American classical pianist on a tour of venues through the 1960s American South."                                                     
    ## [4] "Arthur Curry, the human-born heir to the underwater kingdom of Atlantis, goes on a quest to prevent a war between the worlds of ocean and land."                                                                
    ## [5] "In early 18th century England, a frail Queen Anne occupies the throne and her close friend, Lady Sarah, governs the country in her stead. When a new servant, Abigail, arrives, her charm endears her to Sarah."
    ## [6] "A musician helps a young singer find fame, even as age and alcoholism send his own career into a downward spiral."

Runtime

  • # Using CSS selectors to scrap the Movie runtime section
    runtime_data_html <- html_nodes(webpage, '.text-muted .runtime')
    # Converting the runtime data to text
    runtime_data <- html_text(runtime_data_html)
    # Let's have a look at the runtime
    head(runtime_data)
    ## [1] "134 min" "110 min" "130 min" "143 min" "119 min" "136 min"
    # Data-Preprocessing: removing mins and converting it to numerical
    runtime_data <- str_replace(runtime_data, " min", "")
    runtime_data <- as.numeric(runtime_data)
    #Let's have another look at the runtime data
    head(runtime_data)
    ## [1] 134 110 130 143 119 136

Genre

  • # Using CSS selectors to scrap the Movie genre section
    genre_data_html <- html_nodes(webpage, '.genre')
    # Converting the genre data to text
    genre_data <- html_text(genre_data_html)
    # Let's have a look at the genre data
    head(genre_data)    
    ## [1] "\nBiography, Drama, Music            "   
    ## [2] "\nAction, Adventure, Horror            " 
    ## [3] "\nBiography, Comedy, Drama            "  
    ## [4] "\nAction, Adventure, Fantasy            "
    ## [5] "\nBiography, Comedy, Drama            "  
    ## [6] "\nDrama, Music, Romance            "
    # Data-Preprocessing: retrieve the first word
    genre_data <- str_extract(genre_data, "[:alpha:]+")
    # Convering each genre from text to factor
    #genre_data <- as.factor(genre_data)
    # Let's have another look at the genre data
    head(genre_data)
    ## [1] "Biography" "Action"    "Biography" "Action"    "Biography" "Drama"

Rating

  • # Using CSS selectors to scrap the IMDB rating section
    rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong')
    # Converting the ratings data to text
    rating_data <- html_text(rating_data_html)
    # Let's have a look at the ratings
    head(rating_data)
    ## [1] "8.2" "6.9" "8.3" "7.4" "7.8" "7.9"
    # Data-Preprocessing: converting ratings to numerical
    rating_data <- as.numeric(rating_data)
    # Let's have another look at the ratings data
    rating_data
    ##   [1] 8.2 6.9 8.3 7.4 7.8 7.9 7.8 6.2 7.1 5.3 7.9 7.2 8.7 8.5 7.5 6.7 7.3
    ##  [18] 7.0 7.1 7.4 7.2 5.6 7.4 7.4 6.6 6.1 6.8 7.1 7.2 7.3 6.5 5.7 7.0 8.3
    ##  [35] 7.6 6.8 7.5 7.3 6.9 6.7 7.8 6.3 6.3 7.0 7.0 7.0 6.6 6.3 7.2 7.3 6.9
    ##  [52] 7.0 7.7 7.8 7.6 7.3 5.5 7.6 6.7 7.7 7.2 6.2 7.0 7.6 5.4 6.3 5.4 7.6
    ##  [69] 7.1 6.5 6.7 7.6 6.6 7.6 7.0 7.9 7.7 6.8 4.4 7.0 8.3 7.1 6.8 5.7 7.4
    ##  [86] 6.1 5.4 6.0 7.3 7.0 6.2 5.7 8.6 7.5 7.7 9.8 6.7 6.1 6.5 6.6

Votes

  • # Using CSS selectors to scrap the votes section
    votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)')
    # Converting the votes data to text
    votes_data <- html_text(votes_data_html)
    # Let's have a look at the votes data
    head(votes_data)
    ## [1] "246,470" "30,977"  "62,443"  "164,708" "53,107"  "177,646"
    # Data-Preprocessing: removing commas
    votes_data <- str_replace(votes_data, ",", "")
    # Data-Preprocessing: converting votes to numerical
    votes_data <- as.numeric(votes_data)
    #Let's have another look at the votes data
    votes_data
    ##   [1] 246470  30977  62443 164708  53107 177646 175775  38561  42185  24190
    ##  [11]  73780  31162  94876 584672 105254 181502  11941  23891 194191  12128
    ##  [21]  18793   1478  94459 460433  23700  18439 232108  50412  65543  70123
    ##  [31]   9858   8176 203858  11331 273750 118011 277710   1198  67893   4853
    ##  [41] 347170   3104  23306  11519   6503  11282  71377   2288  32968  21556
    ##  [51] 210985  78108  22384 214004   9922 130347  13666  10324  75263  13687
    ##  [61]  47185 126827  19964  36849  82948  36372   9681   6170  12404   4050
    ##  [71]   3681  89913 127667  17575  24854  96145  79219  18667   2278 149228
    ##  [81]   4846  83167  14791  97311  79985 105826  78279    447  53964    744
    ##  [91] 197518   4520  17953  33566  74886   8274   1246  40984  39406  37113

Director

  • # Using CSS selectors to scrap the directors section
    (directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm0001741/?ref_=adv_li_dr_0">Bryan Singer</a>
    ##  [2] <a href="/name/nm1170339/?ref_=adv_li_dr_0">Julius Avery</a>
    ##  [3] <a href="/name/nm0268380/?ref_=adv_li_dr_0">Peter Farrelly</a>
    ##  [4] <a href="/name/nm1490123/?ref_=adv_li_dr_0">James Wan</a>
    ##  [5] <a href="/name/nm0487166/?ref_=adv_li_dr_0">Yorgos Lanthimos</a>
    ##  [6] <a href="/name/nm0177896/?ref_=adv_li_dr_0">Bradley Cooper</a>
    ##  [7] <a href="/name/nm0083348/?ref_=adv_li_dr_0">Brad Bird</a>
    ##  [8] <a href="/name/nm0729514/?ref_=adv_li_dr_0">Christian Rivers</a>
    ##  [9] <a href="/name/nm2588606/?ref_=adv_li_dr_0">Steve McQueen</a>
    ## [10] <a href="/name/nm1163264/?ref_=adv_li_dr_0">Otto Bathurst</a>
    ## [11] <a href="/name/nm0190859/?ref_=adv_li_dr_0">Alfonso Cuarón</a>
    ## [12] <a href="/name/nm0570912/?ref_=adv_li_dr_0">Adam McKay</a>
    ## [13] <a href="/name/nm2130108/?ref_=adv_li_dr_0">Bob Persichetti</a>
    ## [14] <a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>
    ## [15] <a href="/name/nm0000490/?ref_=adv_li_dr_0">Spike Lee</a>
    ## [16] <a href="/name/nm0081540/?ref_=adv_li_dr_0">Susanne Bier</a>
    ## [17] <a href="/name/nm1716636/?ref_=adv_li_dr_0">Marielle Heller</a>
    ## [18] <a href="/name/nm0345174/?ref_=adv_li_dr_0">Luca Guadagnino</a>
    ## [19] <a href="/name/nm0715636/?ref_=adv_li_dr_0">Peyton Reed</a>
    ## [20] <a href="/name/nm0637615/?ref_=adv_li_dr_0">Gaspar Noé</a>
    ## ...
    # Converting the directors data to text
    directors_data <- html_text(directors_data_html)
    # Let's have a look at the directors data
    head(directors_data)
    ## [1] "Bryan Singer"     "Julius Avery"     "Peter Farrelly"  
    ## [4] "James Wan"        "Yorgos Lanthimos" "Bradley Cooper"
    # Data-Preprocessing: converting directors data into factors
    (directors_data <- as.factor(directors_data))
    ##   [1] Bryan Singer          Julius Avery          Peter Farrelly       
    ##   [4] James Wan             Yorgos Lanthimos      Bradley Cooper       
    ##   [7] Brad Bird             Christian Rivers      Steve McQueen        
    ##  [10] Otto Bathurst         Alfonso Cuarón        Adam McKay           
    ##  [13] Bob Persichetti       Anthony Russo         Spike Lee            
    ##  [16] Susanne Bier          Marielle Heller       Luca Guadagnino      
    ##  [19] Peyton Reed           Gaspar Noé            Clint Eastwood       
    ##  [22] Robert D. Krzykowski  Damien Chazelle       Ryan Coogler         
    ##  [25] Donovan Marsh         Fede Alvarez          Ruben Fleischer      
    ##  [28] Travis Knight         Drew Goddard          Ethan Coen           
    ##  [31] Josie Rourke          Charles Stone III     Ron Howard           
    ##  [34] Tatsuya Nagamine      John Krasinski        David Yates          
    ##  [37] Steven Spielberg      Joe Penna             Paul Feig            
    ##  [40] Matthew Heineman      David Leitch          Jason Reitman        
    ##  [43] Yarrow Cheney         Joel Edgerton         Julian Schnabel      
    ##  [46] Asghar Farhadi        Jeff Tomsic           Christopher Caldwell 
    ##  [49] Rob Marshall          Felix van Groeningen  Alex Garland         
    ##  [52] Jon M. Chu            Pawel Pawlikowski     Christopher McQuarrie
    ##  [55] Barry Jenkins         Ari Aster             Lasse Hallström      
    ##  [58] Sean Anders           David Gordon Green    Chang-dong Lee       
    ##  [61] Phil Johnston         Gary Ross             Jacques Audiard      
    ##  [64] Steven Caple Jr.      Shane Black           Genndy Tartakovsky   
    ##  [67] Jeffrey Nachmanoff    Jon S. Baird          George Tillman Jr.   
    ##  [70] Mimi Leder            Karyn Kusama          Leigh Whannell       
    ##  [73] Francis Lawrence      Gustav Möller         Lars von Trier       
    ##  [76] Wes Anderson          Aneesh Chaganty       David Lowery         
    ##  [79] Tyler Perry           John Francis Daley    Nadine Labaki        
    ##  [82] Stefano Sollima       Anne Fletcher         Jon Turteltaub       
    ##  [85] David Slade           Brad Peyton           Corin Hardy          
    ##  [88] Emma Forrest          Susan Johnson         Peter Hutchings      
    ##  [91] J.A. Bayona           Donald Petrie         Prashanth Neel       
    ##  [94] Bo Burnham            Greg Berlanti         Ram                  
    ##  [97] Claire Denis          Peter Berg            Andy Serkis          
    ## [100] Panos Cosmatos       
    ## 100 Levels: Adam McKay Alex Garland Alfonso Cuarón ... Yorgos Lanthimos

Actor

  • # Using CSS selectors to scrap the actors section
    (actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm1785339/?ref_=adv_li_st_0">Rami Malek</a>
    ##  [2] <a href="/name/nm5381254/?ref_=adv_li_st_0">Jovan Adepo</a>
    ##  [3] <a href="/name/nm0001557/?ref_=adv_li_st_0">Viggo Mortensen</a>
    ##  [4] <a href="/name/nm0597388/?ref_=adv_li_st_0">Jason Momoa</a>
    ##  [5] <a href="/name/nm1469236/?ref_=adv_li_st_0">Olivia Colman</a>
    ##  [6] <a href="/name/nm3078932/?ref_=adv_li_st_0">Lady Gaga</a>
    ##  [7] <a href="/name/nm0005266/?ref_=adv_li_st_0">Craig T. Nelson</a>
    ##  [8] <a href="/name/nm2623492/?ref_=adv_li_st_0">Hera Hilmar</a>
    ##  [9] <a href="/name/nm0205626/?ref_=adv_li_st_0">Viola Davis</a>
    ## [10] <a href="/name/nm5473782/?ref_=adv_li_st_0">Taron Egerton</a>
    ## [11] <a href="/name/nm8611957/?ref_=adv_li_st_0">Yalitza Aparicio</a>
    ## [12] <a href="/name/nm0000288/?ref_=adv_li_st_0">Christian Bale</a>
    ## [13] <a href="/name/nm4271336/?ref_=adv_li_st_0">Shameik Moore</a>
    ## [14] <a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>
    ## [15] <a href="/name/nm0913475/?ref_=adv_li_st_0">John David Washington</a>
    ## [16] <a href="/name/nm0000113/?ref_=adv_li_st_0">Sandra Bullock</a>
    ## [17] <a href="/name/nm0565250/?ref_=adv_li_st_0">Melissa McCarthy</a>
    ## [18] <a href="/name/nm1631269/?ref_=adv_li_st_0">Chloë Grace Moretz</a>
    ## [19] <a href="/name/nm0748620/?ref_=adv_li_st_0">Paul Rudd</a>
    ## [20] <a href="/name/nm1154749/?ref_=adv_li_st_0">Sofia Boutella</a>
    ## ...
    # Converting the gross actors data to text
    actors_data <- html_text(actors_data_html)
    # Let's have a look at the actors data
    head(actors_data)
    ## [1] "Rami Malek"      "Jovan Adepo"     "Viggo Mortensen" "Jason Momoa"    
    ## [5] "Olivia Colman"   "Lady Gaga"
    # Data-Preprocessing: converting actors data into factors
    (actors_data <- as.factor(actors_data))
    ##   [1] Rami Malek            Jovan Adepo           Viggo Mortensen      
    ##   [4] Jason Momoa           Olivia Colman         Lady Gaga            
    ##   [7] Craig T. Nelson       Hera Hilmar           Viola Davis          
    ##  [10] Taron Egerton         Yalitza Aparicio      Christian Bale       
    ##  [13] Shameik Moore         Robert Downey Jr.     John David Washington
    ##  [16] Sandra Bullock        Melissa McCarthy      Chloë Grace Moretz   
    ##  [19] Paul Rudd             Sofia Boutella        Bradley Cooper       
    ##  [22] Sam Elliott           Ryan Gosling          Chadwick Boseman     
    ##  [25] Gerard Butler         Claire Foy            Tom Hardy            
    ##  [28] Hailee Steinfeld      Jeff Bridges          Tim Blake Nelson     
    ##  [31] Saoirse Ronan         Kyrie Irving          Alden Ehrenreich     
    ##  [34] Masako Nozawa         Emily Blunt           Eddie Redmayne       
    ##  [37] Tye Sheridan          Mads Mikkelsen        Anna Kendrick        
    ##  [40] Rosamund Pike         Ryan Reynolds         Hugh Jackman         
    ##  [43] Benedict Cumberbatch  Lucas Hedges          Willem Dafoe         
    ##  [46] Penélope Cruz         Jeremy Renner         Sophie Thatcher      
    ##  [49] Emily Blunt           Steve Carell          Natalie Portman      
    ##  [52] Constance Wu          Joanna Kulig          Tom Cruise           
    ##  [55] KiKi Layne            Toni Collette         Mackenzie Foy        
    ##  [58] Mark Wahlberg         Jamie Lee Curtis      Ah-in Yoo            
    ##  [61] John C. Reilly        Sandra Bullock        John C. Reilly       
    ##  [64] Michael B. Jordan     Boyd Holbrook         Adam Sandler         
    ##  [67] Alice Eve             Shirley Henderson     Amandla Stenberg     
    ##  [70] Felicity Jones        Nicole Kidman         Logan Marshall-Green 
    ##  [73] Jennifer Lawrence     Jakob Cedergren       Matt Dillon          
    ##  [76] Bryan Cranston        John Cho              Robert Redford       
    ##  [79] Tiffany Haddish       Jason Bateman         Zain Al Rafeea       
    ##  [82] Benicio Del Toro      Danielle Macdonald    Jason Statham        
    ##  [85] Fionn Whitehead       Dwayne Johnson        Demián Bichir        
    ##  [88] Alice Eve             Lana Condor           Asa Butterfield      
    ##  [91] Chris Pratt           Hayden Christensen    Yash                 
    ##  [94] Elsie Fisher          Nick Robinson         Mammootty            
    ##  [97] Robert Pattinson      Mark Wahlberg         Christian Bale       
    ## [100] Nicolas Cage         
    ## 94 Levels: Adam Sandler Ah-in Yoo Alden Ehrenreich ... Zain Al Rafeea

Metascore

  • Be careful with missing data.

    # Using CSS selectors to scrap the metascore section
    metascore_data_html <- html_nodes(webpage, '.metascore')
    # Converting the runtime data to text
    metascore_data <- html_text(metascore_data_html)
    # Let's have a look at the metascore 
    head(metascore_data)
    ## [1] "49        " "60        " "69        " "55        " "90        "
    ## [6] "88        "
    # Data-Preprocessing: removing extra space in metascore
    metascore_data <- str_replace(metascore_data, "\\s*$", "")
    metascore_data <- as.numeric(metascore_data)
    metascore_data
    ##  [1] 49 60 69 55 90 88 80 44 84 32 96 61 87 68 83 51 87 64 70 83 58 50 84
    ## [24] 88 43 43 35 66 60 79 60 57 62 59 82 52 64 71 67 75 66 61 51 69 78 67
    ## [47] 56 68 66 62 79 74 90 86 87 87 39 57 67 90 71 61 78 66 48 54 19 75 81
    ## [70] 60 62 67 53 83 42 82 71 80 39 66 75 61 53 46 45 46 45 64 43 51 28 90
    ## [93] 72 81 38 51 81
    # Lets check the length of metascore data
    length(metascore_data)
    ## [1] 97
    # Visual inspection finds 69, 74, 87 don't have metascore
    ms <- rep(NA, 100)
    ms[-c(85, 93, 96)] <- metascore_data
    (metascore_data <- ms)
    ##   [1] 49 60 69 55 90 88 80 44 84 32 96 61 87 68 83 51 87 64 70 83 58 50 84
    ##  [24] 88 43 43 35 66 60 79 60 57 62 59 82 52 64 71 67 75 66 61 51 69 78 67
    ##  [47] 56 68 66 62 79 74 90 86 87 87 39 57 67 90 71 61 78 66 48 54 19 75 81
    ##  [70] 60 62 67 53 83 42 82 71 80 39 66 75 61 53 46 NA 45 46 45 64 43 51 28
    ##  [93] NA 90 72 NA 81 38 51 81

Gross

  • Be careful with missing data.

    # Using CSS selectors to scrap the gross revenue section
    gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
    # Converting the gross revenue data to text
    gross_data <- html_text(gross_data_html)
    # Let's have a look at the votes data
    head(gross_data)
    ## [1] "$210.79M" "$21.70M"  "$62.10M"  "$329.00M" "$30.38M"  "$208.85M"
    # Data-Preprocessing: removing '$' and 'M' signs
    gross_data <- str_replace(gross_data, "M", "")
    gross_data <- str_sub(gross_data, 2, 10)
    #(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
    gross_data <- as.numeric(gross_data)
    # Let's check the length of gross data
    length(gross_data)
    ## [1] 85
    # Visual inspection finds below movies don't have gross
    gs_data <- rep(NA, 100)
    gs_data[-c(6, 12, 29, 40, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data
    (gross_data <- gs_data)
    ##   [1] 210.79  21.70  62.10 329.00  30.38     NA 208.85 608.58  15.95  42.39
    ##  [11]  30.82     NA  45.36 180.30 678.82  48.69   8.56   2.47 216.65 102.70
    ##  [21]  44.94 700.06  15.77  14.84 213.52 125.93  17.84  16.47     NA  42.47
    ##  [31] 213.77  30.38 188.02 159.45 137.69   0.17  53.54   1.63 324.59     NA
    ##  [41]   2.00 270.60   6.79   2.29   0.08  54.55 169.91   7.65  32.73 174.53
    ##  [51]   2.90 220.16  13.83  44.07  54.86  67.36 159.34   0.70 197.62 139.38
    ##  [61]     NA   3.14 115.62  51.02 167.51   4.04   4.30  29.72     NA  23.90
    ##  [71]     NA   1.48  11.87     NA  46.87   0.21   0.09     NA  32.02  26.02
    ##  [81]  11.28     NA  31.71     NA     NA     NA     NA  69.00   0.74     NA
    ##  [91]  50.07 145.44  99.35 117.44 417.72   0.99  13.54  40.83  36.11   1.21

Missing entries

Visualizing movie data

Scraping image data from Google

searchTerm <- "ucla"
# tbm=isch (images), app (apps), bks (books), nws (news), pts (patents), vid (videos)
# <https://stenevang.wordpress.com/2013/02/22/google-advanced-power-search-url-request-parameters/>
(url <- paste0("https://www.google.com/search?q=", searchTerm,
              "&source=lnms&tbm=isch&sa=X&tbs=isz:m"))
## [1] "https://www.google.com/search?q=ucla&source=lnms&tbm=isch&sa=X&tbs=isz:m"
webpage <- read_html(url)
(imageurl <- webpage %>% html_nodes("img") %>% html_attr("src"))
##  [1] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSPAHGngfC90w8TVGwxDtF7EOfHRruNfeHqVicOCZHfcFZ28aN9AGMPOrE7"  
##  [2] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRi4lynSHbXZ4Iw8g2dqSWIHUbwYlVAnCG8JmoJk0m5TDqv7u1A4DZXIXo"   
##  [3] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQgDqKCP2Uj7dmIqDKLCzwnO2Nxe4NKkxqi7yzULvq5yIvx9AM5Let8VSfF"  
##  [4] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS7PJWOnMu2-F2ZOcKYLiHF6EJM4ddYQas6W7_nEOqv4kmyo5ANyJK-wy1g"  
##  [5] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSIwWQ2gn7Ae8lSfeTuK5QGLPv4v6O1uxD_RrVuDElsHEXOZQiPvIyEcWY"   
##  [6] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSuyXbGus89EayJHMdhKKrUcbPlPgwGSajIZzlWqrM8MVaHZYg6_2QLSe8"   
##  [7] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQE3k_c0rFd-J7mOLS19TZ4crq6Bx9vXI3gNGPklgoVArW3skuai1ljggub"  
##  [8] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTI-oSi_FSFDeqOxkX1sse5D2Q-yfDKE28MMq3lOh1B9LckMTTYi-JxJIC8"  
##  [9] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR_dm3YNqqvT8HV_c6-sYISjWDHpiEG3ivlntQAGkbub2gItlOcgFgSnrU5"  
## [10] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQsP-A6JtaJjfe33HsF5Ng7yRa4HRalvfCBNjERoFxWhvFL_ALJ3UcYtCno"  
## [11] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSi7wafHAE8gTQM7LWdMDi49sfwV9bp4n2-l6MdJ3pODMLa-z83zQZ4j2FY"  
## [12] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRcuXKyr-IO6FRGq4mKUqfaZMmVUKdqOFGkC74VFxLv3atIlywjhMvLay5Z"  
## [13] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQtayay8IIEhDm5NyjKj_1Jir0j1YKdHHcUoUJaI0MYV5t7WcK7iK0e_bfN"  
## [14] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTrkSgBdNkT44_cRvZvyqfAk-M-Y30qxLiiGD_KkqSIBEpuoSLPkI66YQQ2"  
## [15] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSazvWH_KTumPPOEFTfJbYNt5KeCPCCoIAqi38Rxxh25D1vBeviJKgFTi8"   
## [16] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSEIYZP7caAsQ8_KHJBDCeMCcNK62f4a3F6tbQMYU4e8PXliQnK6mhPhNY"   
## [17] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRwAHVp8uKCO6RN_QPvUZd87r_qGz1gB6ywhSG6OmpEWihlV8OZloNQW-EzBw"
## [18] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQtr10_NL7Ko621GJPtfMkvky-r7PQN85p-7Y5YZ55oQsAX9GBYUEty5Q"    
## [19] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS9BgYpoLO-hIH9QC1GUIhhPwy0T6XQe-WiTPJ18UbWhny6SikDHcCFHtz7"  
## [20] "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT6Vr40BZfh0xMbWCgMRLlHZgPIuZtrEsGMM4FyGu3kEcGLrauySTKCjZ2_"
downloadImages <- function(files, brand, outPath="images"){
  for(i in 1:length(files)){
    download.file(files[i], destfile = paste0(outPath, "/", brand, "_", i, ".jpg"), mode = 'wb')
  }
}
downloadImages(imageurl, "ucla")
ls images/
## ucla_1.jpg
## ucla_10.jpg
## ucla_11.jpg
## ucla_12.jpg
## ucla_13.jpg
## ucla_14.jpg
## ucla_15.jpg
## ucla_16.jpg
## ucla_17.jpg
## ucla_18.jpg
## ucla_19.jpg
## ucla_2.jpg
## ucla_20.jpg
## ucla_3.jpg
## ucla_4.jpg
## ucla_5.jpg
## ucla_6.jpg
## ucla_7.jpg
## ucla_8.jpg
## ucla_9.jpg

Scraping finance data