Working with Databases

Introduction

In this lecture we follow a tutorial by Dr. Colin Rundel to handle a bigish data set. We will learn:

Import data from a bigish csv file.
Tidy data.
Deposit data into an SQLite database.
Query SQLite database.

Machine information

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.4.3     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.15   
##  [9] stringi_1.1.6   rmarkdown_1.8   knitr_1.19      stringr_1.2.0  
## [13] digest_0.6.15   evaluate_0.10.1

CSV file

The /home/m280-data/nyc_parking/NYParkingViolations.csv file on teaching server contains information about parking tickets in NYC. You can create a symbolic link to the data file by Bash command

ln -sf /home/m280-data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv

The CSV file is about 1.7GB

ls -l NYParkingViolations.csv

## -rw-r--r--@ 1 huazhou  staff  1793818408 Feb 11 20:26 NYParkingViolations.csv

How many rows?

wc -l < NYParkingViolations.csv

##  9100279

First few lines:

head -5 NYParkingViolations.csv

## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,Number,Street,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
## 1361929741,FCJ5493,NY,PAS,12/18/1970,20,SUBN,GMC,S,35030,89180,32600,20140728,0070,70,0,829379,KS14,0000,0358A,,K,F,959,E 5 ST,,0,408,E2,,BBBBBBB,ALL,ALL,BLACK,0,2013,-,0,,,,,
## 1366962000,63540MC,NY,COM,02/02/1971,46,DELV,FRUEH,P,58830,11430,11450,0,0068,68,68,928157,0068,0000,1116A,,K,F,185,MARINE AVENUE,,0,408,F1,,BBBBBBB,ALL,ALL,BRN,0,2013,-,0,,,,,
## 1342296187,GCY4187,NY,SRF,09/18/1971,21,VAN,FORD,S,11790,35780,57890,20150430,104,104,0,515350,QW05,0000,0939A,,,O,60-25,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,BLUE,0,2002,-,0,,,,,
## 1342296199,95V6675,TX,PAS,09/18/1971,21,,GMC,S,11790,35780,57890,0,104,104,0,515350,QW05,0000,0945A,,,F,60-12,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,SILVR,0,2008,-,0,,,,,

Read csv file

read.csv() function in base R takes nearly 6 minutes to read in the 1.7GB csv file:

system.time(read.csv("NYParkingViolations.csv"))

##    user  system elapsed 
## 344.781  16.486 367.218

data.table is an R package for reading large data sets:

library(data.table)
system.time({nyc = fread("NYParkingViolations.csv")})

## 
Read 0.0% of 9100278 rows

## Warning in fread("NYParkingViolations.csv"): Bumped column 30 to type
## character on data row 2565, field contains 'T'. Coercing previously read
## values in this column from logical, integer or numeric back to character
## which may not be lossless; e.g., if '00' and '000' occurred before they
## will now be just '0', and there may be inconsistencies with treatment of
## ',,' and ',NA,' too (if they occurred in this column before the bump).
## If this matters please rerun and set 'colClasses' to 'character' for this
## column. Please note that column type detection uses a sample of 1,000 rows
## (100 rows at 10 points) so hopefully this message should be very rare.
## If reporting to datatable-help, please rerun and include the output from
## verbose=TRUE.

## 
Read 4.9% of 9100278 rows
Read 10.7% of 9100278 rows
Read 16.5% of 9100278 rows
Read 21.5% of 9100278 rows
Read 25.9% of 9100278 rows
Read 30.4% of 9100278 rows
Read 35.3% of 9100278 rows
Read 39.2% of 9100278 rows
Read 42.9% of 9100278 rows
Read 47.6% of 9100278 rows
Read 52.4% of 9100278 rows
Read 56.7% of 9100278 rows
Read 61.0% of 9100278 rows
Read 65.9% of 9100278 rows
Read 70.8% of 9100278 rows
Read 75.9% of 9100278 rows
Read 81.0% of 9100278 rows
Read 85.8% of 9100278 rows
Read 90.5% of 9100278 rows
Read 95.2% of 9100278 rows
Read 99.6% of 9100278 rows
Read 9100278 rows and 43 (of 43) columns from 1.671 GB file in 00:00:26

##    user  system elapsed 
##  23.933   1.593  26.058

class(nyc)

## [1] "data.table" "data.frame"

nyc = as.data.frame(nyc)
class(nyc)

## [1] "data.frame"

read_csv() function in tidyverse is 3-4 times faster than base R:

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()

system.time({nyc = read_csv("NYParkingViolations.csv")})

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_integer(),
##   `Street Code1` = col_integer(),
##   `Street Code2` = col_integer(),
##   `Street Code3` = col_integer(),
##   `Vehicle Expiration Date` = col_integer(),
##   `Violation Precinct` = col_integer(),
##   `Issuer Precinct` = col_integer(),
##   `Issuer Code` = col_integer(),
##   `Date First Observed` = col_integer(),
##   `Law Section` = col_integer(),
##   `Violation Legal Code` = col_integer(),
##   `Unregistered Vehicle?` = col_integer(),
##   `Vehicle Year` = col_integer(),
##   `Feet From Curb` = col_integer()
## )

## See spec(...) for full column specifications.

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col     row col                  expected   actual file                      expected   <int> <chr>                <chr>      <chr>  <chr>                     actual 1  2565 Violation Legal Code an integer T      'NYParkingViolations.csv' file 2  3786 Violation Legal Code an integer T      'NYParkingViolations.csv' row 3  3991 Violation Legal Code an integer T      'NYParkingViolations.csv' col 4  3992 Violation Legal Code an integer T      'NYParkingViolations.csv' expected 5  3993 Violation Legal Code an integer T      'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.

##    user  system elapsed 
##  43.824   4.737  49.761

nyc

## # A tibble: 9,100,278 x 43
##    `Summons Number` `Plate ID` `Registration St… `Plate Type` `Issue Date`
##               <dbl> <chr>      <chr>             <chr>        <chr>       
##  1       1361929741 FCJ5493    NY                PAS          12/18/1970  
##  2       1366962000 63540MC    NY                COM          02/02/1971  
##  3       1342296187 GCY4187    NY                SRF          09/18/1971  
##  4       1342296199 95V6675    TX                PAS          09/18/1971  
##  5       1342296217 FYM5117    NY                SRF          09/18/1971  
##  6       1356906515 GFM1421    NY                PAS          09/18/1971  
##  7       1337077380 18972BB    NY                999          10/10/1971  
##  8       1364523796 WNJ4730    VA                PAS          04/05/1973  
##  9       1359914924 68091JZ    NY                COM          07/22/1973  
## 10       1355498326 EWV4127    NY                PAS          08/12/1973  
## # ... with 9,100,268 more rows, and 38 more variables: `Violation
## #   Code` <int>, `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, `Issuing
## #   Agency` <chr>, `Street Code1` <int>, `Street Code2` <int>, `Street
## #   Code3` <int>, `Vehicle Expiration Date` <int>, `Violation
## #   Location` <chr>, `Violation Precinct` <int>, `Issuer Precinct` <int>,
## #   `Issuer Code` <int>, `Issuer Command` <chr>, `Issuer Squad` <chr>,
## #   `Violation Time` <chr>, `Time First Observed` <chr>, `Violation
## #   County` <chr>, `Violation In Front Of Or Opposite` <chr>,
## #   Number <chr>, Street <chr>, `Intersecting Street` <chr>, `Date First
## #   Observed` <int>, `Law Section` <int>, `Sub Division` <chr>, `Violation
## #   Legal Code` <int>, `Days Parking In Effect` <chr>, `From Hours In
## #   Effect` <chr>, `To Hours In Effect` <chr>, `Vehicle Color` <chr>,
## #   `Unregistered Vehicle?` <int>, `Vehicle Year` <int>, `Meter
## #   Number` <chr>, `Feet From Curb` <int>, `Violation Post Code` <chr>,
## #   `Violation Description` <chr>, `No Standing or Stopping
## #   Violation` <chr>, `Hydrant Violation` <chr>, `Double Parking
## #   Violation` <chr>

Fix column names

names(nyc) <- str_replace_all(names(nyc), " ", "_")
nyc

## # A tibble: 9,100,278 x 43
##    Summons_Number Plate_ID Registration_State Plate_Type Issue_Date
##             <dbl> <chr>    <chr>              <chr>      <chr>     
##  1     1361929741 FCJ5493  NY                 PAS        12/18/1970
##  2     1366962000 63540MC  NY                 COM        02/02/1971
##  3     1342296187 GCY4187  NY                 SRF        09/18/1971
##  4     1342296199 95V6675  TX                 PAS        09/18/1971
##  5     1342296217 FYM5117  NY                 SRF        09/18/1971
##  6     1356906515 GFM1421  NY                 PAS        09/18/1971
##  7     1337077380 18972BB  NY                 999        10/10/1971
##  8     1364523796 WNJ4730  VA                 PAS        04/05/1973
##  9     1359914924 68091JZ  NY                 COM        07/22/1973
## 10     1355498326 EWV4127  NY                 PAS        08/12/1973
## # ... with 9,100,268 more rows, and 38 more variables:
## #   Violation_Code <int>, Vehicle_Body_Type <chr>, Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Street_Code1 <int>, Street_Code2 <int>,
## #   Street_Code3 <int>, Vehicle_Expiration_Date <int>,
## #   Violation_Location <chr>, Violation_Precinct <int>,
## #   Issuer_Precinct <int>, Issuer_Code <int>, Issuer_Command <chr>,
## #   Issuer_Squad <chr>, Violation_Time <chr>, Time_First_Observed <chr>,
## #   Violation_County <chr>, Violation_In_Front_Of_Or_Opposite <chr>,
## #   Number <chr>, Street <chr>, Intersecting_Street <chr>,
## #   Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## #   Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>,
## #   From_Hours_In_Effect <chr>, To_Hours_In_Effect <chr>,
## #   Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>,
## #   Vehicle_Year <int>, Meter_Number <chr>, Feet_From_Curb <int>,
## #   Violation_Post_Code <chr>, Violation_Description <chr>,
## #   No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>,
## #   Double_Parking_Violation <chr>

Simplifying

We make a smaller data set with selected variables:

# magrittr for %<>%
library("magrittr")

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

(nyc %<>% 
  select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color))

## # A tibble: 9,100,278 x 14
##    Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>             <chr>      <chr>               <int> <chr>           
##  1 NY                PAS        12/18/1970             20 SUBN            
##  2 NY                COM        02/02/1971             46 DELV            
##  3 NY                SRF        09/18/1971             21 VAN             
##  4 TX                PAS        09/18/1971             21 <NA>            
##  5 NY                SRF        09/18/1971             21 SUBN            
##  6 NY                PAS        09/18/1971             40 SDN             
##  7 NY                999        10/10/1971             14 BUS             
##  8 VA                PAS        04/05/1973             14 SDN             
##  9 NY                COM        07/22/1973             46 DELV            
## 10 NY                PAS        08/12/1973             21 SUBN            
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Fix dates

Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.

Make variable Issue_Date of date-time type:

library("lubridate")

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

class(nyc$Issue_Date)

## [1] "character"

nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)

## [1] "Date"

nyc

## # A tibble: 9,100,278 x 14
##    Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>             <chr>      <date>              <int> <chr>           
##  1 NY                PAS        1970-12-18             20 SUBN            
##  2 NY                COM        1971-02-02             46 DELV            
##  3 NY                SRF        1971-09-18             21 VAN             
##  4 TX                PAS        1971-09-18             21 <NA>            
##  5 NY                SRF        1971-09-18             21 SUBN            
##  6 NY                PAS        1971-09-18             40 SDN             
##  7 NY                999        1971-10-10             14 BUS             
##  8 VA                PAS        1973-04-05             14 SDN             
##  9 NY                COM        1973-07-22             46 DELV            
## 10 NY                PAS        1973-08-12             21 SUBN            
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Only keep data in years 2013-4.

range(nyc$Issue_Date)

## [1] "1970-12-18" "2069-12-23"

nyc$Issue_Date %>% year() %>% table()

## .
##    1970    1971    1973    1974    1976    1977    1979    1981    1983 
##       1       6      10       1       2       1       2       4       1 
##    1984    1987    1990    1991    1996    2000    2001    2002    2003 
##       2       3       2       1       1     319      91       7      39 
##    2004    2005    2006    2007    2008    2009    2010    2011    2012 
##      77       9      11      13       8       9     129     251     618 
##    2013    2014    2015    2016    2017    2018    2019    2020    2021 
## 4379109 4716512    1522     296     309     181     329      18      26 
##    2022    2023    2024    2025    2026    2027    2028    2029    2030 
##       1      31      23      10       4       4       7       3      45 
##    2031    2032    2033    2040    2041    2043    2044    2045    2046 
##      93       3       8       1      39       9       9       2       7 
##    2047    2048    2049    2050    2051    2052    2053    2060    2061 
##       6       1       3       1      12       2       1       3      10 
##    2063    2064    2066    2067    2069 
##       9       5       3       2       1

filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))

## # A tibble: 9,095,621 x 14
##    Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>             <chr>      <date>              <int> <chr>           
##  1 NY                PAS        2013-07-29             94 <NA>            
##  2 NY                COM        2013-01-07             19 DELV            
##  3 NY                COM        2013-01-01             46 DELV            
##  4 IL                PAS        2013-01-01             78 VAN             
##  5 NJ                PAS        2013-01-01             46 SUBN            
##  6 NY                PAS        2013-01-01             46 SDN             
##  7 NY                COM        2013-01-01             78 VAN             
##  8 NY                COM        2013-01-01             46 SDN             
##  9 OK                PAS        2013-01-01             47 TRLR            
## 10 NY                COM        2013-01-01             45 DELV            
## # ... with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Putting it all together

system.time({
  nyc <- read_csv("NYParkingViolations.csv") %>%
    setNames(str_replace_all(names(.)," ", "_")) %>%
    select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color) %>%
    mutate(Issue_Date = mdy(Issue_Date)) %>% 
    mutate(Issue_Day = day(Issue_Date),
           Issue_Month = month(Issue_Date),
           Issue_Year = year(Issue_Date),
           Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
    filter(Issue_Year %in% 2013:2014)  
})

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_integer(),
##   `Street Code1` = col_integer(),
##   `Street Code2` = col_integer(),
##   `Street Code3` = col_integer(),
##   `Vehicle Expiration Date` = col_integer(),
##   `Violation Precinct` = col_integer(),
##   `Issuer Precinct` = col_integer(),
##   `Issuer Code` = col_integer(),
##   `Date First Observed` = col_integer(),
##   `Law Section` = col_integer(),
##   `Violation Legal Code` = col_integer(),
##   `Unregistered Vehicle?` = col_integer(),
##   `Vehicle Year` = col_integer(),
##   `Feet From Curb` = col_integer()
## )

## See spec(...) for full column specifications.

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col     row col                  expected   actual file                      expected   <int> <chr>                <chr>      <chr>  <chr>                     actual 1  2565 Violation Legal Code an integer T      'NYParkingViolations.csv' file 2  3786 Violation Legal Code an integer T      'NYParkingViolations.csv' row 3  3991 Violation Legal Code an integer T      'NYParkingViolations.csv' col 4  3992 Violation Legal Code an integer T      'NYParkingViolations.csv' expected 5  3993 Violation Legal Code an integer T      'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.

##    user  system elapsed 
##  59.290   5.114  65.049

nyc

## # A tibble: 9,095,621 x 18
##    Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>             <chr>      <date>              <int> <chr>           
##  1 NY                PAS        2013-07-29             94 <NA>            
##  2 NY                COM        2013-01-07             19 DELV            
##  3 NY                COM        2013-01-01             46 DELV            
##  4 IL                PAS        2013-01-01             78 VAN             
##  5 NJ                PAS        2013-01-01             46 SUBN            
##  6 NY                PAS        2013-01-01             46 SDN             
##  7 NY                COM        2013-01-01             78 VAN             
##  8 NY                COM        2013-01-01             46 SDN             
##  9 OK                PAS        2013-01-01             47 TRLR            
## 10 NY                COM        2013-01-01             45 DELV            
## # ... with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## #   Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>

Ticket frequency

nyc %>% 
  group_by(Issue_Date) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x = Issue_Date, y = n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))

## Warning: Removed 365 rows containing missing values (geom_path).

Create an SQLite database

Import the nyc data set into the SQLite database:

library("DBI")
library("RSQLite")
library("tidyverse")
if (Sys.info()[["sysname"]] == "Linux") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "/home/m280-data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)

## [1] "nyc"

dbDisconnect(db)

You can create a symbolic link to the database file by Bash command

ln -sf /home/m280-data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite

Size of the database file

ls -l NYParkingViolations.sqlite

## -rw-r--r--  1 huazhou  staff  729309184 Feb 19 20:43 NYParkingViolations.sqlite

Read data from database

Connect to table in database:

db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)

## [1] "nyc"

nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)

## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 6 slots
##   .. .. ..@ ptr                :<externalptr> 
##   .. .. ..@ dbname             : chr "/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParki"| __truncated__
##   .. .. ..@ loadable.extensions: logi TRUE
##   .. .. ..@ flags              : int 70
##   .. .. ..@ vfs                : chr ""
##   .. .. ..@ ref                :<environment: 0x7fc41647f6b0> 
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   :Classes 'ident', 'character'  chr "nyc"
##   ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"

nyc_sql %>% print(width = Inf)

## # Source:   table<nyc> [?? x 18]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
##    Registration_State Plate_Type Issue_Date Violation_Code
##    <chr>              <chr>           <dbl>          <int>
##  1 NY                 PAS             15915             94
##  2 NY                 COM             15712             19
##  3 NY                 COM             15706             46
##  4 IL                 PAS             15706             78
##  5 NJ                 PAS             15706             46
##  6 NY                 PAS             15706             46
##  7 NY                 COM             15706             78
##  8 NY                 COM             15706             46
##  9 OK                 PAS             15706             47
## 10 NY                 COM             15706             45
##    Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
##    <chr>             <chr>        <chr>          <chr>             
##  1 <NA>              TRIUM        X              <NA>              
##  2 DELV              <NA>         P              0076              
##  3 DELV              FRUEH        P              110               
##  4 VAN               FORD         P              0025              
##  5 SUBN              CHEVR        P              0081              
##  6 SDN               HONDA        P              0078              
##  7 VAN               FORD         P              110               
##  8 SDN               ISUZU        P              112               
##  9 TRLR              INTER        P              0013              
## 10 DELV              FRUEH        P              0026              
##    Violation_Precinct Violation_Time Number Street          
##                 <int> <chr>          <chr>  <chr>           
##  1                  0 <NA>           <NA>   <NA>            
##  2                 76 0407A          548    COURT STREET    
##  3                110 1045A          88-10  WHITNEY AVE     
##  4                 25 0332A          2      W 120TH ST      
##  5                 81 0139P          427    HART ST         
##  6                 78 1127A          565    UNION ST        
##  7                110 0412A          40-59  95 ST           
##  8                112 0554P          67-70  YELLOWSTONE BLVD
##  9                 13 0745A          456    3RD AVE         
## 10                 26 0330P          44     MORNINGSIDE DR  
##    Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
##    <chr>               <chr>             <int>       <dbl>      <dbl>
##  1 <NA>                GREEN                29        7.00       2013
##  2 <NA>                <NA>                  7        1.00       2013
##  3 <NA>                WHIT                  1        1.00       2013
##  4 <NA>                WHT                   1        1.00       2013
##  5 <NA>                RD                    1        1.00       2013
##  6 <NA>                GREY                  1        1.00       2013
##  7 <NA>                WH                    1        1.00       2013
##  8 <NA>                WH                    1        1.00       2013
##  9 <NA>                WHITE                 1        1.00       2013
## 10 <NA>                WHITE                 1        1.00       2013
##    Issue_WDay
##    <chr>     
##  1 Mon       
##  2 Mon       
##  3 Tue       
##  4 Tue       
##  5 Tue       
##  6 Tue       
##  7 Tue       
##  8 Tue       
##  9 Tue       
## 10 Tue       
## # ... with more rows

Use dplyr with SQLite

addr <- nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr

## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Precinct Number  Street         
##         <dbl> <chr>                       <int> <chr>   <chr>          
##  1      15706 P                              25 2       W 120TH ST     
##  2      15706 P                              13 456     3RD AVE        
##  3      15706 P                              26 44      MORNINGSIDE DR 
##  4      15706 X                              14 8       PENN PLAZA     
##  5      15706 P                              25 2123    MADISON AVE    
##  6      15706 P                              18 413     48 TH ST       
##  7      15706 P                              12 630     LEXINGTON AVE  
##  8      15706 P                              25 219     E 121 ST       
##  9      15706 P                              26 1420    AMSTERDAM AVE  
## 10      15706 P                              26 488-490 ST NICHOLAS AVE
## # ... with more rows

SQL query

class(addr)

## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl"

show_query(addr)

## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))

Mutate and summarise

addr %>% mutate(address = paste(Number, Street))

## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Precinct Number  Street   address  
##         <dbl> <chr>                       <int> <chr>   <chr>    <chr>    
##  1      15706 P                              25 2       W 120TH… 2 W 120T…
##  2      15706 P                              13 456     3RD AVE  456 3RD …
##  3      15706 P                              26 44      MORNING… 44 MORNI…
##  4      15706 X                              14 8       PENN PL… 8 PENN P…
##  5      15706 P                              25 2123    MADISON… 2123 MAD…
##  6      15706 P                              18 413     48 TH ST 413 48 T…
##  7      15706 P                              12 630     LEXINGT… 630 LEXI…
##  8      15706 P                              25 219     E 121 ST 219 E 12…
##  9      15706 P                              26 1420    AMSTERD… 1420 AMS…
## 10      15706 P                              26 488-490 ST NICH… 488-490 …
## # ... with more rows

addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
##    mean
##   <dbl>
## 1  16.1

SQL grouping

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <int> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # ... with more rows

SQL query

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n()) %>% 
  show_query()

## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

library("dbplyr")

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

translate_sql(paste(x, y))

## <SQL> PASTE("x", "y")

translate_sql(mean(x))

## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning

## <SQL> avg("x") OVER ()

translate_sql(mean(x, na.rm = TRUE))

## <SQL> avg("x") OVER ()

Timings

dplyr using tibble:

system.time(
  nyc %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)

##    user  system elapsed 
##   0.383   0.091   0.479

dplyr using SQLite:

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)

##    user  system elapsed 
##   0.016   0.000   0.017

nyc_sql was 30x times faster than nyc, but the former is disk based while the latter is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n = n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.19.3
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <int> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # ... with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n()) %>%
    collect()
)

##    user  system elapsed 
##   3.166   0.341   3.544

Plotting

Bar plot

Suppose we want the bar plot of number of tickets on each weekday. We encounter error with the usual ggplot2 command:

nyc_sql %>%
  ggplot() + 
    geom_bar(aes(x = Issue_WDay))

## Error: ggplot2 doesn't know how to deal with data of class tbl_dbi/tbl_sql/tbl_lazy/tbl

This is becasue ggplot2 needs to compute the count per category by going through all the rows. But here nyc_sql is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.

nyc_sql %>%
  group_by(Issue_WDay) %>%
  count() %>%
  collect() %>%
  ggplot() + 
    geom_col(aes(x = Issue_WDay, y = n))

Other plots

For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.

Close connection to database

dbDisconnect(db)

Working with Databases

Biostat M280

Dr. Hua Zhou

Feb 13, 2018

Introduction

Machine information

CSV file

Read csv file

Fix column names

Simplifying

Fix dates

Putting it all together

Ticket frequency

Create an SQLite database

Read data from database

Use dplyr with SQLite

SQL query

Mutate and summarise

SQL grouping

SQL query

SQL translation

Timings

Laziness

Full query

Plotting

Bar plot

Other plots

Close connection to database