Working with Databases

Introduction

In this lecture we follow a tutorial by Dr. Colin Rundel to handle a bigish data set. We will learn:

Import data from a bigish csv file.
Tidy data.
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.

Machine information

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6
##  [5] yaml_2.2.0      Rcpp_1.0.0      stringi_1.2.4   rmarkdown_1.11 
##  [9] knitr_1.21      stringr_1.4.0   xfun_0.4        digest_0.6.18  
## [13] evaluate_0.13

CSV file

The /home/m280data/nyc_parking/NYParkingViolations.csv file on teaching server contains information about parking tickets in NYC. You can create a symbolic link to the data file by Bash command

ln -sf /home/m280data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv

The CSV file is about 1.7GB

ls -l NYParkingViolations.csv

## -rw-r--r--@ 1 huazhou  staff  1793818408 Feb 11  2018 NYParkingViolations.csv

How many rows?

wc -l < NYParkingViolations.csv

##  9100279

First few lines:

head -5 NYParkingViolations.csv

## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,Number,Street,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
## 1361929741,FCJ5493,NY,PAS,12/18/1970,20,SUBN,GMC,S,35030,89180,32600,20140728,0070,70,0,829379,KS14,0000,0358A,,K,F,959,E 5 ST,,0,408,E2,,BBBBBBB,ALL,ALL,BLACK,0,2013,-,0,,,,,
## 1366962000,63540MC,NY,COM,02/02/1971,46,DELV,FRUEH,P,58830,11430,11450,0,0068,68,68,928157,0068,0000,1116A,,K,F,185,MARINE AVENUE,,0,408,F1,,BBBBBBB,ALL,ALL,BRN,0,2013,-,0,,,,,
## 1342296187,GCY4187,NY,SRF,09/18/1971,21,VAN,FORD,S,11790,35780,57890,20150430,104,104,0,515350,QW05,0000,0939A,,,O,60-25,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,BLUE,0,2002,-,0,,,,,
## 1342296199,95V6675,TX,PAS,09/18/1971,21,,GMC,S,11790,35780,57890,0,104,104,0,515350,QW05,0000,0945A,,,F,60-12,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,SILVR,0,2008,-,0,,,,,

Read csv file

read.csv() function in base R takes nearly 3-4 minutes to read in the 1.7GB csv file:

system.time(read.csv("NYParkingViolations.csv"))

##    user  system elapsed 
## 196.861  10.987 210.255

data.table is an R package for reading large data sets:

library(data.table)
system.time({nyc = fread("NYParkingViolations.csv")})

##    user  system elapsed 
##  81.599   2.019  11.529

class(nyc)

## [1] "data.table" "data.frame"

nyc = as.data.frame(nyc)
class(nyc)

## [1] "data.frame"

read_csv() function in tidyverse is 4-5 times faster than base R:

library("tidyverse")

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()

system.time({nyc = read_csv("NYParkingViolations.csv")})

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_double(),
##   `Street Code1` = col_double(),
##   `Street Code2` = col_double(),
##   `Street Code3` = col_double(),
##   `Vehicle Expiration Date` = col_double(),
##   `Violation Precinct` = col_double(),
##   `Issuer Precinct` = col_double(),
##   `Issuer Code` = col_double(),
##   `Date First Observed` = col_double(),
##   `Law Section` = col_double(),
##   `Violation Legal Code` = col_double(),
##   `Unregistered Vehicle?` = col_double(),
##   `Vehicle Year` = col_double(),
##   `Feet From Curb` = col_double(),
##   `Violation Description` = col_logical(),
##   `No Standing or Stopping Violation` = col_logical(),
##   `Hydrant Violation` = col_logical(),
##   `Double Parking Violation` = col_logical()
## )

## See spec(...) for full column specifications.

## Warning: 8523062 parsing failures.
##  row                   col           expected             actual                      file
## 2565 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## 2565 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3786 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## 3786 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3991 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## .... ..................... .................. .................. .........................
## See problems(...) for more details.

##    user  system elapsed 
##  38.759   2.271  41.161

nyc

## # A tibble: 9,100,278 x 43
##    `Summons Number` `Plate ID` `Registration S… `Plate Type` `Issue Date`
##               <dbl> <chr>      <chr>            <chr>        <chr>       
##  1       1361929741 FCJ5493    NY               PAS          12/18/1970  
##  2       1366962000 63540MC    NY               COM          02/02/1971  
##  3       1342296187 GCY4187    NY               SRF          09/18/1971  
##  4       1342296199 95V6675    TX               PAS          09/18/1971  
##  5       1342296217 FYM5117    NY               SRF          09/18/1971  
##  6       1356906515 GFM1421    NY               PAS          09/18/1971  
##  7       1337077380 18972BB    NY               999          10/10/1971  
##  8       1364523796 WNJ4730    VA               PAS          04/05/1973  
##  9       1359914924 68091JZ    NY               COM          07/22/1973  
## 10       1355498326 EWV4127    NY               PAS          08/12/1973  
## # … with 9,100,268 more rows, and 38 more variables: `Violation
## #   Code` <dbl>, `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, `Issuing
## #   Agency` <chr>, `Street Code1` <dbl>, `Street Code2` <dbl>, `Street
## #   Code3` <dbl>, `Vehicle Expiration Date` <dbl>, `Violation
## #   Location` <chr>, `Violation Precinct` <dbl>, `Issuer Precinct` <dbl>,
## #   `Issuer Code` <dbl>, `Issuer Command` <chr>, `Issuer Squad` <chr>,
## #   `Violation Time` <chr>, `Time First Observed` <chr>, `Violation
## #   County` <chr>, `Violation In Front Of Or Opposite` <chr>,
## #   Number <chr>, Street <chr>, `Intersecting Street` <chr>, `Date First
## #   Observed` <dbl>, `Law Section` <dbl>, `Sub Division` <chr>, `Violation
## #   Legal Code` <dbl>, `Days Parking In Effect` <chr>, `From Hours In
## #   Effect` <chr>, `To Hours In Effect` <chr>, `Vehicle Color` <chr>,
## #   `Unregistered Vehicle?` <dbl>, `Vehicle Year` <dbl>, `Meter
## #   Number` <chr>, `Feet From Curb` <dbl>, `Violation Post Code` <chr>,
## #   `Violation Description` <lgl>, `No Standing or Stopping
## #   Violation` <lgl>, `Hydrant Violation` <lgl>, `Double Parking
## #   Violation` <lgl>

Fix column names

names(nyc) <- str_replace_all(names(nyc), " ", "_")
nyc

## # A tibble: 9,100,278 x 43
##    Summons_Number Plate_ID Registration_St… Plate_Type Issue_Date
##             <dbl> <chr>    <chr>            <chr>      <chr>     
##  1     1361929741 FCJ5493  NY               PAS        12/18/1970
##  2     1366962000 63540MC  NY               COM        02/02/1971
##  3     1342296187 GCY4187  NY               SRF        09/18/1971
##  4     1342296199 95V6675  TX               PAS        09/18/1971
##  5     1342296217 FYM5117  NY               SRF        09/18/1971
##  6     1356906515 GFM1421  NY               PAS        09/18/1971
##  7     1337077380 18972BB  NY               999        10/10/1971
##  8     1364523796 WNJ4730  VA               PAS        04/05/1973
##  9     1359914924 68091JZ  NY               COM        07/22/1973
## 10     1355498326 EWV4127  NY               PAS        08/12/1973
## # … with 9,100,268 more rows, and 38 more variables: Violation_Code <dbl>,
## #   Vehicle_Body_Type <chr>, Vehicle_Make <chr>, Issuing_Agency <chr>,
## #   Street_Code1 <dbl>, Street_Code2 <dbl>, Street_Code3 <dbl>,
## #   Vehicle_Expiration_Date <dbl>, Violation_Location <chr>,
## #   Violation_Precinct <dbl>, Issuer_Precinct <dbl>, Issuer_Code <dbl>,
## #   Issuer_Command <chr>, Issuer_Squad <chr>, Violation_Time <chr>,
## #   Time_First_Observed <chr>, Violation_County <chr>,
## #   Violation_In_Front_Of_Or_Opposite <chr>, Number <chr>, Street <chr>,
## #   Intersecting_Street <chr>, Date_First_Observed <dbl>,
## #   Law_Section <dbl>, Sub_Division <chr>, Violation_Legal_Code <dbl>,
## #   Days_Parking_In_Effect <chr>, From_Hours_In_Effect <chr>,
## #   To_Hours_In_Effect <chr>, Vehicle_Color <chr>,
## #   `Unregistered_Vehicle?` <dbl>, Vehicle_Year <dbl>, Meter_Number <chr>,
## #   Feet_From_Curb <dbl>, Violation_Post_Code <chr>,
## #   Violation_Description <lgl>, No_Standing_or_Stopping_Violation <lgl>,
## #   Hydrant_Violation <lgl>, Double_Parking_Violation <lgl>

Simplifying

We make a smaller data set with selected variables:

# magrittr for %<>%
library("magrittr")

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

(nyc %<>% 
  select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color))

## # A tibble: 9,100,278 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <chr>               <dbl> <chr>           
##  1 NY               PAS        12/18/1970             20 SUBN            
##  2 NY               COM        02/02/1971             46 DELV            
##  3 NY               SRF        09/18/1971             21 VAN             
##  4 TX               PAS        09/18/1971             21 <NA>            
##  5 NY               SRF        09/18/1971             21 SUBN            
##  6 NY               PAS        09/18/1971             40 SDN             
##  7 NY               999        10/10/1971             14 BUS             
##  8 VA               PAS        04/05/1973             14 SDN             
##  9 NY               COM        07/22/1973             46 DELV            
## 10 NY               PAS        08/12/1973             21 SUBN            
## # … with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Fix dates

Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.

Make variable Issue_Date of date-time type:

library("lubridate")

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

class(nyc$Issue_Date)

## [1] "character"

nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)

## [1] "Date"

nyc

## # A tibble: 9,100,278 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <dbl> <chr>           
##  1 NY               PAS        1970-12-18             20 SUBN            
##  2 NY               COM        1971-02-02             46 DELV            
##  3 NY               SRF        1971-09-18             21 VAN             
##  4 TX               PAS        1971-09-18             21 <NA>            
##  5 NY               SRF        1971-09-18             21 SUBN            
##  6 NY               PAS        1971-09-18             40 SDN             
##  7 NY               999        1971-10-10             14 BUS             
##  8 VA               PAS        1973-04-05             14 SDN             
##  9 NY               COM        1973-07-22             46 DELV            
## 10 NY               PAS        1973-08-12             21 SUBN            
## # … with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Only keep data in years 2013-4.

range(nyc$Issue_Date)

## [1] "1970-12-18" "2069-12-23"

nyc$Issue_Date %>% year() %>% table()

## .
##    1970    1971    1973    1974    1976    1977    1979    1981    1983 
##       1       6      10       1       2       1       2       4       1 
##    1984    1987    1990    1991    1996    2000    2001    2002    2003 
##       2       3       2       1       1     319      91       7      39 
##    2004    2005    2006    2007    2008    2009    2010    2011    2012 
##      77       9      11      13       8       9     129     251     618 
##    2013    2014    2015    2016    2017    2018    2019    2020    2021 
## 4379109 4716512    1522     296     309     181     329      18      26 
##    2022    2023    2024    2025    2026    2027    2028    2029    2030 
##       1      31      23      10       4       4       7       3      45 
##    2031    2032    2033    2040    2041    2043    2044    2045    2046 
##      93       3       8       1      39       9       9       2       7 
##    2047    2048    2049    2050    2051    2052    2053    2060    2061 
##       6       1       3       1      12       2       1       3      10 
##    2063    2064    2066    2067    2069 
##       9       5       3       2       1

filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))

## # A tibble: 9,095,621 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <dbl> <chr>           
##  1 NY               PAS        2013-07-29             94 <NA>            
##  2 NY               COM        2013-01-07             19 DELV            
##  3 NY               COM        2013-01-01             46 DELV            
##  4 IL               PAS        2013-01-01             78 VAN             
##  5 NJ               PAS        2013-01-01             46 SUBN            
##  6 NY               PAS        2013-01-01             46 SDN             
##  7 NY               COM        2013-01-01             78 VAN             
##  8 NY               COM        2013-01-01             46 SDN             
##  9 OK               PAS        2013-01-01             47 TRLR            
## 10 NY               COM        2013-01-01             45 DELV            
## # … with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Putting it all together

system.time({
  nyc <- read_csv("NYParkingViolations.csv") %>%
    setNames(str_replace_all(names(.)," ", "_")) %>%
    select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         Number:Intersecting_Street, Vehicle_Color) %>%
    mutate(Issue_Date = mdy(Issue_Date)) %>% 
    mutate(Issue_Day = day(Issue_Date),
           Issue_Month = month(Issue_Date),
           Issue_Year = year(Issue_Date),
           Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
    filter(Issue_Year %in% 2013:2014)  
})

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_double(),
##   `Street Code1` = col_double(),
##   `Street Code2` = col_double(),
##   `Street Code3` = col_double(),
##   `Vehicle Expiration Date` = col_double(),
##   `Violation Precinct` = col_double(),
##   `Issuer Precinct` = col_double(),
##   `Issuer Code` = col_double(),
##   `Date First Observed` = col_double(),
##   `Law Section` = col_double(),
##   `Violation Legal Code` = col_double(),
##   `Unregistered Vehicle?` = col_double(),
##   `Vehicle Year` = col_double(),
##   `Feet From Curb` = col_double(),
##   `Violation Description` = col_logical(),
##   `No Standing or Stopping Violation` = col_logical(),
##   `Hydrant Violation` = col_logical(),
##   `Double Parking Violation` = col_logical()
## )

## See spec(...) for full column specifications.

## Warning: 8523062 parsing failures.
##  row                   col           expected             actual                      file
## 2565 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## 2565 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3786 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## 3786 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3991 Violation Legal Code  a double           T                  'NYParkingViolations.csv'
## .... ..................... .................. .................. .........................
## See problems(...) for more details.

##    user  system elapsed 
##  50.642   3.034  53.791

nyc

## # A tibble: 9,095,621 x 18
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <dbl> <chr>           
##  1 NY               PAS        2013-07-29             94 <NA>            
##  2 NY               COM        2013-01-07             19 DELV            
##  3 NY               COM        2013-01-01             46 DELV            
##  4 IL               PAS        2013-01-01             78 VAN             
##  5 NJ               PAS        2013-01-01             46 SUBN            
##  6 NY               PAS        2013-01-01             46 SDN             
##  7 NY               COM        2013-01-01             78 VAN             
##  8 NY               COM        2013-01-01             46 SDN             
##  9 OK               PAS        2013-01-01             47 TRLR            
## 10 NY               COM        2013-01-01             45 DELV            
## # … with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## #   Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## #   Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>

Ticket frequency

nyc %>% 
  group_by(Issue_Date) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x = Issue_Date, y = n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))

## Warning: Removed 365 rows containing missing values (geom_path).

Create an SQLite database

Import the nyc data set into the SQLite database:

library("DBI")
library("RSQLite")
library("tidyverse")

if (Sys.info()[["sysname"]] == "Linux") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "/home/m280data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
dbDisconnect(db)

You can create a symbolic link to the database file by Bash command

ln -sf /home/m280data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite

Size of the database file

ls -l NYParkingViolations.sqlite

## -rw-r--r--  1 huazhou  staff  729309184 Feb 18 21:29 NYParkingViolations.sqlite

Read data from database

Connect to table in database:

db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)

## [1] "nyc"

nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)

## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   .. .. ..@ ptr                :<externalptr> 
##   .. .. ..@ dbname             : chr "/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParki"| __truncated__
##   .. .. ..@ loadable.extensions: logi TRUE
##   .. .. ..@ flags              : int 70
##   .. .. ..@ vfs                : chr ""
##   .. .. ..@ ref                :<environment: 0x7feced8777d8> 
##   .. .. ..@ bigint             : chr "integer64"
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   : 'ident' chr "nyc"
##   ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...

nyc_sql %>% print(width = Inf)

## # Source:   table<nyc> [?? x 18]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
##    Registration_State Plate_Type Issue_Date Violation_Code
##    <chr>              <chr>           <dbl>          <dbl>
##  1 NY                 PAS             15915             94
##  2 NY                 COM             15712             19
##  3 NY                 COM             15706             46
##  4 IL                 PAS             15706             78
##  5 NJ                 PAS             15706             46
##  6 NY                 PAS             15706             46
##  7 NY                 COM             15706             78
##  8 NY                 COM             15706             46
##  9 OK                 PAS             15706             47
## 10 NY                 COM             15706             45
##    Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
##    <chr>             <chr>        <chr>          <chr>             
##  1 <NA>              TRIUM        X              <NA>              
##  2 DELV              <NA>         P              0076              
##  3 DELV              FRUEH        P              110               
##  4 VAN               FORD         P              0025              
##  5 SUBN              CHEVR        P              0081              
##  6 SDN               HONDA        P              0078              
##  7 VAN               FORD         P              110               
##  8 SDN               ISUZU        P              112               
##  9 TRLR              INTER        P              0013              
## 10 DELV              FRUEH        P              0026              
##    Violation_Precinct Violation_Time Number Street          
##                 <dbl> <chr>          <chr>  <chr>           
##  1                  0 <NA>           <NA>   <NA>            
##  2                 76 0407A          548    COURT STREET    
##  3                110 1045A          88-10  WHITNEY AVE     
##  4                 25 0332A          2      W 120TH ST      
##  5                 81 0139P          427    HART ST         
##  6                 78 1127A          565    UNION ST        
##  7                110 0412A          40-59  95 ST           
##  8                112 0554P          67-70  YELLOWSTONE BLVD
##  9                 13 0745A          456    3RD AVE         
## 10                 26 0330P          44     MORNINGSIDE DR  
##    Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
##    <chr>               <chr>             <int>       <dbl>      <dbl>
##  1 <NA>                GREEN                29           7       2013
##  2 <NA>                <NA>                  7           1       2013
##  3 <NA>                WHIT                  1           1       2013
##  4 <NA>                WHT                   1           1       2013
##  5 <NA>                RD                    1           1       2013
##  6 <NA>                GREY                  1           1       2013
##  7 <NA>                WH                    1           1       2013
##  8 <NA>                WH                    1           1       2013
##  9 <NA>                WHITE                 1           1       2013
## 10 <NA>                WHITE                 1           1       2013
##    Issue_WDay
##    <chr>     
##  1 Mon       
##  2 Mon       
##  3 Tue       
##  4 Tue       
##  5 Tue       
##  6 Tue       
##  7 Tue       
##  8 Tue       
##  9 Tue       
## 10 Tue       
## # … with more rows

Use dplyr with SQLite

addr <- nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr

## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Precinct Number  Street         
##         <dbl> <chr>                       <dbl> <chr>   <chr>          
##  1      15706 P                              25 2       W 120TH ST     
##  2      15706 P                              13 456     3RD AVE        
##  3      15706 P                              26 44      MORNINGSIDE DR 
##  4      15706 X                              14 8       PENN PLAZA     
##  5      15706 P                              25 2123    MADISON AVE    
##  6      15706 P                              18 413     48 TH ST       
##  7      15706 P                              12 630     LEXINGTON AVE  
##  8      15706 P                              25 219     E 121 ST       
##  9      15706 P                              26 1420    AMSTERDAM AVE  
## 10      15706 P                              26 488-490 ST NICHOLAS AVE
## # … with more rows

SQL query

class(addr)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

show_query(addr)

## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))

Mutate and summarise

addr %>% mutate(address = paste(Number, Street))

## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Precin… Number  Street   address    
##         <dbl> <chr>                      <dbl> <chr>   <chr>    <chr>      
##  1      15706 P                             25 2       W 120TH… 2 W 120TH …
##  2      15706 P                             13 456     3RD AVE  456 3RD AVE
##  3      15706 P                             26 44      MORNING… 44 MORNING…
##  4      15706 X                             14 8       PENN PL… 8 PENN PLA…
##  5      15706 P                             25 2123    MADISON… 2123 MADIS…
##  6      15706 P                             18 413     48 TH ST 413 48 TH …
##  7      15706 P                             12 630     LEXINGT… 630 LEXING…
##  8      15706 P                             25 219     E 121 ST 219 E 121 …
##  9      15706 P                             26 1420    AMSTERD… 1420 AMSTE…
## 10      15706 P                             26 488-490 ST NICH… 488-490 ST…
## # … with more rows

addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))

## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
##    mean
##   <dbl>
## 1  16.1

SQL grouping

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <dbl> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # … with more rows

SQL query

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n()) %>% 
  show_query()

## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

library("dbplyr")

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

translate_sql(x == 1 & (y < 2 | z > 3))

## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)

translate_sql(x ^ 2 < 10)

## <SQL> POWER("x", 2.0) < 10.0

translate_sql(x %% 2 == 10)

## <SQL> "x" % 2.0 = 10.0

translate_sql(paste(x, y))

## <SQL> PASTE("x", "y")

translate_sql(mean(x))

## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning

## <SQL> avg("x") OVER ()

translate_sql(mean(x, na.rm = TRUE))

## <SQL> avg("x") OVER ()

Timings

dplyr using tibble:

system.time(
  nyc %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)

##    user  system elapsed 
##   0.325   0.098   0.424

dplyr using SQLite:

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)

##    user  system elapsed 
##   0.005   0.000   0.005

nyc_sql was 30x times faster than nyc, but the former is disk based while the latter is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
- won’t connect and query the database until absolutely necessary (e.g. show output),
- and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like

nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n = n())

## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0
## #   [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <dbl> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # … with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n()) %>%
    collect()
)

##    user  system elapsed 
##   3.314   0.405   3.763

Plotting

Bar plot

Suppose we want the bar plot of number of tickets on each weekday. We encounter error (???) with the usual ggplot2 command:

nyc_sql %>%
  ggplot() + 
    geom_bar(aes(x = Issue_WDay))

Apparently current version of ggplot2 (v3.1.0) works for tbl_dbi/tbl_sql/tbl_lazy now.

Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per category by going through all the rows. But here nyc_sql is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.

nyc_sql %>%
  group_by(Issue_WDay) %>%
  count() %>%
  collect() %>%
  ggplot() + 
    geom_col(aes(x = Issue_WDay, y = n))

Other plots

For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.

Close connection to database

dbDisconnect(db)

Working with Databases - Part II

Biostat M280

Dr. Hua Zhou @ UCLA

Feb 19, 2019

Introduction

Machine information

CSV file

Read csv file

Fix column names

Simplifying

Fix dates

Putting it all together

Ticket frequency

Create an SQLite database

Read data from database

Use dplyr with SQLite

SQL query

Mutate and summarise

SQL grouping

SQL query

SQL translation

Timings

Laziness

Full query

Plotting

Bar plot

Other plots

Close connection to database