In this lecture we follow a tutorial by Dr. Colin Rundel to handle a bigish data set. We will learn:
Import data from a bigish csv file.
Tidy data.
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.2 magrittr_1.5 tools_3.5.2 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.0 stringi_1.2.4 rmarkdown_1.11
## [9] knitr_1.21 stringr_1.4.0 xfun_0.4 digest_0.6.18
## [13] evaluate_0.13
The /home/m280data/nyc_parking/NYParkingViolations.csv
file on teaching server contains information about parking tickets in NYC. You can create a symbolic link to the data file by Bash command
ln -sf /home/m280data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv
The CSV file is about 1.7GB
ls -l NYParkingViolations.csv
## -rw-r--r--@ 1 huazhou staff 1793818408 Feb 11 2018 NYParkingViolations.csv
How many rows?
wc -l < NYParkingViolations.csv
## 9100279
First few lines:
head -5 NYParkingViolations.csv
## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,Number,Street,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
## 1361929741,FCJ5493,NY,PAS,12/18/1970,20,SUBN,GMC,S,35030,89180,32600,20140728,0070,70,0,829379,KS14,0000,0358A,,K,F,959,E 5 ST,,0,408,E2,,BBBBBBB,ALL,ALL,BLACK,0,2013,-,0,,,,,
## 1366962000,63540MC,NY,COM,02/02/1971,46,DELV,FRUEH,P,58830,11430,11450,0,0068,68,68,928157,0068,0000,1116A,,K,F,185,MARINE AVENUE,,0,408,F1,,BBBBBBB,ALL,ALL,BRN,0,2013,-,0,,,,,
## 1342296187,GCY4187,NY,SRF,09/18/1971,21,VAN,FORD,S,11790,35780,57890,20150430,104,104,0,515350,QW05,0000,0939A,,,O,60-25,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,BLUE,0,2002,-,0,,,,,
## 1342296199,95V6675,TX,PAS,09/18/1971,21,,GMC,S,11790,35780,57890,0,104,104,0,515350,QW05,0000,0945A,,,F,60-12,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,SILVR,0,2008,-,0,,,,,
read.csv()
function in base R takes nearly 3-4 minutes to read in the 1.7GB csv file:
system.time(read.csv("NYParkingViolations.csv"))
## user system elapsed
## 196.861 10.987 210.255
data.table is an R package for reading large data sets:
library(data.table)
system.time({nyc = fread("NYParkingViolations.csv")})
## user system elapsed
## 81.599 2.019 11.529
class(nyc)
## [1] "data.table" "data.frame"
nyc = as.data.frame(nyc)
class(nyc)
## [1] "data.frame"
read_csv()
function in tidyverse is 4-5 times faster than base R:
library("tidyverse")
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
system.time({nyc = read_csv("NYParkingViolations.csv")})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_double(),
## `Street Code1` = col_double(),
## `Street Code2` = col_double(),
## `Street Code3` = col_double(),
## `Vehicle Expiration Date` = col_double(),
## `Violation Precinct` = col_double(),
## `Issuer Precinct` = col_double(),
## `Issuer Code` = col_double(),
## `Date First Observed` = col_double(),
## `Law Section` = col_double(),
## `Violation Legal Code` = col_double(),
## `Unregistered Vehicle?` = col_double(),
## `Vehicle Year` = col_double(),
## `Feet From Curb` = col_double(),
## `Violation Description` = col_logical(),
## `No Standing or Stopping Violation` = col_logical(),
## `Hydrant Violation` = col_logical(),
## `Double Parking Violation` = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 8523062 parsing failures.
## row col expected actual file
## 2565 Violation Legal Code a double T 'NYParkingViolations.csv'
## 2565 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3786 Violation Legal Code a double T 'NYParkingViolations.csv'
## 3786 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3991 Violation Legal Code a double T 'NYParkingViolations.csv'
## .... ..................... .................. .................. .........................
## See problems(...) for more details.
## user system elapsed
## 38.759 2.271 41.161
nyc
## # A tibble: 9,100,278 x 43
## `Summons Number` `Plate ID` `Registration S… `Plate Type` `Issue Date`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1361929741 FCJ5493 NY PAS 12/18/1970
## 2 1366962000 63540MC NY COM 02/02/1971
## 3 1342296187 GCY4187 NY SRF 09/18/1971
## 4 1342296199 95V6675 TX PAS 09/18/1971
## 5 1342296217 FYM5117 NY SRF 09/18/1971
## 6 1356906515 GFM1421 NY PAS 09/18/1971
## 7 1337077380 18972BB NY 999 10/10/1971
## 8 1364523796 WNJ4730 VA PAS 04/05/1973
## 9 1359914924 68091JZ NY COM 07/22/1973
## 10 1355498326 EWV4127 NY PAS 08/12/1973
## # … with 9,100,268 more rows, and 38 more variables: `Violation
## # Code` <dbl>, `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, `Issuing
## # Agency` <chr>, `Street Code1` <dbl>, `Street Code2` <dbl>, `Street
## # Code3` <dbl>, `Vehicle Expiration Date` <dbl>, `Violation
## # Location` <chr>, `Violation Precinct` <dbl>, `Issuer Precinct` <dbl>,
## # `Issuer Code` <dbl>, `Issuer Command` <chr>, `Issuer Squad` <chr>,
## # `Violation Time` <chr>, `Time First Observed` <chr>, `Violation
## # County` <chr>, `Violation In Front Of Or Opposite` <chr>,
## # Number <chr>, Street <chr>, `Intersecting Street` <chr>, `Date First
## # Observed` <dbl>, `Law Section` <dbl>, `Sub Division` <chr>, `Violation
## # Legal Code` <dbl>, `Days Parking In Effect` <chr>, `From Hours In
## # Effect` <chr>, `To Hours In Effect` <chr>, `Vehicle Color` <chr>,
## # `Unregistered Vehicle?` <dbl>, `Vehicle Year` <dbl>, `Meter
## # Number` <chr>, `Feet From Curb` <dbl>, `Violation Post Code` <chr>,
## # `Violation Description` <lgl>, `No Standing or Stopping
## # Violation` <lgl>, `Hydrant Violation` <lgl>, `Double Parking
## # Violation` <lgl>
names(nyc) <- str_replace_all(names(nyc), " ", "_")
nyc
## # A tibble: 9,100,278 x 43
## Summons_Number Plate_ID Registration_St… Plate_Type Issue_Date
## <dbl> <chr> <chr> <chr> <chr>
## 1 1361929741 FCJ5493 NY PAS 12/18/1970
## 2 1366962000 63540MC NY COM 02/02/1971
## 3 1342296187 GCY4187 NY SRF 09/18/1971
## 4 1342296199 95V6675 TX PAS 09/18/1971
## 5 1342296217 FYM5117 NY SRF 09/18/1971
## 6 1356906515 GFM1421 NY PAS 09/18/1971
## 7 1337077380 18972BB NY 999 10/10/1971
## 8 1364523796 WNJ4730 VA PAS 04/05/1973
## 9 1359914924 68091JZ NY COM 07/22/1973
## 10 1355498326 EWV4127 NY PAS 08/12/1973
## # … with 9,100,268 more rows, and 38 more variables: Violation_Code <dbl>,
## # Vehicle_Body_Type <chr>, Vehicle_Make <chr>, Issuing_Agency <chr>,
## # Street_Code1 <dbl>, Street_Code2 <dbl>, Street_Code3 <dbl>,
## # Vehicle_Expiration_Date <dbl>, Violation_Location <chr>,
## # Violation_Precinct <dbl>, Issuer_Precinct <dbl>, Issuer_Code <dbl>,
## # Issuer_Command <chr>, Issuer_Squad <chr>, Violation_Time <chr>,
## # Time_First_Observed <chr>, Violation_County <chr>,
## # Violation_In_Front_Of_Or_Opposite <chr>, Number <chr>, Street <chr>,
## # Intersecting_Street <chr>, Date_First_Observed <dbl>,
## # Law_Section <dbl>, Sub_Division <chr>, Violation_Legal_Code <dbl>,
## # Days_Parking_In_Effect <chr>, From_Hours_In_Effect <chr>,
## # To_Hours_In_Effect <chr>, Vehicle_Color <chr>,
## # `Unregistered_Vehicle?` <dbl>, Vehicle_Year <dbl>, Meter_Number <chr>,
## # Feet_From_Curb <dbl>, Violation_Post_Code <chr>,
## # Violation_Description <lgl>, No_Standing_or_Stopping_Violation <lgl>,
## # Hydrant_Violation <lgl>, Double_Parking_Violation <lgl>
We make a smaller data set with selected variables:
# magrittr for %<>%
library("magrittr")
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
(nyc %<>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
Number:Intersecting_Street, Vehicle_Color))
## # A tibble: 9,100,278 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <chr> <dbl> <chr>
## 1 NY PAS 12/18/1970 20 SUBN
## 2 NY COM 02/02/1971 46 DELV
## 3 NY SRF 09/18/1971 21 VAN
## 4 TX PAS 09/18/1971 21 <NA>
## 5 NY SRF 09/18/1971 21 SUBN
## 6 NY PAS 09/18/1971 40 SDN
## 7 NY 999 10/10/1971 14 BUS
## 8 VA PAS 04/05/1973 14 SDN
## 9 NY COM 07/22/1973 46 DELV
## 10 NY PAS 08/12/1973 21 SUBN
## # … with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.
Make variable Issue_Date
of date-time type:
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
class(nyc$Issue_Date)
## [1] "character"
nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)
## [1] "Date"
nyc
## # A tibble: 9,100,278 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <dbl> <chr>
## 1 NY PAS 1970-12-18 20 SUBN
## 2 NY COM 1971-02-02 46 DELV
## 3 NY SRF 1971-09-18 21 VAN
## 4 TX PAS 1971-09-18 21 <NA>
## 5 NY SRF 1971-09-18 21 SUBN
## 6 NY PAS 1971-09-18 40 SDN
## 7 NY 999 1971-10-10 14 BUS
## 8 VA PAS 1973-04-05 14 SDN
## 9 NY COM 1973-07-22 46 DELV
## 10 NY PAS 1973-08-12 21 SUBN
## # … with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Only keep data in years 2013-4.
range(nyc$Issue_Date)
## [1] "1970-12-18" "2069-12-23"
nyc$Issue_Date %>% year() %>% table()
## .
## 1970 1971 1973 1974 1976 1977 1979 1981 1983
## 1 6 10 1 2 1 2 4 1
## 1984 1987 1990 1991 1996 2000 2001 2002 2003
## 2 3 2 1 1 319 91 7 39
## 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 77 9 11 13 8 9 129 251 618
## 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 4379109 4716512 1522 296 309 181 329 18 26
## 2022 2023 2024 2025 2026 2027 2028 2029 2030
## 1 31 23 10 4 4 7 3 45
## 2031 2032 2033 2040 2041 2043 2044 2045 2046
## 93 3 8 1 39 9 9 2 7
## 2047 2048 2049 2050 2051 2052 2053 2060 2061
## 6 1 3 1 12 2 1 3 10
## 2063 2064 2066 2067 2069
## 9 5 3 2 1
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <dbl> <chr>
## 1 NY PAS 2013-07-29 94 <NA>
## 2 NY COM 2013-01-07 19 DELV
## 3 NY COM 2013-01-01 46 DELV
## 4 IL PAS 2013-01-01 78 VAN
## 5 NJ PAS 2013-01-01 46 SUBN
## 6 NY PAS 2013-01-01 46 SDN
## 7 NY COM 2013-01-01 78 VAN
## 8 NY COM 2013-01-01 46 SDN
## 9 OK PAS 2013-01-01 47 TRLR
## 10 NY COM 2013-01-01 45 DELV
## # … with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
system.time({
nyc <- read_csv("NYParkingViolations.csv") %>%
setNames(str_replace_all(names(.)," ", "_")) %>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
Number:Intersecting_Street, Vehicle_Color) %>%
mutate(Issue_Date = mdy(Issue_Date)) %>%
mutate(Issue_Day = day(Issue_Date),
Issue_Month = month(Issue_Date),
Issue_Year = year(Issue_Date),
Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
filter(Issue_Year %in% 2013:2014)
})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_double(),
## `Street Code1` = col_double(),
## `Street Code2` = col_double(),
## `Street Code3` = col_double(),
## `Vehicle Expiration Date` = col_double(),
## `Violation Precinct` = col_double(),
## `Issuer Precinct` = col_double(),
## `Issuer Code` = col_double(),
## `Date First Observed` = col_double(),
## `Law Section` = col_double(),
## `Violation Legal Code` = col_double(),
## `Unregistered Vehicle?` = col_double(),
## `Vehicle Year` = col_double(),
## `Feet From Curb` = col_double(),
## `Violation Description` = col_logical(),
## `No Standing or Stopping Violation` = col_logical(),
## `Hydrant Violation` = col_logical(),
## `Double Parking Violation` = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 8523062 parsing failures.
## row col expected actual file
## 2565 Violation Legal Code a double T 'NYParkingViolations.csv'
## 2565 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3786 Violation Legal Code a double T 'NYParkingViolations.csv'
## 3786 Violation Description 1/0/T/F/TRUE/FALSE BUS LANE VIOLATION 'NYParkingViolations.csv'
## 3991 Violation Legal Code a double T 'NYParkingViolations.csv'
## .... ..................... .................. .................. .........................
## See problems(...) for more details.
## user system elapsed
## 50.642 3.034 53.791
nyc
## # A tibble: 9,095,621 x 18
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <dbl> <chr>
## 1 NY PAS 2013-07-29 94 <NA>
## 2 NY COM 2013-01-07 19 DELV
## 3 NY COM 2013-01-01 46 DELV
## 4 IL PAS 2013-01-01 78 VAN
## 5 NJ PAS 2013-01-01 46 SUBN
## 6 NY PAS 2013-01-01 46 SDN
## 7 NY COM 2013-01-01 78 VAN
## 8 NY COM 2013-01-01 46 SDN
## 9 OK PAS 2013-01-01 47 TRLR
## 10 NY COM 2013-01-01 45 DELV
## # … with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <dbl>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## # Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>
nyc %>%
group_by(Issue_Date) %>%
summarize(n = n()) %>%
ggplot(aes(x = Issue_Date, y = n)) +
geom_line() +
xlim(mdy("7/1/2013"), mdy("6/30/2014"))
## Warning: Removed 365 rows containing missing values (geom_path).
Import the nyc
data set into the SQLite database:
library("DBI")
library("RSQLite")
library("tidyverse")
if (Sys.info()[["sysname"]] == "Linux") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "/home/m280data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
dbDisconnect(db)
You can create a symbolic link to the database file by Bash command
ln -sf /home/m280data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite
Size of the database file
ls -l NYParkingViolations.sqlite
## -rw-r--r-- 1 huazhou staff 729309184 Feb 18 21:29 NYParkingViolations.sqlite
Connect to table in database:
db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)
## [1] "nyc"
nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr "/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParki"| __truncated__
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x7feced8777d8>
## .. .. ..@ bigint : chr "integer64"
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "nyc"
## ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
nyc_sql %>% print(width = Inf)
## # Source: table<nyc> [?? x 18]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## Registration_State Plate_Type Issue_Date Violation_Code
## <chr> <chr> <dbl> <dbl>
## 1 NY PAS 15915 94
## 2 NY COM 15712 19
## 3 NY COM 15706 46
## 4 IL PAS 15706 78
## 5 NJ PAS 15706 46
## 6 NY PAS 15706 46
## 7 NY COM 15706 78
## 8 NY COM 15706 46
## 9 OK PAS 15706 47
## 10 NY COM 15706 45
## Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
## <chr> <chr> <chr> <chr>
## 1 <NA> TRIUM X <NA>
## 2 DELV <NA> P 0076
## 3 DELV FRUEH P 110
## 4 VAN FORD P 0025
## 5 SUBN CHEVR P 0081
## 6 SDN HONDA P 0078
## 7 VAN FORD P 110
## 8 SDN ISUZU P 112
## 9 TRLR INTER P 0013
## 10 DELV FRUEH P 0026
## Violation_Precinct Violation_Time Number Street
## <dbl> <chr> <chr> <chr>
## 1 0 <NA> <NA> <NA>
## 2 76 0407A 548 COURT STREET
## 3 110 1045A 88-10 WHITNEY AVE
## 4 25 0332A 2 W 120TH ST
## 5 81 0139P 427 HART ST
## 6 78 1127A 565 UNION ST
## 7 110 0412A 40-59 95 ST
## 8 112 0554P 67-70 YELLOWSTONE BLVD
## 9 13 0745A 456 3RD AVE
## 10 26 0330P 44 MORNINGSIDE DR
## Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
## <chr> <chr> <int> <dbl> <dbl>
## 1 <NA> GREEN 29 7 2013
## 2 <NA> <NA> 7 1 2013
## 3 <NA> WHIT 1 1 2013
## 4 <NA> WHT 1 1 2013
## 5 <NA> RD 1 1 2013
## 6 <NA> GREY 1 1 2013
## 7 <NA> WH 1 1 2013
## 8 <NA> WH 1 1 2013
## 9 <NA> WHITE 1 1 2013
## 10 <NA> WHITE 1 1 2013
## Issue_WDay
## <chr>
## 1 Mon
## 2 Mon
## 3 Tue
## 4 Tue
## 5 Tue
## 6 Tue
## 7 Tue
## 8 Tue
## 9 Tue
## 10 Tue
## # … with more rows
addr <- nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precinct Number Street
## <dbl> <chr> <dbl> <chr> <chr>
## 1 15706 P 25 2 W 120TH ST
## 2 15706 P 13 456 3RD AVE
## 3 15706 P 26 44 MORNINGSIDE DR
## 4 15706 X 14 8 PENN PLAZA
## 5 15706 P 25 2123 MADISON AVE
## 6 15706 P 18 413 48 TH ST
## 7 15706 P 12 630 LEXINGTON AVE
## 8 15706 P 25 219 E 121 ST
## 9 15706 P 26 1420 AMSTERDAM AVE
## 10 15706 P 26 488-490 ST NICHOLAS AVE
## # … with more rows
class(addr)
## [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
show_query(addr)
## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
addr %>% mutate(address = paste(Number, Street))
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precin… Number Street address
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 15706 P 25 2 W 120TH… 2 W 120TH …
## 2 15706 P 13 456 3RD AVE 456 3RD AVE
## 3 15706 P 26 44 MORNING… 44 MORNING…
## 4 15706 X 14 8 PENN PL… 8 PENN PLA…
## 5 15706 P 25 2123 MADISON… 2123 MADIS…
## 6 15706 P 18 413 48 TH ST 413 48 TH …
## 7 15706 P 12 630 LEXINGT… 630 LEXING…
## 8 15706 P 25 219 E 121 ST 219 E 121 …
## 9 15706 P 26 1420 AMSTERD… 1420 AMSTE…
## 10 15706 P 26 488-490 ST NICH… 488-490 ST…
## # … with more rows
addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## mean
## <dbl>
## 1 16.1
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <dbl> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # … with more rows
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
show_query()
## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`
dbplyr package (a dplyr backend for databases) has a function, translate_sql
, that lets you experiment with how R functions are translated to SQL:
library("dbplyr")
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning
## <SQL> avg("x") OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> avg("x") OVER ()
dplyr using tibble:
system.time(
nyc %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.325 0.098 0.424
dplyr using SQLite:
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.005 0.000 0.005
nyc_sql
was 30x times faster than nyc
, but the former is disk based while the latter is in memory. Why is the discrepancy?
dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <dbl> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # … with more rows
To force a full query and return a complete table it is necessary to use the collect
function.
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
collect()
)
## user system elapsed
## 3.314 0.405 3.763
Suppose we want the bar plot of number of tickets on each weekday. We encounter error (???) with the usual ggplot2 command:
nyc_sql %>%
ggplot() +
geom_bar(aes(x = Issue_WDay))
Apparently current version of ggplot2 (v3.1.0) works for tbl_dbi/tbl_sql/tbl_lazy
now.
Older version of ggplot2, e.g., v2.2.1, will output error. This is becasue ggplot2 needed to compute the count per category by going through all the rows. But here nyc_sql
is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.
nyc_sql %>%
group_by(Issue_WDay) %>%
count() %>%
collect() %>%
ggplot() +
geom_col(aes(x = Issue_WDay, y = n))
For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.
dbDisconnect(db)