In this lecture we follow a tutorial by Dr. Colin Rundel to handle a bigish data set. We will learn:
Import data from a bigish csv file.
Tidy data.
Deposit data into an SQLite database.
Query SQLite database.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.3 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2
## [5] tools_3.4.3 htmltools_0.3.6 yaml_2.1.16 Rcpp_0.12.15
## [9] stringi_1.1.6 rmarkdown_1.8 knitr_1.19 stringr_1.2.0
## [13] digest_0.6.15 evaluate_0.10.1
The /home/m280-data/nyc_parking/NYParkingViolations.csv
file on teaching server contains information about parking tickets in NYC. You can create a symbolic link to the data file by Bash command
ln -sf /home/m280-data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv
The CSV file is about 1.7GB
ls -l NYParkingViolations.csv
## -rw-r--r--@ 1 huazhou staff 1793818408 Feb 11 20:26 NYParkingViolations.csv
How many rows?
wc -l < NYParkingViolations.csv
## 9100279
First few lines:
head -5 NYParkingViolations.csv
## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,Number,Street,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
## 1361929741,FCJ5493,NY,PAS,12/18/1970,20,SUBN,GMC,S,35030,89180,32600,20140728,0070,70,0,829379,KS14,0000,0358A,,K,F,959,E 5 ST,,0,408,E2,,BBBBBBB,ALL,ALL,BLACK,0,2013,-,0,,,,,
## 1366962000,63540MC,NY,COM,02/02/1971,46,DELV,FRUEH,P,58830,11430,11450,0,0068,68,68,928157,0068,0000,1116A,,K,F,185,MARINE AVENUE,,0,408,F1,,BBBBBBB,ALL,ALL,BRN,0,2013,-,0,,,,,
## 1342296187,GCY4187,NY,SRF,09/18/1971,21,VAN,FORD,S,11790,35780,57890,20150430,104,104,0,515350,QW05,0000,0939A,,,O,60-25,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,BLUE,0,2002,-,0,,,,,
## 1342296199,95V6675,TX,PAS,09/18/1971,21,,GMC,S,11790,35780,57890,0,104,104,0,515350,QW05,0000,0945A,,,F,60-12,56 ST,,0,408,D1,,BBYBBBB,0930A,1100A,SILVR,0,2008,-,0,,,,,
read.csv()
function in base R takes nearly 6 minutes to read in the 1.7GB csv file:
system.time(read.csv("NYParkingViolations.csv"))
## user system elapsed
## 344.781 16.486 367.218
data.table is an R package for reading large data sets:
library(data.table)
system.time({nyc = fread("NYParkingViolations.csv")})
##
Read 0.0% of 9100278 rows
## Warning in fread("NYParkingViolations.csv"): Bumped column 30 to type
## character on data row 2565, field contains 'T'. Coercing previously read
## values in this column from logical, integer or numeric back to character
## which may not be lossless; e.g., if '00' and '000' occurred before they
## will now be just '0', and there may be inconsistencies with treatment of
## ',,' and ',NA,' too (if they occurred in this column before the bump).
## If this matters please rerun and set 'colClasses' to 'character' for this
## column. Please note that column type detection uses a sample of 1,000 rows
## (100 rows at 10 points) so hopefully this message should be very rare.
## If reporting to datatable-help, please rerun and include the output from
## verbose=TRUE.
##
Read 4.9% of 9100278 rows
Read 10.7% of 9100278 rows
Read 16.5% of 9100278 rows
Read 21.5% of 9100278 rows
Read 25.9% of 9100278 rows
Read 30.4% of 9100278 rows
Read 35.3% of 9100278 rows
Read 39.2% of 9100278 rows
Read 42.9% of 9100278 rows
Read 47.6% of 9100278 rows
Read 52.4% of 9100278 rows
Read 56.7% of 9100278 rows
Read 61.0% of 9100278 rows
Read 65.9% of 9100278 rows
Read 70.8% of 9100278 rows
Read 75.9% of 9100278 rows
Read 81.0% of 9100278 rows
Read 85.8% of 9100278 rows
Read 90.5% of 9100278 rows
Read 95.2% of 9100278 rows
Read 99.6% of 9100278 rows
Read 9100278 rows and 43 (of 43) columns from 1.671 GB file in 00:00:26
## user system elapsed
## 23.933 1.593 26.058
class(nyc)
## [1] "data.table" "data.frame"
nyc = as.data.frame(nyc)
class(nyc)
## [1] "data.frame"
read_csv()
function in tidyverse is 3-4 times faster than base R:
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
system.time({nyc = read_csv("NYParkingViolations.csv")})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2565 Violation Legal Code an integer T 'NYParkingViolations.csv' file 2 3786 Violation Legal Code an integer T 'NYParkingViolations.csv' row 3 3991 Violation Legal Code an integer T 'NYParkingViolations.csv' col 4 3992 Violation Legal Code an integer T 'NYParkingViolations.csv' expected 5 3993 Violation Legal Code an integer T 'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
## user system elapsed
## 43.824 4.737 49.761
nyc
## # A tibble: 9,100,278 x 43
## `Summons Number` `Plate ID` `Registration St… `Plate Type` `Issue Date`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1361929741 FCJ5493 NY PAS 12/18/1970
## 2 1366962000 63540MC NY COM 02/02/1971
## 3 1342296187 GCY4187 NY SRF 09/18/1971
## 4 1342296199 95V6675 TX PAS 09/18/1971
## 5 1342296217 FYM5117 NY SRF 09/18/1971
## 6 1356906515 GFM1421 NY PAS 09/18/1971
## 7 1337077380 18972BB NY 999 10/10/1971
## 8 1364523796 WNJ4730 VA PAS 04/05/1973
## 9 1359914924 68091JZ NY COM 07/22/1973
## 10 1355498326 EWV4127 NY PAS 08/12/1973
## # ... with 9,100,268 more rows, and 38 more variables: `Violation
## # Code` <int>, `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, `Issuing
## # Agency` <chr>, `Street Code1` <int>, `Street Code2` <int>, `Street
## # Code3` <int>, `Vehicle Expiration Date` <int>, `Violation
## # Location` <chr>, `Violation Precinct` <int>, `Issuer Precinct` <int>,
## # `Issuer Code` <int>, `Issuer Command` <chr>, `Issuer Squad` <chr>,
## # `Violation Time` <chr>, `Time First Observed` <chr>, `Violation
## # County` <chr>, `Violation In Front Of Or Opposite` <chr>,
## # Number <chr>, Street <chr>, `Intersecting Street` <chr>, `Date First
## # Observed` <int>, `Law Section` <int>, `Sub Division` <chr>, `Violation
## # Legal Code` <int>, `Days Parking In Effect` <chr>, `From Hours In
## # Effect` <chr>, `To Hours In Effect` <chr>, `Vehicle Color` <chr>,
## # `Unregistered Vehicle?` <int>, `Vehicle Year` <int>, `Meter
## # Number` <chr>, `Feet From Curb` <int>, `Violation Post Code` <chr>,
## # `Violation Description` <chr>, `No Standing or Stopping
## # Violation` <chr>, `Hydrant Violation` <chr>, `Double Parking
## # Violation` <chr>
names(nyc) <- str_replace_all(names(nyc), " ", "_")
nyc
## # A tibble: 9,100,278 x 43
## Summons_Number Plate_ID Registration_State Plate_Type Issue_Date
## <dbl> <chr> <chr> <chr> <chr>
## 1 1361929741 FCJ5493 NY PAS 12/18/1970
## 2 1366962000 63540MC NY COM 02/02/1971
## 3 1342296187 GCY4187 NY SRF 09/18/1971
## 4 1342296199 95V6675 TX PAS 09/18/1971
## 5 1342296217 FYM5117 NY SRF 09/18/1971
## 6 1356906515 GFM1421 NY PAS 09/18/1971
## 7 1337077380 18972BB NY 999 10/10/1971
## 8 1364523796 WNJ4730 VA PAS 04/05/1973
## 9 1359914924 68091JZ NY COM 07/22/1973
## 10 1355498326 EWV4127 NY PAS 08/12/1973
## # ... with 9,100,268 more rows, and 38 more variables:
## # Violation_Code <int>, Vehicle_Body_Type <chr>, Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Street_Code1 <int>, Street_Code2 <int>,
## # Street_Code3 <int>, Vehicle_Expiration_Date <int>,
## # Violation_Location <chr>, Violation_Precinct <int>,
## # Issuer_Precinct <int>, Issuer_Code <int>, Issuer_Command <chr>,
## # Issuer_Squad <chr>, Violation_Time <chr>, Time_First_Observed <chr>,
## # Violation_County <chr>, Violation_In_Front_Of_Or_Opposite <chr>,
## # Number <chr>, Street <chr>, Intersecting_Street <chr>,
## # Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## # Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>,
## # From_Hours_In_Effect <chr>, To_Hours_In_Effect <chr>,
## # Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>,
## # Vehicle_Year <int>, Meter_Number <chr>, Feet_From_Curb <int>,
## # Violation_Post_Code <chr>, Violation_Description <chr>,
## # No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>,
## # Double_Parking_Violation <chr>
We make a smaller data set with selected variables:
# magrittr for %<>%
library("magrittr")
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
(nyc %<>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
Number:Intersecting_Street, Vehicle_Color))
## # A tibble: 9,100,278 x 14
## Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <chr> <int> <chr>
## 1 NY PAS 12/18/1970 20 SUBN
## 2 NY COM 02/02/1971 46 DELV
## 3 NY SRF 09/18/1971 21 VAN
## 4 TX PAS 09/18/1971 21 <NA>
## 5 NY SRF 09/18/1971 21 SUBN
## 6 NY PAS 09/18/1971 40 SDN
## 7 NY 999 10/10/1971 14 BUS
## 8 VA PAS 04/05/1973 14 SDN
## 9 NY COM 07/22/1973 46 DELV
## 10 NY PAS 08/12/1973 21 SUBN
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.
Make variable Issue_Date
of date-time type:
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
class(nyc$Issue_Date)
## [1] "character"
nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)
## [1] "Date"
nyc
## # A tibble: 9,100,278 x 14
## Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 1970-12-18 20 SUBN
## 2 NY COM 1971-02-02 46 DELV
## 3 NY SRF 1971-09-18 21 VAN
## 4 TX PAS 1971-09-18 21 <NA>
## 5 NY SRF 1971-09-18 21 SUBN
## 6 NY PAS 1971-09-18 40 SDN
## 7 NY 999 1971-10-10 14 BUS
## 8 VA PAS 1973-04-05 14 SDN
## 9 NY COM 1973-07-22 46 DELV
## 10 NY PAS 1973-08-12 21 SUBN
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Only keep data in years 2013-4.
range(nyc$Issue_Date)
## [1] "1970-12-18" "2069-12-23"
nyc$Issue_Date %>% year() %>% table()
## .
## 1970 1971 1973 1974 1976 1977 1979 1981 1983
## 1 6 10 1 2 1 2 4 1
## 1984 1987 1990 1991 1996 2000 2001 2002 2003
## 2 3 2 1 1 319 91 7 39
## 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 77 9 11 13 8 9 129 251 618
## 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 4379109 4716512 1522 296 309 181 329 18 26
## 2022 2023 2024 2025 2026 2027 2028 2029 2030
## 1 31 23 10 4 4 7 3 45
## 2031 2032 2033 2040 2041 2043 2044 2045 2046
## 93 3 8 1 39 9 9 2 7
## 2047 2048 2049 2050 2051 2052 2053 2060 2061
## 6 1 3 1 12 2 1 3 10
## 2063 2064 2066 2067 2069
## 9 5 3 2 1
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 14
## Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 2013-07-29 94 <NA>
## 2 NY COM 2013-01-07 19 DELV
## 3 NY COM 2013-01-01 46 DELV
## 4 IL PAS 2013-01-01 78 VAN
## 5 NJ PAS 2013-01-01 46 SUBN
## 6 NY PAS 2013-01-01 46 SDN
## 7 NY COM 2013-01-01 78 VAN
## 8 NY COM 2013-01-01 46 SDN
## 9 OK PAS 2013-01-01 47 TRLR
## 10 NY COM 2013-01-01 45 DELV
## # ... with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
system.time({
nyc <- read_csv("NYParkingViolations.csv") %>%
setNames(str_replace_all(names(.)," ", "_")) %>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
Number:Intersecting_Street, Vehicle_Color) %>%
mutate(Issue_Date = mdy(Issue_Date)) %>%
mutate(Issue_Day = day(Issue_Date),
Issue_Month = month(Issue_Date),
Issue_Year = year(Issue_Date),
Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
filter(Issue_Year %in% 2013:2014)
})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 2565 Violation Legal Code an integer T 'NYParkingViolations.csv' file 2 3786 Violation Legal Code an integer T 'NYParkingViolations.csv' row 3 3991 Violation Legal Code an integer T 'NYParkingViolations.csv' col 4 3992 Violation Legal Code an integer T 'NYParkingViolations.csv' expected 5 3993 Violation Legal Code an integer T 'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
## user system elapsed
## 59.290 5.114 65.049
nyc
## # A tibble: 9,095,621 x 18
## Registration_Sta… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 2013-07-29 94 <NA>
## 2 NY COM 2013-01-07 19 DELV
## 3 NY COM 2013-01-01 46 DELV
## 4 IL PAS 2013-01-01 78 VAN
## 5 NJ PAS 2013-01-01 46 SUBN
## 6 NY PAS 2013-01-01 46 SDN
## 7 NY COM 2013-01-01 78 VAN
## 8 NY COM 2013-01-01 46 SDN
## 9 OK PAS 2013-01-01 47 TRLR
## 10 NY COM 2013-01-01 45 DELV
## # ... with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, Number <chr>,
## # Street <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## # Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>
nyc %>%
group_by(Issue_Date) %>%
summarize(n = n()) %>%
ggplot(aes(x = Issue_Date, y = n)) +
geom_line() +
xlim(mdy("7/1/2013"), mdy("6/30/2014"))
## Warning: Removed 365 rows containing missing values (geom_path).
Import the nyc
data set into the SQLite database:
library("DBI")
library("RSQLite")
library("tidyverse")
if (Sys.info()[["sysname"]] == "Linux") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "/home/m280-data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
## [1] "nyc"
dbDisconnect(db)
You can create a symbolic link to the database file by Bash command
ln -sf /home/m280-data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite
Size of the database file
ls -l NYParkingViolations.sqlite
## -rw-r--r-- 1 huazhou staff 729309184 Feb 19 20:43 NYParkingViolations.sqlite
Connect to table in database:
db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)
## [1] "nyc"
nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 6 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr "/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParki"| __truncated__
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x7fc41647f6b0>
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x :Classes 'ident', 'character' chr "nyc"
## ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
nyc_sql %>% print(width = Inf)
## # Source: table<nyc> [?? x 18]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## Registration_State Plate_Type Issue_Date Violation_Code
## <chr> <chr> <dbl> <int>
## 1 NY PAS 15915 94
## 2 NY COM 15712 19
## 3 NY COM 15706 46
## 4 IL PAS 15706 78
## 5 NJ PAS 15706 46
## 6 NY PAS 15706 46
## 7 NY COM 15706 78
## 8 NY COM 15706 46
## 9 OK PAS 15706 47
## 10 NY COM 15706 45
## Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
## <chr> <chr> <chr> <chr>
## 1 <NA> TRIUM X <NA>
## 2 DELV <NA> P 0076
## 3 DELV FRUEH P 110
## 4 VAN FORD P 0025
## 5 SUBN CHEVR P 0081
## 6 SDN HONDA P 0078
## 7 VAN FORD P 110
## 8 SDN ISUZU P 112
## 9 TRLR INTER P 0013
## 10 DELV FRUEH P 0026
## Violation_Precinct Violation_Time Number Street
## <int> <chr> <chr> <chr>
## 1 0 <NA> <NA> <NA>
## 2 76 0407A 548 COURT STREET
## 3 110 1045A 88-10 WHITNEY AVE
## 4 25 0332A 2 W 120TH ST
## 5 81 0139P 427 HART ST
## 6 78 1127A 565 UNION ST
## 7 110 0412A 40-59 95 ST
## 8 112 0554P 67-70 YELLOWSTONE BLVD
## 9 13 0745A 456 3RD AVE
## 10 26 0330P 44 MORNINGSIDE DR
## Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
## <chr> <chr> <int> <dbl> <dbl>
## 1 <NA> GREEN 29 7.00 2013
## 2 <NA> <NA> 7 1.00 2013
## 3 <NA> WHIT 1 1.00 2013
## 4 <NA> WHT 1 1.00 2013
## 5 <NA> RD 1 1.00 2013
## 6 <NA> GREY 1 1.00 2013
## 7 <NA> WH 1 1.00 2013
## 8 <NA> WH 1 1.00 2013
## 9 <NA> WHITE 1 1.00 2013
## 10 <NA> WHITE 1 1.00 2013
## Issue_WDay
## <chr>
## 1 Mon
## 2 Mon
## 3 Tue
## 4 Tue
## 5 Tue
## 6 Tue
## 7 Tue
## 8 Tue
## 9 Tue
## 10 Tue
## # ... with more rows
addr <- nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precinct Number Street
## <dbl> <chr> <int> <chr> <chr>
## 1 15706 P 25 2 W 120TH ST
## 2 15706 P 13 456 3RD AVE
## 3 15706 P 26 44 MORNINGSIDE DR
## 4 15706 X 14 8 PENN PLAZA
## 5 15706 P 25 2123 MADISON AVE
## 6 15706 P 18 413 48 TH ST
## 7 15706 P 12 630 LEXINGTON AVE
## 8 15706 P 25 219 E 121 ST
## 9 15706 P 26 1420 AMSTERDAM AVE
## 10 15706 P 26 488-490 ST NICHOLAS AVE
## # ... with more rows
class(addr)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
show_query(addr)
## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
addr %>% mutate(address = paste(Number, Street))
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precinct Number Street address
## <dbl> <chr> <int> <chr> <chr> <chr>
## 1 15706 P 25 2 W 120TH… 2 W 120T…
## 2 15706 P 13 456 3RD AVE 456 3RD …
## 3 15706 P 26 44 MORNING… 44 MORNI…
## 4 15706 X 14 8 PENN PL… 8 PENN P…
## 5 15706 P 25 2123 MADISON… 2123 MAD…
## 6 15706 P 18 413 48 TH ST 413 48 T…
## 7 15706 P 12 630 LEXINGT… 630 LEXI…
## 8 15706 P 25 219 E 121 ST 219 E 12…
## 9 15706 P 26 1420 AMSTERD… 1420 AMS…
## 10 15706 P 26 488-490 ST NICH… 488-490 …
## # ... with more rows
addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## mean
## <dbl>
## 1 16.1
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <int> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
show_query()
## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `Number`, `Street`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`
dbplyr package (a dplyr backend for databases) has a function, translate_sql
, that lets you experiment with how R functions are translated to SQL:
library("dbplyr")
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning
## <SQL> avg("x") OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> avg("x") OVER ()
dplyr using tibble:
system.time(
nyc %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.383 0.091 0.479
dplyr using SQLite:
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.016 0.000 0.017
nyc_sql
was 30x times faster than nyc
, but the former is disk based while the latter is in memory. Why is the discrepancy?
dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3
## # [/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/11-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <int> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
To force a full query and return a complete table it is necessary to use the collect
function.
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, Number, Street) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
collect()
)
## user system elapsed
## 3.166 0.341 3.544
Suppose we want the bar plot of number of tickets on each weekday. We encounter error with the usual ggplot2 command:
nyc_sql %>%
ggplot() +
geom_bar(aes(x = Issue_WDay))
## Error: ggplot2 doesn't know how to deal with data of class tbl_dbi/tbl_sql/tbl_lazy/tbl
This is becasue ggplot2 needs to compute the count per category by going through all the rows. But here nyc_sql
is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.
nyc_sql %>%
group_by(Issue_WDay) %>%
count() %>%
collect() %>%
ggplot() +
geom_col(aes(x = Issue_WDay, y = n))
For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.
dbDisconnect(db)