Acknowledgement

Much material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.

Why databases?

Size of data

  • Small data: those can fit into computer memory.

  • Bigish data: those can fit into disk(s) of a single machine.

  • Big data: those cannot fit into disk(s) of a single machine.

Computer architecture

Key to high performance is effective use of memory hierarchy. True on all architectures.

Numbers everyone should know

Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

Implications for bigish data

Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds

If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and what if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Databases

SQL

Structured Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

  • ANSI Standard but with some dialect divergence.

  • SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.

  • We will see this mapping in more detail in a bit.

Access databases from R

  • dplyr package supports a variety of databases.
    • Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
    • Commercial databases: Oracle, Microsoft SQL Server.
    • See link for a complete list.
  • DBI package provides a common interface for connecting to databases.

  • dbplyr package is the backend that translates dplyr verbs to database SQL queries.

  • To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.

A sample session using SQLite

Create a SQLite database

Create a SQLite database employee.sqlite (or in memory) for learning purpose.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
#con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019winter/slides/12-dbplyr/employee.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7fc0a3528bf0> 
##   ..@ bigint             : chr "integer64"

Add a table into database

First table:

library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
(employees <- tibble(name   = c("Alice", "Bob", "Carol", "Dave", "Eve", "Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting", "Sales",
                                "Accounting", "Sales", "Sales")))
## # A tibble: 6 x 4
##   name  email             salary dept      
##   <chr> <chr>              <dbl> <chr>     
## 1 Alice alice@company.com  52000 Accounting
## 2 Bob   bob@company.com    40000 Accounting
## 3 Carol carol@company.com  30000 Sales     
## 4 Dave  dave@company.com   33000 Accounting
## 5 Eve   eve@company.com    44000 Sales     
## 6 Frank frank@comany.com   37000 Sales

Second table:

(phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("919 555-1111", "919 555-2222", "919 555-3333", "919 555-4444")))
## # A tibble: 4 x 2
##   name  phone       
##   <chr> <chr>       
## 1 Bob   919 555-1111
## 2 Carol 919 555-2222
## 3 Eve   919 555-3333
## 4 Frank 919 555-4444

Write tables to the database:

dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
## [1] "employees" "phone"

Add another table

dbWriteTable(con, "employs", employees)
dbListTables(con)
## [1] "employees" "employs"   "phone"

Remove a table from database

dbRemoveTable(con, "employs")
dbListTables(con)
## [1] "employees" "phone"

Querying tables

# select all columns from table employees
res <- dbSendQuery(con, "SELECT * FROM employees")
# execute the query
dbFetch(res)
##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales
dbClearResult(res)

Closing the connection

dbDisconnect(con)

SQL Queries

Following we demonstrate some common SQL commands, although all task can be achieved by using dplyr as well.

Connecting

con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
## [1] "employees" "phone"
knitr::opts_chunk$set(connection = "con")

SELECT statements

SELECT * FROM employees;
6 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Eve eve@company.com 44000 Sales
Frank frank@comany.com 37000 Sales
SELECT * FROM phone;
4 records
name phone
Bob 919 555-1111
Carol 919 555-2222
Eve 919 555-3333
Frank 919 555-4444

Select using SELECT

SELECT name AS first_name, salary FROM employees;
6 records
first_name salary
Alice 52000
Bob 40000
Carol 30000
Dave 33000
Eve 44000
Frank 37000

Arrange using ORDER BY

SELECT name AS first_name, salary FROM employees ORDER BY salary;
6 records
first_name salary
Carol 30000
Dave 33000
Frank 37000
Bob 40000
Eve 44000
Alice 52000

Descending order:

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
6 records
first_name salary
Alice 52000
Eve 44000
Bob 40000
Frank 37000
Dave 33000
Carol 30000

Filter via WHERE

SELECT * FROM employees WHERE salary < 40000
3 records
name email salary dept
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Frank frank@comany.com 37000 Sales

Group_by via GROUP BY

SELECT * FROM employees GROUP BY dept;
2 records
name email salary dept
Dave dave@company.com 33000 Accounting
Frank frank@comany.com 37000 Sales

Head via LIMIT

SELECT * FROM employees LIMIT 3;
3 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales
SELECT * FROM employees ORDER BY name DESC LIMIT 3;
3 records
name email salary dept
Frank frank@comany.com 37000 Sales
Eve eve@company.com 44000 Sales
Dave dave@company.com 33000 Accounting

Join two tables (default)

By default SQLite uses a CROSS JOIN which is not terribly useful

SELECT * FROM employees JOIN phone;
Displaying records 1 - 10
name email salary dept name..5 phone
Alice alice@company.com 52000 Accounting Bob 919 555-1111
Alice alice@company.com 52000 Accounting Carol 919 555-2222
Alice alice@company.com 52000 Accounting Eve 919 555-3333
Alice alice@company.com 52000 Accounting Frank 919 555-4444
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Bob bob@company.com 40000 Accounting Carol 919 555-2222
Bob bob@company.com 40000 Accounting Eve 919 555-3333
Bob bob@company.com 40000 Accounting Frank 919 555-4444
Carol carol@company.com 30000 Sales Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222

Inner join by NATURAL

SELECT * FROM employees NATURAL JOIN phone;
4 records
name email salary dept phone
Bob bob@company.com 40000 Accounting 919 555-1111
Carol carol@company.com 30000 Sales 919 555-2222
Eve eve@company.com 44000 Sales 919 555-3333
Frank frank@comany.com 37000 Sales 919 555-4444

Inner join - explicit

SELECT * FROM employees JOIN phone ON employees.name = phone.name;
4 records
name email salary dept name..5 phone
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222
Eve eve@company.com 44000 Sales Eve 919 555-3333
Frank frank@comany.com 37000 Sales Frank 919 555-4444

Left join - natural

SELECT * FROM employees NATURAL LEFT JOIN phone;
6 records
name email salary dept phone
Alice alice@company.com 52000 Accounting NA
Bob bob@company.com 40000 Accounting 919 555-1111
Carol carol@company.com 30000 Sales 919 555-2222
Dave dave@company.com 33000 Accounting NA
Eve eve@company.com 44000 Sales 919 555-3333
Frank frank@comany.com 37000 Sales 919 555-4444

Left join - explicit

SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
6 records
name email salary dept name..5 phone
Alice alice@company.com 52000 Accounting NA NA
Bob bob@company.com 40000 Accounting Bob 919 555-1111
Carol carol@company.com 30000 Sales Carol 919 555-2222
Dave dave@company.com 33000 Accounting NA NA
Eve eve@company.com 44000 Sales Eve 919 555-3333
Frank frank@comany.com 37000 Sales Frank 919 555-4444

Other joins

SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.

Creating indices

CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
## index_name        index_name_email

Close connection

dbDisconnect(con)