mpg dataWe will spend next couple lectures studying some R packages for typical data science projects.
Data wrangling (import, visualization, transformation, tidy).
R for Data Science by Garrett Grolemund and Hadley Wickham.
Web applications by Shiny.
Interface with databases, eg., SQL and Apache Spark.
A typical data science project:
tidyverse is a collection of R packages that make data wrangling easy.
Install tidyverse from RStudio menu Tools -> Install Packages... or
install.packages("tidyverse")After installation, load tidyverse by
library("tidyverse")## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  2.0.1     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.1     ✔ forcats 0.3.0## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()“The simple graph has brought more information to the data analyst’s mind than any other device.”
John Tukey
mpg datampg data is available from the ggplot2 package:
mpg## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rowsTibbles are a generalized form of data frames, which are extensively used in tidyverse.
displ: engine size, in litres.
hwy: highway fuel efficiency, in mile per gallen (mpg).
hwy vs displ
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))An aesthetic maps data to a specifc feature of plot.
Check available aesthetics for a geometric object by ?geom_point.
Color points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))Assign different sizes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))Assign different transparency levels to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))## Warning: Using alpha for a discrete variable is not advised.Assign different shapes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
Set the color of all points to be blue:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)A subplot for each car type and drive:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)geom_smooth(): smooth linehwy vs displ line:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))Different line types according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))Different line colors according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))Lines overlaid over scatter plot:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth()Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)diamonds datadiamonds data:
diamonds## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rowsgeom_bar() creates bar chart:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
?geom_barUse stat_count() directly:
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))stat_count() has a default geom geom_bar().
Display frequency instead of counts:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))     Note the aesthetics mapping 
group=1 overwrites the default grouping (by cut) by considering all observations as a group. Without this we get
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))    Color bar:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))Fill color:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))Fill color according to another variable:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))position_gitter() add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")geom_jitter() is similar:
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy))position_fill() stack elements on top of one another, normalize height:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")position_dodge() arrange elements side by side:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")position_stack() stack elements on top of each other:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()coord_cartesian() is the default cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_cartesian(xlim = c(0, 5))coord_fixed() specifies aspect ratio (x / y):
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_fixed(ratio = 1/2)coord_flip() flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_flip()A map:
library("maps")
nz <- map_data("nz")
head(nz, 20)##        long       lat group order        region subregion
## 1  172.7433 -34.44215     1     1 North.Island       <NA>
## 2  172.7983 -34.45562     1     2 North.Island       <NA>
## 3  172.8528 -34.44846     1     3 North.Island       <NA>
## 4  172.8986 -34.41786     1     4 North.Island       <NA>
## 5  172.9593 -34.42503     1     5 North.Island       <NA>
## 6  173.0184 -34.39895     1     6 North.Island       <NA>
## 7  173.0229 -34.44662     1     7 North.Island       <NA>
## 8  173.0184 -34.49343     1     8 North.Island       <NA>
## 9  172.9616 -34.50426     1     9 North.Island       <NA>
## 10 172.9181 -34.47367     1    10 North.Island       <NA>
## 11 172.9353 -34.52225     1    11 North.Island       <NA>
## 12 172.8808 -34.51504     1    12 North.Island       <NA>
## 13 172.9049 -34.55646     1    13 North.Island       <NA>
## 14 172.9553 -34.53303     1    14 North.Island       <NA>
## 15 172.9376 -34.57806     1    15 North.Island       <NA>
## 16 172.9760 -34.61227     1    16 North.Island       <NA>
## 17 172.9926 -34.56723     1    17 North.Island       <NA>
## 18 173.0218 -34.61404     1    18 North.Island       <NA>
## 19 173.0396 -34.65902     1    19 North.Island       <NA>
## 20 173.0676 -34.70044     1    20 North.Island       <NA>ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")coord_quickmap() puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size")ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) + 
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
  x = "Engine displacement (L)",
  y = "Highway fuel economy (mpg)"
)df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
  labs(
    x = quote(sum(x[i] ^ 2, i == 1, n)),
    y = quote(alpha + beta + frac(delta, theta))
  )?plotmath
Create labels
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)
best_in_class## # A tibble: 7 x 11
## # Groups:   class [7]
##   manufacturer model  displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet    corve…   5.7  1999     8 manu… r        16    26 p     2sea…
## 2 dodge        carav…   2.4  1999     4 auto… f        18    24 r     mini…
## 3 nissan       altima   2.5  2008     4 manu… f        23    32 r     mids…
## 4 subaru       fores…   2.5  2008     4 manu… 4        20    27 r     suv  
## 5 toyota       toyot…   2.7  2008     4 manu… 4        17    22 r     pick…
## 6 volkswagen   jetta    1.9  1999     4 manu… f        33    44 d     comp…
## 7 volkswagen   new b…   1.9  1999     4 manu… f        35    44 d     subc…Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class)) +
  geom_text(aes(label = model), data = best_in_class)ggrepel package automatically adjust labels so that they don’t overlap:
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()breaks
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_y_continuous(breaks = seq(15, 40, by = 5))labels
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_y_log10()Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_reverse()Set legend position: "left", "right", "top", "bottom", none:
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  theme(legend.position = "left")See following link for more details on how to change title, labels, … of a legend.
Without clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  xlim(5, 7) + ylim(10, 30)ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  scale_x_continuous(limits = c(5, 7)) +
  scale_y_continuous(limits = c(10, 30))mpg %>%
  filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  theme_bw()
ggplot(mpg, aes(displ, hwy)) + geom_point()ggsave("my-plot.pdf")
## Saving 7 x 5 in imageRStudio cheat sheet is extremely helpful.