Jan 23, 2018

Outline

We will spend next couple lectures studying R. I'll closely follow a few great books by Hadley Wickham.

  • Data wrangling (import, visualization, transformation, tidy).
    R for Data Science by Garrett Grolemund and Hadley Wickham.

  • R programming, Rcpp.
    Advanced R by Hadley Wickham.

  • R package development.
    R Packages by Hadley Wickham.

  • Web applications.

  • Interface with SQL and Apache Spark.

A typical data science project:

Tidyverse

  • tidyverse is a collection of R packages that make data wrangling easy.

  • Install tidyverse from RStudio menu Tools -> Install Packages... or

    install.packages("tidyverse")
  • After installation, load tidyverse by

    library("tidyverse")

mpg data

  • mpg data is available from the ggplot2 package:

    mpg
    ## # A tibble: 234 x 11
    ##    manufac… model   displ  year   cyl trans  drv     cty   hwy fl    class
    ##    <chr>    <chr>   <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
    ##  1 audi     a4       1.80  1999     4 auto(… f        18    29 p     comp…
    ##  2 audi     a4       1.80  1999     4 manua… f        21    29 p     comp…
    ##  3 audi     a4       2.00  2008     4 manua… f        20    31 p     comp…
    ##  4 audi     a4       2.00  2008     4 auto(… f        21    30 p     comp…
    ##  5 audi     a4       2.80  1999     6 auto(… f        16    26 p     comp…
    ##  6 audi     a4       2.80  1999     6 manua… f        18    26 p     comp…
    ##  7 audi     a4       3.10  2008     6 auto(… f        18    27 p     comp…
    ##  8 audi     a4 qua…  1.80  1999     4 manua… 4        18    26 p     comp…
    ##  9 audi     a4 qua…  1.80  1999     4 auto(… 4        16    25 p     comp…
    ## 10 audi     a4 qua…  2.00  2008     4 manua… 4        20    28 p     comp…
    ## # ... with 224 more rows
  • displ: engine size, in litres.
    hwy: highway fuel efficiency, in mile per gallen (mpg).

Aesthetic mappings

r4ds chapter 3.3

Scatter plot

  • hwy vs displ

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy))

  • Check available aesthetics for a geometric object by ?geom_point.

Color of points

  • Color points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, color = class))

Size of points

  • Assign different sizes to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, size = class))

Transparency of points

  • Assign different transparency levels to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

Shape of points

  • Assign different shapes to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, shape = class))

  • Maximum of 6 shapes at a time. By default, additional groups will go unplotted.

Manual setting of an aesthetic

  • Set the color of all points to be blue:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Facets

r4ds chapter 3.5

Facets

  • Facets divide a plot into subplots based on the values of one or more discrete variables.

  • A subplot for each car type:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2)

  • A subplot for each car type and drive:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      facet_grid(drv ~ class)

Geometric objects

r4ds chapter 3.6

geom_smooth(): smooth line

  • hwy vs displ line:

    ggplot(data = mpg) + 
      geom_smooth(mapping = aes(x = displ, y = hwy))

Different line types

  • Different line types according to drv:

    ggplot(data = mpg) + 
      geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

Different line colors

  • Different line colors according to drv:

    ggplot(data = mpg) + 
      geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

Points and lines

  • Lines overlaid over scatter plot:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      geom_smooth(mapping = aes(x = displ, y = hwy))

  • Same as

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point() + geom_smooth()

Aesthetics for each geometric object

  • Different aesthetics in different layers:

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point(mapping = aes(color = class)) + 
      geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

Bar charts

r4ds chapter 3.7

diamonds data

  • diamonds data:

    diamonds
    ## # A tibble: 53,940 x 10
    ##    carat cut       color clarity depth table price     x     y     z
    ##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    ##  1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
    ##  2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
    ##  3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
    ##  4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
    ##  5 0.310 Good      J     SI2      63.3  58.0   335  4.34  4.35  2.75
    ##  6 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
    ##  7 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
    ##  8 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
    ##  9 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
    ## 10 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39
    ## # ... with 53,930 more rows

Bar chart

  • geom_bar() creates bar chart:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut))

  • Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.

  • Check available computed variables for a geometric object via help:

    ?geom_bar

  • Use stat_count() directly:

    ggplot(data = diamonds) + 
      stat_count(mapping = aes(x = cut))
  • stat_count() has a default geom geom_bar().

  • Display frequency instead of counts:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))    

  • Color bar:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, colour = cut))

  • Fill color:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = cut))

  • Fill color according to another variable:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity))

Positional arguments

r4ds chapter 3.8

  • position_gitter() add random noise to X and Y position of each element to avoid overplotting:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

  • geom_jitter() is similar:

    ggplot(data = mpg) + 
      geom_jitter(mapping = aes(x = displ, y = hwy))

  • position_fill() stack elements on top of one another, normalize height:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

  • position_dodge() arrange elements side by side:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

  • position_stack() stack elements on top of each other:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

Coordinate systems

r4ds chapter 3.9

  • A boxplot:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot()

  • coord_cartesian() is the default cartesian coordinate system:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_cartesian(xlim = c(0, 5))

  • coord_fixed() specifies aspect ratio:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_fixed(ratio = 1/2)

  • coord_flip() flips x- and y- axis:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_flip()

  • A map:

    library("maps")
    nz <- map_data("nz")
    
    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black")

  • coord_quickmap() puts maps in scale:

    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black") +
      coord_quickmap()

Graphics for communications

r4ds chapter 28

Title

  • Figure title should be descriptive:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth(se = FALSE) +
      labs(title = "Fuel efficiency generally decreases with engine size")

Subtitle and caption

  • ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(color = class)) +
    geom_smooth(se = FALSE) + 
    labs(
      title = "Fuel efficiency generally decreases with engine size",
      subtitle = "Two seaters (sports cars) are an exception because of their light weight",
      caption = "Data from fueleconomy.gov"
    )

Axis labels

  • ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(colour = class)) +
    geom_smooth(se = FALSE) +
    labs(
      x = "Engine displacement (L)",
      y = "Highway fuel economy (mpg)"
    )

Math equations

  • df <- tibble(x = runif(10), y = runif(10))
    ggplot(df, aes(x, y)) + geom_point() +
      labs(
        x = quote(sum(x[i] ^ 2, i == 1, n)),
        y = quote(alpha + beta + frac(delta, theta))
      )

  • ?plotmath

Annotations

  • Create labels

    best_in_class <- mpg %>%
      group_by(class) %>%
      filter(row_number(desc(hwy)) == 1)
    best_in_class
    ## # A tibble: 7 x 11
    ## # Groups:   class [7]
    ##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
    ##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
    ## 1 chevrolet    corv…  5.70  1999     8 manu… r        16    26 p     2sea…
    ## 2 dodge        cara…  2.40  1999     4 auto… f        18    24 r     mini…
    ## 3 nissan       alti…  2.50  2008     4 manu… f        23    32 r     mids…
    ## 4 subaru       fore…  2.50  2008     4 manu… 4        20    27 r     suv  
    ## 5 toyota       toyo…  2.70  2008     4 manu… 4        17    22 r     pick…
    ## 6 volkswagen   jetta  1.90  1999     4 manu… f        33    44 d     comp…
    ## 7 volkswagen   new …  1.90  1999     4 manu… f        35    44 d     subc…

  • Annotate points

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(colour = class)) +
      geom_text(aes(label = model), data = best_in_class)

  • ggrepel package automatically adjust labels so that they don’t overlap:

    library("ggrepel")
    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) +
      geom_point(size = 3, shape = 1, data = best_in_class) +
      ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

Scales

  • ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class))

    automatically adds scales

    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) +
      scale_x_continuous() +
      scale_y_continuous() +
      scale_colour_discrete()

  • breaks

    ggplot(mpg, aes(displ, hwy)) +
      geom_point() +
      scale_y_continuous(breaks = seq(15, 40, by = 5))

  • labels

    ggplot(mpg, aes(displ, hwy)) +
      geom_point() +
      scale_x_continuous(labels = NULL) +
      scale_y_continuous(labels = NULL)

  • Plot y-axis at log scale:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      scale_y_log10()

  • Plot x-axis in reverse order:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      scale_x_reverse()

Legends

  • Set legend position: "left", "right", "top", "bottom", none:

    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) + 
      theme(legend.position = "left")

Zooming

  • Without clipping (removes unseen data points)

    ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))

  • With clipping (removes unseen data points)

    ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      xlim(5, 7) + ylim(10, 30)

  • ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      scale_x_continuous(limits = c(5, 7)) +
      scale_y_continuous(limits = c(10, 30))

  • mpg %>%
      filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
      ggplot(aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth()

Themes

  • ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth(se = FALSE) +
      theme_bw()

Saving plots

ggplot(mpg, aes(displ, hwy)) + geom_point()

ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image