mpg
dataWe will spend next couple lectures studying some R packages for typical data science projects.
Data wrangling (import, visualization, transformation, tidy).
R for Data Science by Garrett Grolemund and Hadley Wickham.
Web applications by Shiny.
Interface with databases, eg., SQL and Apache Spark.
A typical data science project:
tidyverse
is a collection of R packages that make data wrangling easy.
Install tidyverse
from RStudio menu Tools -> Install Packages...
or
install.packages("tidyverse")
After installation, load tidyverse
by
library("tidyverse")
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 2.0.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
“The simple graph has brought more information to the data analyst’s mind than any other device.”
John Tukey
mpg
datampg
data is available from the ggplot2
package:
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
Tibbles are a generalized form of data frames, which are extensively used in tidyverse.
displ
: engine size, in litres.
hwy
: highway fuel efficiency, in mile per gallen (mpg).
hwy
vs displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
An aesthetic maps data to a specifc feature of plot.
Check available aesthetics for a geometric object by ?geom_point
.
Color points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Assign different sizes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Assign different transparency levels to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Assign different shapes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
Set the color of all points to be blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A subplot for each car type and drive:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
geom_smooth()
: smooth linehwy
vs displ
line:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Different line types according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
Different line colors according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
Lines overlaid over scatter plot:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
diamonds
datadiamonds
data:
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
geom_bar()
creates bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
?geom_bar
Use stat_count()
directly:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
stat_count()
has a default geom geom_bar()
.
Display frequency instead of counts:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
Note the aesthetics mapping group=1
overwrites the default grouping (by cut
) by considering all observations as a group. Without this we get
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
Color bar:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
Fill color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Fill color according to another variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
position_gitter()
add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter()
is similar:
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
position_fill()
stack elements on top of one another, normalize height:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position_dodge()
arrange elements side by side:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
position_stack()
stack elements on top of each other:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
coord_cartesian()
is the default cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_cartesian(xlim = c(0, 5))
coord_fixed()
specifies aspect ratio (x / y):
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_fixed(ratio = 1/2)
coord_flip()
flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
A map:
library("maps")
nz <- map_data("nz")
head(nz, 20)
## long lat group order region subregion
## 1 172.7433 -34.44215 1 1 North.Island <NA>
## 2 172.7983 -34.45562 1 2 North.Island <NA>
## 3 172.8528 -34.44846 1 3 North.Island <NA>
## 4 172.8986 -34.41786 1 4 North.Island <NA>
## 5 172.9593 -34.42503 1 5 North.Island <NA>
## 6 173.0184 -34.39895 1 6 North.Island <NA>
## 7 173.0229 -34.44662 1 7 North.Island <NA>
## 8 173.0184 -34.49343 1 8 North.Island <NA>
## 9 172.9616 -34.50426 1 9 North.Island <NA>
## 10 172.9181 -34.47367 1 10 North.Island <NA>
## 11 172.9353 -34.52225 1 11 North.Island <NA>
## 12 172.8808 -34.51504 1 12 North.Island <NA>
## 13 172.9049 -34.55646 1 13 North.Island <NA>
## 14 172.9553 -34.53303 1 14 North.Island <NA>
## 15 172.9376 -34.57806 1 15 North.Island <NA>
## 16 172.9760 -34.61227 1 16 North.Island <NA>
## 17 172.9926 -34.56723 1 17 North.Island <NA>
## 18 173.0218 -34.61404 1 18 North.Island <NA>
## 19 173.0396 -34.65902 1 19 North.Island <NA>
## 20 173.0676 -34.70044 1 20 North.Island <NA>
ggplot(nz, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
coord_quickmap()
puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)"
)
df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
?plotmath
Create labels
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
best_in_class
## # A tibble: 7 x 11
## # Groups: class [7]
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corve… 5.7 1999 8 manu… r 16 26 p 2sea…
## 2 dodge carav… 2.4 1999 4 auto… f 18 24 r mini…
## 3 nissan altima 2.5 2008 4 manu… f 23 32 r mids…
## 4 subaru fores… 2.5 2008 4 manu… 4 20 27 r suv
## 5 toyota toyot… 2.7 2008 4 manu… 4 17 22 r pick…
## 6 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
## 7 volkswagen new b… 1.9 1999 4 manu… f 35 44 d subc…
Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
ggrepel
package automatically adjust labels so that they don’t overlap:
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
breaks
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
labels
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_log10()
Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_reverse()
Set legend position: "left"
, "right"
, "top"
, "bottom"
, none
:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
theme(legend.position = "left")
See following link for more details on how to change title, labels, … of a legend.
Without clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
xlim(5, 7) + ylim(10, 30)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 7)) +
scale_y_continuous(limits = c(10, 30))
mpg %>%
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
## Saving 7 x 5 in image
RStudio cheat sheet is extremely helpful.