Previous lesson: Controlling the appearance of plots
In exploratory data visualization, there are often too many variables to explore using a single visualization. In this lesson we will see how to reduce the complexity of the data by reducing the number of variables, and reducing the number of levels with categorical variables.
Learning objectives At the end of this lesson, the learner will be able to:
Total video time: n/a
patchwork package can be used to control the position of several subplots within a larger combined plot. See this chapter for details.
The easiest way to arrange the subplots is to assign each one to a named object, then arrange them using the format allowed by
Create functions and assign:
library(patchwork) plot1 <- ggplot(hemoglobin_frame, aes(hemoglobin)) + geom_density(alpha = 0.2, aes(fill = population, color = population)) + # use alpha to control transparency theme(legend.position = c(0.8, 0.8)) plot2 <- ggplot(hemoglobin_frame, aes(sample = hemoglobin)) + geom_qq(aes(color = population)) + stat_qq_line(aes(color = population)) + theme(legend.position = c(0.8, 0.2))
To place plots side-by-side, use the
plot1 + plot2
To place plots on top of each other, use the
plot1 / plot2
If you want to apply letter or number tags to the subplots, use
plot_annotation(). The style “A” will use A, B, C, etc. :
plot1 + plot2 + plot_annotation(tag_levels = c("A"))
inset_element() function can be used to place a subplot within the limits of another. The position is controlled by
bottom arguments. Trial and error is usually needed to get it in the right spot.
ggplot(hemoglobin_frame, aes(hemoglobin)) + geom_density(alpha = 0.2, aes(fill = population, color = population)) + inset_element(qq_no_legend, left = 0.5, bottom = 0.6, right = 0.95, top = 0.95)
Often a dataset will have many variables (continuous and categorical) to visualize at once. There also may be so many levels in a categorical variable that there is too much data to display. We can use functions from the
dplyr library as well as built-in feature of ggplot to do this.
The dataset can be simplified by filtering rows to a single value of several of the variables. Those variable columns can then be ignored.
The following code limits the dataset to only delays caused by late aircraft at O’Hare airport:
ord <- airline %>% filter(`Airport Name` == "Chicago O'Hare International") %>% filter(`Ontime Category` == "Delayed by Late Aircraft")
group() argument can be used within an aesthetic to indicate that a particular categorical variable be used to generate multiple objects within a particular geom. In this example, separate line and point plots for delay by time series are generated for each airline:
ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, group = `Carrier Name`)) + geom_line(na.rm = TRUE) + geom_point()
Because there are 14 airlines, there are so many lines it’s impossible to tell what’s going on. Typically, instead of simply grouping using
group(), we group and also assign a different value of an aesthetic to each group. For example, we can group and give each airline a different color. There still are many similar colors, so we can also the airline to control the shape of the points using an aesthetic specific to the
Since there is only a small number of default shapes, we have to specify the shapes to be used manually by creating a vector of shape numbers.
shapes <- 0:19 ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, color = `Carrier Name`)) + geom_line(na.rm = TRUE) + geom_point(aes(shape = `Carrier Name`), size = 3) + scale_shape_manual(values=shapes)
Instead of just eliminating the airport category, we can combine the information we are interesed in using the
summarize() function from the dplyr library.
In this case, it’s a bit more complicated because we need to summarize two different variables (minuts of delay and number of flights) in order to calculate the mean delay over all airports instead of just a single airport (which we were using in the previous plot).
all_airports_total_flights <- airline %>% filter(`Ontime Category` == "Delayed by Late Aircraft") %>% group_by(`Carrier Name`, `Date`) %>% # group_by() is from the dplyr package summarize(total_flights = sum(`Number of Flights`)) # summarize() also from dplyr all_airports_total_delays <- airline %>% filter(`Ontime Category` == "Delayed by Late Aircraft") %>% group_by(`Carrier Name`, `Date`) %>% summarize(total_delay = sum(`Minutes of Delay`))
In this example, we group by the two variables we want to keep in the data for future use: the carrier name and the date (used for the time series).
Since the resulting two data frames have the same number of rows in the same order, we can calculate the mean of the summarized delays by dividing the summarized delay totals by the sumarized total number of flights. We can do this as we create a single data frame:
mean_delay <- data.frame(carrier = all_airports_total_delays$`Carrier Name`, date = all_airports_total_delays$`Date`, mean_delay = all_airports_total_delays$total_delay/all_airports_total_flights$total_flights)
The resulting data frame can now be used to create the same kind of plot as before, but based on all airports and not just one:
ggplot(mean_delay, aes(x = date, y = mean_delay, color = carrier)) + geom_line() + geom_point(aes(shape = carrier), size = 3) + scale_shape_manual(values=shapes)
I didn’t have to put
na.rm = TRUE as an argument for
geom_line since there are no missing data.
If I want to see the overall trend for all airlines superimposed on top of the plot, I can add a
geom_smooth. However, I only want the grouping by color to apply to the line and point plots, and NOT to the smooth geom because I want it to apply to all airlines. So I need to remove the
color = carrier grouping argument as a global mapping and put it only in the specific geoms to which it should apply (line and point):
ggplot(mean_delay, aes(x = date, y = mean_delay)) + geom_line(aes(color = carrier)) + geom_point(aes(shape = carrier, color = carrier), size = 3) + scale_shape_manual(values=shapes)+ geom_smooth(size = 2, se = FALSE)
This exploration has allowed me to identify four types of airlines for further exploration based on the length of delay and consistency: United (consistently bad), Southwest (consistently good), Hawaiian (inconsistently good), Virgin (inconsistently bad).
I can filter using the logical OR operator
| to screen out the other airlines:
small_mean_delay <- mean_delay %>% filter(carrier == "United" | carrier == "Virgin" | carrier == "Hawaiian" | carrier == "Southwest")
We can also reorder the factors so that they appear in the order that makes the most sense to us:
four_airline$`Carrier Name` <- factor(four_airline$`Carrier Name`, c("Southwest", "United", "Hawaiian", "Virgin"))
The point and line geoms apply to only a singe value of the Y variable per object plotted. If we switch to a plot that produces an object that summarizes a variable, we can visualize the effect of that variable without eliminating it.
For example, in the last plot we eliminated the airport as a variable by averaging over it. If instead we use the
geom_boxplot geom, it will visualize the distribution of delay times over all airports instead of showing them as a single point that is a dot.
Generating multiple boxplot geoms requires the X variable to be discrete rather than continuous. We can accomplish this by turing the continuous
Date variable into a factor before generating the plot
ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`, color = `Carrier Name`)) + geom_boxplot() + guides(x = guide_axis(angle = 90))
Because the date labels are so long, I rotated them by 90 degrees.
The previous plot has too much information packed into a single plot. Faceting systematically generates subplots based on values of a discontinuous variable.
The function to be used in the facets must have the grouping argument removed from the aesthetic. In this case the
color argument was removed from the previous example. The subplot axes labels are also set to
NULL because they will be applied to the overall plot.
base <- ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`)) + geom_boxplot() + xlab(NULL) + ylab(NULL) + guides(x = guide_axis(angle = 90))
Faceting on a single variable uses the
facet_wrap function. The variable (or a function) is specified after a tilde
base + facet_wrap(~`Carrier Name`, nrow = 2)
nrow argument is given, the facets will be divided somewhat equally among the specified number of rows.
If faceting is to be done on two variables,
facet_grid is used instead.
There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.
Next lession: controlling plot dimensions
Questions? Contact us