Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: Controlling the appearance of plots
In exploratory data visualization, there are often too many variables to explore using a single visualization. In this lesson we will see how to reduce the complexity of the data by reducing the number of variables, and reducing the number of levels with categorical variables.
Learning objectives At the end of this lesson, the learner will be able to:
patchwork
library.inset_element()
from the patchwork
library.Total video time: n/a
The patchwork
package can be used to control the position of several subplots within a larger combined plot. See this chapter for details.
The easiest way to arrange the subplots is to assign each one to a named object, then arrange them using the format allowed by patchwork
.
Create functions and assign:
library(patchwork)
plot1 <- ggplot(hemoglobin_frame, aes(hemoglobin)) +
geom_density(alpha = 0.2, aes(fill = population, color = population)) + # use alpha to control transparency
theme(legend.position = c(0.8, 0.8))
plot2 <- ggplot(hemoglobin_frame, aes(sample = hemoglobin)) +
geom_qq(aes(color = population)) +
stat_qq_line(aes(color = population)) +
theme(legend.position = c(0.8, 0.2))
To place plots side-by-side, use the +
operator:
plot1 + plot2
To place plots on top of each other, use the /
operator:
plot1 / plot2
If you want to apply letter or number tags to the subplots, use plot_annotation()
. The style “A” will use A, B, C, etc. :
plot1 + plot2 +
plot_annotation(tag_levels = c("A"))
The inset_element()
function can be used to place a subplot within the limits of another. The position is controlled by left
, right
, top
, and bottom
arguments. Trial and error is usually needed to get it in the right spot.
ggplot(hemoglobin_frame, aes(hemoglobin)) +
geom_density(alpha = 0.2, aes(fill = population, color = population)) +
inset_element(qq_no_legend, left = 0.5, bottom = 0.6, right = 0.95, top = 0.95)
Often a dataset will have many variables (continuous and categorical) to visualize at once. There also may be so many levels in a categorical variable that there is too much data to display. We can use functions from the dplyr
library as well as built-in feature of ggplot to do this.
Data in this section are from Bureau of Transportation Statistics (https://www.transtats.bts.gov/). The full dataset can be downloaded here.
The dataset can be simplified by filtering rows to a single value of several of the variables. Those variable columns can then be ignored.
The following code limits the dataset to only delays caused by late aircraft at O’Hare airport:
ord <- airline %>%
filter(`Airport Name` == "Chicago O'Hare International") %>%
filter(`Ontime Category` == "Delayed by Late Aircraft")
The group()
argument can be used within an aesthetic to indicate that a particular categorical variable be used to generate multiple objects within a particular geom. In this example, separate line and point plots for delay by time series are generated for each airline:
ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, group = `Carrier Name`)) +
geom_line(na.rm = TRUE) +
geom_point()
Because there are 14 airlines, there are so many lines it’s impossible to tell what’s going on. Typically, instead of simply grouping using group()
, we group and also assign a different value of an aesthetic to each group. For example, we can group and give each airline a different color. There still are many similar colors, so we can also the airline to control the shape of the points using an aesthetic specific to the geom_point
geom.
Since there is only a small number of default shapes, we have to specify the shapes to be used manually by creating a vector of shape numbers.
shapes <- 0:19
ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, color = `Carrier Name`)) +
geom_line(na.rm = TRUE) +
geom_point(aes(shape = `Carrier Name`), size = 3) +
scale_shape_manual(values=shapes)
summarize
Instead of just eliminating the airport category, we can combine the information we are interesed in using the summarize()
function from the dplyr library.
In this case, it’s a bit more complicated because we need to summarize two different variables (minuts of delay and number of flights) in order to calculate the mean delay over all airports instead of just a single airport (which we were using in the previous plot).
all_airports_total_flights <- airline %>%
filter(`Ontime Category` == "Delayed by Late Aircraft") %>%
group_by(`Carrier Name`, `Date`) %>% # group_by() is from the dplyr package
summarize(total_flights = sum(`Number of Flights`)) # summarize() also from dplyr
all_airports_total_delays <- airline %>%
filter(`Ontime Category` == "Delayed by Late Aircraft") %>%
group_by(`Carrier Name`, `Date`) %>%
summarize(total_delay = sum(`Minutes of Delay`))
In this example, we group by the two variables we want to keep in the data for future use: the carrier name and the date (used for the time series).
Since the resulting two data frames have the same number of rows in the same order, we can calculate the mean of the summarized delays by dividing the summarized delay totals by the sumarized total number of flights. We can do this as we create a single data frame:
mean_delay <- data.frame(carrier = all_airports_total_delays$`Carrier Name`, date = all_airports_total_delays$`Date`, mean_delay = all_airports_total_delays$total_delay/all_airports_total_flights$total_flights)
The resulting data frame can now be used to create the same kind of plot as before, but based on all airports and not just one:
ggplot(mean_delay, aes(x = date, y = mean_delay, color = carrier)) +
geom_line() +
geom_point(aes(shape = carrier), size = 3) +
scale_shape_manual(values=shapes)
I didn’t have to put na.rm = TRUE
as an argument for geom_line
since there are no missing data.
If I want to see the overall trend for all airlines superimposed on top of the plot, I can add a geom_smooth
. However, I only want the grouping by color to apply to the line and point plots, and NOT to the smooth geom because I want it to apply to all airlines. So I need to remove the color = carrier
grouping argument as a global mapping and put it only in the specific geoms to which it should apply (line and point):
ggplot(mean_delay, aes(x = date, y = mean_delay)) +
geom_line(aes(color = carrier)) +
geom_point(aes(shape = carrier, color = carrier), size = 3) +
scale_shape_manual(values=shapes)+
geom_smooth(size = 2, se = FALSE)
This exploration has allowed me to identify four types of airlines for further exploration based on the length of delay and consistency: United (consistently bad), Southwest (consistently good), Hawaiian (inconsistently good), Virgin (inconsistently bad).
I can filter using the logical OR operator |
to screen out the other airlines:
small_mean_delay <- mean_delay %>%
filter(carrier == "United" | carrier == "Virgin" | carrier == "Hawaiian" | carrier == "Southwest")
We can also reorder the factors so that they appear in the order that makes the most sense to us:
four_airline$`Carrier Name` <- factor(four_airline$`Carrier Name`, c("Southwest", "United", "Hawaiian", "Virgin"))
The point and line geoms apply to only a singe value of the Y variable per object plotted. If we switch to a plot that produces an object that summarizes a variable, we can visualize the effect of that variable without eliminating it.
For example, in the last plot we eliminated the airport as a variable by averaging over it. If instead we use the geom_boxplot
geom, it will visualize the distribution of delay times over all airports instead of showing them as a single point that is a dot.
Generating multiple boxplot geoms requires the X variable to be discrete rather than continuous. We can accomplish this by turing the continuous Date
variable into a factor before generating the plot
ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`, color = `Carrier Name`)) +
geom_boxplot() +
guides(x = guide_axis(angle = 90))
Because the date labels are so long, I rotated them by 90 degrees.
The previous plot has too much information packed into a single plot. Faceting systematically generates subplots based on values of a discontinuous variable.
The function to be used in the facets must have the grouping argument removed from the aesthetic. In this case the color
argument was removed from the previous example. The subplot axes labels are also set to NULL
because they will be applied to the overall plot.
base <- ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`)) +
geom_boxplot() +
xlab(NULL) +
ylab(NULL) +
guides(x = guide_axis(angle = 90))
Faceting on a single variable uses the facet_wrap
function. The variable (or a function) is specified after a tilde ~
:
base + facet_wrap(~`Carrier Name`, nrow = 2)
When the nrow
argument is given, the facets will be divided somewhat equally among the specified number of rows.
If faceting is to be done on two variables, facet_grid
is used instead.
There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.
Next lession: controlling plot dimensions
Revised 2021-09-30
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"