Digital Education Resources - Vanderbilt Libraries Digital Lab

Previous lesson: Controlling the appearance of plots

R viz using ggplot: Displaying complex data

In exploratory data visualization, there are often too many variables to explore using a single visualization. In this lesson we will see how to reduce the complexity of the data by reducing the number of variables, and reducing the number of levels with categorical variables.

Learning objectives At the end of this lesson, the learner will be able to:

Total video time: n/a

Links

Lesson R script at GitHub

Lesson slides

ggplot function reference


Organizing multiple plots

The patchwork package can be used to control the position of several subplots within a larger combined plot. See this chapter for details.

Arranging subplots next to each other

The easiest way to arrange the subplots is to assign each one to a named object, then arrange them using the format allowed by patchwork.

Create functions and assign:

library(patchwork)

plot1 <- ggplot(hemoglobin_frame, aes(hemoglobin)) +
  geom_density(alpha = 0.2, aes(fill = population, color = population)) + # use alpha to control transparency
  theme(legend.position = c(0.8, 0.8))

plot2 <- ggplot(hemoglobin_frame, aes(sample = hemoglobin)) + 
  geom_qq(aes(color = population)) +
  stat_qq_line(aes(color = population)) +
  theme(legend.position = c(0.8, 0.2))

To place plots side-by-side, use the + operator:

plot1 + plot2

To place plots on top of each other, use the / operator:

plot1 / plot2

If you want to apply letter or number tags to the subplots, use plot_annotation(). The style “A” will use A, B, C, etc. :

plot1 + plot2 + 
  plot_annotation(tag_levels = c("A"))

Overlaying a subplot on top of the main plot

The inset_element() function can be used to place a subplot within the limits of another. The position is controlled by left, right, top, and bottom arguments. Trial and error is usually needed to get it in the right spot.

ggplot(hemoglobin_frame, aes(hemoglobin)) +
  geom_density(alpha = 0.2, aes(fill = population, color = population)) + 
  inset_element(qq_no_legend, left = 0.5, bottom = 0.6, right = 0.95, top = 0.95)

Approaches for collapsing variables and reducing levels when there’s too much data

Often a dataset will have many variables (continuous and categorical) to visualize at once. There also may be so many levels in a categorical variable that there is too much data to display. We can use functions from the dplyr library as well as built-in feature of ggplot to do this.

Data in this section are from Bureau of Transportation Statistics (https://www.transtats.bts.gov/). The full dataset can be downloaded here.

Reducing the number of variables

The dataset can be simplified by filtering rows to a single value of several of the variables. Those variable columns can then be ignored.

The following code limits the dataset to only delays caused by late aircraft at O’Hare airport:

ord <- airline %>%
  filter(`Airport Name` == "Chicago O'Hare International") %>%
  filter(`Ontime Category` == "Delayed by Late Aircraft")

Grouping to use one variable for multiple objects within a geom

The group() argument can be used within an aesthetic to indicate that a particular categorical variable be used to generate multiple objects within a particular geom. In this example, separate line and point plots for delay by time series are generated for each airline:

ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, group = `Carrier Name`)) + 
  geom_line(na.rm = TRUE) +
  geom_point()

Because there are 14 airlines, there are so many lines it’s impossible to tell what’s going on. Typically, instead of simply grouping using group(), we group and also assign a different value of an aesthetic to each group. For example, we can group and give each airline a different color. There still are many similar colors, so we can also the airline to control the shape of the points using an aesthetic specific to the geom_point geom.

Since there is only a small number of default shapes, we have to specify the shapes to be used manually by creating a vector of shape numbers.

shapes <- 0:19

ggplot(ord, aes(x = Date, y = `Minutes of Delay per Flight`, color = `Carrier Name`)) + 
  geom_line(na.rm = TRUE) +
  geom_point(aes(shape = `Carrier Name`), size = 3) +
  scale_shape_manual(values=shapes)

Collapsing one of the variables using summarize

Instead of just eliminating the airport category, we can combine the information we are interesed in using the summarize() function from the dplyr library.

In this case, it’s a bit more complicated because we need to summarize two different variables (minuts of delay and number of flights) in order to calculate the mean delay over all airports instead of just a single airport (which we were using in the previous plot).

all_airports_total_flights <- airline %>% 
  filter(`Ontime Category` == "Delayed by Late Aircraft") %>%
  group_by(`Carrier Name`, `Date`) %>% # group_by() is from the dplyr package
  summarize(total_flights = sum(`Number of Flights`))  # summarize() also from dplyr

all_airports_total_delays <- airline %>% 
  filter(`Ontime Category` == "Delayed by Late Aircraft") %>%
  group_by(`Carrier Name`, `Date`) %>%
  summarize(total_delay = sum(`Minutes of Delay`))

In this example, we group by the two variables we want to keep in the data for future use: the carrier name and the date (used for the time series).

Since the resulting two data frames have the same number of rows in the same order, we can calculate the mean of the summarized delays by dividing the summarized delay totals by the sumarized total number of flights. We can do this as we create a single data frame:

mean_delay <- data.frame(carrier = all_airports_total_delays$`Carrier Name`, date = all_airports_total_delays$`Date`, mean_delay = all_airports_total_delays$total_delay/all_airports_total_flights$total_flights)

The resulting data frame can now be used to create the same kind of plot as before, but based on all airports and not just one:

ggplot(mean_delay, aes(x = date, y = mean_delay, color = carrier)) + 
  geom_line() +
  geom_point(aes(shape = carrier), size = 3) +
  scale_shape_manual(values=shapes)

I didn’t have to put na.rm = TRUE as an argument for geom_line since there are no missing data.

If I want to see the overall trend for all airlines superimposed on top of the plot, I can add a geom_smooth. However, I only want the grouping by color to apply to the line and point plots, and NOT to the smooth geom because I want it to apply to all airlines. So I need to remove the color = carrier grouping argument as a global mapping and put it only in the specific geoms to which it should apply (line and point):

ggplot(mean_delay, aes(x = date, y = mean_delay)) + 
  geom_line(aes(color = carrier)) +
  geom_point(aes(shape = carrier, color = carrier), size = 3) +
  scale_shape_manual(values=shapes)+ 
  geom_smooth(size = 2, se = FALSE)

Reducing the number of levels of a factor

This exploration has allowed me to identify four types of airlines for further exploration based on the length of delay and consistency: United (consistently bad), Southwest (consistently good), Hawaiian (inconsistently good), Virgin (inconsistently bad).

I can filter using the logical OR operator | to screen out the other airlines:

small_mean_delay <- mean_delay %>%
  filter(carrier == "United" | carrier == "Virgin" | carrier == "Hawaiian" | carrier == "Southwest")

We can also reorder the factors so that they appear in the order that makes the most sense to us:

four_airline$`Carrier Name` <- factor(four_airline$`Carrier Name`, c("Southwest", "United", "Hawaiian", "Virgin"))

Reducing a variable by applying a collective geom

The point and line geoms apply to only a singe value of the Y variable per object plotted. If we switch to a plot that produces an object that summarizes a variable, we can visualize the effect of that variable without eliminating it.

For example, in the last plot we eliminated the airport as a variable by averaging over it. If instead we use the geom_boxplot geom, it will visualize the distribution of delay times over all airports instead of showing them as a single point that is a dot.

Generating multiple boxplot geoms requires the X variable to be discrete rather than continuous. We can accomplish this by turing the continuous Date variable into a factor before generating the plot

ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`, color = `Carrier Name`)) + 
  geom_boxplot() +
  guides(x = guide_axis(angle = 90))

Because the date labels are so long, I rotated them by 90 degrees.

Faceting

The previous plot has too much information packed into a single plot. Faceting systematically generates subplots based on values of a discontinuous variable.

The function to be used in the facets must have the grouping argument removed from the aesthetic. In this case the color argument was removed from the previous example. The subplot axes labels are also set to NULL because they will be applied to the overall plot.

base <- ggplot(four_airline, aes(x = as.factor(Date), y = `Minutes of Delay per Flight`)) + 
  geom_boxplot() + 
  xlab(NULL) + 
  ylab(NULL) +
  guides(x = guide_axis(angle = 90))

Faceting on a single variable uses the facet_wrap function. The variable (or a function) is specified after a tilde ~:

base + facet_wrap(~`Carrier Name`, nrow = 2)

When the nrow argument is given, the facets will be divided somewhat equally among the specified number of rows.

If faceting is to be done on two variables, facet_grid is used instead.


Practice assignment

There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.

  1. Load th

Next lession: controlling plot dimensions


Revised 2021-09-30

Questions? Contact us

License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"