Digital Education Resources - Vanderbilt Libraries Digital Lab

Previous lesson: last lesson in Intro to R module

R viz using ggplot: Introduction

This lesson will introduce the conceptual background that influences the construction of ggplot plots. You will practice building several simple plot types that were previously introduced in the beginner module using the ggplot function template.

Learning objectives At the end of this lesson, the learner will be able to:

Total video time: 25 m 23 s

Links

Lesson R script at GitHub

Lesson slides

Access to O’Reilly for Higher Education via Vanderbilt Libraries. See “R for Data Science” and “R Graphics Cookbook”

ggplot cheatsheet

ggplot2 book (draft of 3rd edition)

Duke University Center for Data and Visualization Sciences

NC State University Libraries workshops

Penn State Institute for Computational and Data Sciences


The layered Grammar of Graphics

The “gg” in ggplot stands for “grammar of graphics”. In this section we will explore what that means in the context of the ggplot2 R package.

Introduction (2m 56s)

There are several authoritative resources provided by the creators of ggplot. In this section we will introduce several of them.

ggplot2: Elegant Graphics for Data Analysis (draft of 3rd edition)

R for Data Science: free online version, access through the Vanderbilt Libraries’ O’Reilly subscription

ggplot website https://ggplot2.tidyverse.org/ with online cheatsheet

Downloadable ggplot and other RStudio cheatsheets

ggplot function reference

Wickham, H. 2010. A layered grammar of graphics. J. Comp. and Graph. Stats. http://dx.doi.org/10.1198/jcgs.2009.07098. Freely available preprint

Wikipedia article on kākāpō

Radio New Zealand podcast Kākāpō Files

What is a Grammar of Graphics? (4m 41s)

In ggplot, plots are built by adding a series of functions according to a generalized “grammar of graphics” paradigm.

The functions can include several different plotting functions. If the plot includes multiple geometric features, several geometric object functions can be added.

Description of layared grammar of graphics in R for Data Science

Common geometric object (geom) types

A particular type of geometric object or geom can be plotted by adding its function to the base ggplot function. We will see three simple types of geoms here.

Creating a histogram (3m 30s)

A histogram displays the distribution of a single continuous variable.

Here is an example of code for generating a plot using the histogram geom:

ggplot(data = schools_data) + geom_histogram(mapping = aes(x = Female), binwidth = 100)

Note that the ggplot function is not assigned to anything. So after its “value” is computed, it is displayed (in the plots pane).

Breaking up ggplot functions onto several lines (2m 40s)

It is possible to assign all or part of the function(s) to a variable.

base_plot <- ggplot(data = schools_data)
base_plot + geom_histogram(mapping = aes(x = Female), binwidth = 100)

(We see this kind of shortcut in the ggplot cheatsheet.)

The subfunctions can be put on separate lines, but only if there is a trailing plus sign to indicate that another function is coming.

ggplot(data = schools_data) +
  geom_histogram(mapping = aes(x = Female), binwidth = 100)

RStudio will auto-indent to show that the functions continue.

Box and whisker plot (3m 34s)

A box and whisker plot compares the distributions of several subsets of the data. The value of x is a discontinuous grouping variable. The value of y is a continuous numeric variable.

Unlike the classic R box and whisker plot, ggplot does not care whether the grouping variable is a factor (as it would be if the spreadsheet were read in as a vanilla data frame) or if it is not (if it is read in as a tibble).

ggplot(data = human_data) +
  geom_boxplot(mapping = aes(x = grouping, y = height))

Scatterplot with best-fit trendline (4m 00s)

A scatterplot plots continuous X and Y variables as a cloud of points.

If only the point geom is added to the plot, only the data points are plotted in the scatterplot.

ggplot(data = schools_data) +
  geom_point(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged))

To add a best-fit trendline, it must be added as a separate geom where the points are “smoothed” into a curve. To indicate that the smoothing function is the best fit line, the method argument is set to lm. This differs from the classic R trendline, which must be added after building a separate linear model for the data.

ggplot(data = schools_data) +
  geom_point(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged)) +
  geom_smooth(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged), method = "lm")

Global mapping and cosmetic formatting (4m 02s)

If a mapping aesthetic applies to several geoms, it is annoying and redundant to have to keep repeating the aesthetic.

If the aesthetic is mapped within the base ggplot function, the aesthetic automatically is mapped to all geoms that are plotted onto that base function. This approach is called global mapping. The following example creates the same plot as in the last example, but with more succinct code:

ggplot(data = schools_data, mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged)) +
  geom_point() +
  geom_smooth(method = "lm")

To make the ggplot code easier to read, functions can be broken down across multiple lines.

This example contains exactly the same code as the previous example, but whitespace (newlines and indented spaces) make the structure of the functions more apparent by putting argument key/value pairs on separate lines. In this case, the open parentheses signals to R that the function is not finished.

ggplot(
  data = schools_data, 
  mapping = aes(
    x = Limited.English.Proficiency, 
    y = Economically.Disadvantaged
    )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm"
    )

Although highly structured, this representation is also pretty verbose. So you can optimize the structural clarity vs. compactness to create the code you think is most readable.


Practice assignment

There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.

Under construction

Next lession: controlling appearance


Revised 2023-10-25

Questions? Contact us

License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"