Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: last lesson in Intro to R module
This lesson will introduce the conceptual background that influences the construction of ggplot plots. You will practice building several simple plot types that were previously introduced in the beginner module using the ggplot function template.
Learning objectives At the end of this lesson, the learner will be able to:
geom_histogram
, geom_boxplot
, and geom_point
.Total video time: 25 m 23 s
Access to O’Reilly for Higher Education via Vanderbilt Libraries. See “R for Data Science” and “R Graphics Cookbook”
ggplot2 book (draft of 3rd edition)
Duke University Center for Data and Visualization Sciences
NC State University Libraries workshops
Penn State Institute for Computational and Data Sciences
The “gg” in ggplot stands for “grammar of graphics”. In this section we will explore what that means in the context of the ggplot2 R package.
There are several authoritative resources provided by the creators of ggplot. In this section we will introduce several of them.
ggplot2: Elegant Graphics for Data Analysis (draft of 3rd edition)
R for Data Science: free online version, access through the Vanderbilt Libraries’ O’Reilly subscription
ggplot website https://ggplot2.tidyverse.org/ with online cheatsheet
Downloadable ggplot and other RStudio cheatsheets
Wickham, H. 2010. A layered grammar of graphics. J. Comp. and Graph. Stats. http://dx.doi.org/10.1198/jcgs.2009.07098. Freely available preprint
Radio New Zealand podcast Kākāpō Files
In ggplot, plots are built by adding a series of functions according to a generalized “grammar of graphics” paradigm.
The functions can include several different plotting functions. If the plot includes multiple geometric features, several geometric object functions can be added.
Description of layared grammar of graphics in R for Data Science
A particular type of geometric object or geom can be plotted by adding its function to the base ggplot
function. We will see three simple types of geoms here.
A histogram displays the distribution of a single continuous variable.
Here is an example of code for generating a plot using the histogram geom:
ggplot(data = schools_data) + geom_histogram(mapping = aes(x = Female), binwidth = 100)
Note that the ggplot
function is not assigned to anything. So after its “value” is computed, it is displayed (in the plots
pane).
It is possible to assign all or part of the function(s) to a variable.
base_plot <- ggplot(data = schools_data)
base_plot + geom_histogram(mapping = aes(x = Female), binwidth = 100)
(We see this kind of shortcut in the ggplot cheatsheet.)
The subfunctions can be put on separate lines, but only if there is a trailing plus sign to indicate that another function is coming.
ggplot(data = schools_data) +
geom_histogram(mapping = aes(x = Female), binwidth = 100)
RStudio will auto-indent to show that the functions continue.
A box and whisker plot compares the distributions of several subsets of the data. The value of x
is a discontinuous grouping variable. The value of y
is a continuous numeric variable.
Unlike the classic R box and whisker plot, ggplot does not care whether the grouping variable is a factor (as it would be if the spreadsheet were read in as a vanilla data frame) or if it is not (if it is read in as a tibble).
ggplot(data = human_data) +
geom_boxplot(mapping = aes(x = grouping, y = height))
A scatterplot plots continuous X and Y variables as a cloud of points.
If only the point geom is added to the plot, only the data points are plotted in the scatterplot.
ggplot(data = schools_data) +
geom_point(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged))
To add a best-fit trendline, it must be added as a separate geom where the points are “smoothed” into a curve. To indicate that the smoothing function is the best fit line, the method
argument is set to lm
. This differs from the classic R trendline, which must be added after building a separate linear model for the data.
ggplot(data = schools_data) +
geom_point(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged)) +
geom_smooth(mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged), method = "lm")
If a mapping aesthetic applies to several geoms, it is annoying and redundant to have to keep repeating the aesthetic.
If the aesthetic is mapped within the base ggplot
function, the aesthetic automatically is mapped to all geoms that are plotted onto that base function. This approach is called global mapping. The following example creates the same plot as in the last example, but with more succinct code:
ggplot(data = schools_data, mapping = aes(x = Limited.English.Proficiency, y = Economically.Disadvantaged)) +
geom_point() +
geom_smooth(method = "lm")
To make the ggplot code easier to read, functions can be broken down across multiple lines.
This example contains exactly the same code as the previous example, but whitespace (newlines and indented spaces) make the structure of the functions more apparent by putting argument key/value pairs on separate lines. In this case, the open parentheses signals to R that the function is not finished.
ggplot(
data = schools_data,
mapping = aes(
x = Limited.English.Proficiency,
y = Economically.Disadvantaged
)
) +
geom_point() +
geom_smooth(
method = "lm"
)
Although highly structured, this representation is also pretty verbose. So you can optimize the structural clarity vs. compactness to create the code you think is most readable.
There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.
Under construction
Next lession: controlling appearance
Revised 2023-10-25
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"