Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: Controlling the appearance of plots
ggplot allows you to used built-in statistical capabilities to explore your data without running separate statistical transformations. In this lesson, we will also see how some ggplot functions are actually built from less specific functions by assuming default values for some arguments.
Learning objectives At the end of this lesson, the learner will be able to:
geom_hist and use bins and binwidth to control granularity.geom_density and geom_qq.geom_boxplot, geom_violin, and geom_dotplot.stat argument to override the default stat for a geom.geom_smooth to show the trend of scatterplot data.method and formula arguments of geom_smooth.se argument of geom_smooth.Total video time: n/a
real datasets to explore from “The Analysis of Biological Data” by Whitlock and Schluter
ggplot2 book - displaying distributions
ggplot2 book - explanation of components of a layer
bins and binwidth can be used to set the granularity of the bins used to block the data.
ggplot(usa_hemoglobin, aes(hemoglobin)) +
geom_histogram(binwidth = 0.1)
ggplot(usa_hemoglobin, aes(hemoglobin)) +
geom_histogram(bins = 50) # 30 bins is the default
If there are multiple levels of a factor in the data, each level can be plotted as a different histogram on the same axes using the color aesthetic:
ggplot(hemoglobin_frame, aes(hemoglobin)) +
geom_histogram(aes(color = population), fill = "NA", binwidth = 0.5) # NA makes the bars transparent
For larger datasets where the data are smoothly distributed, a density plot may make more sense
ggplot(usa_hemoglobin, aes(hemoglobin)) +
geom_density()
In this case, a density plot makes it much easier to visualize the overlapping distributions than the histogram. The alpha argument can be used to make the fill partially transparent.
ggplot(hemoglobin_frame, aes(hemoglobin)) +
geom_density(alpha = 0.2, aes(fill = population, color = population))
A normal quantile (or “Q-Q”) plot makes the similarity and differences among overlapping distributions more apparent, although they are harder to understand.
ggplot(hemoglobin_frame, aes(sample = hemoglobin)) +
geom_qq(aes(color = population)) +
stat_qq_line(aes(color = population))
Box plot
ggplot(hemoglobin_frame, aes(population, hemoglobin)) +
geom_boxplot()
Violin plot
ggplot(hemoglobin_frame, aes(population, hemoglobin)) +
geom_violin()
A dot plot shows the magnitude of the data throughout the distribution, but doesn’t work well for large datasets. It’s better for small datasets, including those that aren’t smoothly distributed.
ggplot(hemoglobin_frame, aes(x = population, y = hemoglobin)) +
geom_dotplot(binaxis = "y", stackdir = "center", dotsize = .5, binwidth = .25, aes(color = population, fill = population))
In this example, the distributions are displayed vertically. Reversing the axes will display them horizontally.
In the overall scheme of plot construction, there is a stage where statistics can be applied to data prior to plotting.

Figure from Wickham and Grolemund https://r4ds.had.co.nz/ CC BY-NC-ND
We’ll explore that stage in this section.
ggplot simplifies the specification of geoms by providing more specific “shortcut” geoms that don’t require you to specify all of the arguments.
The most general geom is layer, which can be used to build many different kings of geoms by specifying all of the details.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) +
layer(
mapping = NULL,
data = NULL,
geom = "bar",
stat = "identity",
position = "identity"
)
The geom_bar geom is more specific by hard coding the geom to bar. This makes it easier to use because fewer arguments need to be provided. The following plot is exactly the same as the one above.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) +
geom_bar(
stat="identity"
)
The stat argument specifies what kind of statistical transformation is done to the data prior to plotting. A value of “identity” tells the geom to use the data directly without any transformation.
The geom_col geom is identical to the geom_bar geom, except that the stat always defaults to “identity” and doesn’t need to be specified. The following plot will be exactly the same as the two above.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) +
geom_col(
)
A stat argument can be used to override the default stat for a geom. The default stat for geom_bar is “count”, but in this example, the y values are derived from a statistical summary of the data rather than the count of items.
ggplot(erg_frame, aes(x = color, y = response)) +
geom_bar(stat = "summary", fun = "mean") # stat = "summary" overrides the default counting method
Error bars are create using geom_errorbar. They can be applied as a layer on top of bar or point plots.
ggplot(graphing_data, aes(x=color, y=mean)) +
geom_bar(stat="identity", aes(fill = color)) + # same as geom_col(aes(fill = color))
geom_errorbar(aes(ymin = lower_cl, ymax = upper_cl),
width=.2)
The geom_smooth geom will perform various types of statistical analyses on X-Y data to show the trend in the data. By default, the “loess” stat is used.
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) +
geom_point() +
geom_smooth()
By default, standard error ranges are shown around the trend curve. They can be suppressed by providing a FALSE value for the se argument.
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) +
geom_point() +
geom_smooth(se = FALSE)
Other statistical methods than loess can be specified. Linear model (method = "lm") will default to the best fit line:
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Other functions can be used for the linear model if specified using the formula argument
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ exp(x)) +
labs(y="age (years)", x = "proportion black")
There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.
Next lession: displaying complex data
Revised 2021-09-24
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"