Previous lesson: Controlling the appearance of plots
ggplot allows you to used built-in statistical capabilities to explore your data without running separate statistical transformations. In this lesson, we will also see how some ggplot functions are actually built from less specific functions by assuming default values for some arguments.
Learning objectives At the end of this lesson, the learner will be able to:
binwidthto control granularity.
statargument to override the default
statfor a geom.
geom_smoothto show the trend of scatterplot data.
Total video time: n/a
binwidth can be used to set the granularity of the bins used to block the data.
ggplot(usa_hemoglobin, aes(hemoglobin)) + geom_histogram(binwidth = 0.1) ggplot(usa_hemoglobin, aes(hemoglobin)) + geom_histogram(bins = 50) # 30 bins is the default
If there are multiple levels of a factor in the data, each level can be plotted as a different histogram on the same axes using the
ggplot(hemoglobin_frame, aes(hemoglobin)) + geom_histogram(aes(color = population), fill = "NA", binwidth = 0.5) # NA makes the bars transparent
For larger datasets where the data are smoothly distributed, a density plot may make more sense
ggplot(usa_hemoglobin, aes(hemoglobin)) + geom_density()
In this case, a density plot makes it much easier to visualize the overlapping distributions than the histogram. The
alpha argument can be used to make the fill partially transparent.
ggplot(hemoglobin_frame, aes(hemoglobin)) + geom_density(alpha = 0.2, aes(fill = population, color = population))
A normal quantile (or “Q-Q”) plot makes the similarity and differences among overlapping distributions more apparent, although they are harder to understand.
ggplot(hemoglobin_frame, aes(sample = hemoglobin)) + geom_qq(aes(color = population)) + stat_qq_line(aes(color = population))
ggplot(hemoglobin_frame, aes(population, hemoglobin)) + geom_boxplot()
ggplot(hemoglobin_frame, aes(population, hemoglobin)) + geom_violin()
A dot plot shows the magnitude of the data throughout the distribution, but doesn’t work well for large datasets. It’s better for small datasets, including those that aren’t smoothly distributed.
ggplot(hemoglobin_frame, aes(x = population, y = hemoglobin)) + geom_dotplot(binaxis = "y", stackdir = "center", dotsize = .5, binwidth = .25, aes(color = population, fill = population))
In this example, the distributions are displayed vertically. Reversing the axes will display them horizontally.
In the overall scheme of plot construction, there is a stage where statistics can be applied to data prior to plotting.
Figure from Wickham and Grolemund https://r4ds.had.co.nz/ CC BY-NC-ND
We’ll explore that stage in this section.
ggplot simplifies the specification of geoms by providing more specific “shortcut” geoms that don’t require you to specify all of the arguments.
The most general geom is
layer, which can be used to build many different kings of geoms by specifying all of the details.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) + layer( mapping = NULL, data = NULL, geom = "bar", stat = "identity", position = "identity" )
geom_bar geom is more specific by hard coding the geom to
bar. This makes it easier to use because fewer arguments need to be provided. The following plot is exactly the same as the one above.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) + geom_bar( stat="identity" )
stat argument specifies what kind of statistical transformation is done to the data prior to plotting. A value of “identity” tells the geom to use the data directly without any transformation.
geom_col geom is identical to the
geom_bar geom, except that the stat always defaults to “identity” and doesn’t need to be specified. The following plot will be exactly the same as the two above.
ggplot(erg_mean_frame, aes(x=color, y=mean_response)) + geom_col( )
stat argument can be used to override the default stat for a geom. The default stat for
geom_bar is “count”, but in this example, the y values are derived from a statistical summary of the data rather than the count of items.
ggplot(erg_frame, aes(x = color, y = response)) + geom_bar(stat = "summary", fun = "mean") # stat = "summary" overrides the default counting method
Error bars are create using
geom_errorbar. They can be applied as a layer on top of bar or point plots.
ggplot(graphing_data, aes(x=color, y=mean)) + geom_bar(stat="identity", aes(fill = color)) + # same as geom_col(aes(fill = color)) geom_errorbar(aes(ymin = lower_cl, ymax = upper_cl), width=.2)
geom_smooth geom will perform various types of statistical analyses on X-Y data to show the trend in the data. By default, the “loess” stat is used.
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) + geom_point() + geom_smooth()
By default, standard error ranges are shown around the trend curve. They can be suppressed by providing a
FALSE value for the
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) + geom_point() + geom_smooth(se = FALSE)
Other statistical methods than loess can be specified. Linear model (
method = "lm") will default to the best fit line:
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Other functions can be used for the linear model if specified using the
ggplot(lion_noses, mapping = aes(x = proportionBlack, y = ageInYears)) + geom_point() + geom_smooth(method = "lm", formula = y ~ exp(x)) + labs(y="age (years)", x = "proportion black")
There are a number of built-in datasets included with the R installation that can be referenced without loading them from an external file. We will use some of them in the practice assignment.
Next lession: displaying complex data
Questions? Contact us