Previous lesson: Lists and data frames
In this lesson, we will start using R data structures to do simple forms of data exploration through statistical calculations and plotting. In the process, we will learn the role that missing data plays in R data sets. We will also examine how the vectorized programming approach that is fundamental to R influences the way we code complex operations.
Learning objectives At the end of this lesson, the learner will be able to:
NAserves in R data structures.
NAwhen loaded into an R data frame.
is.na()function to modify items in a vector or column of a data frame.
lm()function to generate a trendline for a scatterplot and to calculate statistical quantities related to linear regressions.
Total video time: 46 m 44 s
The graphics package is included in the standard distribution, so many kinds of plots can be generated without installing any additional packages. A standard reference book for R graphics is R Graphics by Paul Murrel (now in its third edition). The book is available in print from online sellers, but some parts of older editions are available online, such as chapters 1, 4, and 5 of the first edition. There are also graph galleries for the figures in each chapter of the book that provide the code used to generate the plots. The second edition of R Graphics is available online to Vanderbilt users through our O’Reilly subscription. See the following paragraph for login information.
Another more brief reference is Chapter 10 of R Cookbook, 2nd Edition by Paul Teetor (print book), which shows how to generate many of the typical kinds of simple plots useful to R users. Vanderbilt users can find it by logging directly into the O’Reilly website, using VUNet ID and password at this link. You can also try this direct link to the book. In order for the direct links to places in the book to work, you must be logged into the O’Reilly website.
R Cookbook chapter 9 provides a brief introduction to statistical calculations. Because R was created to do stats, there are many, many references available. Some free online resources to get you started are here. CodeGraf intermediate lessons on Intro to Stats with R
The “not available” missing data indicator
NA is written without quotes:
vector_with_missing <- c(1, 2, NA, 3)
Functions that will not return a value when data are missing can be forced to remove
NA prior to calculating with
na.rm = TRUE:
mean(vector_with_missing, na.rm = TRUE) # remove NAs, then calculate
"" (empty) and
"NA" strings to
NA missing data values when reading CSV data into tibbles.
The behavior of
read.csv() is complicated with respect to the cells it treats as missing data when it reads values into traditional data frames. Specific strings can be specified to be considered missing data using an
data_frame <- read.csv(url, na.strings = c("-9999", "NaN", "NA" ,"")) # all listed strings read as NA
colClasses = "character" argument causes every value (including numbers) to be read as characters except for the string
"NA", which is still read as an
data_frame <- read.csv(url, colClasses = "character")
length(vector) # returns the number of items in a vector mean(vector) # returns the average (mean) of items in a vector sd(vector) # returns the standard deviation of items in a vector summary(vector) # summaries limits, mean, median, etc. of items in a vector quantile(vector) # defaults to returning quartiles of a vector
To generate quantiles other than quartiles, use the
probs argument. For example:
quantile(x, probs = seq(0, 1, 1/10))
will generate deciles.
is.na() function returns a boolean:
TRUE if the argument is
is.na(NA) # returns TRUE is.na(3) # returns FALSE
Example with result in a separate vector:
schools_data <- read_csv("https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv") asian <- schools_data$Asian # Asian column of data frame assigned to a named vector object booleans_vector <- is.na(asian) asian[booleans_vector] <- 0 mean(asian)
Example with replacement in the original column (“in place”):
schools_data <- read_csv("https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv") schools_data$Asian[is.na(schools_data$Asian)] <- 0 mean(schools_data$Asian)
A box-and-whisker plot is generated when the independent variable (x) is a discontinuous factor and the dependent variable (y) is continous (a numeric vector):
plot(y ~ x)
NOTE: In version 3 of R and lower, character columns are automatically converted to factors when CSV files are read into traditional data frames using
read.csv(). They are not converted when CSV files are read into tibbles using
read_csv(), or when using
read.csv() in R version 4.0 or greater. To convert a character column to factors, place the column name inside the function
as.factor(). For example, in a data frame named
df with numeric column
y and character column
plot(df$y ~ as.factor(df$x))
An X-Y scatterplot is generated when the independent variable (x) and the dependent variable (y) are both continuous (numeric vectors):
plot(y ~ x)
To include a trend line, a linear model must be generated using the same variables:
model <- lm(y ~ x)
To overlay the trendline:
The model will provide information about the linear regression test:
model # gives slope and intercept of best-fit line summary(model) # gives statistical quantities related to the regression
The practice assignment is here. You will need to load it into the editor pane of RStudio.
It is best to try to complete each problem on your own before resorting to watching the solution videos below.
Problem 1 solution
Problem 2 solution
Problem 3 solution
Next lesson: Tidy Data and basic data wrangling
Questions? Contact us