Digital Education Resources - Vanderbilt Libraries Digital Lab

Previous lesson: Lists and data frames

# R programming basics: Basic statistics, plots, and missing data

In this lesson, we will start using R data structures to do simple forms of data exploration through statistical calculations and plotting. In the process, we will learn the role that missing data plays in R data sets. We will also examine how the vectorized programming approach that is fundamental to R influences the way we code complex operations.

Learning objectives At the end of this lesson, the learner will be able to:

• describe the function that the missing data indicator `NA` serves in R data structures.
• list two types of cells in CSV files that might be interpreted as `NA` when loaded into an R data frame.
• calculate basic statistics on a vector: `length()`, `mean()`, `sd()`, `summary()`, and `quantile()`.
• describe how the procedural and vectorized programming paradigms differ.
• use the `is.na()` function to modify items in a vector or column of a data frame.
• create a histogram with `hist()`.
• create a box-and-whisker plot using `plot()`.
• create an x-y scatterplot using `plot()`.
• use the `lm()` function to generate a trendline for a scatterplot and to calculate statistical quantities related to linear regressions.

Total video time: 46 m 44 s

Lesson R script at GitHub

Lesson slides

CodeGraf intermediate lessons on Intro to Stats with R

CodeGraf intermediate lessons on Data visualization with R using ggplot

## Standard graphics

The graphics package is included in the standard distribution, so many kinds of plots can be generated without installing any additional packages. A standard reference book for R graphics is R Graphics by Paul Murrel (now in its third edition). The book is available in print from online sellers, but some parts of older editions are available online, such as chapters 1, 4, and 5 of the first edition. There are also graph galleries for the figures in each chapter of the book that provide the code used to generate the plots. The second edition of R Graphics is available online to Vanderbilt users through our O’Reilly subscription. See the following paragraph for login information.

Another more brief reference is Chapter 10 of R Cookbook, 2nd Edition by Paul Teetor (print book), which shows how to generate many of the typical kinds of simple plots useful to R users. Vanderbilt users can find it by logging directly into the O’Reilly website, using VUNet ID and password at this link. You can also try this direct link to the book. In order for the direct links to places in the book to work, you must be logged into the O’Reilly website.

## Basic statistics

R Cookbook chapter 9 provides a brief introduction to statistical calculations. Because R was created to do stats, there are many, many references available. Some free online resources to get you started are here. CodeGraf intermediate lessons on Intro to Stats with R

# Missing data

## The missing data indicator NA (3m54s)

The “not available” missing data indicator `NA` is written without quotes:

``````vector_with_missing <- c(1, 2, NA, 3)
``````

Functions that will not return a value when data are missing can be forced to remove `NA` prior to calculating with `na.rm = TRUE`:

``````mean(vector_with_missing, na.rm = TRUE) # remove NAs, then calculate
``````

## Missing data when reading from CSV files (9m16s)

`read_csv()` converts `""` (empty) and `"NA"` strings to `NA` missing data values when reading CSV data into tibbles.

The behavior of `read.csv()` is complicated with respect to the cells it treats as missing data when it reads values into traditional data frames. Specific strings can be specified to be considered missing data using an `na.strings` argument:

``````data_frame <- read.csv(url, na.strings = c("-9999", "NaN", "NA" ,"")) # all listed strings read as NA
``````

A `colClasses = "character"` argument causes every value (including numbers) to be read as characters except for the string `"NA"`, which is still read as an `NA` value.

``````data_frame <- read.csv(url, colClasses = "character")
``````

# Basic statistical quantites

## Calculating some statical quantities (7m27s)

``````length(vector) # returns the number of items in a vector
mean(vector) # returns the average (mean) of items in a vector
sd(vector) # returns the standard deviation of items in a vector
summary(vector) # summaries limits, mean, median, etc. of items in a vector
quantile(vector) # defaults to returning quartiles of a vector
``````

To generate quantiles other than quartiles, use the `probs` argument. For example:

``````quantile(x, probs = seq(0, 1, 1/10))
``````

will generate deciles.

# Procedural vs. vectorized programming paradigm

## The is.na() function (0m47s)

The `is.na()` function returns a boolean: `TRUE` if the argument is `NA` and `FALSE` otherwise.

``````is.na(NA) # returns TRUE
is.na(3) # returns FALSE
``````

## Replacing NA with zeros (9m53s)

Code for Python example

Example with result in a separate vector:

``````schools_data <- read_csv("https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv")
asian <- schools_data\$Asian # Asian column of data frame assigned to a named vector object
booleans_vector <- is.na(asian)
asian[booleans_vector] <- 0
mean(asian)
``````

Example with replacement in the original column (“in place”):

``````schools_data <- read_csv("https://github.com/HeardLibrary/digital-scholarship/raw/master/data/gis/wg/Metro_Nashville_Schools.csv")
schools_data\$Asian[is.na(schools_data\$Asian)] <- 0
mean(schools_data\$Asian)
``````

# Basic plots

CodeGraf intermediate lessons on Data visualization with R using ggplot

## Histograms (3m23s)

``````hist(vector)
``````

## Box-and-whisker plot (2m34s)

A box-and-whisker plot is generated when the independent variable (x) is a discontinuous factor and the dependent variable (y) is continous (a numeric vector):

``````plot(y ~ x)
``````

NOTE: In version 3 of R and lower, character columns are automatically converted to factors when CSV files are read into traditional data frames using `read.csv()`. They are not converted when CSV files are read into tibbles using `read_csv()`, or when using `read.csv()` in R version 4.0 or greater. To convert a character column to factors, place the column name inside the function `as.factor()`. For example, in a data frame named `df` with numeric column `y` and character column `x`, use:

``````plot(df\$y ~ as.factor(df\$x))
``````

## Scatterplot with trendline (3m46s)

An X-Y scatterplot is generated when the independent variable (x) and the dependent variable (y) are both continuous (numeric vectors):

``````plot(y ~ x)
``````

To include a trend line, a linear model must be generated using the same variables:

``````model <- lm(y ~ x)
``````

To overlay the trendline:

``````abline(model)
``````

The model will provide information about the linear regression test:

``````model # gives slope and intercept of best-fit line
summary(model) # gives statistical quantities related to the regression
``````

# Practice assignment

The practice assignment is here. You will need to load it into the editor pane of RStudio.

It is best to try to complete each problem on your own before resorting to watching the solution videos below.

Problem 1 solution

Problem 2 solution

Problem 3 solution

Next lesson: Tidy Data and basic data wrangling

Revised 2023-08-22