Last lesson in beginner series: R programming basics

R Intro to stats: Factors and the t-test of means

This is the first lesson in the module Introduction to statistics with R. This module assumes that you have some basic familiarity with statistics, although some background is reviewed with each topic. It also assumes that you have installed R and RStudio and completed the R for beginners module or have equivalent experience gained on your own.

In this lesson, we will gain a more in-depth understanding of factors in R and will learn the details necessary to conduct a valid t-test of means.

Learning objectives At the end of this lesson, the learner will be able to:

explain the role of factors in structuring data.
describe three ways that a column in a data frame can be made a factor.
use the as.character() function to turn a factor into a character vector.
carry out a t-test of means with equal variance.
explain the relationship between P and the likelihood that differences are due to chance rather than real differences between groups.
visualize the normality of data using a histogram and a normal quantile plot.
test the normality of data using the Shapiro-Wilkes test.
use Bartlett’s test to determine whether the variance of two groups differ significantly.

Total video time: 33 m 16 s

Factors

What are factors? (5m10s)

A factor is a data structure used for grouping data. The categories of a factor are called levels.

The factor() function can be used to convert a vector into a factor:

factor_object <- factor(vector)

Factors are viewed by their string labels, but are actually stored as numbers.

Data frames and factors (4m11s)

When data frames are constructed using the data.frame() function, character vectors are automatically converted into factors.

When data frames read in using the read.csv() function, character vectors are also automatically converted to factors. (This does not happen with tibble data frames.)

To convert a factor to a vector, use the as.character() function.

The t-test of means

What is a t-test of means? (4m31s)

The command to carry out a t-test of means with equal variances is:

t.test(dep_vec ~ ind_vec, var.equal=TRUE)

when the data input is by vectors, or

t.test(dep_col ~ ind_col, data=data_frame, var.equal=TRUE)

when the data are input from data frame columns.

Review of p-value (P) (8m54s)

P is the probability that we would get results like these samples if there were really no difference between the means of the two groups (the variation we see is cause by random unrepresentative sampling).

If the value of P is less than 0.05, it is unlikely that the means of the two groups are actually the same, and we assume that the null hypothesis (that the groups are the same) is wrong. We conclude that the two groups are significantly different.

If P > 0.05, it may be that the two groups are not different, but it could also mean that the groups are different but our experiment was too bad to detect it.

Statistical power is the ability of a test to detect differences that are real. We can increase the power of a test by reducing variability in the experimental conditions, or by increasing the sample size.

When statistical power is very great, small effects can become statistically significant even if they are not actually very important.

Testing for normality (7m06s)

There are three main assumptions that must be met in order for a t-test of means to be valid:

Independence of the two samples.
Each group normally distributed.
Variances of the two groups are the same.

To test the assumption that each group is normally distributed, we can visualize the distribution with a histogram or normal quantile (Q-Q) plot, and we can run the Shapiro-Wilkes test to do a numerical assessment.

To create a histogram, use:

hist(data_frame$column)

where the column has been filtered to contain only one of the levels of the factor. To create a normal quantile plot, use:

qqnorm(data_frame$column, datax = TRUE)

The Shapiro-Wilks test is:

shapiro.test(data_frame$column)

If the result has P < 0.05, the data deviate significantly from normal.

The t-test of means is relatively robust to deviations from normality. So the test can be valid even when the data are not as normal as we would like.

Testing the assumption of equal variances (2m02s)

Bartlett’s test can be used to test the assumption of equal variances between the two groups:

bartlett.test(dep_col ~ ind_col, data=data_frame)

If the result has P < 0.05, the variances are significantly different. Bartlett’s test requires that the data be normally distributed, but if they aren’t the t-test is invalid anyway.

Practice assignment

For the first 5 questions, use the Nashville schools data from

read.csv("https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/data/gis/wg/Metro_Nashville_Schools.csv")

For the rest of the questions, use the cockroach electroretinogram data from

read.csv("https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/data/r/red-blue-erg-data.csv")

Load the Nashville schools data into a data frame using the read.csv() function. Create a new column in the table for total number of students by adding the number of male and female students. Create another new column for the fraction of students that are white in each school.
Create a plot of fraction white versus zip code using plot(schools_data$frac_white ~ schools_data$Zip.Code). What is wrong with this plot? Is zip code a factor or a numeric vector? Which should it be?
Convert the zip code column to a factor, then fraction white versus zip code again.
The plot from the previous problem is crazily busy. Create a more reasonable plot by filtering to plot only zip codes between 37115 and 37208.
Are the values of school name character vectors or factors? Which should they be? Why? Convert the school name column to the correct type of data structure.
Load the red/blue color data for a cockroach electoretinogram experiment into a data frame. Don’t forget to use the URL for the Raw data, not the URL for the web page about the table. “Tidy” the data so that the data for the color is in a single factor column. See this lesson if you don’t remember how to do that. Note: after tidying, the color column will be a character vector.
Filter the red and blue data into two separate dataframes (or put them into two separate vectors). Create a histogram and normal quantile plot for each color, then run the Shapiro-Wilkes test on each color vector. Are the data normal enough to carry out a t-test of means?
Use Bennett’s test to test for equal variance of the response values of the two levels of the color factor. Make sure the color column is a factor before you run this test.
Regardless of whether the assumptions were met, carry out at t-test of means to determine whether the mean red and blue roach eye response differ significantly. Interpret the results. Make sure the color column is a factor before you run this test.

Next lesson: Transformations and non-parametric tests

Revised 2020-10-21

Questions? Contact us

License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"