Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: Factors and the t-test of means
In this lesson, we will look at several strategies for dealing with data when a t-test of means fails to meet the assumptions of the test.
Learning objectives At the end of this lesson, the learner will be able to:
Total video time: 24 m 47 s
Code for filtering one experimental group and testing for normality:
# filter each color and test if normally distributed
red_tibble <- filter(erg_tidy, erg_tidy$color=="red")
hist(red_tibble$voltage)
qqnorm(red_tibble$voltage, datax = TRUE)
shapiro.test(red_tibble$voltage)
Code to add a column with log-transformed data:
transformed_red <- mutate(red_tibble, log_col=log(voltage)) # note: natural log
The log()
function returns the natural log or ln() of the argument. Base ten log could be used and would produce the same effect.
Other transformations are:
sqrt(dependent_variable) # use for counts
asin(sqrt(dependent_variable)) # use for proportions
Typically, for proportions a logistic regression is a better option, since it has an dependent variable that is binary (has one of two possible states). A binary outcome can be summarized as a proportion.
Code for testing for equal variance using Bartlett’s test:
# Bartlett's test without transformation
erg_factor <- mutate(erg_tidy, color=factor(color))
bartlett.test(voltage ~ color, data=erg_tidy)
Note that Bartlett’s test requires the independent variable to be a factor, so if manipulating the data resulted in a tibble with the independent variable converted to a character string, the conversion to a factor may need to be done prior to running the test.
Often the same transformation that fixes problems with normality will also fix problems with unequal variance. If not, in the case of the t-test of means, there is a variant of the test for unequal variances. In that case, set var.equal=FALSE
:
t.test(voltage ~ color, data=transformed_factor, var.equal=FALSE)
When reporting statistical quantities following a test using transformed variables, you should back-transform the quantities so that they have the same scale as the original quantities. To back transform, use the inverse property to the property you used for the transformation.
In the case of the log(variable)
(natural log) transformation, use exp(transformed_variable)
, which raises e
to the power of the argument.
In the case of the sqrt(variable)
transformation, use transformed_variable^2
(square the transformed quantity).
Most parametric tests have an alternative non-parametric test that can be used with the same type of data. In the case of the t-test of means, the Wilcoxon rank sum/Mann-Whitney U test can be used instead. We refer to it as the Wilcoxon-Mann-Whitney (WMW) test. The format of the test is:
# perform the test on untransformed data since non-parametric
wilcox.test(voltage ~ color, data=erg_factor)
As with similar functions, the data=
argument can be omitted if the dependent and independent variables are input as vectors or as columns where the data frame is explicitly specified using the “$” notation.
An appropriate visualization for t-test of means is a box-and-whisker plot:
plot(dependent_variable ~ independent_variable, data=data_frame)
A better visualization would use 95% confidence intervals for the error bars, but generating such a plot is beyond the scope of this lesson.
The ggplot2
package makes it possible to generate presentation-quality plots where the appearance can be precisely controlled. Here is an example of a box plot using the ggplot()
function from that package:
ggplot(data = erg_factor, aes(x=color, y=voltage, color=color)) + geom_boxplot()
An alternative is a “violin” plot, which provides an indication of the “shape” of the distribution:
ggplot(data = erg_factor, aes(x=color, y=voltage, color=color)) + geom_violin()
Both plots can be overlaid using this code:
ggplot(data = erg_factor, aes(x=color, y=voltage, color=color)) +
geom_boxplot() +
geom_violin(alpha = 0.3)
The alpha parameter controls the transparency, making it possible to see the box plot without the violing plot obscuring it.
The URL below links to a dataset of 1000 simulated body mass index (BMI) values where the smoking status (smoker vs. non-smoker) of the person is recorded:
read.csv("https://raw.githubusercontent.com/HeardLibrary/digital-scholarship/master/code/codegraf/027/fake_bmi_smoking_data.csv")
In this situation, we need to be careful about the conclusions we draw from the analysis because this is NOT a controlled experiment (i.e. NOT a randomized controlled trial). That’s because people were not randomly assigned to a “smoker” or “non-smoker” experimental group - their status was simply recorded. It would be unethical to conduct such an experiment because smoking is known to be extremely dangerous to health. So this experiment is likely to violate one of the assumptions of the t-test of means: independence between the two variables. We could imagine that stress could cause one to smoke and stress could also cause one to over-eat. Thus smoking and BMI could be correlated through a third factor: stress.
Also, because we did not assign the levels of the “independent” variable (smoking status), we don’t really know the direction of cause and effect. It is possible that BMI and smoking might be related because having a greater BMI affects one’s propensity to smoke rather smoking affecting BMI.
Nevertheless, the t-test provides us with a means to assess objectively whether the two groups (smoker and non-smoker) differ significantly. So we will go ahead and run the test, but need to be extremely caution about interpretation of the results.
Read the data into a data frame, then separate the smoker and non-smoker rows into separate data frames using filter()
. The power of a t-test is greatest if the design is “balanced”, i.e. there are equal numbers of samples for each level of a factor. Is the smoking factor balanced? Review this section if you don’t remember how to find out how many items are in a vector.
Test the BMI values for each of the two levels of the smoking status for normality, both graphically and using the Shapiro-Wilkes test.
If you think about BMI, it is limited on the lower end by the possible smallest size of a human. Our mass has a lower limit, so that places a lower limit of zero on the BMI. On the other end of the distribution, there are a few people who have very large masses, making the upper end of the BMI distribution have a very long tail. This makes it likely that the distribution of BMI will be skewed to the right. Do you see this in your results to the previous question? Apply an appropriate transformation to the data, then repeat the tests for normality. Does the situation improve enough that you think it’s valid to perform the t-test of means? Try running the tests again, but with only the first 50 rows of each data frame. You can do that like this: data_frame[1:50,]
. Why do the results change?
Carry out Bartlett’s test on both the transformed and untransformed data. How does the transformation affect the homogeneity of the variances?
Because this is an exercise, we’ll carry out the t-test of means regardless of whether the assumption of normality is met. If Bartlett’s test on the transformed data shows a non-significant difference in variances, perform a t-test of means for equal variances. If there is a significant difference in the variances, perform a t-test of means for unequal variances. Back-transform the estimated sample means for the two levels of the smoking factor.
Carry out the Wilcoxon-Mann-Whitney (WMW) test on the untransformed data. Compare the P value of the WMW test to that of the t-test of means.
Create a box plot and violin plot for the data you used in the test.
Next lesson: Continuous bivariate data
Revised 2020-11-03
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"