Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: Transformations and non-parametric tests
The statistical methods used to analyze data with two continuous variables varies depending on the nature of the data and what you want to do with it. In this lesson, we’ll learn how to test for significant relationships between the variables and how to create a model from the data that can be used for prediction.
Learning objectives At the end of this lesson, the learner will be able to:
lm()
function.summary()
function.plot()
function, use the output to assess the normality of the residuals.R
tells us about two variables.Total video time: 47 m 09 s
What test should I use?
1. Am I assuming cause and effect?
2. What do I want to know?
Linear regression determines the straight line that best fits a set of points by minimizing the distance between it and each point (least squares method). The equation of that line and statistics associated with it are referred to as the linear model.
Data on Women and Development downloaded from the World Bank Data Catalog at http://wdi.worldbank.org/table/WV.5
To get the first 217 rows of a data frame, use this syntax:
data_frame[1:217,]
Create a linear model
model <- lm(Y ~ X)
where Y and X are vectors or columns from a data frame. To display the slope and intercept, just print
model
To print all statistics associated with the model
summary(model)
To plot the data and the best fit line:
plot(Y ~ X)
abline(model)
To predict the value of Y, insert the coefficient of the X variable (i.e. the slope) and the intercept:
Y = coef_X * X + intercept
R^{2} is a measure of how tightly the points fit around the best fit line.
R^{2} tells us the fraction of the variance explained by the model.
The predictive ability of a line depends on the R^{2} value.
Linear regression tests for a significant effect of X on Y by determining whether the best fit line has a slope that differs significantly from zero.
P assesses the probability that random variability is causing the slope to differ from zero.
Example data from Whitlock and Schluter (2nd ed.) chapter 17. https://whitlockschluter.zoology.ubc.ca/r-code/rcode17
The assumptions of a linear regression are:
To retrieve the residuals from a model, use
residuals(model)
The residuals can be plotted against X. We can also check their distribution using a histogram or normal quantile (Q-Q) plot.
Example of non-homogeneous residuals:
For an in-depth analysis of residuals, pass the model into the plot()
function, then click on the console and press Enter/Return after each plot is generated. The first plot is a plot of the residuals. The second plot is a normal quantile plot of the residuals.
A Shapiro-Wilkes test can be used to test whether the residuals are normally distributed:
shaprio.test(residuals(model))
For data with counts for Y values, the square root transformation may be appropriate.
iris$transformed_grains <- sqrt(iris$grainsDeposited)
This example also illustrates a simple way to add a column to a data frame.
The Siegel nonparametric linear regression is done using the mblm package.
library(mblm)
model <- mblm(Y ~ Y, data = data_frame)
summary(model)
R
is the correlation coefficent. A positive value of R
indicates a positive correlation (Y goes up when X goes up) and a negative value of R
indicates a negative correlation (Y goes up when X goes down).
The example comes from the Washington Post https://www.washingtonpost.com/business/2020/10/23/pandemic-data-chart-masks/.
To run a correlation test:
cor.test(variable_1, variable_2)
where variable_1
and variable_2
are vectors or columns of a data frame.
Assumptions of correlation:
1. Random sample
2. Bivariate normal distribution
To test for bivariate normality use the MVN package:
result <- mvn(data = data_frame, mvnTest = "royston", univariatePlot = "qqplot")
The last argument can have a value of qqplot
or histogram
depending on which kind of plot you want to see. The returned value from the function can be printed to see whether the data are overall multivariate normal and also whether each of the variables is separately normal.
result
Kendall rank correlation test is probably the best non-parametric alternative to correlation. It is run by modifying the regular correlation test by adding a method
argument with a value of kendall
.
cor.test(variable_1, variable_2, method="kendall")
Another commonly used test (Spearman rank correlation) can be run if the method
argument has a value of spearman
.
Use the World Bank Data on Women and Development to create a linear model that would allow you to predict the percentage of women who have bank accounts from the percentage of women who are employed. Don’t forget to remove the missing value rows before doing the regression. Express the model as the equation of a line. Assess the usefulness of this prediction by commenting on the R^{2} value.
Whitlock and Schluter provide data from an experiment where the stability of biomass in a prairie was measured under circumstances where there were different numbers of species present. The data are available here: http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter17/chap17e3PlantDiversityAndStability.csv. Test whether there is a significant effect of number of species on biomass stability using a linear regression. Test the assumptions of the regression and transform the data or use a non-parametric test as necessary.
In question 3 of this practice assignment, we compared the fraction of students in schools that were economically disadvantaged with the fraction that had limited English proficiency using a regression. Since we do not know cause and effect, it would be better to compare them using correlation. Prepare the data by excluding missing values and calculating the fractions for each category (use the totals of male and female as the total number of students). Create a scatterplot, then carry out the correlation analysis.
Check the data for multivariate normality. If the data are not normal, try a transformation to improve the multivariate normality. If that doesn’t work, use the Kendall rank correlation test.
Next lesson: analysis of variance (ANOVA)
Revised 2020-11-05
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"