Digital Education Resources - Vanderbilt Libraries Digital Lab
Previous lesson: basic statistics and plots
In this lesson, we will examine the features of tidy data and consider how it differs from the way people often record data in spreadsheets. We will learn how to “tidy” a data frame and how to extract a subset of data from a tibble data frame.
Learning objectives At the end of this lesson, the learner will be able to:
pivot_longer()
function to transform a “wide” tibble to Tidy Data.pivot_wider()
function to transform Tidy Data into the “wide” form.filter()
function to subset rows from a tibble.select()
function to subset columns from a tibble.mutate()
function to create a new column in a tibble.transmute()
function to create a new tibble from columns of another tibble.Total video time: 31 m 11 s
Tidy Data chapter from the R for Data Science book.
Data transformation chapter from the R for Data Science book (covers the dplyr()
function)
Data Carpentries unit Data Analysis and Visualization in R for Ecologists: “Manipulating data” (dplyr) lesson
Cockroach electroretinogram videos
For more examples and practice involving tidying data, see this Software Carpentries lesson: http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html
Experimental factors are independent variables that the experimenter controls. Measurements are dependent variables that are responses to the experimental factors.
Tidy Data has three main features:
To read more about the features of Tidy Data, visit the R for Data Science website.
Use the pivot_longer()
function to transform “wide” data into Tidy Data (“long” data).
The format of the pivot_longer()
function is:
library("tidyr") # The tidyr library must be loaded if not already done
pivot_longer(wide_tibble_name,
cols = c("collapse_column1", "collapse_column2", "collapse_column3, etc.),
names_to = "new_category_column",
values_to = "new_data_values_column")
Notice that you can make a function more readable by putting its arguments on separate lines and indenting.
Note: pivot_longer()
replaces the older function gather()
.
Use the pivot_wider()
function to transform Tidy Data (“long” data) into “wide” data.
The format of the pivot_wider()
function is:
pivot_wider(long_tibble_name,
names_from = "category_column_to_become_column_headers",
values_from = "data_column_to_become_table_cells")
Note: pivot_wider()
replaces the older function spread()
.
Notice that unlike “regular” data frames, tibbles allow spaces in column names. When tibble column names include spaces, in code the column names must be enclosed in backtics (`
).
The general form of the filter()
function is:
filter(tibble_name, boolean_condition)
Code examples:
library("dplyr") # The dplyr library must be loaded if not already done
filter(schools_tibble, `Zip Code` == 37212) # filter rows by a particular value
filter(schools_tibble, !is.na(`Grade 12`)) # filter rows by a more complex boolean condition
high_schools_data <- filter(schools_tibble, `School Level` == "High School") # assigning the result of a filter operation to a new tibble
The select()
function has several general forms:
select(tibble_name, col1, col2, col3, ...) # select by list of columns
select(tibble_name, start_column:end_column) # select by a range of columns
select(tibble_name, boolean_condition) # select by boolean condition
Code examples:
library("dplyr") # The dplyr library must be loaded if not already done
select(schools_tibble, Male, Female)
select(schools_tibble, `School Year`:`Zip Code`)
select(schools_tibble, starts_with("Grade"))
The mutate()
function creates a new column at the end of an existing tibble. The column can be the result of a calcuation. The general form is:
mutate(existing_tibble, new_column_name = calculation)
Example:
library("dplyr") # The dplyr library must be loaded if not already done
mutate(schools_tibble, total_students = Male + Female)
The transmutate()
column creates a new tibble from existing columns or calculations based on existing columns. The general form is:
transmute(existing_tibble, list_of_existing_or_calculated_columns)
Example:
transmute(schools_tibble, `School Name`, total_students = Male + Female, `Economically Disadvantaged`)
Note: In the case of both mutate()
and transmute()
, simply issuing the command displays the output, but doesn’t actually put it anywhere unless you assign it to an object using the <-
assignment operator. That object can be the original data frame if you want to replace the old version with the new one. For example
mod_sch_tibble <- transmute(schools_tibble, `School Name`, total_students = Male + Female, `Economically Disadvantaged`)
would create a new tibble called mod_sch_tibble
that holds the result of the modification.
schools_tibble <- mutate(schools_tibble, total_students = Male + Female)
would replace the original schools_tibble
tibble with the new one having the added column on the end.
The practice assignment is here. You will need to load it into the editor pane of RStudio.
Problem 1a solution
Problem 1b solution
Problem 2a and b solution
Problem 2c solution
Problem 3 solution
Problem 4a solution
Problem 4b solution
Problem 4c and d solution
Next lesson: piping and more data wrangling
Revised 2023-09-11
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"