Previous lesson: Vectors and packages
This lesson is focused on another important data structure in R: data frames. It shows how the structure of data frames are related to the simpler R data structures, vectors and lists. It also introduces a variation on the traditional data frame: tibbles.
Learning objectives At the end of this lesson, the learner will be able to:
Total video time: 40 m 26 s
In this lesson, I’ll reference some sections of the book, R Cookbook, 2nd Edition, which you can find by searching at the O’Reilly site, or try this direct link to the book. In order for the direct links to places in the book to work, you must be logged into the O’Reilly website. Sometimes the links don’t work and you need to log directly into the website using VUNet ID and password at this link.
R Cookbook section 5.6
A list is a one-dimensional data structure, like a vector. However, unlike a vector, the items in a list can be heterogeneous (different types of items). In the diagram above, we see that the values in the list consist of two strings (characters listed in quotes), one number (no quotes), and the vector
animal that we created in the previous lesson (with its name given without quotes).
I can create the list in the diagram using this command (assuming that I’ve already created the
thing <- list(fruitKind="apple", euler=2.71828, vectorData=animal, curse="!@#$%")
Notice that as I add items to the list, I can also assign names to each item. The items can be referenced by those names. This allows an R list to behave like a dictionary in Python or a JSON object consisting of name:value pairs. (Note: names can also be assigned to items in a vector, although that is often not particularly useful.)
The screenshot above shows what happens when I create the list in RStudio using the console. After the list is created, it shows up in the workspace summary (Environment tab in upper right pane). Because lists can be complicated, the items in the list aren’t shown there. If I want to see the details of the list, I can click on the list in the summary and a new tab will open in the upper left pane showing me what is contained in the list. Alternatively, I can get a more detailed view of the list in the summary if I click on the triangle button to the left of the list name. That drops down the details. Clicking again collapses the detailed listing.
As with vectors, you can refer to a particular item by its position in the list, like this:
Notice that when referring to an item in a list, you use two square brackets (vs. one square bracket for vectors). You can also reference a list item by its name, for example:
R Cookbook section 5.18
Data frames are two dimensional data objects and are one of the most widely used data types in R. One can think of a data frame as a table with rows and columns, with the top row containing column headers that are names describing what’s in the columns.
It is helpful to think of a data frame as a sort of combination of lists and vectors. The values in a particular column are like a vector, with the column header for that column containing the vector’s name. The set of columns is like a list whose items are vectors.
We can actually create a data frame by first constructing a vector for each column:
group <- c("reptile", "arachnid", "annelid", "insect") # vector of strings animal <- c("lizard", "spider", "worm", "bee") numberLegs <- c(4,8,0,6) # vector of numbers
then loading the vectors into the data frame:
organismInfo <- data.frame(group, animal, numberLegs)
By default, R will use the name of each vector as the name for the column in the data frame.
We can refer to a particular cell in the table by listing its row followed by its column in brackets, like this:
Because the columns of a data frame behave somewhat like list items, the notation for referring to list items by name (dollar sign followed by name) can also be used to refer to columns in a data frame:
and items in that column can be referred to by their position from the top of the column:
Referring to ranges within data frames
To subset a range of rows or columns from a data frame, you can insert the range in the appropriate position using the square bracket notation. For example, to get only rows 2 through 5, use:
To get only columns 4 through 8, use:
You can combine ranges in both rows and columns to get a rectangular subset of the data frame:
R Cookbook section 5.4
R for Data Science Chapter 10 (no login required)
When data are read into a data frame, what happens to them depends on the type of data. Numeric data remain as numeric data, but string data (e.g. non-numeric data enclosed in quotes) are converted into a special data type called factor when they are loaded into the data frame. This format is useful when the data are intended to be used in statistical tests, and given that R was originally statistics-heavy, this automatic conversion makes some sense.
IMPORTANT NOTE: Starting with R 4.0, character data are no longer automatically loaded as factors by the
read.csv() function! When you start RStudio, look in the console window to see what version you are running. If the version starts with 3 or less, the conversion behavior will happen as described above. If the version starts with 4 or above, you will need to convert columns to factors manually using the
as.factor() function. For example, in a dataframe named
df with a character column named
words, you can use
as.factor(df$words) in functions that require the input to be factor rather than character.
More recently, the use of R has expanded far beyond statistics, so automatically trasforming data into a form that is optimal for statistics is no longer necessarily desirable in every case. Another two-dimensional data structure, called a tibble, was developed to broaden the use of data frames. When data are read into a tibble, there is never a conversion of data types (strings remain strings). The rules for column names are also relaxed over traditional data frames.
R Cookbook section 4.8
The method of loading data into a data frame by manually entering the items as part of the script is not efficient for large data sets. Large sets of tabular data are commonly saved as files in comma separated values (CSV) format. All common spreadsheet applications (such as Microsoft Excel, OpenOffice Calc, and Libre Office Calc) provide a way to export spreadsheet data in CSV format, so that’s the best way to get a dataset from a spreadsheet into R. If a spreadsheet contains multiple sheets, each one must be saved as a separate CSV file. To save an Excel sheet in CSV format, go to Save As… and select “CSV (Comma delimited) (*.csv)” from the “Save as type:” dropdown.
For more about CSV files, see this video
Important note: R can be much more picky about file names than other applications. Here is some general advice about naming CSV files:
_) or dashes (
There are two convenient ways to load CSV data into a data frame: loading it from a file on your local computer, and loading it through the Internet using a URL.
Note about downloading text files from GitHub
When you first get a Mac or PC, the file explorer hides the file extensions by default. Often this isn’t a problem, but if the wrong application opens when you click on a file, or if a file fails to open when it should, it is difficult to troubleshoot without being able to see what the file extension is. To stop the hiding of file extension, see these instructions for Windows or these instructions for Mac.
Because of the importance of file extensions in determining what files are for and how they are used, it is a problem if a file has the wrong file extension. This problem commonly occurs when files are downloaded from GitHub. When communicating with a web browser, GitHub always tells the browser that a raw text file is plain text. Often, this means that the browser will assign the downloaded file the extension
.txt even if the file should have another extension like
.csv. This problem is made worse if you computer is hiding the extensions from you and you don’t realize that the problem has occurred. If you have downloaded a file from GitHub and you are having trouble getting it to work, first check that it has the correct file extension and correct the extension if it has been changed to
From your hard drive
To load a CSV file from your local hard drive, we’ll make use of an R function that initiates a “file open” dialog:
file.choose(). When the file open dialog is executed, a popup window lets you navigate to and select the file that you want to open. Important note: occasionally, the popup window is hidden behind the RStudio window. So if you run the script and it seems to have gotten stuck, minimize the RStudio window and see if the file open dialog window was hiding underneath.
file.choose() function reads the CSV file into R, but the data that has been read in isn’t in the form of a data frame. The function
read.csv is used to convert CSV-formatted raw data into the data frame data structure. So the file open function and the CSV conversion function can be put together into a combined function that both opens the file and converts the CSV:
myDataFrame <- read.csv(file.choose())
To practice using this function, download and save the file t-test.csv somewhere on your computer. To do that, right click on the link in the previous sentence, and select Save Link As…. After you have downloaded the file, paste the line in the example above into the Console pane of RStudio and press Enter. The data frame
myDataFrame should appear in the summary pane in the upper right, and if you click on its name, you can see the table in a tab in the upper left pane.
From a URL
R Cookbook section 4.10
Sometimes a teacher, colleague, or a website might make a file available directly via a URL. So an alternative to getting a CSV file from your computer’s drive is to specify a URL that points to the file location at some place on the Internet. One important consideration is that the URL must deliver the raw data file and not a web page. You can see the distinction between the two by comparing:
In the first case, the URL leads to a web page that displays the content of the CSV file formatted as an HTML table. In the second case, the browser displays the actual characters that comprise the CSV file. The second URL could be used to load the file as part of an R script, but the first URL would display an error.
Here is the command that would read data from the file URL into a data frame:
myOtherDataFrame <- read.csv("https://gist.githubusercontent.com/baskaufs/1a7a995c1b25d6e88b45/raw/4bb17ccc5c1e62c27627833a4f25380f27d30b35/t-test.csv")
You can test this command by copying it and entering it into the Console pane of RStudio. You should see the newly created data frame in the workspace summary pane (upper right) as in the previous example.
If you have a GitHub account, creating a Gist is an easy way to make raw data available publicly through a URL. Create the gist in the editing environment, then after creating a public Gist, click on the Raw button at the upper right of the screen. Copy the URL from the browser’s address box and paste it into the script between the quotes as shown in the example above.
R Cookbook section 4.11
Data from an Excel spreadsheet
openxlsx package can read an Excel file into a data frame.
To load the package then read in the file:
library(openxlsx) data_frame <- read.xlsx(xlsxFile = "my_file.xlsx", sheet = 'name_of_sheet')
sheet argument is optional.
readxl package can read from Excel files (both .xls and .xlsx) into a tibble. It is also included in the
To load the package then read in the file:
library(readxl) tibble_from_xl <- read_excel(file.choose())
As with read_csv, the grouping column contains characters, not factors.
You can also specify the sheet to use:
another_tibble_from_xl <- read_excel(file.choose(), sheet = "t-test")
For more details, see http://www.sthda.com/english/wiki/reading-data-from-excel-files-xls-xlsx-into-r.
Commands for examining a data frame
head() shows the first 6 rows
tail() shows the last 6 rows
names() returns the column names
str() describes the structure of the data frame with information about each column
Vectorized operations can be done on data frame columns in the same manner as they can be done on vectors (since columns are essentially vectors). For example, to multiply all values in the column
col1 of the data frame
df by 7, we can use the expression
df$col1 * 7
This will display the result on the screen, where we can see that it is a vector. If we want to assign the result to a named vector object we can use the assignment operator:
answer_vector <- df$col1 * 7
If you perform a two-vector operation on two vectors (or columns) that are not the same length, the items in the smaller vector will be repeated until the operation has been completed on all items in the longer vector. This behavior is known as vector recycling. Here’s an example. If we create vectors
a <- c(1, 2) b <- c(10, 15, 17, 5, 1)
the sum of the two vectors
a + b will be the vector
11, 17, 18, 7, 2
The practice assignment is here. You will need to load it into the editor pane of RStudio.
It is best to try to complete each problem on your own before resorting to watching the solution videos below.
Problem 1 solution
Problem 2 solution
Problem 3 solution
Problem 4 solution
Problem 5 solution
Problem 6 solution
Next lesson: basic statistics and plots
Questions? Contact us