Digital Education Resources - Vanderbilt Libraries Digital Lab
Citation: Harris, Kathleen Mullan, and Udry, J. Richard. National Longitudinal Study of Adolescent to Adult Health (Add Health), 1994-2008 [Public Use]. Carolina Population Center, University of North Carolina-Chapel Hill [distributor], Inter-university Consortium for Political and Social Research [distributor], 2018-08-06. https://doi.org/10.3886/ICPSR21600.v21
Project website: https://www.cpc.unc.edu/projects/addhealth
The data from this study are published in the Inter-university Consortium for Political and Social Research (ICPSR) data archive.
Go to the login page and click on the Create Account
button. Follow the instructions to establish your username (email) and password.
Go to the study data page for the study.
We will be looking at the DS1 Wave 1: In-Home Questionnaire, Public Use Sample
and DS22 Wave IV: In-Home Questionnaire, Public Use Sample
datasets. To download each one, click the Download button to the right of its heading and select Delimited
.
After the download is completed, go to your download directory and unzip the archive. The files we want are in the ICPSR_21800
folder, then in the DS0001
and DS0022
folders, called 21600-0001-Data.tsv
and 21600-0022-Data.tsv
respectively. If necessary move the files to somewhere where you can easily navigate to it. Optimally, it will be a location that you know how to write a path for in the script.
The form of the data that we’ve downloaded is tab separated values
(TSV). TSV and variants with other separators, such as comma separated values (CSV)
are very common tabular data storage and transfer formats. (See this information for more information about CSV files and how you can look at them.)
The R read.csv
function can be used to read fielded text files with delimeters other than comma if the separator is indicated. The tab character that’s used as a delimeter is indicated as “\t”.
nls_ds1 <- read.csv(file.choose(), header = TRUE, sep = "\t")
The read.csv()
function loads the data into a regular data frame. The readr
library (part of the tidyverse
library) contains two functions that read the data into tibbles rather than generic data frames. They are read_csv()
(with an underscore rather than a dot) for comma delineated files and read_tsv()
for tab delineated files.
Here’s a script that reads in the files and does a bit of manipulation. You can try running it to make sure you’ve successfully loaded the data.
library(readr) # for reading tibbles
# read in tab separated value file 21600-0001-Data.tsv
nls_ds1 <- read_tsv(file.choose())
# display the data in the column labeled "BIO_SEX"
nls_ds1$BIO_SEX
# convert the coded data into factors values of male and female
sex <- factor(nls_ds1$BIO_SEX,
levels = c(1, 2),
labels = c("male", "female"))
# summarize the data
sex
table(sex)
barplot(table(sex))
The datasets are really big and might actually cause your computer to run out of memory if you have too many applications running. So one task we want to accomplish is to pull a subset of data out of of the origial datasets.
A second issue is that the data include various forms of missing data (“don’t know”, “won’t say”, “not applicable”, etc.). We may not want such data included in the analysis and therefore need to replace those values with missing data (NA) values.
The third item is that we’d like to have two new variables to use in future analyses: a calculation of the body mass index (BMI) and a new index called “maternal closeness” that has a value of 1 if the five maternal closeness indications all have values of 1, and a value of 0 if they are a number other than 1. Missing values should continue to be missing values.
The template without code is here if you want to try to code it yourself. The completed script is here.
Revised 2020-02-18
Questions? Contact us
License: CC BY 4.0.
Credit: "Vanderbilt Libraries Digital Lab - www.library.vanderbilt.edu"