Citation: Harris, Kathleen Mullan, and Udry, J. Richard. National Longitudinal Study of Adolescent to Adult Health (Add Health), 1994-2008 [Public Use]. Carolina Population Center, University of North Carolina-Chapel Hill [distributor], Inter-university Consortium for Political and Social Research [distributor], 2018-08-06. https://doi.org/10.3886/ICPSR21600.v21
Project website: https://www.cpc.unc.edu/projects/addhealth
The data from this study are published in the Inter-university Consortium for Political and Social Research (ICPSR) data archive.
Go to the login page and click on the
Create Account button. Follow the instructions to establish your username (email) and password.
Go to the study data page for the study.
We will be looking at the
DS1 Wave 1: In-Home Questionnaire, Public Use Sample and
DS22 Wave IV: In-Home Questionnaire, Public Use Sample datasets. To download each one, click the Download button to the right of its heading and select
After the download is completed, go to your download directory and unzip the archive. The files we want are in the
ICPSR_21800 folder, then in the
DS0022 folders, called
21600-0022-Data.tsv respectively. If necessary move the files to somewhere where you can easily navigate to it. Optimally, it will be a location that you know how to write a path for in the script.
The form of the data that we’ve downloaded is
tab separated values (TSV). TSV and variants with other separators, such as
comma separated values (CSV) are very common tabular data storage and transfer formats. (See this information for more information about CSV files and how you can look at them.)
read.csv function can be used to read fielded text files with delimeters other than comma if the separator is indicated. The tab character that’s used as a delimeter is indicated as “\t”.
nls_ds1 <- read.csv(file.choose(), header = TRUE, sep = "\t")
read.csv() function loads the data into a regular data frame. The
readr library (part of the
tidyverse library) contains two functions that read the data into tibbles rather than generic data frames. They are
read_csv() (with an underscore rather than a dot) for comma delineated files and
read_tsv() for tab delineated files.
Here’s a script that reads in the files and does a bit of manipulation. You can try running it to make sure you’ve successfully loaded the data.
library(readr) # for reading tibbles # read in tab separated value file 21600-0001-Data.tsv nls_ds1 <- read_tsv(file.choose()) # display the data in the column labeled "BIO_SEX" nls_ds1$BIO_SEX # convert the coded data into factors values of male and female sex <- factor(nls_ds1$BIO_SEX, levels = c(1, 2), labels = c("male", "female")) # summarize the data sex table(sex) barplot(table(sex))
The datasets are really big and might actually cause your computer to run out of memory if you have too many applications running. So one task we want to accomplish is to pull a subset of data out of of the origial datasets.
A second issue is that the data include various forms of missing data (“don’t know”, “won’t say”, “not applicable”, etc.). We may not want such data included in the analysis and therefore need to replace those values with missing data (NA) values.
The third item is that we’d like to have two new variables to use in future analyses: a calculation of the body mass index (BMI) and a new index called “maternal closeness” that has a value of 1 if the five maternal closeness indications all have values of 1, and a value of 0 if they are a number other than 1. Missing values should continue to be missing values.
Questions? Contact us