Selecting the best repository to house a dataset may be straightforward, if there is already a well-established subject based repository in your discipline, or it may take some research to determine the best place for your data. Look for a research data repository with open licenses, to make your datasets more accessible (CC0 is the least restrictive license). The repository should provide clear, persistent citations for datasets. Repositories offer a range of services to depositors (from data validation to peer review) and to users (from in-browser data exploration to visualization and analysis tools), which may also influence your choice. The Digital Scholarship and Communications Office is happy to assist you as you select an appropriate data repository.
There are several useful tools for finding data repositories that serve your field.
The National Institutes of Health (NIH) maintains a list of generalist repositories that may be used if there is no domain-specific repository that is suitable for a particular dataset. Those repositories are described in the below the table.
Comparison of generalist data repositories
|up to 1 TB per researcher, 2.5 GB per file
|“permanent” (by Harvard)
|$120 DPC up to 50 GB, $50 per additional 10 GB
|300 GB per data publication or more
|indefinite, “reasonable effort to move” if closed
|free up to 20 GB, [sliding DPC for higher limits}(https://knowledge.figshare.com/plus#pricing)
|up to 5 TB per file
|legal minimum of 10 years, aims for indefinite
|Open Science Framework (OSF)
|up to 50 GB for open data, linked external storage for more
|preservation fund for 50+ years after closing at current costs
|“modest” internal storage (10s of GB)
|no upper limit, 50 GB per record
|lifetime of CERN (at least 20 years)
Harvard Dataverse is a repository for research data and code. “The Harvard Dataverse is open to all scientific data from all disciplines worldwide. It includes the world’s largest collection of social science research data. It is hosting data for projects, archives, researchers, journals, organizations, and institutions.”
Datasets can contain any number of files, with DOIs assigned at the dataset level.
The cost is free with up to 2.5 GB per file and 1 TB per researcher (may be increased on request). Paid curation services available upon request.
Dryad was originally created by a group of journals and scientific societies to create a location to archive data from their publications. It is flexible about data format and assigns citable Digital Object Identifiers (DOIs) to submissions. It is also committed to long-term preservation and access. Because of its integration with partner journal workflows, it may be a good choice in cases where journals require archiving of data prior to publication.
Individual data files are assembled into Frictionless Data packages upon upload. There is a curation step during the upload process. DOIs resolve to the most recent version of a dataset, but all previous versions are accessible.
Cost is a basic Data Publishing Charge (DPC) of $120 per submission, which covers up to 50 GB. The DPC increases by $50 for each additional 10 GB. There are no DPCs for affiliates of institutional members (Vanderbilt is not a member). Dryad provides infrastructure services and expert consultants as part of the DPC.
Figshare is “a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner.” The research outputs you can upload to Figshare include datasets, figures, papers, posters, and video. When you publish research materials on Figshare, they receive a Digital Object Identifier (DOI), providing a persistent citation. Figshare also supports version control, so that you can update or add to a dataset without confusing other researchers who may wish to cite it.
Figshare is a commercial product that is “freemium”. Individuals can upload up to 20 GB for free (with a file size limit that is also 20 GB). A “Figshare+” submission has a one-time Data Publishing Charge (DPC) with variable pricing ranging from $395 for 100 GB to $11860 for 5 TB, with higher limits available. Figshare+ allows 5000 files per dataset.
DOIs can be issued for a dataset with up 10 DOIs for individual files within the dataset. The DOIs can be versioned when changes are made to the metadata or files.
In some cases, publishers may be working with Figshare to streamline the process of publishing data along with manuscripts.
The Inter-University Consortium for Political and Social Research (ICPSR) archives data from any source. It has the world’s largest collection of Social Science data.
Data can be deposited for free, although there is a fee for curated deposits. Using the openICPSR system, researchers can self-deposit raw data without going through the full ICPSR data review process.
For more information about ICPSR, visit this research guide.
OSF is more than a data archive. It is an entire ecosystem for managing data and related artifacts throughout the data life cycle. In particular, it facilitates registrations (time-stamped versions of projects) and pre-registrations (documentation of protocols, variables to be investigated, and analysis prior to data collection). (More information on registrations.)
OSF is free, but has relatively small included storage. Private projects are limited to 5 GB and open projects are allowed 50 GB. Add-on storage from outside of OSF (Amazon S3, Bitbucket, Box, Dataverse, Dropbox, Figshare, Github, GitLab, Google Drive, OneDrive, and Owncloud) can be linked to the project, with the user responsible for the cost of maintaining those resources. (More information on storage limits.)
Digital Object Identifiers (DOIs) are assigned at the project level and resolve to the project landing page. DOIs can also be assigned to particular registrations or pre-registrations.
Not exactly a data repository, REDCap is a system for building and managing online surveys and databases in a manner that is compliant with HIPPA, GDPR, and other data security requirements.
Synapse is an innovative ecosystem supported by National Institute of Health institutes, the National Cancer Institute, the Sloan Foundation, and others. It not only supports data archiving, but also wikis, dataset linking, “views”, user teams, and a Docker hub. It also supports project management through supporting Python, R, and command line access as well as direct programmatic access to the underlying Amazon Web Services (AWS) S3 buckets. So it would be a useful platform for building an automated pipeline for data uploading, organization, and analysis.
Users must take a quiz and be certified to upload data. Varying degrees of access to the data can also be configured. Digital Object Identifiers (DOIs) are minted for stable versions of datasets. File versioning is automatic and dataset versioning is controlled manually. Versioned DOIs resolve to particular versions and non-versioned DOIs resolve to the most recent version.
A major downside of the platform is that only “modest internal Synapse cloud storage” (10s of GB) is free to users. Users with more data than that must provide their own cloud storage.
Zenodo is a large data-hosting initiative associated with the European Organization for Nuclear Research (CERN). It is supported by a number of European government agencies and institutional members. In addition to uploading datasets, if you are using Github to manage a project, you can easily archive dataset releases to Zenodo by setting up a web hook.
Uploaded data in Zenodo is organized in “records”. A record can include a single file, a directory tree stored in a single zip compressed file, or multiple files uploaded to the same record.
Zenodo is free and has no upper data limits. There is a 50 GB limit per record.
Zenodo assigns a digital object identifier (DOI) to the record in general as well as each specific version of the record.
IEEE Dataport is supported by the Institute of Electrical and Electronics Engineers (IEEE). One critical distinction is that the cost-free tier of data sharing is NOT openly available and is viewable only to paid subscribers. There is a one-time Data Processing Charge of $1950 to publish an open access dataset, which allows 2 TB of storage for individual users and 10 TB of storage for institutional subscribers.
This repository is a product associated with the for-profit publisher Elsevier’s citation management tool, Mendeley. Mendelay Data is actually hosted on Digital Commons Data, an Amazon Web Services (AWS) S3-based system that’s also run by Elsevier.
Vivli is supported by a non-profit organization and is specifically geared towards archiving anonymized, patient-level data from clinical trials. It isn’t freely accessible, since potential data users must request access to particular research datasets. It has built in-features to index clinical trial data and report on data usage.
The basic service is free and allows files up to 1 TB. Larger sizes up to 100 TB can be accommodated by arrangement. Varying degrees of support are available for the paid service.
Questions? Contact us