Instead of collecting your own data, you can also use already existing data from digital or physical archives or repositories to answer your research question. The benefits of using existing data are that you make efficient use of data, that you can save work and time, and that you could have access to larger datasets and/or participant pools than you would normally have.
Finding existing data
If you intend to reuse existing data instead of collecting it yourself, there are good sources for potentially relevant existing data. To judge the quality of a data archive, you may check:
- Any certification it holds, such as the CoreTrustSeal and ISO or Nestor Seal DIN standards. If a repository is certified as trustworthy, there is less/no need to check the following points
- The repository’s mission, policy and guidelines. Try to find the answer to questions such as: does the repository have a long-term preservation goal? Does it commit to the FAIR principles? Does it offer guidelines for sustainable data formats and metadata standards? What is their policy regarding access management and privacy concerns?
- The use of persistent identifiers such as DOI’s, which ensure the findability of the data
- The use of licenses. For example: Does the repository offer broadly acknowledged (open access) licenses for research data and code? Does it offer the option to add data use agreements for sharing research data under restricted access conditions?
Sources
There are many good sources available. If you don’t know where to start, one or more directories that list data repositories are a good option such as Re3data.org, FAIRsharing, and EOSC Portal. Use the filters that these sites provide to find databases within your research field or those that have specific certifications, licenses, access options, etcetera.
You can also start by browsing Radboud University’s own repository, the Radboud Data Repository (RDR). This repository contains data from several research fields, such as the humanities, social sciences, life sciences, law and technical sciences. Starting your search in the RDR can result in collaborations with fellow Radboud University researchers.
Using existing data
Once you have found a suitable dataset or collection be aware of all the conditions and restrictions on data access and data reuse. These can often be found in a license or a data use agreement. If you cannot find such a document, contact the owner of the data and ask for clarification. When reading a license, check whether:
- You are allowed to edit, combine and/or expand on the data.
- You are allowed to publish the original and/or derived data. You can still use data that you are not allowed to re-publish, but being aware of it at the start of your project means that you can clearly communicate this to relevant parties, such as funders or publishers.
- You are allowed to change the license (if you are allowed to publish the data again). Be aware that if you are not allowed to change the original license, this might cause difficulties if you combine multiple datasets with different licenses.
- You need to cite the original author or source in your own work. This is of course common practice in the scientific community, but some licenses specifically state this. See below for more detail.
- There are any other restrictions. Some data use agreements can specify other, less general restrictions, for example that you are not allowed to establish the identity or contact participants in the study.
Citing existing data
Regardless of the license requirements, it is always important to properly cite the data you will use. Just like with articles, citing a dataset provides the creator with appropriate credit and makes their research more findable. Also, by specifically citing a dataset, you recognise that datasets are legitimate scholarly contributions. A reference to a dataset needs to consist of:
- The names and/or organizations of the producers of the data;
- The year in which the data was produced;
- The title;
- The name of the publisher;
- The persistent identifier as a full URL.
Furthermore, it may contain the version number and the resource type (e.g. dataset, software, workflow).