Documenting data
Why documentation?
As you do your research, it is useful to document your data so the process of data handling is transparent and so that data will be understandable in the future by yourself and by others. Producing high-quality documentation in the course of your research ensures that your data can be:
- Understood now and in the future
- Properly interpreted as relevant context is available
What types of documentation?
For other people to understand and interpret your data during research in the right way, it is important that your data is accompanied by the right documentation:
- Files that explain the context behind the dataset and that contain information on how the research was done. Generally, these are version logs, notebooks such as lab notebooks, or documents setting put given methodologies. They may also come in the form of standardized protocols, equipment or software manuals, field notes on paper and so on. They answers the who, what, why, where and how of the data. Examples of the context documentation are the context around data collection (project history, objectives and hypotheses), data-collection methods (sampling, the data-collection process, measuring instruments, etc.) and information on access, conditions of use and data confidentiality.
- Files that describe the structure of the dataset. These are often readme.txt files or other documents that contain an overview of the various folders and files that make up the dataset. The more elaborate your dataset is, the more important such document are. Which folder contains what? Which files must be opened first?, and so on.
- Files that describe the content of the dataset. This document describes the data set at the data level. These are often codebooks that explain the concepts and/or variables in question as well as their meaning and the numerical or other values they represent. This document can also include an explanation or definition of codes and classification schemes used, the coding of data and reasons for missing values, etc.
Examples of the description of the content and structure and of the context can be found under these links.
Descriptions of and documentation on data can be embedded in the data file itself. Many software packages have facilities for this purpose. They can also be incorporated into data guides, lab books and so on.
Creating, processing and analysing steps[1]
To enable others to verify the quality of your data and ideally to replicate the results of your research, you should properly document the steps you have followed to create, process and analyse the data.
- Steps to create your data: you should make sure that your research project includes data of the highest quality and that these can be reinterpreted from the project documentation. Therefore it is important to properly document how you collected the data, including as regards the methodology you followed
- Steps to process your data: perhaps your raw data contain sensitive information such as identifiable demographics or private information. Or perhaps you need to modify or delete incomplete or incorrect parts of your dataset. Perhaps the data are in a proprietary format and need to be exported into another format. Any of these cases involves processing or changing your data. It is essential to document any such steps
- Steps to analyse your data: data analysis is the core of your research project. The analysis you carry out may range in complexity from quite simple to really complex, depending on the data and the research methodology you follow. It is important to document what methods you have used to analyse your data and to describe that documentation process.
Other data[2]
The following research records may also need to be managed during and after the life of a project but are not generally considered to be research data:
- Correspondence whether exchanged over e-mail or in hard copy
- Grant applications
- Submission to ethics committees
- Research progress reports
- Research publications
- Master lists/to-do lists
- Social media communications such as blogs and wikis
2 Based on An introduction to managing research data by the University of Leicester.
1 Based on Lifecycle data management planning by the University of Michigan.