Skip to content Skip to navigation

Metadata and Documentation

Metadata

Definition: Metadata is "data about data" or "data in context." Metadata is pieces of information that provide context for data. Having metadata helps when researchers re-analyze their own data, use other people's data, use existing data for a different project, or collaborate with others. Metadata is becoming increasingly important as the culture of data sharing spreads, although it's important to remember that metadata makes it easier for you to use your own data too. When documenting your research data, ask yourself if it passes the "Ward test" -- that is, if you disappeared, would someone else be able to access, interpret, and analyze your data? If the answer is "No," you should improve your documentation and metadata.

Creating metadata for your research projects and data leads to increased accessibility, helps data retain its context, accommodates version control (through distinguishing multiple versions), and can satisfy the legal requirements of repositories and funders. Quality metadata also makes data easier to preserve and more persistent over time.

When you create metadata for a piece of data -- whether that data be code, a paper, images, spreadsheets, et cetera -- it can help to answer the following questions:

  • Who created the data?
  • What does the data file contain?
  • When were the data generated?
  • Where were the data generated?
  • Why were the data generated?
  • How were the data generated?

Metadata Standards

Metadata standards exist to create consistency across research documentation. There are several "discipline agnostic" metadata standards that researchers can choose. They include DataCite, Project Open Data, and the Data Documentation Initiative (DDI). Standards like these can generally be applied to data in almost any discipline.

Specific disciplines often have their own popular sets of metadata standards. The Digital Curation Center (DCC) maintains a directory of standards that can be browsed by discipline. The Research Data Alliance (RDA) offers a community-maintained directory that is updated more frequently.

Finally, the choice of repository sometimes determines the metadata schema. For instance, Dash uses the discipline-agnostic DataCite schema by default. Discipline-specific repositories may use others.

This infographic provides a quick visual overview of different metadata standards by discipline:

Documentation

If your data is intended for local use -- meaning that it will only be used by you and your co-authors, labmates, or collaborators -- it doesn't matter what standard(s) you use, as long as they are consistent. The following are some examples of local data documentation that you can immediately implement with your own projects -- that, in fact, you may already be using without realizing! You can (and should) also include this documentation when publishing your data in a repository or archive.

Laboratory or Field Notebooks

These are, as the name implies, physical (analog) or digital notebooks in which researchers document information that is relevant to their research process. Maintaining a notebook means that each researcher is able to localize all their relevant information in one place; it encourages thoughtful work; and it enables other researchers to pick up and continue a line of research if necessary.

Best practices for maintaining an effective lab notebook include dating each entry in a consistent format, listing names and contact information of collaborators, keeping notes from important meetings or discussions, and justifying methods and data source(s). Researchers should also note any corrections, calculations (with units), file names/locations, and the locations of any physical materials.

Codebook

A codebook is a set of codes, definitions, and examples often used as a guide to provide context for and help analyze survey data. Codebooks are:

  • Essential for analyzing qualitative research
  • Contains the text of survey questions
  • Often also contains lists of possible responses for survey questions
  • Level of detail is up to the user, but generally, the more specificity, the better

If your research involves administering surveys, you may use a codebook to ease interpretation and increase accessibility of the survey results. If you download an archived data file, that file often comes with some version of a codebook which explains each variable and its possible values.

Data Dictionary

A Data Dictionary is a collection of names, definitions, and attributes about data elements that are being used or captured in a database, information system, or part of a research project. It describes the meanings and purposes of data elements within the context of a project, and provides guidance on interpretation, accepted meanings and representation. A Data Dictionary also provides metadata about data elements. The metadata included in a Data Dictionary can assist in defining the scope and characteristics of data elements, as well the rules for their usage and application. 

Data Dictionaries are useful for a number of reasons. In short, they:

  • Assist in avoiding data inconsistencies across a project
  • Help define conventions that are to be used across a project
  • Provide consistency in the collection and use of data across multiple members of a research team
  • Make data easier to analyze
  • Enforce the use of Data Standards

This is a short (6:30) video explaining the concept and construction of data dictionaries, produced by the University of Wisconsin Data Services.

README File

Like the title implies -- a file that users of your data are intended to read first, which explains all the information users need to know to understand your data
Make sure that whatever file naming convention you use associates each README file with the file or files that it references.

  • Use a plain text file or other non-proprietary format to create it and format it clearly
  • If you use multiple READMEs, keep the format consistent across them
  • Use standard date, time, and name formats within READMEs

A brief overview of recommended README file content:

  • Names and contact info for all personnel involved
  • Date
  • Short description of the data contained in each file
  • List of all files (including relationships between them)
  • For tabular data, full names and definitions of column headings
  • Units of measurement
  • Any specialized abbreviations, codes, or symbols used
  • Copyrights/licensing information
  • Funding sources

An example README template that you can download and customize to meet your needs is available from Cornell University (https://cornell.app.box.com/v/ReadmeTemplate).

Some data repositories may either require or recommend that you upload README files along with your data. UC Dash, for example, does not currently preserve the hierarchical structure of files and strongly recommends a README.

These three types of local data documentation -- codebooks, data dictionaries, and README files -- are crucial. They provide context not only for you, the researcher, in the future but also for anyone else who may ever need or want to use your data for any reason. Using other people's data can be either a breeze or a huge headache depending on the quality of the documentation.


If you'd like more information on research data curation and management, please schedule a consultation: