Skip to content Skip to navigation

File and Folder Organization

Following best practices for managing your research data can ensure it will be available to other researchers in the long term. Not all of these suggested guidelines will always apply to every discipline or project. Overall, however, these guidelines will streamline your data management activities and help prevent data loss.

File Structures

Choose a consistent organizational structure for all of your project folders. Although it may seem obvious, thinking about the structure of your folders and planning effectively makes navigation much easier. Minimize the number of clicks necessary to reach files. In conjunction with a consistent file naming convention, an efficient structure saves a lot of time.

Remember:

  • Be consistent
  • Structure your hierarchy logically and follow the logic that makes the most sense for your project
  • Keep folders and subfolders separate to reduce overlap, yet don't make an excessive number of subfolders
  • Keep subfolder categories narrow to restrict the number of files in each
  • Your Desktop is meant to be temporary storage, and never keep files there for longer than absolutely necessary
  • When naming folders, include information you might want when looking up files
     

File Naming

Establishing a consistent file naming convention early in your project and maintaining that convention throughout is an underrated but incredibly useful practice. Without a convention, it is easy to end up with a lot of files whose names tell you nothing about their contents -- this situation can require a lot of time and effort to locate a single file and can make it near-impossible to find things.

Remember: 

  • Be consistent
  • Determine a file naming convention before you gather data
  • Limit file names to 32 characters or less (usually less)
  • If you use abbreviations, define them in a README file (and keep the README file linked to the files it describes)
  • With sequential numbering (e.g., 1, 2, 3, etc.), use leading zeros to accommodate multi-digit versions (e.g. use 01-10 for 1-10, 001-100 for 1-100, and so on)
  • Avoid special characters like & , * % # ; ( ) ! @ $ ^ ~ ' { } [ ] ? < >
  • Use underscores _ rather than spaces in your file names
  • Use descriptive names that document the important aspects of your project
  • Use a consistent date and time convention such as YYYYMMDD for year, which will result in your files being sorted chronologically.

File Renaming

In a perfect world, you would be able to maintain a file naming convention from the very start of a project and never need to make a change. However, you may find that you need to add or remove information from your file names or make other changes. You have two options:

  • Rename each file manually, or
  • Use a program capable of renaming files in batches

There are several batch renaming programs available. They include, but are not limited to:

File Formats

The best file formats for research data are non-proprietary, "lossless," and unencrypted/uncompiled.

Researchers may sometimes encounter situations where they absolutely must use a problematic file format. In this case, they should make every possible effort to provide a backup version of the file in a different format. They should also provide documentation explaining how to use the problematic format.

Non-Proprietary

If the program that created a file is the only option for reading or accessing the file, the file format is proprietary or not open. To help ensure that your data and files are accessible by a wide range of users for a long time, choose open, non-proprietary formats whenver possible. With proprietary formats, if the original software becomes unavailable or ceases to function, the files are lost.

Non-proprietary, or open, file formats are ones where the dsecription and/or development of the format are open to the public; they often can be opened by multiple software programs. Open formats are often community-maintained.

"Lossless"

Some file formats compress the information in files. This can be useful because the files take up less disk space. However, for many such formats, the compression causes data from the file to be lost. These formats are "lossy." Formats that can compress files without losing any information are "lossless" and retain the original details of the data.

A "lossless" file that has been compressed can be completely restored to its original state, unchanged. A "lossy" file will be compromised in quality due to the deletion of some information.

Unencrypted/Uncompiled

Encrypting or password-locking a file may improve security, but if the encryption key or password is ever lost, the data in the file may also be lost.

Uncompiled source code is easier to re-use and is more likely to last a long time since it can be compiled on a range of architectures/platforms.

Here is a list of some non-proprietary file formats that are generally preferred for different types of files:

  • Containers: TAR, GZIP, ZIP
  • Databases: XML, CSV
  • Geospatial: SHP, DBF, GeoTIFF, NetCDF
  • Moving images: MOV, MPEG, AVI, MXF
  • Sounds: WAVE, AIFF, MP3, MXF
  • Statistics: ASCII, DTA, POR, SAS, SAV, R
  • Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
  • Tabular data: CSV
  • Text: XML, PDF/A, HTML, ASCII, UTF-8
  • Web archive: WARC

The Library of Congress' Sustainability of Digital Formats and Recommended Format Specifications provide more extensive information on formats, including guidance for preserving data sets, geospatial data, and web archives.


If you'd like more information on research data curation and management, please schedule a consultation: