Best practices for creating reusable Dryad data packages

Best practices for creating reusable Dryad data packages

So, you have decided to share your research data in Dryad? Congratulations! This is a huge first step. If you are unsure where to start or what you ‘should’ share, don’t worry, you are not alone. It’s not always obvious how to craft a data package with reusability in mind.

We want to help you effectively communicate to the scientific community about your research to enhance both its visibility and the potential for collaboration. To this end, we provide the following recommendations to help make your Dryad data packages as interpretable and reusable as possible:

Make sure your data are shareable

  • All files submitted to Dryad must abide by the terms of Creative Commons Zero (CC0 1.0). Under these terms, the author releases the data to the public domain. Please review all files and ensure they conform to these terms (i.e., are not covered by any copyright claims). We cannot archive software or code files that contain licenses which are incompatible with CC0 (e.g., GNU GPL, MIT, CC BY), but we can provide a link to the files if they are in a dedicated software repository such as Github, Bitbucket, or CRAN. For more information on why Dryad uses Creative Commons Zero please see our FAQ.
  • Human subjects data must be properly anonymized and prepared under applicable legal and ethical guidelines (see recommendations for human subjects data)
  • For information on rare and endangered species, it may be necessary to mask certain details (e.g., locations) to prevent any further threat to these natural resources. Please review our recommendations for sharing rare and endangered species data responsibly. (See recommendations for endangered species data.

Make sure your data are accessible

  • Data should be in non-proprietary open formats if possible (see preferred formats). Open format files ensure that your data will be accessible by the broadest number of people into the future.
  • Review all of your files carefully for errors. Common errors include missing data, misnamed files, mislabeled variables, incorrectly formatted values, and corrupted file archives. It may be helpful to run data validation tools before sharing. For example, if you are working with tabular datasets, a service like goodTables can identify missing data and data type formatting problems.
  • Multiple files, or directories of files, can be bundled together into compressed file archives (e.g., .zip, .tar.gz). We recommend that large files do not exceed 10GB each. If you have a large directory of files, and there is a logical way to split it into subdirectories and compress those, we encourage you to do so. See our recommendations for large file transfers.

Name files and directories in a consistent and descriptive manner

Name files and directories in a consistent and descriptive manner All filenames should be organized in a consistent manner and clearly describe the content they contain. Ideally, the names will be concise, informative, and unique. Stanford Libraries provides some excellent advice in their best practices for file naming.

Vague and ambiguous filenames should be avoided. Examples of poor filenames that should be avoided:

  • data.txt
  • Latest tables for upload.xls
  • Dryad Data (#1).csv

It is also important to avoid including spaces and special characters (e.g., ! @ # $ % ^ & and ") in filenames because they can be problematic for computers to interpret. It is best to remove all blank spaces and use one of the common special letter case patterns that is easily readable by both machines and people:

  • Kebab-case: "The-quick-brown-fox-jumps-over-the-lazy-dog"
  • CamelCase: "TheQuickBrownFoxJumpsOverTheLazyDog"
  • Snake_case: "The_quick_brown_fox_jumps_over_the_lazy_dog"

We encourage authors to include the following types of information when naming files:

  • Author name
  • Project name
  • Type of data
  • Type of analysis
  • Date
  • File extension for application-specific files (e.g., .csv, .txt, .R, .xls, .fasta, .nii)

Here are some examples of descriptive file names that do not include special characters or spaces:

  • 1900-2000_sasquatch_migration_coordinates.csv
  • Smith-fMRI-neural-response-to-cupcakes-vs-vegetables.nii.gz
  • 2015-SimulationOfTropicalFrogEvolution.R

Organize files in a logical schema

Just as with file naming, it is important to generate a clear file structure that will be interpretable by others. For instance, it may be sensible to segregate the data, code, and results of the data package as in the examples below (by file type or analysis as presented in your publication). However you choose to structure the data files, it will be helpful to your fellow researchers to explain all relevant details in a README file.

Examples organized by filetype and analysis

Provide all files necessary to recreate the analyses in your publication

  • Unprocessed and processed data. Providing both unprocessed and processed data can be valuable for re-analysis assuming the data are of a reasonable size. Unprocessed raw digital data straight out of the recording instrument or database ensures that no details are lost, and any issues in the processing pipeline can be discovered and rectified. Processed data are cleaned, formatted, organized and ready for reuse by others. If possible, data should be provided in non-proprietary formats (see preferred formats).
  • Code. Programming scripts communicate to others all of the steps in processing and analysis. Including them ensures that the results are reproducible by others. Informative comments throughout your code will help future users understand your code's logic.
  • External data. Links to associated data stored in other repositories (e.g., NCBI, TreeBASE can be included using the “External file identifier” option.

Provide a description of all details relevant to the data package

Provide a clear and concise description of all relevant details about data collection, processing, and analysis in a README document. A README helps ensure that your data are correctly interpreted and reanalyzed by others. We recommend that a README be a plain text file, however, if text formatting is important, PDF is also acceptable. If you have included a README in a compressed archive of files, please also consider uploading it externally in the README section so that users are aware of the contents of the archive before downloading potentially huge files.

There are two ways to include a README with your Dryad data submission:

Details you may want to include:

  • Project name
  • Citation(s) of your published research derived from these data.
  • Contact information for author(s) regarding data analyses
  • Description of all data processing steps and analyses. Make sure to include any details that are not included in the publication that may affect the interpretation of results.
  • Description of any de-identification procedures for sensitive human subjects or endangered species data, and a list of variables that were removed from the original dataset.
  • The operating system platform and software (include version) used for analyses and file compression. Please provide a URL for each software package. If proprietary software was used, please provide information for open source software alternatives if known.
  • List of files, or directories of files, included in data package.
  • Description for each file or group of files:
    • type(s) of data included (e.g, categorical, time-series, geospatial, human subjects, etc.)
    • relationship to the tables, figures, or sections within the accompanying publication
    • key of definitions of variable names, column headings and row labels, data codes (including missing data), and measurement units
  • Description of associated datasets stored elsewhere and the associated URLs, if applicable.

Examples of good reusability practices

  • Gallo T, Fidino M, Lehrer E, Magle S (2017) Data from: Mammal diversity and metacommunity dynamics in urban green spaces: implications for urban wildlife conservation. Dryad Digital Repository.
  • Rajon E, Desouhant E, Chevalier M, Débias F, Menu F (2014) Data from: The evolution of bet hedging in response to local ecological conditions. Dryad Digital Repository.
  • Drake JM, Kaul RB, Alexander LW, O'Regan SM, Kramer AM, Pulliam JT, Ferrari MJ, Park AW (2015) Data from: Ebola cases and health system demand in Liberia. Dryad Digital Repository.
  • Wall CB, Mason RAB, Ellis WR, Cunning R, Gates RD (2017) Data from: Elevated pCO2 affects tissue biomass composition, but not calcification, in a reef coral under two light regimes. Dryad Digital Repository.
  • Kriebel R, Khabbazian M, Sytsma KJ (2017) Data from: A continuous morphological approach to study the evolution of pollen in a phylogenetic context: an example with the order Myrtales. Dryad Digital Repository.

Further resources

Go to Frequently Asked Questions

All text on this page is available under a CC-BY 3.0 license. CC-BY 3.0 icon (opens in a new window)

Last revised: 2018-01-24

Search for data

Be part of Dryad

We encourage organizations to: