Skip to content

Best Practices for Data Management

A Deeper Dive into Data

Data consists of observed or instrument measurements that could be calculated post collection and can be text or numerical values. Below are some key data terms:

Term Description
Raw data Measurements or survey values from environmental or laboratory observation that have not been modified.
Metadata The data that describes your data. It assists with understanding the contents or context of a dataset.
Variables The dependant or explanatory factors that are being observed. They are commonly termed as fields or column headers in tables/spreadsheets.
dataset A collection of data. It can consist of one or more data tables (files) containing data observations, with multiple variables and values for each observation.

File Formats

Data files can take on various formats to support the data stored in each dataset. Formats you will commonly find on CanWIN’s site are:

Format Description
.txt Text file that contains letters, numbers, and symbols which are organized in lines of text or sentence structure.
.csv Comma separated values is a simple text file containing a list of data that can be viewed using various programs (e.g. Microsoft Excel).
.xls or .xlsx An Excel file or spreadsheet with data organized using tables and primarily utilized by Microsoft Excel.
.Gsheet Google spreadsheet file that is similar to structure and organization as Microsoft Excel file.
.netCDF network Common Data Form which contains data in an array (multidimensional data) to store not only variables and their values, but also information (metadata) about the data it contains.

Good Research Data Management (RDM) Practices

Good RDM practices are essential to keeping data organized and minimizes the occurrence of errors, data loss and ambiguous data. These practices support your data being FAIR once it is published.

1. Choose your format carefully

Many researchers work with Excel files during their data analysis and processing, as it has powerful analytical features and handle complex data well. However, publishing XLSX or XLS files to online repositories may not be the best option, as they are proprietary file formats and not universally compatible or easily read by other programs and tools. CSV files, on the other hand, are supported by nearly all interfaces where you upload data, because of its simple, machine-readable format. This makes your data interoperable.

Best practice

  • Export your XLSX files to CSV at the end of your analysis, saving each sheet as a CSV file.

2. Versioning your data

Versioning helps you to keep track of the changes you have made along the life cycle of your data. This helps with backtracking if needs be, and prevents any loss of information from one version to the next. In the simplest sense, you can perform data versioning by saving your raw files in a separate folder than your processed files (which itself could contain several folders based on each processing step). More advanced versioning within software programs and platforms (think Google sheets, GitLab) allows you to see the exact changes that were made and when.

Best practices

  • Organize your data on your own computer; make separate copies for each version and store in appropriately labelled folders.

  • Talk to the curators at CanWIN to help you set up an GitLab repository with an appropriate folder structure for your files.

3. Creating proper spreadsheets

Creating proper spreadsheets or data tables is important from the moment you begin data collection. Having well-structured and formatted data tables will save you significant time and energy in the future, as they will need to meet certain criteria before they can be published to an online repository.

Best practices

  • Arrange your variables as the columns and your samples as the rows. This will facilitate an easier transfer of your data into software or online platform for manipulation or visualization.

  • A column should contain a single variable, a line contains a single observation, and a cell should contain a single value.

  • Ensure all variable names are consistent in all your data files (check your spelling!).

  • Maintain the same column order in each table.

  • Remove unnecessary columns. If there are columns that are not valuable for data analysis or for sharing, they should be removed.

  • Do not use spaces or special characters in column headers (use underscores instead).

  • Use consistent NULL values, the most software compatible is a blank space.

  • Ensure that each data variable has a consistent format, for e.g., all dates should be the same format.

  • Do not mix data types within the same column, e.g., do not mix numeric and non-numeric values.

  • If the data in a column are ranges, try to separate that column into two columns. For example, instead of one column labelled Temperature with values like 10-15, opt for two columns - Temperature_low containing 10 and Temperature_high containing 15.

  • Use consistent codes across data variables and files.

  • Use consistent units of measurement.

  • Do not leave rows or columns empty.

  • Do not merge cells.

4. File and variable names

It is important to adopt certain conventions when naming files, as well as the column names in spreadsheets. This makes it easy to identify which files contain what information and their relationship to other files. Proper naming conventions also keep your files digitally compatible, eliminating the risk of files not being read by certain software or tools.

Variable standardization

Controlled vocabularies for variables are established terms that standardize the definitions of variables used in research data. This benefits the researcher by reducing ambiguity in their data, increases reusability by other experts in their field, and allows for data to be shared and combined cohesively. Below are the core vocabularies that CanWIN uses to standardize variables.

Vocabulary Description
BODC The British Oceanographic Data Centre (BODC) provides a variety of vocabularies which are focused on oceanographic data, instruments, and other parameters measured on the seas.
CF Climate Forecast (CF) provides a vocabulary focused on meteorological data.
WQX Created by the EPA, this schema is used for formatting your data to be submitted to the EPA. This vocabulary focuses on water quality variables and provide templates for formatting data.
CanWIN Where variables aren't captured in the vocabularies within this table, CanWIN creates their own internal terms for describing instruments, platforms, or variables.
Ocean Acidification Terminology Words to describe datasets related to Ocean Acidification.

Best practices

  • Use a standardized variable name for variables where applicable. Talk to a data curator for help with finding an appropriate standardized name.
  • Avoid the use of capitals in naming variables, unless it is standard practice (e.g. standardized vocabulary or units such as pH)
  • Keep file names within 32 characters or fewer (recommended).

  • Avoid the use of special characters when naming both files and variables, such as: < > : " / \ | ?*{}#&?$!=|`'"

  • Utilize underscores (_) and hyphens (-) when needed to append additional information to your file name, or in place of spaces between words.

5. Date and Time

Unambiguous interpretation of date and time information requires specification of the time zone or offset from universal time (UTC).

Best Practices

  • Convert your date and time formats to the ISO 8601 date and time format, which is an international standard – ⁣yyyy-mm-ddThhmmss.

  • To avoid problems in recording times for a sensor, CanWIN recommends ODM best practices.