Best Practices for Data Management¶

A Deeper Dive into Data¶

Data consists of observed or instrument measurements that could be calculated post collection and can be text or numerical values. Below are some key data terms:

Term	Description
Raw data	Measurements or survey values from environmental or laboratory observation that have not been modified.
Metadata	The data that describes your data. It assists with understanding the contents or context of a dataset.
Variables	The dependant or explanatory factors that are being observed. They are commonly termed as fields or column headers in tables/spreadsheets.
dataset	A collection of data. It can consist of one or more data tables (files) containing data observations, with multiple variables and values for each observation.

File Formats¶

Data files can take on various formats to support the data stored in each dataset. Formats you will commonly find on CanWIN’s site are:

✔️ .csv - Comma separated values file (preferred format)

✔️ .txt - Text file

✔️ .xls or .xlsx - Excel files

✔️ .netCDF - Network Common Data Form files

✔️ .Gsheet - Google spreadsheet files

Good Research Data Management (RDM) Practices¶

Good RDM practices are essential to keeping data organized and minimizes the occurrence of errors, data loss and ambiguous data. These practices support your data being FAIR once it is published.

1. Choose your format carefully¶

Many researchers work with Excel files during their data analysis and processing, as it has powerful analytical features and handle complex data well. However, publishing XLSX or XLS files to online repositories may not be the best option, as they are proprietary file formats and not universally compatible or easily read by other programs and tools. CSV files, on the other hand, are supported by nearly all interfaces where you upload data, because of its simple, machine-readable format. This makes your data interoperable.

Best practice¶

Export your XLSX files to CSV at the end of your analysis, saving each sheet as a CSV file.

2. Versioning your data¶

Versioning helps you to keep track of the changes you have made along the life cycle of your data. This helps with backtracking if needs be, and prevents any loss of information from one version to the next. In the simplest sense, you can perform data versioning by saving your raw files in a separate folder than your processed files (which itself could contain several folders based on each processing step). More advanced versioning within software programs and platforms (think Google sheets, GitLab) allows you to see the exact changes that were made and when.

Best practices¶

Organize your data on your own computer; make separate copies for each version and store in appropriately labelled folders.
Talk to the curators at CanWIN to help you set up an GitLab repository with an appropriate folder structure for your files.

3. Creating proper spreadsheets¶

Creating proper spreadsheets or data tables is important from the moment you begin data collection. Having well-structured and formatted data tables will save you significant time and energy in the future, as they will need to meet certain criteria before they can be published to an online repository.

Best practices¶

4. File and variable names¶

It is important to adopt certain conventions when naming files, as well as the column names in spreadsheets. This makes it easy to identify which files contain what information and their relationship to other files. Proper naming conventions also keep your files digitally compatible, eliminating the risk of files not being read by certain software or tools.

Variable standardization¶

Controlled vocabularies for variables are established terms that standardize the definitions of variables used in research data. This benefits the researcher by reducing ambiguity in their data, increases reusability by other experts in their field, and allows for data to be shared and combined cohesively. Below are the core vocabularies that CanWIN uses to standardize variables.

Vocabulary	Description
BODC	The British Oceanographic Data Centre (BODC) provides a variety of vocabularies which are focused on oceanographic data, instruments, and other parameters measured on the seas.
CF	Climate Forecast (CF) provides a vocabulary focused on meteorological data.
WQX	Created by the EPA, this schema is used for formatting your data to be submitted to the EPA. This vocabulary focuses on water quality variables and provide templates for formatting data.
CanWIN	Where variables aren't captured in the vocabularies within this table, CanWIN creates their own internal terms for describing instruments, platforms, or variables.
Ocean Acidification Terminology	Words to describe datasets related to Ocean Acidification.

Best practices¶

Use a standardized variable name for variables where applicable. Talk to a data curator for help with finding an appropriate standardized name.
Avoid the use of capitals in naming variables, unless it is standard practice (e.g. standardized vocabulary or units such as pH)
Keep file names within 32 characters or fewer (recommended).
Avoid the use of special characters when naming both files and variables, such as: < > : " / \ | ?*{}#&?$!=|`'"
Utilize underscores (_) and hyphens (-) when needed to append additional information to your file name, or in place of spaces between words.

5. Date and Time¶

Unambiguous interpretation of date and time information requires specification of the time zone or offset from universal time (UTC).

Best Practices¶

Convert your date and time formats to the ISO 8601 date and time format, which is an international standard – ⁣yyyy-mm-ddThhmmss.
To avoid problems in recording times for a sensor, CanWIN recommends ODM best practices.