Research Data Management

“Version control for data science and machine learning with DVC”

Webinar (2023-Dec-12) with Marie-Hélène Burle

Data version control (DVC) is an open-source tool that brings all the versioning and collaboration capabilities you use on your code with Git to your data and machine learning workflow. If you use datasets in your work, it makes it easy to track their evolution. If you are in the field of machine learning, it additionally allows you to track your models, manage your pipelines from parameters to metrics, collaborate on your experiments, and integrate with the continuous integration tool for machine learning projects CML. This webinar shows how to get started with DVC, first in the simple case where you just want to put your data under version control, then in the more complex situation where you want to manage your machine learning workflow in a more organized and reproducible fashion.


“Managing large hierarchical datasets with PyTables”

Webinar (2023-May-23) with Alex Razoumov

PyTables is a free and open-source Python library for managing large hierarchical datasets. It is built on top of NumPy and the HDF5 scientific dataset library and it focuses both on performance and interactive analysis of very large datasets. For large data streams (think multi-dimensional arrays or billions of records), it outperforms databases in terms of speed, memory usage, and I/O bandwidth. That said, PyTables is not a replacement for traditional relational databases because it does not support broad relationships between dataset variables. PyTables can even be used to organize a workflow with many (thousands to millions) of small files, as you can create a PyTables database of nodes that can be used like regular opened files in Python. This lets you store a large number of arbitrary files in a PyTables database with on-the-fly compression, making it very efficient for handling huge amounts of data.


“Distributed datasets with DataLad”

Webinar (2023-Mar-28) with Alex Razoumov

This webinar provides a more beginner-oriented tutorial to version control of large data files with DataLad. We start with a textbook introduction to DalaLad showing its main features on top of Git and git-annex. Next we demonstrate several simple but useful workflows. Please note that not everything fit into the 50-min presentation, but the notes below contain everything.

  1. two users on a shared cluster filesystem working with the same dataset stored in /project,
  2. one user, one dataset spread over multiple drives, with data redundancy,
  3. publishing a dataset on GitHub with annexed files in a special private remote,
  4. publishing a dataset on GitHub with publicly-accessible annexed files on the Alliance’s Nextcloud, and
  5. managing multiple Git repositories under one dataset.

“How to create and access MySQL and PostgreSQL databases on DRI systems”

Webinar (2023-Feb-28) with Gemma Hoad


“Data management with DataLad”

Webinar (2023-Feb-14) with Ian Percel

This talk is a brief introduction to version controlling data and data processing workflows. Three illustrative use cases – taken from neuroimaging, geophysics, and workflows for analyzing housing data respectively – are used to provide an introduction to the main concepts of git-based file management, collaboration, and analysis.


“Hiding large numbers of files in container overlays”

Webinar (2023-Jan-17) by Alex Razoumov

Many unoptimized HPC cluster workflows result in writing large numbers of files to distributed filesystems which can create significant problems for the performance of these shared filesystems. One of the ways to alleviate this is to organize write operations inside a persistent overlay directory attached to an immutable read-only container with your scientific software. These output files will be stored separately from the base container image, and to the host filesystem an overlay appears as a single large file. In this presentation, we demo running parallel OpenFOAM simulations where all output goes into overlay images, and the total number of files on the host filesystem is reduced from several million to several dozen or less. The same approach can be used in post-processing and visualization, where you can read simulation data from multiple overlays both in serial and in parallel. In this webinar we walk you through all stages of creating and using overlays. We assume no prior knowledge of the container technology.


“Linking databases to code repositories with Throughput”

Webinar (2021-Mar-03) by Simon Goring


“Automating your backups in Linux and MacOS”

Webinar (2021-Feb-17) by Alex Razoumov


“Working with multidimensional datasets in xarray”

Webinar (2020-Sep-30) by Alex Razoumov


“File access control approaches and best practices”

Webinar (2019-Oct-30) by Sergiy Stepanenko


“Managing many files with Disk ARchiver (DAR)”

Webinar (2019-May-01) by Alex Razoumov


“Research Data Management Tools, Platforms, and Best Practices for Canadian Researchers”

Webinar (2019-Mar-20) by Alex Garnett and Adam McKenzie