Version controlling in Machine Learning using Data Version Control

Srikant Panda
4 min readDec 7, 2020

Version control, also known as source control, is the practice of tracking and managing changes to software code. Version control system GIT & Bitbucket helps us to manage changes to source code over time. Basically these system supports creating different versions of the code. Each version captures a snapshot of the files at a certain point in time and it allows you to switch between these versions. But in Machine Learning unlike other software engineering and its end product does not depend only on source code but it also depends on the input data which are large is size.

Data Version Control, or DVC is an open-source version control system helps Machine Learning projects to track input data & model which are large is size and versioning them. Apart from versioning DVC also also helps to pipeline various steps, experiment fast and reproducible. DVC takes advantage of the existing engineering toolset as GIT and others. Basically DVC takes help of GIT to store meta its meta information of the version like the location of files corresponding where as DVC itself cache and add these large file.

This walk through will be on how to do version controlling with DVC and in upcoming article we will see how we can we can use DVC to build reproducible Machine Learning pipelines.

Data & Model Versioning:

Following section will cover on how DVC can be used for versioning file with large size file like model & data which are not recommended to be saved in GIT or Bitbucket.

We can install DVC

pip install dvc

Next we create our project folder and change directory to our project folder

mkdir dvc_versioning_demo
cd dvc_versioning_demo

Prerequisites for DVC is to have GIT or similar SCM system installed and configured for the folder. So we create a empty git repo and following which initialise DVC

git init
dvc init

Along with init command DVC has now created a .dvc folder with empty config file & few other cache supporting file. Along with .dvc folder it has created .dvcignore is similar to .gitignore which tracks files and/or directories should be excluded when traversing a DVC workspace. Incase folder is not version controlled with GIT or similar, initialising DVC will result in failed to initiate error.

We need to configure DVC in order to save its actual snapshot of the tracked file which can be a local folder or remote system like S3, GCS, Azure storage etc. Lets create a local folder on our system that will act like a remote machine

dvc remote add -d myremote /temp/remote_dvc_versioning_demo

Alternatively we can change our default set remote location to Azure storage we can do as below. With this DVC will start using remote Azure storage space for storing different version of file

dvc remote modify myremote connection_string "my-connection-string"

Replace with the actual connection string found from Azure portal-> Storage account->Access keys->Connection string as below

Accessing Connection string from Azure portal

Now create necessary files as below

mkdir data src model
echo "data_version-1" > data/data.csv
echo "src_version-1" > src/src.py
echo "model_version-1" > model/model.pkl

At present our file structure will look like

dvc_versioning_demo
-- data
-- data.csv
-- src
-- src.py
-- model
-- model.pkl

In order for dvc to be able to track file or the directory we have add it as below for both model & data files

dvc add model/ data/

DVC adds new .dvc files for each folder data & model. File created is based on the data file name (e.g. data.dvc & model.dvc). These files contain the information needed to track the data with DVC. The folders themselves have been added to the .gitignore so that git doesn`t track. It uses a simple YAML format and contains information such as path, size, number of files etc. In order to track the new .dvc files with Git we make the standard Git procedure for a commit with

git add .
git commit -m "version-1"

In order to store data in remote machine we can use DVC push

dvc push

Now we can update all 3 files

echo "data_version-2" > data/data.csv
echo "src_version-2" > src/src.py
echo "model_version-2" > model/model.pkl

After modification again we need to add modified files to DVC with

dvc add model/ data/

Now new .dvc files can be tracked with Git and we make the standard Git procedure for a commit with

git add .
git commit -m "version-2"

Again we update data at remote machine with

dvc push

Now when we need to switch back to version-1 commit we can git checkout. Which will retrieve corresponding old source code and .dvc files.

git checkout commit_ID

In order to retrieve corresponding data and model file we can use DVC checkout as it uses .dvc file to update data and model file from cache or remote system.

dvc checkout

In this case checkout has retrieved version-1. We can now work with the old data version just as usual. This procedure also works for data versions containing a lot more data. The checkout is of course also possible in the other direction. We could checkout the git commit version-2 and perform a dvc checkout in order to set the data directory to the state of version -2 corresponding data and model file. DVC stores different version of file separately in remote location incase file content are changed. Incase there is not change in the file content DVC does not create separate copy of file instead uses previous version.

--

--