Version controlling in Machine Learning using Data Version Control
Version control, also known as source control, is the practice of tracking and managing changes to software code. Version control system GIT & Bitbucket helps us to manage changes to source code over time. Basically these system supports creating different versions of the code. Each version captures a snapshot of the files at a certain point in time and it allows you to switch between these versions. But in Machine Learning unlike other software engineering and its end product does not depend only on source code but it also depends on the input data which are large is size.
Data Version Control, or DVC is an open-source version control system helps Machine Learning projects to track input data & model which are large is size and versioning them. Apart from versioning DVC also also helps to pipeline various steps, experiment fast and reproducible. DVC takes advantage of the existing engineering toolset as GIT and others. Basically DVC takes help of GIT to store meta its meta information of the version like the location of files corresponding where as DVC itself cache and add these large file.
This walk through will be on how to do version controlling with DVC and in upcoming article we will see how we can we can use DVC to build reproducible Machine Learning pipelines.
Data & Model Versioning:
Following section will cover on how DVC can be used for versioning file with large size file like model & data which are not recommended to be saved in GIT or Bitbucket.
We can install DVC
pip install dvc
Next we create our project folder and change directory to our project folder
mkdir dvc_versioning_demo
cd dvc_versioning_demo
Prerequisites for DVC is to have GIT or similar SCM system installed and configured for the folder. So we create a empty git repo and following which initialise DVC
git init
dvc init
Along with init command DVC has now created a .dvc folder with empty config file & few other cache supporting file. Along with .dvc folder it has created .dvcignore
is similar to .gitignore
which tracks files and/or directories should be excluded when traversing a DVC workspace. Incase folder is not version controlled with GIT or similar, initialising DVC will result in failed to initiate error.
We need to configure DVC in order to save its actual snapshot of the tracked file which can be a local folder or remote system like S3, GCS, Azure storage etc. Lets create a local folder on our system that will act like a remote machine
dvc remote add -d myremote /temp/remote_dvc_versioning_demo
Alternatively we can change our default set remote location to Azure storage we can do as below. With this DVC will start using remote Azure storage space for storing different version of file
dvc remote modify myremote connection_string "my-connection-string"
Replace with the actual connection string found from Azure portal-> Storage account->Access keys->Connection string as below

Now create necessary files as below
mkdir data src model
echo "data_version-1" > data/data.csv
echo "src_version-1" > src/src.py
echo "model_version-1" > model/model.pkl
At present our file structure will look like
dvc_versioning_demo
-- data
-- data.csv
-- src
-- src.py
-- model
-- model.pkl
In order for dvc to be able to track file or the directory we have add it as below for both model & data files
dvc add model/ data/
DVC adds new .dvc files for each folder data & model. File created is based on the data file name (e.g. data.dvc & model.dvc
). These files contain the information needed to track the data with DVC. The folders themselves have been added to the .gitignore so that git doesn`t track. It uses a simple YAML format and contains information such as path, size, number of files etc. In order to track the new .dvc files with Git we make the standard Git procedure for a commit with
git add .
git commit -m "version-1"
In order to store data in remote machine we can use DVC push
dvc push
Now we can update all 3 files
echo "data_version-2" > data/data.csv
echo "src_version-2" > src/src.py
echo "model_version-2" > model/model.pkl
After modification again we need to add modified files to DVC with
dvc add model/ data/
Now new .dvc files can be tracked with Git and we make the standard Git procedure for a commit with
git add .
git commit -m "version-2"
Again we update data at remote machine with
dvc push
Now when we need to switch back to version-1 commit we can git checkout. Which will retrieve corresponding old source code and .dvc files.
git checkout commit_ID
In order to retrieve corresponding data and model file we can use DVC checkout as it uses .dvc file to update data and model file from cache or remote system.
dvc checkout
In this case checkout has retrieved version-1. We can now work with the old data version just as usual. This procedure also works for data versions containing a lot more data. The checkout is of course also possible in the other direction. We could checkout the git commit version-2 and perform a dvc checkout in order to set the data directory to the state of version -2 corresponding data and model file. DVC stores different version of file separately in remote location incase file content are changed. Incase there is not change in the file content DVC does not create separate copy of file instead uses previous version.