How to track large files with git3 min read

On a very usual Sunday afternoon, I was sitting on my couch and sipping a tea while wondering what to do during the pandemic lockdown. Of course, it is not about going out and running amok around the city, rather it is about prioritising my pet projects and working on them since I am fortunate enough to work remotely at a company that cares about its employees. For that, I love my job and respect my employer.

Anyhow, without dozing on my couch, I pulled off my old laptop and started pushing my pet projects’ code. All well, all good, all have been pushed seamlessly until git decided to complain to me. Well, damn you git I thought like every other developer in the universe. But what was the error git threw?

That makes sense. I forgot to ignore CSV files or not to add any files in a data folder. Stupid me forgot this repo is a machine learning project that uses data intensively. I had other large files such as tar.gz, jpeg, pt in my repo that I would want to keep track. Normally, I put them in the data directory and write a manifest which keeps track of what is inside the data folder. Later, from a different machine, I would use this manifest to download the files again. Or I could use a file server for such huge data storage. Efficient, or not efficient is subjective here, but a point is git could do better to support tracking files.

So smart folks at GitHub saw this issue and has decided to support it. Their solution is to use https://git-lfs.github.com/. They decided to store files in GitHub or GitHub enterprise server and point it to our git repo so that we could track our files like our code. To use this, you need to install this first:

https://packagecloud.io/github/git-lfs/install

After installing this, you also need to their command-line extension from here:

https://github.com/git-lfs/git-lfs/releases/download/v2.10.0/git-lfs-linux-amd64-v2.10.0.tar.gz

Upon completing those two installs, you are ready to use git lfs. Okay, how do we use it then?

Firstly, it needs to know which or what type of files you want to track. It stores this information inside .gitattrubures in your repo. To tell it which type of file to track, you can run

git lfs track "*.csv"

or you can edit .gitattrubutes yourself. After running the above, you will see this inside .gitattribute:

*.csv filter=lfs diff=lfs merge=lfs -text

Now you just need to push .gitattrubtes so git knows that it should use git lfs on those files. After pushing this, adding and pushing files is as same as usual git workflow. Finally, during pushing your commits, you will see git lfs is uploading files and tracking them. That is one less pain for developers, folks!

Well, that is a bit of a surprise. Isn’t it? But there is 1GB limit put by Github. Even so, it is nice to dump files in repo and track it from different machines.

Leave a Reply

Your email address will not be published. Required fields are marked *

eleven − 8 =

%d bloggers like this: