Like many development shops have done for the past few years, we migrated our version control at Stackify to a git-based system a few years ago. The benefits were many: it was faster than TFS and supported our branching and merging strategies better. It also played nicely with our ALM tools for reporting commits related to a work item, etc.
We also decided to use one of the major cloud services to host the repo – after all, we are a cloud first company and prefer not to host infrastructure and maintain software updates if we don’t need to.
This journey, however, had an unexpected outcome as I learned to deal with the realities of maintaining the repo size. In doing so, I uncovered an experience much the same as dealing with grief.
Shock (or Disbelief)
A few months ago, I was logged into that repo host and noticed a big banner across the top of the page: our repo was over a 1 GB size limit and needed action taken. I wasn’t surprised that the repo was so big, but…..a 1 GB limit? You’re kidding, right? Clicking on the link, I read “soft limit” with a “hard limit” at 2 GB and realized I had some time to address the problem.
My, does time fly. I spent the next few months just pretending the problem would go away. We’ll clean up some code, change some habits. “Surely it won’t grow that fast, will it?”
Other events in our day to day development sent me down the path of splitting out our repo – and hey, maybe now is a good time to take a look at the size problem we have. Not that I’m worried about it… no, no, not at all. Just going to take advantage of the opportunity. That’s all.
I click the ominous link in the message and start reading up on how to proceed. It hits me rather quickly that this situation could have been avoided. We made mistakes. We added large files and binaries (damn you, Nuget!!); we should have split into multiple repos. We should have used Git LFS for larger items.
What’s this? Someone checked in a memory dump? Who would…. Why would….. Oh, it was me.
So much guilt, and now we are running out of time. What is going to happen come Monday and the entire team stops dead in their tracks. What if we can’t push a build out?? What have we done?!
Anger. So much Anger.
Anyone who has been down this road knows where I am headed. Those of you who haven’t, hold on. There is no easy way to slim down your repo. None.
You have to rewrite your history.
Yes, that’s correct. Read it again. I’ll wait.
You see, the problem is that you’ve checked all of this stuff in, and it’s woven into the history of all those commits and branches. You can remove the file, or folder, but it’s still there in all the history, keeping the repo size bloated. In order to remove the bloat you need to:
- Identify it
- Remove it using ‘git filter-branch’ which (as you will discover) takes FOR-EV-ER.
- Then delete the original refs that were backed up by the above process
- Dereference and expire your reflog
- Run git gc to claim all that space back, which also takes FOR-EV-ER and all your memory (and might throw a malloc exception after a really long wait. Yay! You get to start again)
- Then push back to your repo
- And then… REBASE?? Yes, because all of your hashes change. You rewrote history.
Yep, that’s right. This is essentially an “offline” operation. No one can be working on the repo while this is going on (and remember, it takes FOR-EV-ER). And if something didn’t get pushed before, you are out of luck. You will not be merging that commit back in. Let’s face it, we all have one co-worker who will be “that guy” who didn’t get the memo.
Are you furious yet? I am. I’ve been waiting for 3 hours just for git-filter-branch to finish removing one file from my repo.
Have you ever stared into the dark abyss of a bash shell ticking by sha1 hashes for hours on end? Knowing that as soon as it’s done, you have to do it again? Over and over and over again?
It takes you to a dark place. Once you get there, drop me a note. I’ll help talk you through it. It’s going to be ok. Ok?
But then again, maybe not. Take a detour to Guilt for a while. Then back to Anger. You have plenty of time to cycle through this all while you wait. Endlessly.
Acceptance and Hope
No matter what, you’ve got to get the job done. You’re a developer, right? You’ve done more difficult things than this. You’ve used Windbg to trace memory leaks in unmanaged code! You’ve dealt with escalating SQL deadlocks! This version control will not get the best of you!
If you can specifically target certain files that need to be cleaned up and removed, there is a great utility out there that can help: BFG Repo Cleaner that can do some of the cleanup a lot faster than using the traditional git commands.
The only downside I have seen of the BFG Repo Cleaner is that it can’t target a path. It’s great for cleaning up file types, and doing pattern matching. But say you have:
You can’t remove just the root instance of “FolderA.” BFG wants to remove any folder called “FolderA.” If you want to do that, you need to go back to “git filter-branch” which will take a lot longer.
Preventative Git Repo Size Medicine
One thing becomes very clear through this process: never, ever again will we put ourselves in a position to have to do this work.
There are a few simple things you can do when setting up your projects to avoid this scenario:
- Think about splitting up a repository where you can. If you have a project that is only ever consumed by other projects in the form of a binary, it’s a great target for splitting out. PLUS, you can use this opportunity to publish into a Nuget feed, and use that in the dependent projects, which is another great practice.
- Package restore. Don’t check nuget packages into Git!
- Git LFS. Have some large files and binaries that don’t change often? This is a good place to put them.
My hope is that others can learn from my grief, git repo maintenance and take measures to deal with these issues before they become critical.
If not, I’m there for you. I know your pain.