XD blog

blog page

git, python


2017-08-24 Remove big files from git history

Git repositories always get bigger. I noticed than one of GitHub repository was above 500Mb. I was wondering how I could make that size smaller. First, let see the size.

git count-objects -v 
count: 0
size: 0
in-pack: 19644
packs: 1
size-pack: 222397
prune-packable: 0
garbage: 0
size-garbage: 0

The size is size-pack. To clean, the first option is to rebase the repository so basically to clean everything and to commit the current state of the content. One solution is to keep only the latest commits (see Reduce repository size).

git log -n N
git reset --hard HEAD~N
git push --force 

If this does not work, another strategy is to create an empty branch, to commit everything as the first commit, to delete the master branch, to replace it by the new one and to clean unused files (see Make the current commit the only (initial) commit in a Git repository?).

git checkout --orphan newBranch
git add -A                      # Add all files and commit them
git commit
git branch -D master            # Deletes the master branch
git branch -m master            # Rename the current branch to master
git push -f origin master       # Force push master branch to github
git gc --aggressive --prune=all # Remove the old files

A third option is to remove files added to the repository and then deleted. To do that, you need to follow the steps described into: Removing sensitive data from a repository. That leaves the problem of finding files you can remove. You can go to git_dataframes.ipynb. I tried it on my own repo. I added and removed a file log.txt.

If then run:

git filter-branch --force --index-filter "git rm --cached --ignore-unmatch log.txt" --prune-empty --tag-name-filter cat -- --all

It displayed the following message:

I finally typed:

git push origin --force --all

And its content disappeared from the commit.


<-- -->

Xavier Dupré