I’m not sure this is a good idea. Most of the history is shared anyway. All large files (mostly the nodejs tarballs) are in rpm/develop and that’s part of the history. Git reuses object files so you’d win very little.
When you only clone a single branch and no depth you still have a large repository. For rpm/develop:
Good call out on how we might optimize our CI fetching if we are not already. Let me ask two additional questions:
If we step back from the CI question: do we need all these old branches? Are they providing value existing in the main repository? I find that I encounter some overhead looking through all the branches either locally or on Github to find the most important and relevant ones.
Would there be value in moving to a model to store those nodejs tarballs on one of our webservers or koji? And fetch them like any other source.
Where we package it directly (like nodejs-theforeman-vendor) it properly uses git-annex and we have no additional overhead.
We could get rid of the large tarballs if we can find a way to have bundled NPM packages without actually calling NPM or make NPM behave properly without needing to reach out to the registry. Perhaps npm ci during the packaging phase could help?
Another alternative is to remove the bundled NPM packages, but that may be more controversial. With better automation it could perhaps get done.
If we can’t, finding some large file offloading mechanism would surely help reduce the repository size growth.
Or a short summary: there are a content-v2 and an index-v2 directories. There are 770 files for this particular install and it takes 25M uncompressed, 6.7M compressed
# du -sh cache/
# find cache/ -type f | wc -l
# tar --create --gzip --file cache.tar.gz cache
# ls -sh cache.tar.gz
The tricky thing is that every single tarball also contains the gzipped tarball that is the actual package, which we also already have in the sources section. We are essentially duplicating all sources.
One possible strategy is to create a more complex specfile where we reuse the git annexed sources and extract them to the cache at the right place. The good thing is that the filename is the content of the file:
Cache size before (compressed)
Cache size before (uncompressed)
Cache size after (uncompressed)
Cache size after (compressed)
So we see a ~2 MB or about 31% reduction for this package.
The rest is simply because the API responses are very large and I don’t see a way to reduce those with my limited NPM knowledge.
Still, this is a nice reduction and I’ll see about coming up with a patch to npm2rpm. After that I think it wouldn’t be wise to replace all existing packages in our tree, but rather see if we can do version bumps to get the reduced filesize. While we’re at it, we should check if there are no unused packages that we still have in our repository.