Trimming foreman-packaging branches

ehelms · June 29, 2022, 12:56pm

Proposal

Completely remove or move previous packaging branches to an archive repository in order to reduce the overall size and thus time it takes to clone foreman-packaging.

More specifically create a foreman-packaging-archive repository and then delete all of the branches older than rpm/3.0 and deb/3.0 from the primary foreman-packaging repository.

Why and What

Our CI system clones the foreman-packaging repository for every RPM and Debian PR and then again whenever we merge a PR to build a package.

The current size of the repository: 597M foreman-packaging/
This, for example, takes about 25 seconds on my local machine.

This implies that if we reduce the size of the repository and thereby decrease the time it takes to clone we will gain faster jobs within our CI that can add up given how much we do in this area.

aruzicka · June 29, 2022, 1:23pm

As a slightly less drastic counter proposal, couldn’t CI just do shallow clones?

ekohl · June 29, 2022, 1:26pm

I’m not sure this is a good idea. Most of the history is shared anyway. All large files (mostly the nodejs tarballs) are in rpm/develop and that’s part of the history. Git reuses object files so you’d win very little.

When you only clone a single branch and no depth you still have a large repository. For rpm/develop:

$ time git clone --branch rpm/develop https://github.com/theforeman/foreman-packaging
Cloning into 'foreman-packaging'...
remote: Enumerating objects: 129222, done.
remote: Counting objects: 100% (677/677), done.
remote: Compressing objects: 100% (345/345), done.
remote: Total 129222 (delta 412), reused 537 (delta 323), pack-reused 128545
Receiving objects: 100% (129222/129222), 593.01 MiB | 27.74 MiB/s, done.
Resolving deltas: 100% (74289/74289), done.

real	0m24.156s
user	0m20.818s
sys	0m8.866s
$ time git clone --depth 1 --single-branch --branch rpm/develop https://github.com/theforeman/foreman-packaging
Cloning into 'foreman-packaging'...
remote: Enumerating objects: 2968, done.
remote: Counting objects: 100% (2968/2968), done.
remote: Compressing objects: 100% (2815/2815), done.
remote: Total 2968 (delta 405), reused 1828 (delta 147), pack-reused 0
Receiving objects: 100% (2968/2968), 112.02 MiB | 6.92 MiB/s, done.
Resolving deltas: 100% (405/405), done.

real	0m17.803s
user	0m7.228s
sys	0m3.576s

And on deb/develop you win A LOT in doing so:

$ time git clone --branch deb/develop https://github.com/theforeman/foreman-packaging
Cloning into 'foreman-packaging'...
remote: Enumerating objects: 129222, done.
remote: Counting objects: 100% (677/677), done.
remote: Compressing objects: 100% (345/345), done.
remote: Total 129222 (delta 412), reused 537 (delta 323), pack-reused 128545
Receiving objects: 100% (129222/129222), 593.01 MiB | 27.55 MiB/s, done.
Resolving deltas: 100% (74289/74289), done.

real	0m23.871s
user	0m19.800s
sys	0m8.366s
$ time git clone --depth 1 --single-branch --branch deb/develop https://github.com/theforeman/foreman-packaging
Cloning into 'foreman-packaging'...
remote: Enumerating objects: 822, done.
remote: Counting objects: 100% (822/822), done.
remote: Compressing objects: 100% (572/572), done.
remote: Total 822 (delta 298), reused 512 (delta 165), pack-reused 0
Receiving objects: 100% (822/822), 174.22 KiB | 5.81 MiB/s, done.
Resolving deltas: 100% (298/298), done.

real	0m0.750s
user	0m0.158s
sys	0m0.116s

So from me. We should rather investigate fixing the checkouts in our CI.

ehelms · June 29, 2022, 1:50pm

Good call out on how we might optimize our CI fetching if we are not already. Let me ask two additional questions:

If we step back from the CI question: do we need all these old branches? Are they providing value existing in the main repository? I find that I encounter some overhead looking through all the branches either locally or on Github to find the most important and relevant ones.
Would there be value in moving to a model to store those nodejs tarballs on one of our webservers or koji? And fetch them like any other source.

ekohl · June 29, 2022, 2:18pm

I think history is something very important and does help. I’ve not experienced problems with too many branches (the GH UI allows filtering).

To be precise: this is a problem with the bundled NPM packages strategy. We need to cache the NPM registry responses because otherwise we can’t build offline. This results in large tarballs like https://github.com/theforeman/foreman-packaging/blob/rpm/develop/packages/foreman/nodejs-theforeman-builder/nodejs-theforeman-builder-10.1.0-registry.npmjs.org.tgz (5.81 MB). Those specifically blow up our repository size.

Where we package it directly (like nodejs-theforeman-vendor) it properly uses git-annex and we have no additional overhead.

We could get rid of the large tarballs if we can find a way to have bundled NPM packages without actually calling NPM or make NPM behave properly without needing to reach out to the registry. Perhaps npm ci during the packaging phase could help?

Another alternative is to remove the bundled NPM packages, but that may be more controversial. With better automation it could perhaps get done.

If we can’t, finding some large file offloading mechanism would surely help reduce the repository size growth.

These are improvements we should probably make in https://github.com/theforeman/npm2rpm.

ekohl · June 29, 2022, 2:31pm

So I did some more searching.

This points to AddyOsmani.com - Offline installation of npm packages but more importantly an option NPM 6 introduced: –prefer-offline. That did not exist when we last looked at things. I’m going to see if that’s something we can leverage.

ekohl · June 29, 2022, 4:07pm

So that won’t help. Diving into how we have the bundled NPM packages work today we already do reuse the cache. Now what is the cache?

First of all, we have a script that creates a tarball. The important line here is:

github.com

theforeman/npm2rpm/blob/4c8dc6719595d3a38118ea026dd65248687ccf2d/bin/generate_npm_tarball.sh#L17

    
      
          package=$1
          output=$(pwd)/$2
          usenodejs6=${3:-false}
          wd=$(mktemp -d)
          trap "rm -rf '$wd'" EXIT INT TERM
          
          
create_cache() {
            if [[ $usenodejs6 == true ]];then
              node_modules/npm/bin/npm-cli.js install --cache $wd/cache $package --no-shrinkwrap --no-optional --production --verbose
            else
              npm install --cache $wd/cache $package --no-shrinkwrap --no-optional --production --verbose
            fi
          }
          
          
mkdir $wd/cache $wd/install
          cd $wd/install
          if [[ $usenodejs6 == true ]];then
            npm install npm@3.x
          fi
          create_cache
          cd $wd/cache

This creates a cache directory with all the responses from the NPM registry. What’s in there?

# npm install --cache ./cache @theforeman/builder --no-shrinkwrap --no-optional --production --verbose
# tree cache/
cache/
├── _cacache
│   ├── content-v2
│   │   └── sha512
│   │       ├── 01
│   │       │   └── 11
│   │       │       └── 39c192ca9d380f5bbff57ed4264a2d429a4aaa1e8d83366b73b4586f82d3964837ead02abdffd2fe2a764d61236bc9d05e9215b7af04a176e7f9f51a176a
│   │       ├── 02
│   │       │   └── 3c
│   │       │       └── 7fc4a38a468902c3859862e37b67d174ec9aa7ded890d8afa289c85c96db021013a04259e0198417f77183336a740c92bc14a97a813a62c1415c3b86998e
│   │       ├── 03
│   │       │   ├── 9e
│   │       │   │   └── ed87a6111bba11ddcabfcadfd8f3832f1a01347891b617524b84abb21be8c5c4482b77f1c03096ad885b663a266eaf3fb40e4aef6e17822da5797f625f8c
<stripped a lot of duplication>
│   │       └── ff
│   │           └── 22
│   │               └── 872795d79d20ca8a2eb6c8f2ee180b0f51c097dba7d134d60cf1375c982c8859bc624fd5516692b52cc154ae5b846b4f8c9b19616eda32de7fc8f6dfd9b0
│   ├── index-v5
│   │   ├── 00
│   │   │   └── 03
│   │   │       └── 9a227757cb5437ea7adb37debe8f461d0ace3b7e7de9750e99ec6cffd22c
│   │   ├── 01
│   │   │   ├── 22
│   │   │   │   └── 84e83a48fe5dba743d38acda1de9b25d988297e88bbc5f1973940c3bbb09
<stripped more duplication>
│   │   ├── fe
│   │   │   └── a5
│   │   │       └── ed6bf26667c2e90b25a8e577e69ffbb359a5573ec65502d75947a0b02132
│   │   └── ff
│   │       └── 74
│   │           └── 4dbcce677ce2508faf06897463359a67d7dc76af665df4d9cd0d38c75874
│   └── tmp
├── _locks
└── anonymous-cli-metrics.json

1161 directories, 770 files

Or a short summary: there are a content-v2 and an index-v2 directories. There are 770 files for this particular install and it takes 25M uncompressed, 6.7M compressed

# du -sh cache/
25M	cache/
# find cache/ -type f | wc -l
770
# tar --create --gzip --file cache.tar.gz cache
# ls -sh cache.tar.gz 
6.7M cache.tar.gz

The tricky thing is that every single tarball also contains the gzipped tarball that is the actual package, which we also already have in the sources section. We are essentially duplicating all sources.

One possible strategy is to create a more complex specfile where we reuse the git annexed sources and extract them to the cache at the right place. The good thing is that the filename is the content of the file:

# sha512sum ./01/11/39c192ca9d380f5bbff57ed4264a2d429a4aaa1e8d83366b73b4586f82d3964837ead02abdffd2fe2a764d61236bc9d05e9215b7af04a176e7f9f51a176a
011139c192ca9d380f5bbff57ed4264a2d429a4aaa1e8d83366b73b4586f82d3964837ead02abdffd2fe2a764d61236bc9d05e9215b7af04a176e7f9f51a176a  ./01/11/39c192ca9d380f5bbff57ed4264a2d429a4aaa1e8d83366b73b4586f82d3964837ead02abdffd2fe2a764d61236bc9d05e9215b7af04a176e7f9f51a176a

It could be a challenge to replace only the right files. On the other hand, we have the spec file:

github.com

theforeman/npm2rpm/blob/4c8dc6719595d3a38118ea026dd65248687ccf2d/bin/npm2rpm.js#L93-L98

    
      
          writeSpecFile(npm_module, files, dependencies, npm2rpm.release, npm2rpm.template, npm2rpm.output, npm2rpm.scl);
          
          
if (dependencies.length > 0) {
            console.log(' - Generating npm cache tgz... '.bold)
            createNpmCacheTar(npm_module, npm2rpm.output, npm2rpm.useNodejs6);
          }

So what we can do is fetch all dependencies (spectool --get-files $spec), calculate all sha512 sums and purge entries we find.

Let’s see about a proof of concept:

gist.github.com

https://gist.github.com/ekohl/ce876b9fcc4ec29a27983291658ec64e

purge_cache.sh

#!/bin/bash -e

# Intended as a proof of concept to see if we can purge the NPM cache in npm2rpm

cd "$1"

package="$(basename $(realpath .))"

spectool --get-files "${package}.spec"

This file has been truncated. show original

$ ./bin/npm2rpm.js -n @theforeman/builder -v 10.1.7 -s bundle -o nodejs-theforeman-builder
LOTS of output
$ ./purge_cache.sh nodejs-theforeman-builder
Downloading: https://registry.npmjs.org/@ampproject/remapping/-/remapping-2.2.0.tgz
100% of  14.8 KiB |###############################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Downloaded: remapping-2.2.0.tgz
Downloading: https://registry.npmjs.org/@babel/code-frame/-/code-frame-7.18.6.tgz
100% of   2.8 KiB |###############################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Downloaded: code-frame-7.18.6.tgz
<stripped duplication>
Downloading: https://registry.npmjs.org/update-browserslist-db/-/update-browserslist-db-1.0.4.tgz
100% of   4.2 KiB |###############################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Downloaded: update-browserslist-db-1.0.4.tgz
Cache size before (compressed)
6.7M nodejs-theforeman-builder-10.1.7-registry.npmjs.org.tgz
Cache size before (uncompressed)
16M	/tmp/tmp.J9UjgTdVJF
Handling ansi-styles-3.2.1.tgz (553d1923a91945d4e1f18c89c3748c6d89bfbbe36a7ec03112958ed0f7fdb2af3f7bde16c713a93cac7d151d459720ad3950cd390fbc9ed96a17189173eaf9a8)
removed '/tmp/tmp.J9UjgTdVJF/_cacache/content-v2/sha512/55/3d/1923a91945d4e1f18c89c3748c6d89bfbbe36a7ec03112958ed0f7fdb2af3f7bde16c713a93cac7d151d459720ad3950cd390fbc9ed96a17189173eaf9a8'
Handling babel-plugin-dynamic-import-node-2.3.0.tgz (a3aa85929790101c5caadd176255b3015c4d09209975481c47c2119610fff03ca5c638cee1fa0f72f4d6d0618a6bf715b77aefc59ee8e62a6927eff49ef2e195)
removed '/tmp/tmp.J9UjgTdVJF/_cacache/content-v2/sha512/a3/aa/85929790101c5caadd176255b3015c4d09209975481c47c2119610fff03ca5c638cee1fa0f72f4d6d0618a6bf715b77aefc59ee8e62a6927eff49ef2e195'
<stripped duplication>
Handling update-browserslist-db-1.0.4.tgz (8e798ed81106523b0c39efc583dbb4a1ccce7bfa6920364f79bcdc725d720d618b14fd7a3dad7f44ce701282983c6db3b2d35c0ee012b5e8f2c6c3ae28835d18)
removed '/tmp/tmp.J9UjgTdVJF/_cacache/content-v2/sha512/8e/79/8ed81106523b0c39efc583dbb4a1ccce7bfa6920364f79bcdc725d720d618b14fd7a3dad7f44ce701282983c6db3b2d35c0ee012b5e8f2c6c3ae28835d18'
Cache size after (uncompressed)
13M	/tmp/tmp.J9UjgTdVJF
Cache size after (compressed)
4.6M nodejs-theforeman-builder-new-registry.npmjs.org.tgz

Quoting the relevant bits:

Cache size before (compressed)
6.7M nodejs-theforeman-builder-10.1.7-registry.npmjs.org.tgz
Cache size before (uncompressed)
16M	/tmp/tmp.J9UjgTdVJF
Cache size after (uncompressed)
13M	/tmp/tmp.J9UjgTdVJF
Cache size after (compressed)
4.6M nodejs-theforeman-builder-new-registry.npmjs.org.tgz

So we see a ~2 MB or about 31% reduction for this package.

The rest is simply because the API responses are very large and I don’t see a way to reduce those with my limited NPM knowledge.

Still, this is a nice reduction and I’ll see about coming up with a patch to npm2rpm. After that I think it wouldn’t be wise to replace all existing packages in our tree, but rather see if we can do version bumps to get the reduced filesize. While we’re at it, we should check if there are no unused packages that we still have in our repository.

ekohl · July 4, 2022, 4:57pm

Today I continued with my work and came up with a patch:
https://github.com/theforeman/npm2rpm/pull/68

It takes a slightly different approach but with the same result.

I submitted a PR to test this out, but it’s failing. Tomorrow I’ll try to figure out what went wrong because the failure looks unrelated.

https://github.com/theforeman/foreman-packaging/pull/8090