Improving our Debian repository setup

Hey,

as some of you saw in Do you still use rsync to mirror our repositories?, we’re currently trying to lower our traffic bills by not transferring data we don’t need :slight_smile:

One of the takeaways from the above thread was that we transfer the whole Debian dists/ folder on every rsync run, as the folder has a timestamp in its name and is thus considered new by rsync. This timestamp is an artifact of how freight (our Debian repo management software) works and cannot be disabled by configuration. While I am not a huge freight fan, I also don’t really want to replace it just now, as it works fine otherwise and people are used to it.

My intent is to make the repository served by http(s) and rsync not to contain the timestamps and symlinks generated by freight. After a bit of playing around, I came to the following possible solution:

  • make freight write its repository data not to /var/www/vhosts/deb/htdocs/ directly, but to a different folder on the same partition (probably something like /var/www/vhosts/deb/private/freight/)
  • after freight has regenerated the repository (via cron), use rsync to copy (well, actually hardlink, as we don’t want to waste space and time copying) the repository to /var/www/vhosts/deb/htdocs/:
    • sync new packages, but don’t delete old (yet):
      rsync --archive --copy-links --hard-links --link-dest=/var/www/vhosts/deb/private/freight/pool/ /var/www/vhosts/deb/private/freight/pool/ /var/www/vhosts/deb/htdocs/pool/
    • sync new metadata, resolving symlinks (--copy-links) and excluding the timestamped folders (--exclude '/*-*/':
      rsync --archive --delete --copy-links --hard-links --exclude '/*-*/' --exclude '/*/.refs/' --link-dest=/var/www/vhosts/deb/private/freight/dists/ /var/www/vhosts/deb/private/freight/dists/ /var/www/vhosts/deb/htdocs/dists/
    • sync packages once more (there should be no new ones at this point) and delete old:
      rsync --archive --delete --copy-links --hard-links --link-dest=/var/www/vhosts/deb/private/freight/pool/ /var/www/vhosts/deb/private/freight/pool/ /var/www/vhosts/deb/htdocs/pool/

As the first round of copying does not delete old packages, clients that use the repository while we sync still have the chance to get the packages based on the old metadata. Then we replace the metadata and can safely remove the files that aren’t referenced anymore.

What do y’all think (esp @mmoll, @Gwmngilfen and @ekohl as I guess you’re most knowledgeable with the current setup)?

To me freight is a black box and I only know how to run the documented commands. Do not let me hold you back on changing it, especially if there are better solutions out there that are more CDN friendly.

It looks like it makes sense.

See, you know 100% more than I do. :wink:

Replacing freight with something else needs more thinking, especially as other software has, uhm, different shortcomings.

Same here, tbh… At the end it always worked well enough to not get replaced. :wink:

Regarding the rsync timestamp problem maybe it would sufficient to use --checksum (skip based on checksum, not mod-time & size)?

In general, what you outlined sounds good to me, so we should try it :+1:.

This will slower down the server, also users won’t notice and will stick with rsync defaults I guess.

This is a client option, which we can’t influence from our side. And I think neither of these are used if the filepath changes.

After looking at the recent traffic bills, I’ve decided not to pursue this further.

Yes, we could improve here, but there is also a risk breaking the current setup, and we’re pretty good on the money side now.

1 Like