Do you still use rsync to mirror our repositories?

Hi,

while looking over our current infra, we’ve noticed that we currently have about ~200G of rsync traffic daily, sometimes mirroring as back as Foreman 1.9 (yes, dot nine). This traffic isn’t cheap for us, as we can’t serve it from our CDN and it’s billed to our main server.

As we could cut a lot of our infra costs by just disabling rsync, we ask you who still use it, to tell us why rsync is your tool of choice (instead of the available tools to mirror RPM and DEB repositories) and whether you could just switch to those instead.

Thanks
Evgeni for the Infra Team

3 Likes

The main reason we still use daily rsync is incremental downloads to actually help to conserve network bandwidth on your side. The only other thing is a no-delete option so when you, guys, do remove old versions, they are preserved on our side as we tend to run them longer.

If there are similar alternatives to rsync, I’d gladly switch over to them as the method does not really matter to me, only the end result :slight_smile: - basic HTTP/HTTPS recursive download often assume re-download of everything and definitely would not work here.

Would PULP work in that sense? I was eyeing it since last year when DEB plugin was added and if can help me to accomplish the same as I can with rsync, I’d switch over to it in a flash, just don’t have much experience with Pulp yet.

If you have any sample configs for Pulp to set up an incremental mirror for Foreman repos that you could post on your website as a Wiki/mirroring how-to, I’m sure it would much appreciated and used by many including myself.
Just a suggestion, but we’re definitely open to other options if you had something in mind.

Thanks!

Pulp sounds ideal. I think @Bernhard_Suttner or @x9c4 could help with that?

1 Like

As pulp3 is not yet ready, I would suggest using Katello. It uses pulp under the hood, and you get the content version mechanism, so you can still use old versions that disappear upstream. Even better, the content view can present them isolated.

The main reason we still use daily rsync is incremental downloads to actually help to conserve network bandwidth on your side. The only other thing is a no-delete option so when you, guys, do remove old versions, they are preserved on our side as we tend to run them longer.

@Gwmngilfen do we remove stuff these days? I can see as far back as 1.0 for YUM and 1.2 for Debian repos.

But yes, of course, if we remove things you want to preserve, having a “fresh” 1:1 mirror doesn’t help you.

If there are similar alternatives to rsync, I’d gladly switch over to them as the method does not really matter to me, only the end result :slight_smile: - basic HTTP/HTTPS recursive download often assume re-download of everything and definitely would not work here.

I wasn’t referring to basic HTTP/S mirroring, but something along the lines of Pulp, reposync, debmirror etc. So tools that operate over HTTP/S, but understand the repository metadata and can use that to fetch the content you care about. That way we could leverage the available CDN and you’d still have a fresh mirror :slight_smile:

Would PULP work in that sense? I was eyeing it since last year when DEB plugin was added and if can help me to accomplish the same as I can with rsync, I’d switch over to it in a flash, just don’t have much experience with Pulp yet.

I have no experience with Pulp for Debian, but for RPM you can definetly configure it to “pull in new stuff, but don’t delete anything the remote dropped”

If you have any sample configs for Pulp to set up an incremental mirror for Foreman repos that you could post on your website as a Wiki/mirroring how-to, I’m sure it would much appreciated and used by many including myself.
Just a suggestion, but we’re definitely open to other options if you had something in mind.

That’s definitely something we can arrange for!

One thing that makes me wonder, though. Our Debian repository is 54GB big, the Yum one is only 14, yet we have users that make 1-2TB traffic a month, so basically downloading the full reposet 20-ish times? I wonder if there is also something wrong with our rsyncd and it serves content as new while it isn’t?

Yeah, I don’t know. Here’s the summary of my 2 jobs that ran last night that may help you calculate things:

RPM:
sent 81,514 bytes received 39,884,172 bytes 1,011,789.52 bytes/sec
total size is 20,194,700,768 speedup is 505.30

DEB:
sent 324,156 bytes received 64,239,124,725 bytes 12,331,211.99 bytes/sec
total size is 128,984,151,867 speedup is 2.01

I’ve disabled DEB rsync job for now, BTW.

Yeah, you pulled 64GB(!) of DEB and 39MB of YUM. Wonder why this is so different.

Thanks for disabling DEB, that’s probably saving us a ton of traffic already.

Not as far as I know, although plugins could be in that bracket, I’m not sure what we do with unmaintained ones. @ekohl might know…

You may want to check these DEB repos as they seem rather large to me (compared to RPM repos):

mirrors/deb.theforeman.org/dists$ sudo du -hs *
0 bionic
5.8G bionic-20190212081638664334532
0 jessie
7.9G jessie-20190212075220797403224
0 plugins
354M plugins-20190212082047447587045
0 precise
5.7G precise-20190212083045917399998
0 squeeze
2.1G squeeze-20190212075751691405430
0 stretch
9.8G stretch-20190212082400807654610
0 trusty
7.2G trusty-20190212075952991836373
0 wheezy
6.0G wheezy-20190212081004952037429
0 xenial
15G xenial-20190212083550380001693

Now, since each of the dists above is a symlink to timestamped repo (every day it seems), the whole repo needs to be pulled down in rsync’s opinion resulting in this extra bandwidth wasted.

Thanks for looking into this with me!

This made me wonder why dists is so big at all – Debian repositories have their artifacts in pool and dists should only contain metadata.

Well, turns out we were holding it wrong :wink:
foreman-infra PR #941 should fix it once and for ever.

You can keep your rsyncing, and we get sane traffic bills.

2 Likes

Thanks guys for looking into it. I am rsyncing like once a month to get latest updates, but only yum repos. Also we use rsync for foreman-debug tarball upload, although long-term goal is to switch over to sosreport.

Yeah, upload to our machine is fine, we don’t pay that. Download from us is what I was concerned about :slight_smile:

Confirmed things are looking much better now. Here’s my rsync summary for DEB repos:

sent 29,876 bytes received 64,955,933 bytes 1,780,433.12 bytes/sec
total size is 66,077,888,671 speedup is 1,016.80

Glad that this got resolved and you, guys, don’t have to waste money.

3 Likes

Awesome, thanks for confirming @Konstantin_Orekhov!

I am pretty sure that there is still room for improvement (I mean hey, you had to pull 64MB for zero changes) but it’s already so much better.

We’ll keep an eye on the rsync traffic, but I don’t expect it to be that much anymore and I’ll close this thread in a few days when I’m comfortable with the fact that our next bill will be sane again :slight_smile:

4 Likes

I’m going to be that guy… but why isn’t Red Hat helping out here? They benefit greatly from this project and you’d expect that they would assist with things like this.

I think there’s really two answers here. Firstly, this specific issue turned out to be due to misconfiguration of our rsync server, which was causing ridiculous levels of bandwidth. That’s on us, and credit to @evgeni for finding it and fixing it.

More generally though, on Red Hat contribution - Red Hat does contribute greatly to the project, in a lot of ways. It so happens that the website & mirrors aren’t one of those, that sponsorship comes from Rackspace (and we’re grateful for that, very much so). We also have numerous other sponsors that provide hardware, CDN coverage and more, and that’s a good thing. I don’t think it’s healthy to be solely dependant on one company (this actually almost happened when Rackspace considered puling the plug on it’s Open Source discount program and all our hosting was there).

However, the corollary is that it’s on us as a community to make the best use of the resources these organisations (and individuals) provide. While the end result here was a misconfiguration, and no one had to lose out, that doesn’t negate the future possibility that we may have to drop something else (for an example, consider @tbrisker’s thread on ARM packages). Everything has cost, no budget is infinite, and sometimes you have to make hard choices.

4 Likes

I am actually considered using rsync because when a sync operation fails with foreman it is currently not able to recover where it left off and starts the repo sync right from the start again. We need to go through a proxy we do not control and even a couple of failed with code 408: Request Timeout or failed with code 502: cannotconnect will mean the re-download of Gigs of data.

Is it possible to tune retries within foreman/pulp?

This thread is old and was about Foreman’s own downloads/repos – which we kept serving via rsync btw :slight_smile:

If you have issues with Foreman/Katello/Pulp behind a proxy, I think it would be best to open a new thread in the Support section and ping the @katello team on it.