How to mirror foreman repos - you can help!

Hello,

our download site is hosted on Rackspace (US) and some HTTP traffic is served by Fastly CDN (worldwide). I cannot stress enough how important our sponsors are, thank you very much!

But if you install Foreman regularly (e.g. development, testing RCs) or you want simply to support our community by providing public mirror, rsyncing our content is as easy as:

#!/bin/bash
for what in yum deb downloads; do
  SRC=rsync://rsync.theforeman.org/$what
  DST=/mnt/data/www/foreman-$what
  mkdir -p $DST 2>/dev/null
  rsync -rlvSHP --delete --exclude nightly --exclude scratch $SRC $DST
done

This will get you:

Remove the “–exclude nightly” if you also want nightlies which is not a bad idea at all. Please avoid mirroring around 04:00 UTC when we generate some of our nightly repos. The “scratch” directory is not needed, unless you are discovery image developer.

If you are about to set up a public mirror, reach out to us or @Gwmngilfen and we can work together putting this into http listing header text so users can decide.

1 Like

Are you sure rsync:// traffic goes via the CDN? I thought that only captures http(s) traffic. The last thing we need is a spike in our traffic costs caused by good intentions :wink:

That’s what I don’t say, if you get that feeling let me rephrase.

So pricing-wise, would it be better to mirror using @lzap’s rsync script or just run a recursive wget?

Just a wild idea, how about using one of the “distributed web/data” projects, such as ipfs[1] or dat[2] to provide a way for users to mirror the repos on a p2p basis?

[1] - https://ipfs.io/
[2] - https://datproject.org/

We can always use Katello (Foreman :: Plugin documentation index) which works wonderfully for this use case :smile:

2 Likes

Ooh, the burn… :wink:

My wild guess is that rsync is more efficient than http calls, on the other hand mirroring via wget does go through CDN in case of downloads vhost (more to come):

https://projects.theforeman.org/projects/foreman/wiki/Infrastructure_CDN

This depends on our goal. If our goal is “more availability of packages” then mirrors are great. If our goal is “make sure we have a backup of files” then a single rsync mirror, over the internal Rackspace network, is probably better :slight_smile:

I was thinking the former, although I know that long-term goal is to bring everything to Fastly and have everything served via the CDN. But even CDN won’t help if you have poor connection like me - an in-house mirror is great thing to have.

We should definitely do the latter as well, just keep in mind that rsync ... --delete is everything else but a proper backup :slight_smile: Snapshots are also quite bad backups, we need incremental normal backups which can help us with “hey, I think I deleted a file three months ago” problems. I personally like “duplicity” as it is trivial to setup.

Edit: Public mirrors and CDN are complementary tools, if anyone is interested maintaining a mirror let’s have both.

We use Dirvish which is rsync based, and keeps a backup every day for 30 days, and the backup on the first of each month is kept for 6 months.

1 Like

I have a server out there using linnode that I might be able to use for that.

But for that I have two questions:
A. what is the size that is required for the whole mirror (not including nightly)?
B. Doesn’t rsync requires ssh keys to copy things around (or user account)?!

If you scroll up the size is there, up to 60 GB including nightlies. Rsync has it’s own protocol which is often used for public content (our case), but since it is not secured, people tend to prefer SSH wrapper. But it’s not mandatory, just run my shell script and you will see.

:scream: my server is around 20GB including the OS itself, i’ll play with stuff in the weekend and see what I can do.