Foreman At Large/Huge Scale?

Lang_Jason · June 18, 2019, 11:24pm

Hey All,

Was hoping to possibly connect with some other(s) that are using FOreman to Manage Thousands to Tens of Thousands of Nodes.

I keep hearing foreman can “scale” to 100k nodes or more, but I just keep encountering “scaling issues”. I’m hoping to find people who can collaborate with me, and share our setting(s), sizing, etc.

I love foreman- but lately I seem to be bouncing from one scaling issue to another, and just looking for like minded people to bounce ideas and solutions off of (I can’t be the only one with these issues!)

Current stats:
18k managed server(s) 30 puppet environment(s), ~900 classes per environment, 30 minute check-in. MySQL (MariaDB 10.2)

Current Issues:
DB Load:

Generating 10-50 million “log” table rows per day - cleaning this up can take literally most of the day deleting in batches of 75,000 to prevent locks from lasting more than 10-15 seconds
fact cleanup rake task takes like 6 hours to run (fact_names and fact_values are like 10GB table(s) combined)
Keeping more than 7 days of reports is impossible - db grows to 100+GB, but more specifically sources, logs, reports, messages grow huge (Logs alone got up to 30GB once) and cause tons of I/O and slowdown as queries against them (which happen all the time for report uploads) cannot fit in RAM anymore - but cleanup takes forever as well as it now has to clean out 200-300 million row(s)!

Puppet:
Class Import is super slow - can take 30 minutes to finish loading due to # of classes and environment(s) - seems to run environments in series vs batching/threading in parallel.

Upgrades:
Takes Hours/Days with Many Horizontal Servers - DB Migrate/Seed task(s) run once per plugin on every single server, in our environment this means like 400 seed/migration(s) or so to upgrade - db size means each one can take 15-30 minutes: Feature #19228: Upgrade process with multiple servers is extremely time consuming (many many db:seed, db:migrate apipie cache runs) - Foreman and i end up just…waiting, and waiting. We have a thought to use an ansible play to set the hostfile to point each node(s) database URL to 127.0.0.1 so the RPM upgrade “fails” quickly on all but the first node (and then undo this afterwards) but are concerned with other possible repercussions from this.

Dirk · June 19, 2019, 7:26am

I think @TimoGoebel can perhaps provide some look inside their environment as it is probably the biggest one I am aware of with about 4000 systems and also many puppet environments. They have split Katello into parts what created its own problems but perhaps helped with performance.

TimoGoebel · June 19, 2019, 11:18am

Just patch the foreman-rake wrapper script to be a noop by adding exit 0 somewhere at the top of the file.

Have you tried postgres?

Me, too. But scaling can be hard. Maybe @lzap has some ideas.

Lang_Jason · June 19, 2019, 12:22pm

@TimoGoebel

We are absolutely looking to move to PostgreSQL within the next 6-12 months. I saw that the project is dropping support for mysql by 1.24 so now that is more important on our end. I’m concerned as our DBA’s don’t have much knowledge with PostGreSQL yet (versus Mysql). But now as this is “forced” we can make it happen!
Would you be able to share your postgreSQL conf settings and sizing? I know postgresql requires much less tweaking overall vs mysql, but i know a few things (like memory/cache) will need to be set to ensure the DB is performant.

So as for “patching” foreman-rake wrapper script. I guess I’m wondering how that would work?
I could see:

Update foreman-rake to have an “exit 0” at the top of the script
Sudo yum update command from upgrade docs
RPM isntallation overwrites foreman-rake command
RPM post-install task(s) kick off rake migrate and db seed, still takes forever

UNLESS editing of foreman-rake would cause the RPM installer to treat it like a config file - and lay down the new foreman-rake file as foreman-rake.rpmnew?

In general - will postgresql bring “performance increases” over mysql? any gotcha’s with regard to postgresql at scale with Foreman that you can think of, or that come to mind?

Thanks in advance for your reply, I really appreciate your time and insight.

lzap · June 19, 2019, 12:25pm

Thanks so much for raising this topic. I am fighting hard in the community for changes in these areas, but it’s not seen as a priority. When I’ve first learned that we store reports, audits and facts in RDBM in normal form I was not sure this is the best thing to do.

First, I don’t think that audits and reports do belong into RDBM. In my opinion, audits should go into a log file or system journal where they could be easily archived for long-term booking use.

Second, although reports do not necessary belong there as well, I can imagine storing them in RDBM but not in a way we currently do. We break reports into individual lines and store them separately. We extract “resource” part of each line and everything is stored alongside with sha hash. This creates unnecessary stress on RDBM.

Third, we also break fact values and names into normal form which is pretty heavy on resources too. Frequent updates of facts can put another layer of stress to the RDBM, particularly if you have some misbehaving hosts which are changing facts very often (container hosts with many mountpoints or network interfaces).

These three domains have something in common - we store data in read/search friendly way which comes at cost of update unfriendliness. Of course everything has its price:

Storing audits outside of RDBM would drop taxonomy information and limit presentation capabilities. Therefore I think it could be reasonable to only keep last few days/weeks of audits in RDBM and move the rest to log files.
Changing how we store reports into simple one single text/blob column would dramatically save resources (dropping gigabytes of indices) however reports would not be searchable per resource. Today you can easily click on Puppet Resource and see all other messages related to it. The question is - do we really need this feature? Do you guys use it? Because if not, this could be pretty straightforward change, with painful upgrade process I have to admit (long running migration script).
Fact importers were given decent attention last couple of releases, currently I am actually working on a patch that should give some instances decent relief when there are many facts reported (https://github.com/theforeman/foreman/pull/6850) and I am also planning to start leveraging Rails Cache little bit more in this area to boost performance of fact importing.

I hereby call for action! Let’s stop adding new features and focus on performance! We need to do it and we need to do it fast because we do not want anybody to stop using Foreman because of its scaling limitations.

Lang_Jason · June 19, 2019, 7:27pm

Hey @lzap

I obviously agree with your call to focus on performance, because it selfishly affects ME! However, I would/do also realistically understand that the majority of consumers of foreman probably have 1000 servers or less.

Would you think that simply “switching” to PostgreSQL might implicitly help with some of the RDBMS “stress” that the sources/logs/messages/audits/reports table creates? Does PostGres “handle” that type of operation better, or would it possibly be optimized more in some way (rails prefers it?)?

"Today you can easily click on Puppet Resource and see all other messages related to it. " - I have never done this, and can’t even figure out where it is or how to do it even now. I searched “classes” and can see a host(s) “reports” but was unaware, and still cannot figure out how to search via puppet resource specifically?

Audits in an RDBM - I agree it doesn’t need to be in the database. I’d think most org(s) already have Splunk, ELK, or similar and could process/parse/show millions to billions of audits far more efficiently that way. Perhaps make it an option, it defaults to RDBMS but can be configured to dump to like /var/log/foreman/audits.log Might be interesting with many Foreman instances in a cluster (how might that work?), and might need to have a page saying “check local filesystem” for someone accessing foreman.com/audits or its API equivalent (or could just parse the local log file and display it) - but might work otherwise?

Reports - Being able to BLOB reports as a single DB entry/row, versus the one row per item/entry would be extremely invaluable at scale. However, i can see the flip side that Logistically, I’m not sure it makes sense to offer that as a “switch” due to how many hooks and complexity it might add, but I can definitely say the current setup does NOT scale (at least for me, and how I’ve been doing it). I am not sure i can make it past 25k nodes, and I have 18 PuppetMaster(s) with 24 cores and 96GB RAM and my database is 16 core 64GB RAM - so definitely not a tiny environment.

I am working on getting us upgraded (currently on 1.19.1) however it is understandably difficult to get 24-48h “outage windows” due to the length of time waiting for all the rake and migrate task(s) to fail out. That being said, I’m hopeful that a hostfile entry, or a foreman-rake short-circuit as mentioned above “might” be a decent, if screwy, work-around for that in the short term (once I understand in greater detail how to attempt it)…

Regardless of what changes can/could be made to Foreman core. I’m still also hoping to connect with more people who DO operate foreman at scale, to know that I’m not alone, AND to collaborate with them on issues/fixes that may be unique to larger environments!

Matt_Cahill · June 19, 2019, 8:45pm

We have also struggled with similar issues. Our environment is around 8500 nodes with 30 or so environments and a similar class count per environment.
My initial thought when reading your post @Lang_Jason was a 30 minute check-in is very rapid for that size environment especially if you don’t filter out facts that change every check-in for example mount stats. We have our workstation and general server class machines on an hourly check-in and a large section (1000s of nodes) of our data center nodes (identical fairly static configuration) on a two hour check-in due to load issues on the server and also to try and reduce the load footprint on the client nodes from puppet agent.
I must note that we moved all but a few of our parameters into hiera primarily so that that data is under version control but the decision was also influenced by performance issues we were having at the time with smart class parameter lookups causing huge catalog compile times. That issue turned out to be a bug that was fixed but we had to roll back our whole environment at the time. It’s something that only large scale deployments would have picked up on though I think.

I don’t have the database stats to hand but I will update later, we are on Postgres though. We definitely don’t have the issues you note with upgrades and rakes. Some of our pain points are; class/environment import, slow fact searches, audit page timeouts and puppet CA proxy page timeouts.

I would also put my hand up in support of a performance focus period and we can pitch in with feedback and testing where our resources allow.

Lang_Jason · June 19, 2019, 11:31pm

Hey Matt,

Generally I agree that lowering my check-in to 2-3 hours would greatly alleviate issues. I’ve stubbornly “refused” because i like the general compliance of 30 minutes vs 3 hours, and my issues at this point are only annoyances. Foreman still “woks” but that is my last resort!

I haven’t seen performance issues with smartclass param(s) other than the 1.18 bug which thankfully was fixed

Regarding the upgrade slowness - how many “Foreman” servers do you have? With just one, it’s less noticeable. But with many - the upgrade time is multiplied per-server, and you can’t really do many in parallel as most of the queries are blocking/locking, causing them to just queue for hours if many nodes are doing the migrate/seed tasks as part of the RPM post-install tasks…

I can also say class import is slow (15-30 minutes for us currently, and often taking several tries due to hitting the (too low) timeout on 1 environment of 30 causing the whole thing to bomb

We reduced our audit page timeouts by clearing the audit log(s) via the audits:clean rake task - but we can now only keep like 1 month (which is still something like 250k rows in a table) - this lets the audit pages “load” but still fairly slowly…

Fact searches are super slow, but recently we have also seen the Parameter search become extremely slow. my smaller test environment can still return a detailed param search in 2-3 second(s) but my prod environment simply never finishes complex param searches anymore I’m hopeful that this “may” be an index/optimization issue specific to mysql - but not really sure at this point: Host Params API Lookups Timeout and will not return has more details/specifics on this as observed on my end.

We haven’t ever gotten the Puppet CA Proxy Page to load - we just use the puppet cert CLI command(s) when necessary - which is barely ever

wings · June 20, 2019, 1:27am

I’m very interested in a push for performance and optimisation for a period - long term I’m hoping to run Foreman against 500+ nodes. Obviously that’s not the same kind of large scale as you guys but it would be nice if that worked “out of the box”.

lzap · June 20, 2019, 8:05am

Honestly I don’t think so, actually I feel like it could be a little bit worse as PostgreSQL is very robust system. But recent couple of releases they have been pushing for performance really hard.

Red Hat Satellite consultants have most experience with scaling Foreman to its limits to be honest. There are huge customers with tens of “keys” of managed hosts, but it does not come free. VMs are huge, lots of Capsules, fine tuned options and special ops care needs to be done.

Therefore after you migrate to PostgreSQL I suggest to read Performance Tuning Guide which is I think free to download on Red Hat documentation site. It contains some important tricks and measurements how to boost performance a bit.

Can other folks share their experience with searching in reports by resources? Because if this confirms to be not so useful, nobody will stop me from refactoring how we store reports. I strongly believe this creates probably 50% of the load on large-scaled deployments with puppet.

Actually what is very new is documentation how to set things up with ELK: GitHub - lzap/foreman-elasticsearch: Configuration for foreman-journald-rsyslog-elasticsearch logging setup we actually send all audit records into journal/syslog/ELK today. I believe we should at least provide an option to turn off auditing into RDBM because if logging is configured properly (either sending logs to Elastic or storing audit record in a separate file named foreman-audit.log) then it can do the job pretty nicely.

I believe that most of time is actually caused by huge amount of reports and audits in the database, so then everything is slow. Can you do a little test? Would you mind dropping all reports via TRUNCATE RDBM DDL command (including all entries from logs, resources and join tables) and then do the migration? I would be interested in how much of a difference this would be. Maybe we can document this for users who don’t mind loosing their reports until we fix this.

Can you elaborate? Would you mind reconfiguring postgres to show slow SQL queries and show which were slow?

Ditto.

I see that for example there is a thread about slow performance of fact search, the problem is that Red Hat employees have usually only access to customer database dumps that are PostgreSQL. If you have such a problem, the only way is to do trial and error of questions. I can install foreman with MySQL, the problem is I don’t have the data. Migration is perhaps one way out. Anyway, dropping MySQL would be pretty useful for the future.

lzap · June 20, 2019, 8:46am

Looks like it’s paywalled: https://access.redhat.com/solutions/4224211

frangdlt · June 20, 2019, 2:17pm

You can create a free / gratis account at https://developers.redhat.com/ to access that Red Hat document.

Lang_Jason · June 20, 2019, 8:06pm

Thanks for this link! Luckily we are a pretty big RedHat customer overall, so im able to view this without issue.

I’ve been through older version(s) of this previously - I’m working on implementing the prefork options (we currently just have the defaults set there) and the postgreSQL sizing stuff will be useful as a starting point for sure! The rest of the items we “already have”

For our next upgrade, i can attempt to
truncate reports;
truncate messages;
truncate sources;
truncate logs;
truncate audits;
prior to executing the upgrade in the hopes that it will spead up the migrate and seed tasks!

Practically speaking I’m a few weeks/month away from being able to schedule/do so, with several nonprod environments to try against first - but will definitely report back!

~Jason Lang

TimoGoebel · June 21, 2019, 6:57am

Sorry for my brief reply, I just want to get this info out there fast.

Sure, I’ve copied the relevant settings to a gist. The VM has 24 GB RAM and 8 cores.

It speeds up the process for plugin updates, not for core.

lzap · June 24, 2019, 8:09am

Thanks, important thing for PostgreSQL is to perform VACUUM after bunch of records are deleted to reclaim space. We’ve actually added this as an upgrade step recently.

lzap · June 24, 2019, 12:26pm

Correction, vacuum is unnecessary if data was dropped via truncate statement.

Matt_Cahill · June 26, 2019, 9:26am

Hi sorry I’m still meaning to reply with the answers you’ve requested here with actual empirical data and we should be able to give you examples of long queries etc. from our live database. It’s been a bit of an epic week here so I’ve had to prioritise.

@Lang_Jason We have 6 foreman only servers in a cluster with 8 puppet compile nodes and an additional foreman/puppet server running as the CA all behind an F5. The back end is 2 memcached servers and an active/passive postgres pair behind HAproxy with patroni and barman for PITR.
We are running ubuntu here and the debian packages don’t seem to do the same thing you are reporting, although i would expect they process the same rakes as post tasks.
I normally shut down the F5 vservers for puppet and then run the update manually one by one following the instructions in the foreman docs. I don’t run the installer at the end I just run puppet against them but I do run the rakes manually.

Class import is very slow for us, we now only import classes one environment at a time which works pretty reliably but trying to do the whole thing will always fail. It’s not really a massive issue for us normally if the classes aren’t up to date in foreman though because only one of our class includes comes from the foreman enc and everything else is inherited from that via hiera. We use a git hook to import new environments on branch creation and then imports only really need to be done again if our foreman included class is refactored in some way.

ekohl · June 27, 2019, 12:58pm

There is a well hidden feature to ignore classes during import to trim it down. https://theforeman.org/manuals/1.22/index.html#4.2.2Classes has a section on Ignoring classes on import. This can speed up the process, especially if you only care about a few classes. The side benefit is that the UI becomes much more usable since the list is much smaller.

[PUP-523] - Jira would be interesting as well. It would allow module authors to mark classes as private and the Puppet REST API would expose this. Then we could ignore those classes during import automatically. Given the activity I’m not very hopeful on a quick implementation.

Andrew_Schofield · June 27, 2019, 11:05pm

I’m late to the party.

We run Satellite and we’re in the above category. We have around 40k hosts under management with 29 capsules. We run puppet, probably round about 30 / 40 classes.

Our main satellite master is 24 CPU and 256G of RAM. We’ve heavily trimmed some of the fact imports.

The satellite server is simply not capable of dealing with all of our puppet clients checking in every 30 minutes. Instead, we have a puppet class which configures puppet to run via cron and then chooses a schedule per environment.

One thing that is worth looking at is queries with no indexes, it can be a cheap fix offering decent performance gains.

Postgres needs to be tuned with regards to VACUUM settings, work_mem and shared_buffers.

lzap · June 28, 2019, 7:27am

Thanks Andrew for info. I believe that the main reason for that is how we store reports, I haven’t tested yet but if we change reports to a simple text/blob storage Foreman and Posgres resource usage will go significantly down.

I have some things on my plate but I want to find some time over this summer. Stay tuned, patches incoming. The only thing that worries me is upgrade/migration - getting all those reports converted from logs/sources/messages tables to a single table column report.contents can take hours and hours. I was thinking maybe truncating them could be an alternative approach for those who don’t care loosing reports for a short period of time until clients checks in again.