Was hoping to possibly connect with some other(s) that are using FOreman to Manage Thousands to Tens of Thousands of Nodes.
I keep hearing foreman can “scale” to 100k nodes or more, but I just keep encountering “scaling issues”. I’m hoping to find people who can collaborate with me, and share our setting(s), sizing, etc.
I love foreman- but lately I seem to be bouncing from one scaling issue to another, and just looking for like minded people to bounce ideas and solutions off of (I can’t be the only one with these issues!)
18k managed server(s) 30 puppet environment(s), ~900 classes per environment, 30 minute check-in. MySQL (MariaDB 10.2)
- Generating 10-50 million “log” table rows per day - cleaning this up can take literally most of the day deleting in batches of 75,000 to prevent locks from lasting more than 10-15 seconds
- fact cleanup rake task takes like 6 hours to run (fact_names and fact_values are like 10GB table(s) combined)
- Keeping more than 7 days of reports is impossible - db grows to 100+GB, but more specifically sources, logs, reports, messages grow huge (Logs alone got up to 30GB once) and cause tons of I/O and slowdown as queries against them (which happen all the time for report uploads) cannot fit in RAM anymore - but cleanup takes forever as well as it now has to clean out 200-300 million row(s)!
Class Import is super slow - can take 30 minutes to finish loading due to # of classes and environment(s) - seems to run environments in series vs batching/threading in parallel.
Takes Hours/Days with Many Horizontal Servers - DB Migrate/Seed task(s) run once per plugin on every single server, in our environment this means like 400 seed/migration(s) or so to upgrade - db size means each one can take 15-30 minutes: Feature #19228: Upgrade process with multiple servers is extremely time consuming (many many db:seed, db:migrate apipie cache runs) - Foreman and i end up just…waiting, and waiting. We have a thought to use an ansible play to set the hostfile to point each node(s) database URL to 127.0.0.1 so the RPM upgrade “fails” quickly on all but the first node (and then undo this afterwards) but are concerned with other possible repercussions from this.