Host restarting apparatus using Foreman with Ansible (and Katello, but this can be removed from project)

birkirf · January 14, 2020, 1:14pm

Just wanted to share a project i’ve worked on for my work:
https://gitlab.com/birkir.freyr.hjartarson/host_restarter

As the name suggests this program creates a schedule and updates/reboots hosts while able to run pre and post tasks for both update and reboot tasks. The code is a probably a bit of a mess for actual developers since i would only classify myself as a hobby coder and not an actual developer

At the moment it only works with RedHat based systems and the Check_MK monitoring system - the restart process checks if the host (and it’s group) is in non-critical state, and only if all hosts are non-critical does it actually perform the update/reboot tasks, since i wouldn’t want it to reboot a system that is already broken.

All suggestions are welcome, however i don’t have a lot of time to spend on this project anymore - Merge/Pull requests are also welcomed

Regards,
Birkir Freyr

ekohl · February 20, 2020, 8:03pm

You should also read up on https://github.com/dm-drogeriemarkt/foreman_dlm which can be useful when you don’t want all your hosts to restart at once. I don’t know if your script does the same thing since I need to log in to see the project and don’t have an account.

birkirf · February 20, 2020, 8:36pm

Apologies, apparently i jumped the gun a bit and had to lock down the repository while getting 1 last go ahead from my boss to allow me to share the program. Will update the post as soon as it’s available again.

That being said, it is written so that you run a schedule-creator that iterates through foreman hosts and creates a cron-file with a list for each host for a given month, each host being given its own time to update/reboot.
By default the time params can be set to ‘random’ and the scheduler will create an entry in its config file with a random weekday, week, hour and minute ( for consistency in reboot times each month ).
If multiple hosts in a group happen to hit the same date/time it will offset it by 15 minutes until there isnt a collision within the group

When the cronjob triggers for a given host the program starts by getting some info and checks a monitoring system ( currently only check_mk implented ) if the host in question and its entire group have any checks in Critical state, of one is found the process spits out an error and calls it quits for that job ( this was decided internally so the update/reboot cant completely cripple an already compromised service )
If everything checks out, a downtime is created in the monitoring system ( for 30 minutes by default ) and starts updating and reboots after they finish, able to run pre- and post-tasks for both update and reboot if required ( both update and reboot can be disabled/enabled seperately on a per-host basis ), when the reboot job finishes the downtime gets removed and if everything went smoothly, no one notices anything ( the best kind of automation )

There are still some things that i’m fixing and making better but all in all, it seems to be qorking fairly well, been running it for all non-production hosts since september, mostly stable since then, after a few more fixes we intend to let this thing loose on all machines in foreman ( currently at around 190-ish and constantly growing )