Memory recycler in the age of Sidekiq

A cross post from this gist for visibility.

Foreman-tasks shipped with an optional memory recycler which would restart dynflow executors (processes inside dynflowd service) when they consumed too much memory. This was a double edged sword. It reclaimed the memory, but it had a great potential to leave tasks sitting around in paused state with abnormal termination errors.

Starting with the move to sidekiq, the memory recycler is gone. Luckily, by splitting the executor into several systemd services, we can leverage the resource control features[1] provided by systemd and cgroups to fill the feature gap created by removal of the memory recycler.

Prerequisites

Before we can get to the memory limiting, let’s take a look at how the default state looks. There are the orchestrator, worker and worker-hosts-queue processes running as instances of the dynflow-sidekiq@.service template service and as we can see from the output, systemd doesn’t track how much memory the instances are using or any limits placed on memory usage.

# systemctl status dynflow-sidekiq@* | grep -e Memory -e '^. dynflow-sidekiq@.*.service'
● dynflow-sidekiq@worker-hosts-queue.service - Foreman jobs daemon - worker-hosts-queue on sidekiq
● dynflow-sidekiq@worker.service - Foreman jobs daemon - worker on sidekiq
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq

Enabling memory accounting

In order for the resource control to work, we first have to enable resource accounting for the unit. Please note this needs to be done only once.

# mkdir -p /etc/systemd/system/dynflow-sidekiq@.service.d
# cat <<EOF > /etc/systemd/system/dynflow-sidekiq@.service.d/memory-accounting.conf
[Service]
MemoryAccounting=yes
EOF

Make systemd reload the service definitions and restart all the dynflow-sidekiq services.

# systemctl daemon-reload
# systemctl restart dynflow-sidekiq@*

If we take a look at the status of the services, we should see that systemd started tracking how much memory each of the services uses.

# systemctl status 'dynflow-sidekiq@*' | grep -e Memory -e '^. dynflow-sidekiq@.*.service'
● dynflow-sidekiq@worker-hosts-queue.service - Foreman jobs daemon - worker-hosts-queue on sidekiq
   Memory: 263.1M
● dynflow-sidekiq@worker.service - Foreman jobs daemon - worker on sidekiq
   Memory: 264.1M
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
   Memory: 264.5M

Now that the memory accounting is set up, we can move to the actual limiting.

Limiting memory usage

Here we have two options, the options can be either set individually for each of the instances or it can be set on the template level, in which case it would apply to all the instances. If set on both levels, the per-instance setting is prioritized over the template one.

Setting memory limit globally

For example to set a global limit of 2 gigabytes, the following snippet could be used.

# mkdir -p /etc/systemd/system/dynflow-sidekiq@.service.d
# cat <<EOF > /etc/systemd/system/dynflow-sidekiq@.service.d/memory-limit.conf
[Service]
MemoryLimit=2G
EOF
# systemctl daemon-reload
# systemctl restart dynflow-sidekiq@*

Now we can check the output of systemctl status to see the limit is applied.

# systemctl status 'dynflow-sidekiq@*' | grep -e Memory -e '^. dynflow-sidekiq@.*.service'
● dynflow-sidekiq@worker-hosts-queue.service - Foreman jobs daemon - worker-hosts-queue on sidekiq
   Memory: 490.1M (limit: 2.0G)
● dynflow-sidekiq@worker.service - Foreman jobs daemon - worker on sidekiq
   Memory: 490.1M (limit: 2.0G)
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
   Memory: 492.5M (limit: 2.0G)

Setting memory limit per instance

To apply per-instance overrides, the approach is the same, just the path is a bit different. For example to increase the limit for the worker to 4 gigabytes, the following snippet could be used.

# mkdir -p /etc/systemd/system/dynflow-sidekiq@worker.service.d
# cat <<EOF > /etc/systemd/system/dynflow-sidekiq@worker.service.d/memory-limit.conf
[Service]
MemoryLimit=4G
EOF
# systemctl daemon-reload
# systemctl restart dynflow-sidekiq@worker

We use the same command to check the per-instance setting overrides the template one.

# systemctl status 'dynflow-sidekiq@*' | grep -e Memory -e '^. dynflow-sidekiq@.*.service'
● dynflow-sidekiq@worker-hosts-queue.service - Foreman jobs daemon - worker-hosts-queue on sidekiq
   Memory: 490.9M (limit: 2.0G)
● dynflow-sidekiq@worker.service - Foreman jobs daemon - worker on sidekiq
   Memory: 175.6M (limit: 4.0G)
● dynflow-sidekiq@orchestrator.service - Foreman jobs daemon - orchestrator on sidekiq
   Memory: 494.2M (limit: 2.0G)

Memory limiting in action

Now, we enabled memory accounting, set up the limits, but what happens when the limit is actually reached? Sadly it is nothing too sophisticated, once the limit is reached the service in question is killed with SIGKILL.

# journalctl -u dynflow-sidekiq@worker
Jul 27 08:57:34 foreman.example.com systemd[1]: Started Foreman jobs daemon - worker on sidekiq.
Jul 27 08:57:36 foreman.example.com dynflow-sidekiq@worker[10878]: 2020-07-27T12:57:36.683Z 10878 TID-43k2m INFO: GitLab reliable fetch activated!
Jul 27 08:57:36 foreman.example.com dynflow-sidekiq@worker[10878]: 2020-07-27T12:57:36.731Z 10878 TID-d3w7u INFO: Booting Sidekiq 5.2.7 with redis options {:id=>"Sidekiq-server-PID-10878", :url=>"redis://localhost:6379/0"}
----- B< ----- SNIP ----- B< -----
Jul 27 08:58:39 foreman.example.com systemd[1]: dynflow-sidekiq@worker.service: main process exited, code=killed, status=9/KILL
Jul 27 08:58:39 foreman.example.com systemd[1]: Unit dynflow-sidekiq@worker.service entered failed state.
Jul 27 08:58:39 foreman.example.com systemd[1]: dynflow-sidekiq@worker.service failed.
Jul 27 08:58:40 foreman.example.com systemd[1]: dynflow-sidekiq@worker.service holdoff time over, scheduling restart.
Jul 27 08:58:40 foreman.example.com systemd[1]: Stopped Foreman jobs daemon - worker on sidekiq.
Jul 27 08:58:40 foreman.example.com systemd[1]: Started Foreman jobs daemon - worker on sidekiq.

Since the service is set to restart on non-graceful shutdowns, systemd restarts the freshly killed service after the holdoff time is over.

At the time of writing this, the version of systemd on EL7 was 219. Newer versions of systemd promise better handling of memory management with MemoryMax, MemoryHigh and MemoryLow.

Recovery after worker is killed

When the worker is killed, it may or may not have been processing one or more jobs. If we used Sidekiq as-is, those jobs would be lost. For this reason, we are using gitlab-sidekiq-fetcher which implements the reliable fetch pattern. When a worker starts executing a job, it takes the job from a queue and puts it onto a working queue. When the job is finished, the worker removes the job from the working queue. However if the worker is killed while executing a job, the job stays in the working queue. Once per hour, the working queue is checked and any orphaned jobs there are removed from the working queue, requeued to the original queue and executed again.

When the job gets executed by a worker for the second time, Dynflow notices it already tried to execute the job, turns the step into error state and doesn’t really execute it. From here on, rescue strategies can get applied to handle this situation further.

There was a bug in Dynflow <= 1.4.6 which made the job get stuck when attempting to run the step for the second time. It was fixed as Dynflow/dynflow#360 and released in dynflow-1.4.7.

Closing notes

I knew systemd had support for resource control and it was my plan to use it from the start. Sadly I didn’t know the specifics and assumed it would behave in a more sensitive way rather than just playing whack-a-mole with processes. Even with this rough behavior, we try to do our best to make it work. Sadly recovering from kill -9 is hard so any form of recovery is on a best-effort basis.

There are ways how this could be implemented differently and possibly in a better way, but currently none that comes to my mind can be set up on a fresh Foreman with just a simple configuration change.

[1] -
https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryMax=bytes

4 Likes

Thanks for the write up!

This also requires the unified cgroup (aka cgroup v2) support and I don’t think EL7 has kernel support for that. Given the current development status of EL7, I doubt we’ll ever see it. I don’t think EL8 even supports it, given that Fedora only turned it on by default in version 31. Even then people disabled it because Docker didn’t support it yet (it does since 20.03).

I was wondering, and couldn’t find it, but do you get a signal when you reach MemoryHigh or is it something you need to poll for?

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#usage-guidelines writes:

Because breach of the high limit doesn’t trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.

It doesn’t give any suggestion about how to achieve this.

This was exactly my thought. But great research, we can probably create a ticket for 2022 or something when Foreman will be on EL 8.X with all featured systemd perhaps.

Does this imply that if a worker hits the memory limit and dies, any jobs that were being processed by it will appear as if they are still running for up to an hour before ultimately ending up in an error state?

If that is so, is there anyway to tune this to avoid jobs appearing to be stuck by users?

I was hoping for a signal, but probably not. One has to poll or set up triggers and wait to be notified.

Well, there’s this https://www.kernel.org/doc/html/latest/accounting/psi.html#userspace-monitor-usage-example

Yes

The interval is tunable, but shortening the interval increases load on redis although I don’t have any numbers which would say how much.