RFC: The Future of Periodic Tasks

RFC: The Future of Periodic Tasks

Background

There are certain routine actions which are required to be done periodically in order to make Foreman work as expected. These are currently implemented with a mix of external cron jobs triggering rake tasks and two flavours of in-application periodic tasks. While this model worked reasonably well in the past, the move to container-based deployments offers us an opportunity to rethink things.

Note: When this text mentions “task” it is to be interpreted as a more general and abstract thing to do, it is not meant to be read neither as a rake task nor as an instance of ForemanTasks::Task.

Current State of Periodic Task Scheduling

Currently, periodic tasks could be divided into two main categories - Externally Scheduled and Internally Scheduled.

Externally Scheduled Tasks

The externally scheduled tasks are implemented in two different ways. The older, more widely used approach relies on cron, while the newer approach leverages systemd timers.

Flavors of External Scheduling
  • Plain Old Cron (and foreman-rake):
    • Many traditional tasks are executed via cron using the foreman-rake wrapper. This includes core tasks like db:sessions:clear, reporting tasks (reports:{daily,weekly,monthly}), various plugin tasks (foreman_tasks:cleanup) as well as some coming from smart proxy plugins (rubygem-smart_proxy_openscap).
  • Systemd Timers:
    • A more modern alternative, currently only used in Satellite deployments
Properties of External Scheduling
  • Individual tasks are completely independent - a failure in one does not affect others in any way.
  • Users can freely modify, disable, or reschedule individual tasks.
  • Tasks can be easily executed on demand by the user.
  • Can execute anything, even processes not included within the Rails application.
  • Rake tasks are slow to execute (adding at least 20 seconds) because they require loading the entire application environment before running. However, all memory is released upon exit.

Internally Scheduled Tasks

These tasks are managed directly within the application, offering better visibility and less resource usage.

Flavors of Internal Scheduling

Same as the externally scheduled tasks, internally scheduled ones come in two flavours. Part is implemented using Recurring Logics from foreman-tasks, the other part relies on chaining ActiveJob jobs in a more ad-hoc fashion, even though both end up using Dynflow to schedule things to be executed in the future.

Foreman tasks Recurring Logics (RLs)

Dynflow has support for scheduling one-off things to be executed in the future. Foreman tasks build on top of that by providing several constructs that allow Dynflow actions to be executed periodically. Recurring logics store the configuration (ie. how often?) as well as state.

A RecurringAction module needs to be included in the root action class. The RL acts as a persistent store for scheduling metadata (cronline, iteration count, limits). The scheduling of the next iteration relies on execution plan hooks in the underlying Dynflow engine, tying each iteration to a specific task group tied to the relevant RL.

Several system periodic tasks (for example Red Hat Lightspeed client status aging, Check for long running tasks) are implemented this way. At the same time, recurring logics can be used by users directly to manage Sync Plans and recurring remote execution jobs.

Ad-hoc using Chained Active Jobs (System only):

The ActiveJob (AJ) is registered in an initializer and, upon completion, schedules its next iteration itself. There is no explicit link between iterations. This is the primary tool in environments where foreman-tasks is not available (ie. in vanilla Foreman).

This is used for various notification and cleanup jobs such as Host lifecycle support expiration notification, Clean up StoredValues, and manifest expiration warnings.

The downside is that this kind of periodic tasks cannot be configured in any meaningful way by the users.

Properties of Internal Scheduling
  • Relatively cheap to create and execute compared to the overhead of external foreman-rake tasks.
  • The status and individual invocations leave a paper trail in the form of tasks (RL only) and logs (both cases).
  • No clear, system-wide distinction between system-level and user-configured tasks.
  • Fragility
    • RL only: If the scheduled “next iteration” is cancelled or fails to schedule, the recurrence can be broken for good.
    • AJ only: If the scheduled “next iteration” is cancelled or fails to schedule, the recurrence can be broken until the application is restarted.
  • AJ only: Active Job-based recurrences do not account for the task’s run time, which can lead to drift if the task takes a long time to execute.
  • Because the scheduled tasks are executed by the same background processing engine as any other tasks, relying on internal scheduling doesn’t bring in any extra dependencies, but can make periodic tasks compete for worker slots with other, possibly non-periodic tasks.

Post redmine 38956 Era (The Near Future)

To avoid the need for having a cron container in the container-based deployments, the short term decision is to leverage systemd-timers on the container host. To avoid having to define all the periodic tasks as individual timers, there will be four anchor rake tasks based on cadence (hourly, daily, weekly, monthly) and individual rake tasks will attach to them. Systemd timers will be provided for those four anchor rake tasks.

This will reduce the number of timers from ~13 to 4, the runtime should be reduced by not having to load the application environment individually for each task while still keeping the ability to run the individual tasks on demand.

This comes at the cost of reduced isolation, as a misbehaving task has the potential to block or even completely prevent others within the same group from running.

Open Questions

  • How will completely external processes, such as foreman-reports and smart-proxy scripts, integrate into this consolidated rake task structure?

Proposal

The ultimate goal is to bring all periodical task management directly into the application and settle on a single way of achieving it. To accomplish this we would need to:

  • Implement all periodical tasks within the application as action classes.
  • Convert all Active Job-chaining based tasks to rely on Foreman-Tasks Recurring Logics
    • Note: This might imply merging foreman-tasks or its selected subset into Foreman core or having Foreman core depend on foreman-tasks.
  • Periodic tasks should be migrated from the older approaches to the new one, it is not desirable to maintain two ways of achieving the same thing.

Required Changes in Foreman-Tasks

To support this unification, the foreman-tasks framework needs several enhancements to achieve feature parity with current approaches and to generally improve the user experience:

  1. Add the ability to edit (change end date and limit, change the interval) of Recurring Logics.
    • Note: This is currently available for Sync Plans in a rather workaround-y way
  2. Implement a mechanism to trigger a task defined as an RL on demand without affecting its overall scheduling cycle.
  3. Add the ability to mark RLs and their associated tasks as “system” tasks in the code, allowing users to distinguish them from user RLs or filter them out altogether.
    • Note: In the past we’ve had a similar request for individual tasks SAT-21985
  4. Ensure that when configuring an RL, the exact configuration details are preserved. Currently we convert the user-friendly configuration (ie. “Run daily”) to a cronline and only store that.
  5. Reworked Recurrence Mechanism:
    • The current mechanism, where the “next iteration” is the source of truth, is too fragile, as cancelling the next iteration breaks the entire recurrence.
    • One possible solution would be to Introduce a single delayed plan (in Dynflow’s terminology) to act as a template. Individual iterations would be cloned from this template. This would also allow for a periodic check to ensure all RLs always have their next iteration correctly planned as well as being able to have it run on demand without affecting the schedule.

Optional Enhancements

  • Recurring logic grouping: The short term solution groups the periodic tasks into categories. If users get used to this, it wouldn’t be ideal to remove it a couple of releases later.
  • Splay Time: Add the ability to configure a random “splay” (or offset) time for Recurring Logics, which is useful for tasks that should run across a time window rather than all at the exact same minute (e.g., syncing with cloud services).

By moving towards a unified, in-application model, we distance the application more from the underlying operating system, allowing the administrators to manage a larger part of the application from within the application itself as well as reducing environmental overhead.

Open questions

  • What is the documentation impact? How are current periodic tasks documented, if at all?
4 Likes

Thanks for the write up.

It’s not mentioned here, but IMHO a big motivation for avoiding external timers is multi-host deployments. We know various people like HA solutions and with cron/systemd you will need to build something to avoid tasks running multiple times in parallel. So another reason to go with internal scheduling.

I think Foreman core depending on a plugin could easily lead to problems and if possible I’d like to avoid that. It would give a worst of 2 worlds. Merging more functionality into core is IMHO a better way.

1 Like

We agreed on this already years ago, it just never happened, so +1 from me on merging.

I like the proposal, and the key element for me is:

we distance the application more from the underlying operating system

This will help with our future eye towards Kubernetes based deployments.

You called out in the background section the fact that we do not have way to differentiate “admin” or system-wide tasks from tasks related to user actions. I think this has multiple benefits for the administrator to have this overview of these important tasks all from one place. I am not sure how they are monitored today, but I could imagine if these tasks fail, and seeing that clearly or having webhooks notify would be incredibly useful.

There are benefits to relying on core Rails functionality. Have you thought about how ActiveJob or a move to Solid Queue could help here?

Less implying, more doing :grin:

1 Like

As far as I know, ActiveJob has no native support for running jobs periodically. SolidQueue adds that as an extension, but as far as I know it is rather static (based on a configuration file). Changing the schedules on the fly seems to be unsupported out of the box, but could probably be invented on top of what’s already there. Still worth taking a look when (and if?) we move to it