RFC: Redis as the default Cache

This is a follow up to RFC: Redis in Foreman.

Background

Up till now, Redis has been present in the Foreman project as:

  • The database for Sidekiq
  • Cache for Pulp content locations

Previously, Redis was also used by Pulp for it’s tasking system until it was re-written into a simpler, Postgresql based solution.

When we first brought Redis into the ecosystem it was being used primarily by tasking systems that needed Redis to be configured with persistent storage. Pulp content caching piggy backed on the existing Redis instance utilizing it’s own database within the same instance.

Redis has been available as an option for the Rails cache for a number of years, with Foreman defaulting to the file-based cache. The file based cache has been shown to fall over at scale, https://bugzilla.redhat.com/show_bug.cgi?id=2063717.

Additionally worth noting is that if using performance co-pilot with a local Grafana, pcp will use Redis as a cache.

Proposal

The proposed change is to switch to using Redis as the default cache for Foreman.

In order to do this, we will need to make a change to our deployment architecture to support two instances of Redis. The reason for this is so that we have one instance deployed with persistence which is required by Sidekiq and separate instance deployed for caching. To quote Sidekiq documentation:

it’s important that Sidekiq be run against a Redis instance that is not configured as a cache but as a persistent store.

To that end the proposal is to have two Redis instances.

  1. Keep default Redis instance as a pure cache

The current, default Redis instance will be kept but configured for caching and used by Foreman and Pulp. On foreman-proxy / Capsule this same default Redis will be used for caching there as well.

  1. Introduce a new Redis instance configured for persistence

A second instance of Redis will be deployed configured in persistence mode to be used by Sidekiq. This will require defining and managing our own systemd unit file for the second Redis instance and will need a name to differentiate it from the default Redis.

2 Likes

My main concern (not “we shouldn’t do it because of that” but “we should thoroughly test and think about that case”) is the fact that we need to move the dynflow DB to a different Redis instance.

  • Is it sufficient to ensure all services are stopped while we reconfigure this?
  • There is nothing that ensures there are no existing (running, planned) tasks – I am ignoring foreman-maintain which only covers Satellite, not upstream, scenarios
  • Is it OK for “far in the future” planned tasks, like inventory sync and friends (which we exclude in the f-maintain check!) to get the Redis part “wiped” or will have to move over the data somehow?

cc @aruzicka :smiley:

Well, it shouldn’t be that bad. We really use just a small subset of sidekiq as we only use it for as a means of communication between the orchestrator and the workers. Dynflow should have enough information in postgres to continue whatever it was doing even if contents of redis are dropped, as long as all the dynflow-sidekiq service were shut down cleanly before the drop.

Is it sufficient to ensure all services are stopped while we reconfigure this?

Should be ™

There is nothing that ensures there are no existing (running, planned) tasks

That shouldn’t matter as long as the services are shut down cleanly. If you just go in and kill -9 everything, you’re quite possibly going to have a bad time.

Is it OK for “far in the future” planned tasks, like inventory sync and friends (which we exclude in the f-maintain check!) to get the Redis part “wiped” or will have to move over the data somehow?

It should be fine. As long as they’re not actively running, nothing about them is stored in redis.

1 Like

@aruzicka Are you also implying that the need for persistence may be unnecessary due to our use of Sidekiq?

From taking a look at our setup for Redis, we are running with the default persistence settings:

################################ SNAPSHOTTING  #################################
#
# Save the DB on disk:
#
#   save <seconds> <changes>
#
#   Will save the DB if both the given number of seconds and the given
#   number of write operations against the DB occurred.
#
#   In the example below the behaviour will be to save:
#   after 900 sec (15 min) if at least 1 key changed
#   after 300 sec (5 min) if at least 10 keys changed
#   after 60 sec if at least 10000 keys changed
#
#   Note: you can disable saving at all commenting all the "save" lines.
#
#   It is also possible to remove all the previously configured save
#   points by adding a save directive with a single empty string argument
#   like in the following example:
#
#   save ""

save 900 1
save 300 10
save 60 10000

Implying might be a strong word, but it is something that crossed my mind a couple of times. There are certain situations where persistence might be necessary, but under normal circumstances it shouldn’t be needed.

Two sentence TL;DR

As far as I understand, persistence is recommended for Sidekiq so that jobs don’t “get lost”. Dynflow builds a layer on top of sidekiq, keeps all the necessary information in postgres and uses redis only for in-flight data which we can afford to lose as it can be reconstructed from the data in postgres.

The TL part

An execution plan in Dynflow is, among other things, a collection of steps that need to be done and relationships between them (step 1 must be executed before step 2, step 2 can be executed at the same time as step 3 and so on). This and steps’ states are stored in postgres. The orchestrator could be though of as “the brain”. When an execution plan is being executed, the orchestrator is telling the workers what should be done. This “telling” in sidekiq-based deployments is implemented as sidekiq jobs. The orchestrator puts a job onto a worker’s queue saying “execute this step” and when the worker does that, it puts a job onto the orchestrator’s queue as a response. When the orchestrator processes this response, it updates its internal state and figures out what needs to be done next. This repeats until all the steps are done.

If all the services are up, we don’t really need persistence. The jobs are put onto queues and then they are consumed. All the things that go to the queues are things that should happen as soon as possible. Yes, if the system cannot keep up with the load, the things might sit in the queues for some time, but they are still just in-flight items and if lost, the state can be reconstructed, assuming we notice they were lost.

Upon startup, the orchestrator tries to drain its own queue and discards whatever it finds there. The assumption here is that postgres is the persistent store and it should contain everything that is necessary to reconstruct the state. All the responses in the queue at this point are addressed to a different instance of the orchestrator which is no longer active so they are just thrown away. In this case, again we don’t really need persistence. If the orchestrator is restarted, it discards whatever was in redis (no matter if redis stayed up or was restarted but its contents were persisted).

The only case where persistence is sort of required is if someone needs to restart redis without restarting the other services, which I can’t really recommend even with persistence. Yes, this can happen unintentionally (for example oom killer could probably target redis and accidents happen), but when that happens I still wouldn’t trust that everything would just carry on once redis gets back up.

Do we absolutely know that we need to use a second Redis instance? As in, have we tested performance and found it to be insufficient for the cache? Or: is the additional disk write traffic anticipated for serving the cache from a persistent redis store unacceptable?

From an API perspective, Redis works the same either way, so there is nothing constraining Pulp from using a Redis instance that is configured for persistence outside. That may or may not be a good idea for other reasons but it’s technically fine. So I suppose my question is, do we have any reason to believe it’s a bad idea to do that, and that we need 2 instances?

@dralley I think that is a fair question and to my knowledge we do not have much data in this regard. We just have “best” practice recommendations.

Sidekiq – Using Redis · sidekiq/sidekiq Wiki · GitHub
Rails Cache – Caching with Rails: An Overview — Ruby on Rails Guides

What’s hard to know is what will happen at larger scales. @evgeni Have we seen any data for Redis memory / disk storage from our monitoring?

@aruzicka - Thanks for the explanation of how we use Sidekiq. This is side-tracking a bit, but all seems to get at the heart of how we use Redis and why. @jturel pulling you into this as this description of using Sidekiq as the brain orchestrating sound rather close to some of the proposed changes to Katello’s event queue. Is there some potential there?

This IS an interesting discussion. And I’m definitely +1 to using Redis as the default cache fwiw.

To recap here: I’ve been thinking about Katello’s Event Queue (which used to run in Dynflow as some strange perpetually suspended task) and how to move it out of running in a thread within the Rails app. One option is to put the events into Redis and send them back into Katello using a small external service via Katello’s API. It’s not my favorite option because it means rewriting the whole thing (dedup, scheduling, retry logic). Best case would be to remove the need for it entirely and that means careful analysis, creative solutions, and probably several compromises.

I always saw Sidekiq as a black box behind Dynflow (which I think is correct based on Adam’s comments) and not something that other components in the ecosystem could use directly. But now I’ll ask: if I wanted to use something like sidekiq-scheduler to call application code every 3-5 seconds, is it doable? That would be the trigger for draining the Event Queue which cleans up a whole bunch of stuff in Katello. Big win, although I’m not sure what roadblocks would get in the way.

Not sure, but perhaps @laugmanuel could provided some data for redis usage as we had it configured on a separate system as cache in this environment for HA.

W’re using a separate Redis instance in our environment for Foreman + Katello and it seems to work just fine. Haven’t had any problems since deploying it.

However, our environment is not that big with about 1000 connected hosts, 4 Foreman application servers and between ~30 and ~60 requests to redis per second:

# redis-cli --stat -i 1
------- data ------ --------------------- load -------------------- - child -
keys       mem      clients blocked requests            connections
2129       2.86M    61      21      3868904 (+0)        1678
2129       2.67M    61      21      3868934 (+30)       1678
2129       2.82M    61      21      3868949 (+15)       1678
2129       2.70M    61      21      3868982 (+33)       1678
2129       2.76M    62      21      3869175 (+193)      1679
2129       2.80M    62      21      3869219 (+44)       1679
2129       2.76M    62      21      3869261 (+42)       1679
2129       2.72M    62      21      3869275 (+14)       1679
2129       2.69M    62      21      3869313 (+38)       1679
2129       2.84M    62      21      3869362 (+49)       1679
2129       2.69M    62      21      3869390 (+28)       1679
2129       2.80M    62      21      3869422 (+32)       1679
2129       2.69M    62      21      3869433 (+11)       1679
2129       2.84M    62      21      3869467 (+34)       1679
2129       2.80M    62      21      3869517 (+50)       1679
2129       2.96M    62      21      3869563 (+46)       1679
2129       2.74M    61      21      3869594 (+31)       1679
2129       2.86M    61      21      3869612 (+18)       1679
2129       2.67M    61      21      3869650 (+38)       1679
2129       2.90M    61      21      3869709 (+59)       1679

The redis process itself consumes less than 50MB of RAM and less than 1% of CPU on a VM with 2 CPUs.

1 Like

@laugmanuel Is your Redis instance handling Rails caching? Sidekiq? Pulp caching?

Well, yes, the use of sidekiq is currently treated as an implementation detail, but there’s nothing that would prevent you from using it directly.

It is Rails caching because of HA, Sidekiq/Dynflow is also pointed to the server. Not sure about Pulp as I have no access to the environment or the current version of the puppet code.

Only Rails caching and Sidekiq. Pulp is using it’s own Redis. These are the stats for our Pulp redis:

# redis-cli --stat -i 1
------- data ------ --------------------- load -------------------- - child -
keys       mem      clients blocked requests            connections
774        4.49M    116     0       24802400 (+0)       590
777        4.56M    116     0       24802423 (+23)      590
779        4.46M    116     0       24802436 (+13)      590
779        4.56M    116     0       24802455 (+19)      590
779        4.46M    116     0       24802456 (+1)       590
779        4.46M    116     0       24802457 (+1)       590
779        4.49M    116     0       24802461 (+4)       590
779        4.46M    116     0       24802465 (+4)       590
779        4.46M    116     0       24802472 (+7)       590
780        4.56M    116     0       24802491 (+19)      590
783        4.46M    116     0       24802510 (+19)      590
783        4.50M    116     0       24802520 (+10)      590
783        4.46M    116     0       24802533 (+13)      590
784        4.50M    116     0       24802546 (+13)      590
784        4.53M    116     0       24802583 (+37)      590
784        4.53M    116     0       24802596 (+13)      590
784        4.46M    116     0       24802600 (+4)       590
784        4.60M    116     0       24802628 (+28)      590
784        4.53M    116     0       24802662 (+34)      590
784        4.57M    116     0       24802678 (+16)      590

So even less requests/s compared to Rails + Sidekiq.

Note that when used as a cache it can automatically evict old data, depending on configuration. Key eviction | Redis

While not the default configuration, something to be aware of.

Overall I would still say that Foreman has a very low cache profile. Especially now that the settings are only kept in memory instead of the cache.

@laugmanuel / @Dirk – For each Redis instance can you share details about the configuration as it relates to persistence?

  • Do you have any persistence configured?
  • If yes, which kinds?
  • If any case with no persistence, do you have any memory limits or eviction policies configured?

Quick thoughts on level of effort to formalize a queue dedicated to Katello to isolate its own events? Can you give me some pointers in terms of where things would need to change in the installer, etc? Any drawbacks that would make this a bad idea? I probably wouldn’t go with the scheduled job approach but instead migrate each of Katello’s individual events into separate ApplicationJobs.

I guess these settings should answer your questions:

save 900 1
save 300 10
save 60 10000
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
ppendonly no
appendfilename appendonly.aof
appendfsync everysec

So we use RDB persistence but have AOF disabled.

I think the discussion of how to refactor the Candlepin event handling in Katello is an interesting one and certainly worthwhile, but it feels too much off topic here. @Jonathon_Turel can you open up a separate topic for it?

I agree it could use it’s own discussion, but needs some tie back and inclusion here if it involves the use of Redis / Sidekiq and affects our design decisions.

To recap so far what I have heard. In a standard deployment and the deployment outlined by @laugmanuel with a remote Redis the deployment has always been with persistence. This has worked fine for all our current use cases, including Pulp caching and Foreman caching.

The question is if that is good enough or we need a separate Redis with no persistence to allow more advanced caching algorithms to be used.

@aruzicka is stipulating that we might could flip things completely around and we do not need any persistence due to how we use Redis with Sidekiq.