Smart Proxy Container Gateway Race Condition

iballou · February 29, 2024, 1:59pm

Hello community,

I’m currently working on fixing SQLite busy database errors during concurrent actions with the smart proxy container registry (Container Gateway). Bug #36771: [smart_proxy_container_gateway] introduce sqlite timeout tuning - Katello - Foreman was determined to help in some cases, but not all.

My first attempt after this was to switch SQLite to use “Write-Ahead Logging” (WAL) mode which is supposed to greatly improve performance. One big benefit to WAL mode is that “WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.”

Since we were getting busy database errors, it seemed natural to switch to a mode that allowed for concurrency. Making this switch did see some improvements in testing concurrent smart proxy syncs (the testing was quite severe – I was calling smart_proxy.sync_container_gateway as fast as possible in the Foreman Console in two sessions).

However, errors like the following started cropping up:

2024-02-28T23:14:45 925aa4b0 [W] Error processing request '925aa4b0-984f-42e9-8b33-1fda380c9939: <Sequel::ForeignKeyConstraintViolation>: SQLite3::ConstraintException: FOREIGN KEY constraint failed
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:281:in `step'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:281:in `block in execute_batch'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:156:in `prepare'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:274:in `execute_batch'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:260:in `block (2 levels) in _execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:260:in `block in _execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/connection_pool/threaded.rb:92:in `hold'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/database/connecting.rb:293:in `synchronize'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:248:in `_execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:171:in `execute_dui'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/dataset/actions.rb:1200:in `execute_dui'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/dataset/actions.rb:147:in `delete'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/shared/sqlite.rb:669:in `delete'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/shared/sqlite.rb:669:in `delete'
/usr/share/gems/gems/smart_proxy_container_gateway-1.2.0/lib/smart_proxy_container_gateway/container_gateway_main.rb:96:in `update_repository_list'
/usr/share/gems/gems/smart_proxy_container_gateway-1.2.0/lib/smart_proxy_container_gateway/container_gateway_api.rb:171:in `block in <class:Api>'

After some digging, things started to make sense. During a smart proxy sync, the entire list of repositories in the container gateway is refreshed since the repositories available on a capsule can potentially vary wildly between syncs. This means that Repository records in SQLite are getting deleted and recreated.

Also, when a user runs podman login against the container gateway, that user has their list of available repositories refreshed because, if user permissions changed, the user may no longer have access to certain repositories.

Now, to add to this, in Katello, multiple smart proxy syncs are allowed to run at the same time.

With these three facts in mind, that means race conditions can occur if:

smart_proxy.sync_container_gateway is running in multiple processes due to multiple smart proxy syncs happening.
A user runs podman login while smart_proxy.sync_container_gateway is running.

Currently these race conditions have been mitigated by SQLite blocking concurrent database writes/reads. However, users with bigger environments have been hitting timeouts during scenarios (1) and (2) above.

Thus, I am trying to solve how we can improve performance by switching to WAL mode while avoiding the database race conditions.

To attempt to solve the issue, I first started implementing blocking in the container gateway. However, I soon reached the same issue as non-WAL SQLite: lengthy timeouts.

Another solution that is coming to mind is to completely avoid scenario (1) via Katello. Perhaps smart proxy syncs could queue up updates needed to the container gateway and then fire off the update once no other syncs are running. Or, simply the last smart proxy sync to finish could fire off a single update to the container gateway.

Scenario (2) would still be an issue – I’m thinking that we could at least return a more friendly error saying to wait until the smart proxy sync is complete.

I’m still early in the process of fixing this bug, so I’m looking for any more ideas that you all might have.

Thanks!

Ian

ehelms · February 29, 2024, 3:33pm

What is the database used for in these workflows?

iballou · February 29, 2024, 4:08pm

The database is used for a few things:

To store what container repositories are on the system
To store what users exist and who has access to what repositories
To store active authentication tokens so users can log in without always reaching out to Katello

This is all necessary since Katello manages container registry access rather than Pulp.

iballou · February 29, 2024, 4:51pm

Some things discussed on Matrix:

Improve the queries generally to remove n+1 queries
- Like Reduce the number of queries by ekohl · Pull Request #32 · Katello/smart_proxy_container_gateway · GitHub
Look into introducing transactions
Avoid creating a new DB connection for each query

Some more involved improvements were suggested like using Postgres or Redis instead of SQLite.

iballou · February 29, 2024, 5:10pm

Sequel has provisions for a database connection pool, so having multiple connections should be okay if the race conditions can be dealt with. If we use only a single pool it would certainly deal with the race conditions, but it could become a bottleneck.

iballou · February 29, 2024, 6:54pm

Reduce the number of queries by ekohl · Pull Request #32 · Katello/smart_proxy_container_gateway · GitHub now has more commits for using a single DB connection via a singleton dependency.

It also has the beginnings of introducing support for direct database imports, which should improve time and memory usage for queries (thanks @ekohl !)

I’m going to work from there and do some testing.

Up next it was suggested to take a look into having a cascading deletion for the foreign keys on repository_users. It may help out with the race conditions. I need to think on that one a bit more.

A blog about improving queries with Sequel: Inserting from SELECT with Sequel | Janko Marohnić

iballou · March 29, 2024, 8:51pm

I would like to revisit this now that some work has been done to address the performance problem.

Fixes #37218 - podman login against the container registry returns 500 intermittently by ianballou · Pull Request #33 · Katello/smart_proxy_container_gateway · GitHub has the current best solution that involves SQLite.

Even with the code above, SQLite locks up when trying to do 50 podman logins at the same time.

The script I tested with is as follows:

#!/bin/bash
for i in {1...50}
do
	podman login `hostname` -u admin -p changeme &
done
wait

I think that performance is not acceptable, especially for users who have many hosts that may be consuming container content at once.

Since PostgreSQL is already running on smart proxies due to Pulp existing, I decided to try having the container gateway DB exist in postgres instead to test the performance. The results were much, much better.

Not only could I do 50 logins concurrently, but I could even do 100 concurrently while also unreasonably spamming the “sync all repos from Katello to container gateway” code (smart_proxy.sync_container_gateway).

This was also with 309 repositories being consistently added and removed from the container gateway.

I don’t see a way around the SQLite busy issues. Given that the container gateway has a dependency on Pulp, which means postgres will always be available, I think moving the DB to postgres permanently is the best next step. Heavy container users will be much happier without having to worry about those SQLite busy errors.

Now, even postgres will lock up at some point, but I think by that point we can start limiting how many concurrent logins we allow.

Moving to Postgres would need a couple of steps that should be relatively easy:

The installer would need to set up the Postgres database user and database.
SQLite would need to be migrated to Postgres via pgloader
- SQLite to Postgres — pgloader 3.6.9 documentation
- This step could possibly be done by the container gateway itself since it knows the database backend preference and can easily tell if the Postgres DB is not populated.

I’d like to hear what folks think. I don’t see any downsides to this since postgres is guaranteed to be in any environment with the container gateway installed thanks to Pulp. I think the performance improvement to users would be huge.