Hello community,
I’m currently working on fixing SQLite busy database errors during concurrent actions with the smart proxy container registry (Container Gateway). Bug #36771: [smart_proxy_container_gateway] introduce sqlite timeout tuning - Katello - Foreman was determined to help in some cases, but not all.
My first attempt after this was to switch SQLite to use “Write-Ahead Logging” (WAL) mode which is supposed to greatly improve performance. One big benefit to WAL mode is that “WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.”
Since we were getting busy database errors, it seemed natural to switch to a mode that allowed for concurrency. Making this switch did see some improvements in testing concurrent smart proxy syncs (the testing was quite severe – I was calling smart_proxy.sync_container_gateway
as fast as possible in the Foreman Console in two sessions).
However, errors like the following started cropping up:
2024-02-28T23:14:45 925aa4b0 [W] Error processing request '925aa4b0-984f-42e9-8b33-1fda380c9939: <Sequel::ForeignKeyConstraintViolation>: SQLite3::ConstraintException: FOREIGN KEY constraint failed
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:281:in `step'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:281:in `block in execute_batch'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:156:in `prepare'
/usr/share/gems/gems/sqlite3-1.4.4/lib/sqlite3/database.rb:274:in `execute_batch'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:260:in `block (2 levels) in _execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:260:in `block in _execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/connection_pool/threaded.rb:92:in `hold'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/database/connecting.rb:293:in `synchronize'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:248:in `_execute'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/sqlite.rb:171:in `execute_dui'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/dataset/actions.rb:1200:in `execute_dui'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/dataset/actions.rb:147:in `delete'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/shared/sqlite.rb:669:in `delete'
/usr/share/gems/gems/sequel-5.77.0/lib/sequel/adapters/shared/sqlite.rb:669:in `delete'
/usr/share/gems/gems/smart_proxy_container_gateway-1.2.0/lib/smart_proxy_container_gateway/container_gateway_main.rb:96:in `update_repository_list'
/usr/share/gems/gems/smart_proxy_container_gateway-1.2.0/lib/smart_proxy_container_gateway/container_gateway_api.rb:171:in `block in <class:Api>'
After some digging, things started to make sense. During a smart proxy sync, the entire list of repositories in the container gateway is refreshed since the repositories available on a capsule can potentially vary wildly between syncs. This means that Repository records in SQLite are getting deleted and recreated.
Also, when a user runs podman login
against the container gateway, that user has their list of available repositories refreshed because, if user permissions changed, the user may no longer have access to certain repositories.
Now, to add to this, in Katello, multiple smart proxy syncs are allowed to run at the same time.
With these three facts in mind, that means race conditions can occur if:
smart_proxy.sync_container_gateway
is running in multiple processes due to multiple smart proxy syncs happening.- A user runs
podman login
whilesmart_proxy.sync_container_gateway
is running.
Currently these race conditions have been mitigated by SQLite blocking concurrent database writes/reads. However, users with bigger environments have been hitting timeouts during scenarios (1) and (2) above.
Thus, I am trying to solve how we can improve performance by switching to WAL mode while avoiding the database race conditions.
To attempt to solve the issue, I first started implementing blocking in the container gateway. However, I soon reached the same issue as non-WAL SQLite: lengthy timeouts.
Another solution that is coming to mind is to completely avoid scenario (1) via Katello. Perhaps smart proxy syncs could queue up updates needed to the container gateway and then fire off the update once no other syncs are running. Or, simply the last smart proxy sync to finish could fire off a single update to the container gateway.
Scenario (2) would still be an issue – I’m thinking that we could at least return a more friendly error saying to wait until the smart proxy sync is complete.
I’m still early in the process of fixing this bug, so I’m looking for any more ideas that you all might have.
Thanks!
Ian