Refresh RollingCV Repo task hangs, causing repos to not up

Problem:

Sometimes task “Refresh RollingCV Repo repository…” hangs, causing it to be in state: planned, with progress at 100%, and result pending, example here:

This causes the repositories to not synchronize when running scheduled jobs at night:

I’ve noticed this behavior on Foreman 3.17 + Katello, and the issue is still present on Foreman 3.18.1 + Katello

In Foreman GUI it says there are no tasks running or paused, but I’ve noticed that the “Refresh RollingCV Repo repository…” runs multiple times, and only one instance gets stuck, other finish in couple of seconds without errors. It happens on RollingCVs with both custom and/or Red Hat repositories.

Performing restart using command "foreman-maintain service restart” causes all those stuck tasks to finish with error, so technically it “cleans” the queue.

If I remember correctly this issue was not present at Foreman 3.16, or at least I was not aware of it (on that version we’ve had completely different problems where the repos wouldnt sync at all :D).

Expected outcome:

Finish the task “Refresh RollingCV Repo repository…” without getting stuck

Foreman and Proxy versions:

Foreman 3.18.1 + Katello 4.20

Foreman and Proxy plugin versions:

Distribution and version:

RHEL 9.7

Other relevant data:

Is it always the same Repo refresh that is stuck? Can you check the Dynflow Console button on the paused task and add a screenshot to see which step in the task gets stuck.

Hello

according to Dynflow Console, the task is in state stopped/success.

I’ve double checked just to make sure I’ve clicked the right task, and yes I did, so that means Dynflow thinks is done, but Foreman says its not.

I currently have 2 out of 5 tasks in a stuck state (5 tasks which have the same name, but different ID), here is the screenshot of the task from which I’ve made a dynflow screenshot.

Try switching over from the “Run” to the “Finalize” tab, it is possible the failed/unfinished action is hiding there.

I am afraid this wont be helpful :confused: There is only one entry, and that says success.

And just in case, here it is unwrapped

And also a response to previous question I forgot to answer: I am not sure if it is specific set of repositories where this happens, or it is “random”, because sometimes it happed on one repo more than once, and some repos got stuck only once so far. For example repo for Postgres happened only once, but RHEL/EPEL repos got stuck multiple times, its just always a different version.

I am speculating, but based on the symptoms it sounds like everything that is supposed to happen happens, except that the task does not jump to finished success. I have some distant memory of having observed something similar with tasks that essentially performed a noop. I believe we fixed that by pulling the check if the task needs to do something outside of the task and only planning it if something needs doing. That way the noop tasks that were getting stuck never got planned in the first place. I wonder if this might be a similar case.

So you think this might be a bug in Katello or something? Or maybe is there anything else that could help us provide the information what can be wrong? I was not installing the Foreman, but it should be a standard installation without anything special, only non-standard thing is that we are accessing the internet through corporate proxy, we’ve had other problems with it before but we handled them all on proxy side so far.

I think (low confidence) that it might be a race condition in the tasking system where successfully completed tasks don’t update there state to “completed success” for some reason.

I can think of two things you could investigate that could narrow things down:

  1. Do you know what action you take triggers the affected tasks? Is it a sync, or something else? This might help us narrow down the codepath that was taken.
  2. Are your rolling CV repositories updated even though the relevant task hangs or not? As in you sync new content, does that new content appear in the rolling CV repo even though the task hangs? This would tell us if the action itself is broken, or if it is (in my view more likely) the tasking system. As in all Task actions complete successfully and correctly, but for some reason the task is not updated to “completed success”.

We have a sync plan that happens every night at 01:00 to sync all repositories.

As about the second question, I wasnt sure on how to check this. I’ve checked the repositories inside the RollingCV, and the sync state says the repo got synced:

This is the repo on which the task hanged this night.

Then I’ve checked the errata, there was an advisory about Java from yesterday, so I’ve checked the system connected to the affected CV which java package is available to it, and the newest one was a version mentioned in security advisory from Red Hat, so that should mean the RollingCV itself is synced and updated. The only thing is that there are always more than 1 task to Refresh RollingCV Repo repository (usually 4 or 5 tasks), and I dont know if that could affect something - if every task updates a part of repository, or what is the meaning behind multiple tasks.

Unfortunately, there is no easy way to check if a rolling CV was actually updated. Also if a sync does not add any new content, then there is nothing to update, and hence nothing to see. One way to check is by looking through dynflow task output, directly in the DB, or by querrying the API of backend system components. But all of this requires detailed knowledge of the plumbing that cannot be easily communicated via forum post.

The thing one would notice from a user perspective that would prove that things were not performed as designed is the following:

  • You have repository A in rolling content view B.
  • You have a host C, consuming rolling content view B.
  • You synchronize A, and you see that this adds some new package(s) to the repo. (You can see the new packages in the UI for example).
  • You can see the task that updated the rolling content view B, but your host C cannot install the new package(s). They are not in the rolling repo consumed by the host, because it was not properly updated.

My expectation is that the above is not the case. I think most likely everything was properly updated in the rolling CV, and your host is getting all the new content it should. I suspect, the bug is that the task was not updated to “complete, successful” even though it factually is complete and successful. However, I can’t be sure of this without a reproducing system.

1 Like

I’ve had a big debug session with Claude Opus, and this is what Claude came up with, hope that can help narrow down the issue/help on where to look:

Update: third batch of diagnostics, 13 wedges across 4 nights.

Strongest finding: every single wedge is a rolling clone, never a library instance. All 13 had library_instance_id set. Library/source repos never wedge — only their clones.

When grouped by root_id:

  • root 8 (Debian 12 backports): 3 clones wedged (across 2 different Rolling CVs), within 1 second of each other

  • root 16 (RHEL 9 BaseOS): 2 clones wedged (Test + Prod env of same CV), within 375 ms

  • root 202 (PostgreSQL 15): 2 clones wedged (Dev + Test env of same CV), within 545 ms

  • roots 56, 57, 59, 173, 201, 205: 1 wedged clone each

The sub-second timestamp clusters I mentioned earlier line up exactly with the same-root multi-clone groups. So the wedge correlates with multiple RefreshRollingRepo tasks running concurrently for clones of the same root repository, not with concurrent activity in general.

Updated hypothesis: when a library source syncs and has multiple rolling clones across environments / Rolling CVs, RefreshRollingRepo is spawned once per clone in parallel. The clone-update path likely ends with a callback or event keyed on the shared root repo (Pulp publication regen? applicability? content count?). When N siblings try to deliver that event in parallel, only one’s wrapper-state-update completes cleanly; the others’ completion events get lost. Sources never wedge because they aren’t competing with siblings on shared state.

Workaround in place via the cleanup script. Pattern is now reproducible enough that it should be possible to write a reproducer: create a Rolling CV with a single Red Hat or large custom repo, promote it to Library + Dev + Test + Prod, trigger a sync, observe at least one of the four RefreshRollingRepo tasks wedging.
(personal note to this though, this does not happen always on every repo, so it might take couple of attempts)

Update with two more diagnostic findings.

Sibling-clone correlation generalizes. Followed up on the “loner” wedges from this batch (single wedge per root). All 6 loner roots have many rolling clones — 20 each for the Red Hat / Zabbix ones (across 3 CVs), 4 each for the Postgres ones (in 1 CV). So even the loner wedges came from sources whose sync triggered many concurrent RefreshRollingRepo sibling tasks. We just happened to catch only one wedged sibling per root in that batch.

Reframing the hypothesis: the wedge rate scales with the number of rolling-clone siblings of a root, because all of them fire RefreshRollingRepo in parallel when their source syncs. Bigger sibling group = bigger chance one of them loses the post-completion race. This explains why RHEL/EPEL/Zabbix repos wedge more often than small custom ones — those library sources tend to have many more rolling clones promoted across environments.

A second failure mode. Out of 14 wedged task rows from the past 4 nights, 13 hold a stale lock; 1 has the same planned/pending wrapper state and the same plan stopped/success underlying execution plan, but locks=0. The wrapper update failed in both cases, but in this one case the lock was released cleanly. Suggests the post-completion path has at least two sub-steps that aren’t atomic — sometimes both fail, sometimes only the wrapper-update fails. Not blocking (no stale lock) but still creates a stuck task row.

Reproducer hypothesis: create a custom repo, add it to N (e.g. 5–10) Rolling CVs across environments. Trigger a sync of the source repo. Expect at least one of the N spawned RefreshRollingRepo tasks to wedge with the symptoms above. Probability scales with N.

1 Like

So it looks like I’ve found the root cause, and fix for the issue.

The problem is with the new feature of being able to assign rolling CV to an environment outside of Library. We had env path composed of Library/Dev/Test/Prod, so I’ve added the rolling CVs to all of them. This caused that when there was an update for the repository during sync, it tried to update all the environments of the content view with “Refresh RollingCV Repo” tasks, but the problem is that since it is all linked to one source, and it launches task to refresh them all at the same time, sometimes it can cause one of the tasks to hang, which causes the repository to become locked, because the lock is not released.

Fix to the issue is easy - keep all the rolling CVs only in Library.

Steps to replicate - assign rollingCV to multiple lifecycle environments, and perform sync on one of the repositories.

2 Likes

This should be amended as follows:

You can work around/mitigate this issue by assigning your rolling CVs to exactly one environment instead of multiple. We recommend choosing any environment other than Library for this. If in doubt, creating a new Lifecycle environment path, with a single custom environment named “rolling” and using that for all your rolling CVs is a good choice.

We do not recommend using Library for the following reasons:

  • Syncing Library to smart proxy is painful since Library contains all content and is frequently updated every time anything is synced.
  • A rolling CV repo assigned to Library implies independently syncing that repo to any smart proxies twice over. Once because that repo is in Library, and once because the identical repo state is also in the rolling CV.

I am glad @matoboost found a workaround. I am also glad, this workaround gives us a much better idea what the root cause is. That being said: Assigning rolling CVs to multiple environments is a valid use case, that should not result in hanging tasks. This is a bug that we should fix!

2 Likes

There’s a bug for this now: Bug #39327: Refresh RollingCV Repo task hangs when assigning multiple environments to RollingCV - Katello - Foreman

1 Like