Sync plan gets stuck in running state if one at least one repo failed

Problem:

Noticed this morning, that since the update the sync plans with failing repo sync get stuck:


in this case it’s the Zabbix repos which sometimes just block off.

If I try to cancel them, the action tells me it cancelled, but actually nothing happened. (cancelling in the DynFlow console also doesn’t to much)

Manually syncing the repos works without any issues, but any other recurring task of these sync plans of course fail now.
Force cancelled one of them for now, but not sure if this was a good idea.

Thanks, lumarel

Foreman and Proxy versions:

Foreman 3.8
Katello 4.10

Foreman and Proxy plugin versions:

VMware
Tasks
Ansible
Bootdisk
Puppet
Snapshot Management
Statistics
Templates

Distribution and version:

Rocky Linux 8.8

foreman-tail output:

from the cancelling:

==> /var/log/httpd/foreman-ssl_access_ssl.log <==
10.27.10.200 - - [30/Oct/2023:19:13:11 +0100] "POST /foreman_tasks/tasks/74c3fdb9-fefa-4a8b-b529-44a799d2f248/cancel HTTP/2.0" 200 19 "https://foreman.fritz.box/foreman_tasks/tasks?state=running&page=1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
10.27.10.200 - - [30/Oct/2023:19:13:12 +0100] "GET /foreman_tasks/api/tasks?search=%28state%3Drunning%29&page=1&include_permissions=true HTTP/2.0" 200 1876 "https://foreman.fritz.box/foreman_tasks/tasks?state=running&page=1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"
10.27.10.200 - - [30/Oct/2023:19:13:12 +0100] "GET /foreman_tasks/tasks/summary/24 HTTP/2.0" 304 - "https://foreman.fritz.box/foreman_tasks/tasks?state=running&page=1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"

==> /var/log/foreman/production.log <==
2023-10-30T19:13:11 [I|app|32b4e21b] Started POST "/foreman_tasks/tasks/74c3fdb9-fefa-4a8b-b529-44a799d2f248/cancel" for 10.27.10.200 at 2023-10-30 19:13:11 +0100
2023-10-30T19:13:11 [I|app|32b4e21b] Processing by ForemanTasks::TasksController#cancel as JSON
2023-10-30T19:13:11 [I|app|32b4e21b]   Parameters: {"id"=>"74c3fdb9-fefa-4a8b-b529-44a799d2f248", "task"=>{}}
2023-10-30T19:13:11 [I|app|32b4e21b] Completed 200 OK in 10ms (Views: 0.2ms | ActiveRecord: 1.3ms | Allocations: 4803)
2023-10-30T19:13:12 [I|app|4b4271f0] Started GET "/foreman_tasks/api/tasks?search=%28state%3Drunning%29&page=1&include_permissions=true" for 10.27.10.200 at 2023-10-30 19:13:12 +0100
2023-10-30T19:13:12 [I|app|77cb9f7a] Started GET "/foreman_tasks/tasks/summary/24" for 10.27.10.200 at 2023-10-30 19:13:12 +0100
2023-10-30T19:13:12 [I|app|4b4271f0] Processing by ForemanTasks::Api::TasksController#index as JSON
2023-10-30T19:13:12 [I|app|4b4271f0]   Parameters: {"search"=>"(state=running)", "page"=>"1", "include_permissions"=>"true"}
2023-10-30T19:13:12 [I|app|77cb9f7a] Processing by ForemanTasks::TasksController#summary as JSON
2023-10-30T19:13:12 [I|app|77cb9f7a]   Parameters: {"recent_timeframe"=>"24"}
2023-10-30T19:13:12 [I|app|4b4271f0]   Rendered /usr/share/gems/gems/foreman-tasks-8.2.0/app/views/foreman_tasks/api/tasks/index.json.rabl within api/v2/layouts/index_layout (Duration: 24.0ms | Allocations: 17924)
2023-10-30T19:13:12 [I|app|4b4271f0]   Rendered layout api/v2/layouts/index_layout.json.erb (Duration: 41.0ms | Allocations: 24056)
2023-10-30T19:13:12 [I|app|4b4271f0] Completed 200 OK in 48ms (Views: 27.2ms | ActiveRecord: 15.1ms | Allocations: 26923)
2023-10-30T19:13:12 [I|app|77cb9f7a] Completed 200 OK in 434ms (Views: 0.3ms | ActiveRecord: 426.0ms | Allocations: 3541)

==> /var/log/httpd/foreman-ssl_access_ssl.log <==
10.27.10.200 - - [30/Oct/2023:19:13:12 +0100] "GET /notification_recipients HTTP/2.0" 304 - "https://foreman.fritz.box/foreman_tasks/tasks?state=running&page=1" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0"

==> /var/log/foreman/production.log <==
2023-10-30T19:13:12 [I|app|1817a609] Started GET "/notification_recipients" for 10.27.10.200 at 2023-10-30 19:13:12 +0100
2023-10-30T19:13:12 [I|app|1817a609] Processing by NotificationRecipientsController#index as JSON
2023-10-30T19:13:13 [I|app|1817a609] Completed 200 OK in 7ms (Views: 0.1ms | ActiveRecord: 1.0ms | Allocations: 1966)

Had a similar problem yesterday that the sync underneath the bulk action ended with a warning and the bulk action got stuck so I could not cancel it successfully even after forced unlock. So I ended up with taking the uid from the url and running foreman-rake console with ForemanTasks::Task.find("uuid").destroy to get rid of the task. Not an ideal solution but with the bulk action it should be save.

1 Like

The SyncPlan::Run doesn’t do much on its own. It spawns sub tasks for each of the repositories and waits for them to finish. Looking at the screenshots you posted, it had 2 sub tasks, one of them already finished but the other did not. Chances are that subtask that is still not done got paused or something and that could very well be the reason why the syncplan::run refuses to finish.

In this particular case it should be safe, but on a more general note, using the task cleanup rake task should be the way if you’re sure you absolutely have to destroy tasks to avoid orphaned records being left all over the place.

Okay will try destroy later with access to that machine. (I might find the cleanup rake, but just to make sure is it ForemanTasks::Task.find("uuid").cleanup?)

And it’s the same situation as Dirk described it, the sub-tasks went into stopped-warning state :+1:

Hrmm, that’s a bit concerning. What version of rubygem-foreman-tasks and rubygem-dynflow do you have?

foreman-rake foreman_tasks:cleanup TASK_SEARCH="id = uuid" STATES=\, VERBOSE=y NOOP=y

When you run it like this, it will just print what it would do. When you’re satisfied with what you see, drop the NOOP=y part to perform the cleanup.

rubygem-foreman-tasks  8.2.0-1.fm3_8.el8
rubygem-dynflow        1.8.0-1.el8

And thanks! :slight_smile:

And one more thing, is the Execution time (excluding suspended state) going up?

No, only the Real time is going up :+1:
Yes it does, but really slowly like 0.01s for half a minute or so

Btw, looks like force cancelling worked in my case (as I cancelled one of the 2 sync tasks), the new occurrences of the recurring logic didn’t fail anymore (but this force cancelled one is still in the stopped-pending state)

Ah, I see it[1], thank you for providing all the details.

[1] - V2::WithSubPlans ignores tasks in stopped-warning · Issue #434 · Dynflow/dynflow · GitHub

3 Likes

Ah wow, good catch! :slightly_smiling_face:
And you’re welcome :+1:

Getting slammed by this one after going to 3.8/4.10. Every day there are sync tasks that are stuck with sub tasks that are in warning (have 10 proxies all over the world). Impossible to do anything with them. If I force cancel them, they come back after a Foreman restart. Forced to use the “foreman-rake foreman_tasks:cleanup” to get rid of them.

1 Like

I am also seeing the same issue as Paul (tedevil) same foreman / katello versions.

I went ahead and created a Redmine so Phoenix can investigate this further for 4.10. This seems like quite the hassle at the moment.

1 Like

Logged in today and see 278 syncs tasks running. Going back to the very oldest task it is the sync task from last night that updates all repos every night. Oddly though, when I look at the task it says 50% complete. If I open up the “Sub tasks” it tells me success of every single subtask.
Looking on other tasks in running between 5%-60% complete say over and over:
“Could not transition step from pending to running, step already in running.”

Looking on my proxies, all of them says “Smart proxy currently syncing to your locations…” and are at ~90% but never seem to complete.
Not really sure what to do here. Is it possible to clean up all these running tasks?

Since this is impacting us pretty hard, I am starting to look at a way to downgrade. Is there a supported way to do that?

Is aruzicka’s method above working for cleanup of the incomplete tasks after a force cancel?

I don’t believe there is currently a supported pathway for downgrading the Foreman/Katello version. We’ll be triaging this Redmine bug next week so hopefully we can get a decent turnaround on investigating what’s going wrong. Sorry about the hassle.