I have encountered an issue with foreman and oVirt, but it part of the oVirt that causes the issue.
When a task such as delete is sent to oVirt, and there is an issue, it takes over an hour to the task to report time-out (from oVirt itself).
While this happens, foreman is not responsive (at least in development env, didn’t check if in prod mode it is responsive), forcing the restart of Puma.
I think that there should be at least the ability to set timeout for the request from the fog-ovirt and/or foreman, so that it will not take an hour to fail, but rather a given configurable timeout.
Following short discussion with @Ori_Rabin, I understand that it’s not just “hey, timeout arrived”, message, but it should have more things on it, that I’m not sure yet, what should it be.
As I wrote, when I tried to delete a server, it just hanged for over an hour, getting the following error message .
And it’s not 10 minutes, it took over an hour to get that error message.
One day we will rewrite our ActiveRecord callback-based orchestration engine to something normal, preferably ActiveJobs (Dynflow). Unique feature of Dynflow/Tasks is that if a task fails and the orchestration is coded properly, user/operator can go to Dynflow/Tasks console and resubmit the job. We should write most of our orchestration as fire-and-forget because transaction compensation is challenging and it mostly cannot be done properly.