Handling timeout requests on ovirt

ik5 · October 3, 2018, 7:21am

I have encountered an issue with foreman and oVirt, but it part of the oVirt that causes the issue.

When a task such as delete is sent to oVirt, and there is an issue, it takes over an hour to the task to report time-out (from oVirt itself).

While this happens, foreman is not responsive (at least in development env, didn’t check if in prod mode it is responsive), forcing the restart of Puma.

I think that there should be at least the ability to set timeout for the request from the fog-ovirt and/or foreman, so that it will not take an hour to fail, but rather a given configurable timeout.

Following short discussion with @Ori_Rabin, I understand that it’s not just “hey, timeout arrived”, message, but it should have more things on it, that I’m not sure yet, what should it be.

Your thinking on the matter are more then welcome

ohadlevy · October 3, 2018, 7:37am

Your question is too vague, can you please be more specific? e.g. which
user stories are slow, what is the user trying to do etc?

in man cases, we enable blocking in the ui (see wait_for in
https://github.com/fog/fog-ovirt/blob/master/lib/fog/ovirt/models/compute/server.rb),
but it really depends what you are trying to solve…

ik5 · October 3, 2018, 8:29am

I’m not aware of any user story at all

As I wrote, when I tried to delete a server, it just hanged for over an hour, getting the following error message
Screenshot%20at%202018-10-03%2011-26-21 .
And it’s not 10 minutes, it took over an hour to get that error message.

TimoGoebel · October 3, 2018, 1:46pm

I experienced a similar issue with vsphere, see Bug #22912: Stuck vSphere API breaks Foreman - Foreman. Unfortunately, timeouts seem to be a hard thing to do in ruby.

ohadlevy · October 3, 2018, 2:04pm

I guess each item should be handled individually, for example, for
deletion, we can consider marking object for deletion and deleting in
active job?

lzap · October 4, 2018, 8:24am

One day we will rewrite our ActiveRecord callback-based orchestration engine to something normal, preferably ActiveJobs (Dynflow). Unique feature of Dynflow/Tasks is that if a task fails and the orchestration is coded properly, user/operator can go to Dynflow/Tasks console and resubmit the job. We should write most of our orchestration as fire-and-forget because transaction compensation is challenging and it mostly cannot be done properly.