Remote Execution Cancled randomly

areyus · July 29, 2019, 5:51pm

Problem:
Since we updated to 1.20.3 recently, remote execution jobs get randomly cancled.
I can not find any traces of the problem in the logs (neither production.log nor proxy.log).
The only output I get is in the task details of one of the affected hosts:

 Output:

Job cancelled by user

Errors:

The task could not be started within the maintenance window.

I tried several times and could always reproduce this. Looks like it is always about 75% of the target hosts get processed correctly, then the rest is suddenly cancled all at once. I am 99.9% sure that the job did not get cancled by a user.

Expected outcome:
Jobs should just finish with all hosts processed correctly.

Foreman and Proxy versions:
1.20.3

Foreman and Proxy plugin versions:
tfm-rubygem-smart_proxy_dynflow_core-0.2.1-1.fm1_20.el7.noarch
rubygem-smart_proxy_dynflow-0.2.1-1.el7.noarch
rubygem-smart_proxy_remote_execution_ssh-0.2.0-2.el7.noarch
libsmartcols-2.23.2-59.el7_6.1.x86_64
rubygem-smart_proxy_pulp-1.3.0-1.el7.noarch
rubygem-smart_proxy_dhcp_infoblox-0.0.13-1.fm1_18.el7.noarch
tfm-rubygem-foreman_remote_execution-1.6.7-1.fm1_20.el7.noarch
tfm-rubygem-foreman_remote_execution_core-1.1.4-1.el7.noarch

Other relevant data:

upadhyeammit · July 30, 2019, 8:17am

Hello,

Can you see any traceback in production.log ?
Anything valuable in client side ssh logs ? if its fedora based client then you should see it in /var/log/secure
Is it happening with specific hosts only ? if yes, then can you crosscheck what is the difference in ssh configuration on working and non working hosts ?
_
Amit Upadhye.

areyus · July 30, 2019, 8:21am

Hi,

in production log, I can only see tracebacks from hosts that are failing legitimately (like ssh timeouts) but none related to the hosts that get cancled.
It is also not restricted to certain hosts. If I do a “Rerun failed”, some of the hosts get handled correctly before the job again gets cancled at some point. I will do a little more testing today and get back with anything I can find.

areyus · July 30, 2019, 8:56am

Some more testing has shown that it is probably related to the “Time Span” option of the job.
Running jobs without that option seems to work fine. I think the job cancles of Foreman is not able to execute the job within that timeframe?

aruzicka · July 30, 2019, 9:02am

Well, that is exactly what the time span option is supposed to do. Cancel all jobs which don’t finish within the given time frame.

areyus · July 30, 2019, 10:07am

Good to know, but in that case, the information given is misleading. If you click the (i) next to “Time Span” it says “distribute execution over N seconds”. In my oppinion, this should mean that Foreman distributes the execution equally over that timeframe, not that everything that did not make it in time gets cancled.

aruzicka · July 30, 2019, 10:19am

Oh, right. The semantics is a bit different, but what I said still applies. It tries to distribute the execution equally, but the jobs which don’t make it in time are cancelled.

areyus · July 30, 2019, 10:20am

Thanks for the clarification. I think it would be nice if that info could make it into the tooltip somehow in the future

Tony_Coffman · July 30, 2019, 1:55pm

Wow - thanks for the heads up. I mistakenly believed that tooltip!