Problem:
Since we updated to 1.20.3 recently, remote execution jobs get randomly cancled.
I can not find any traces of the problem in the logs (neither production.log nor proxy.log).
The only output I get is in the task details of one of the affected hosts:
Output:
Job cancelled by user
Errors:
The task could not be started within the maintenance window.
I tried several times and could always reproduce this. Looks like it is always about 75% of the target hosts get processed correctly, then the rest is suddenly cancled all at once. I am 99.9% sure that the job did not get cancled by a user.
Expected outcome:
Jobs should just finish with all hosts processed correctly.
Can you see any traceback in production.log ?
Anything valuable in client side ssh logs ? if its fedora based client then you should see it in /var/log/secure
Is it happening with specific hosts only ? if yes, then can you crosscheck what is the difference in ssh configuration on working and non working hosts ?
_
Amit Upadhye.
in production log, I can only see tracebacks from hosts that are failing legitimately (like ssh timeouts) but none related to the hosts that get cancled.
It is also not restricted to certain hosts. If I do a “Rerun failed”, some of the hosts get handled correctly before the job again gets cancled at some point. I will do a little more testing today and get back with anything I can find.
Some more testing has shown that it is probably related to the “Time Span” option of the job.
Running jobs without that option seems to work fine. I think the job cancles of Foreman is not able to execute the job within that timeframe?
Good to know, but in that case, the information given is misleading. If you click the (i) next to “Time Span” it says “distribute execution over N seconds”. In my oppinion, this should mean that Foreman distributes the execution equally over that timeframe, not that everything that did not make it in time gets cancled.
Oh, right. The semantics is a bit different, but what I said still applies. It tries to distribute the execution equally, but the jobs which don’t make it in time are cancelled.