Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?
Eric
···
On Wed, Feb 25, 2015 at 8:34 PM, Justin Sherrill wrote:
Jenkins appeared to be completely clogged up with tasks that had been
running for ~5-10 hours and were not completing.
I cancelled a slew of katello and foreman jobs in order to unblock
everything and keep it going. You may need to manually initiate tests on
open PRs.
Discovery tests on develop branch have been stuck for almost a month
waiting for a stable build and the PR to fix them was merged today.
This caused all of the open discovery PRs (12) to start running tests.
This was 6 hours ago so it couldn't have caused the clog Justin mentioned
but it is likely it caused (or at least helped) the recent back up.
···
On Thu, Feb 26, 2015 at 9:44 PM, Eric D Helms wrote:
Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?
Eric
On Wed, Feb 25, 2015 at 8:34 PM, Justin Sherrill jsherril@redhat.com > wrote:
Jenkins appeared to be completely clogged up with tasks that had been
running for ~5-10 hours and were not completing.
I cancelled a slew of katello and foreman jobs in order to unblock
everything and keep it going. You may need to manually initiate tests on
open PRs.
As Ori alluded to, it often happens when a bunch of PR tests are
submitted at once. One person can easily cause it if they rebase a
dozen PRs and force-push, or even just repeatedly update one or two PRs
within a few minutes. Also when a master branch build is broken, we
have it configured to pause PR tests, and so these can flood in once
fixed if it's been broken for some time. Every PR test causes another
dozen jobs on Jenkins.
We have a problem with how our jobs are laid out that means Jenkins can
get deadlocked rather than crunch through a large job queue as you'd
expect (eventually). It sounds like this happened again.
Taking test_develop_pull_request as an example, it's a matrix job with
four subcategories:
foreman
rubocop
katello
upgrade17
Each one then uses the trigger job task to start another job on Jenkins,
so we end up with:
foreman
test_develop_pr_core
ruby/database matrix (9 jobs)
rubocop
test_develop_pr_rubocop
katello
test_katello_core
ruby/database matrix (1 job)
upgrade17
test_branch_upgrade
database matrix (3 jobs)
It's the top-level job (test_develop_pull_request) calling trigger other
jobs that causes the problem, since each of those trigger tasks takes up
an executor that's doing nothing apart from blocking on another job to
run. Since these jobs tend to run first, they can easily take up all of
the executors if lots of PR tests are submitted at once and don't leave
any capacity for the underlying jobs to run.
I've got a couple of ideas on how to solve it, but haven't spent much
time on it lately.
This would need an update of the script we use to submit Jenkins jobs
(https://github.com/theforeman/test-pull-requests), or we can migrate to
the Jenkins plugin for GitHub PR testing which looks like it may
support this configuration now, but it would need testing (and some
changes to how we configure git repos in jobs).
Write a patch for the trigger parameterised job Jenkins plugin to
somehow have the triggered job run on the same executor that's currently
waiting for the blocked job. Maybe have the triggering job switch to
being a flyweight job so the slot is freed up. I've no idea how
feasible that is.
···
On 26/02/15 19:44, Eric D Helms wrote:
> Looks like Jenkins is backed up again. Did a recent change to the
> infrastructure occur to suddenly cause this or are we just pushing a lot
> more volume suddenly?
This is purely anecdotal, but over the past week whenever I have looked at
why the CI is jammed up, it always appears to be a job that is running a
test against mysql that ends up jamming the works up.
Also, are the timeouts for each job set low enough? I'd think if they were
set to 1 to 1.5 hours they would fail and free up space such that things
would clear up over time.
Eric
···
On Mon, Mar 2, 2015 at 5:27 AM, Dominic Cleal wrote:
On 26/02/15 19:44, Eric D Helms wrote:
Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?
As Ori alluded to, it often happens when a bunch of PR tests are
submitted at once. One person can easily cause it if they rebase a
dozen PRs and force-push, or even just repeatedly update one or two PRs
within a few minutes. Also when a master branch build is broken, we
have it configured to pause PR tests, and so these can flood in once
fixed if it’s been broken for some time. Every PR test causes another
dozen jobs on Jenkins.
We have a problem with how our jobs are laid out that means Jenkins can
get deadlocked rather than crunch through a large job queue as you’d
expect (eventually). It sounds like this happened again.
Taking test_develop_pull_request as an example, it’s a matrix job with
four subcategories:
foreman
rubocop
katello
upgrade17
Each one then uses the trigger job task to start another job on Jenkins,
so we end up with:
foreman
test_develop_pr_core
ruby/database matrix (9 jobs)
rubocop
test_develop_pr_rubocop
katello
test_katello_core
ruby/database matrix (1 job)
upgrade17
test_branch_upgrade
database matrix (3 jobs)
It’s the top-level job (test_develop_pull_request) calling trigger other
jobs that causes the problem, since each of those trigger tasks takes up
an executor that’s doing nothing apart from blocking on another job to
run. Since these jobs tend to run first, they can easily take up all of
the executors if lots of PR tests are submitted at once and don’t leave
any capacity for the underlying jobs to run.
I’ve got a couple of ideas on how to solve it, but haven’t spent much
time on it lately.
This would need an update of the script we use to submit Jenkins jobs
(https://github.com/theforeman/test-pull-requests), or we can migrate to
the Jenkins plugin for GitHub PR testing which looks like it may
support this configuration now, but it would need testing (and some
changes to how we configure git repos in jobs).
Write a patch for the trigger parameterised job Jenkins plugin to
somehow have the triggered job run on the same executor that’s currently
waiting for the blocked job. Maybe have the triggering job switch to
being a flyweight job so the slot is freed up. I’ve no idea how
feasible that is.
> Discovery tests on develop branch have been stuck for almost a month
> waiting for a stable build and the PR to fix them was merged today.
> This caused all of the open discovery PRs (12) to start running tests.
> This was 6 hours ago so it couldn't have caused the clog Justin mentioned
> but it is likely it caused (or at least helped) the recent back up.
We have merged the blocking patch finally today, next nightly should be
in a good shape again and release of 3.0.0 discovery is tomorrow (for
Foreman 1.8).
Sorry for the delay, dependant blockers. Whole chain of blockers, I
should say
> This is purely anecdotal, but over the past week whenever I have looked
> at why the CI is jammed up, it always appears to be a job that is
> running a test against mysql that ends up jamming the works up.
Test times for MySQL look identical to other DBs, I expect that's just
chance.
> Also, are the timeouts for each job set low enough? I'd think if they
> were set to 1 to 1.5 hours they would fail and free up space such that
> things would clear up over time.
Good idea, I've put a two hour timeout on the test_develop_pull_request
top level job.
>> Discovery tests on develop branch have been stuck for almost a month
>> waiting for a stable build and the PR to fix them was merged today.
>> This caused all of the open discovery PRs (12) to start running tests.
>> This was 6 hours ago so it couldn't have caused the clog Justin mentioned
>> but it is likely it caused (or at least helped) the recent back up.
>
> We have merged the blocking patch finally today