Jenkins congestion

Jenkins appeared to be completely clogged up with tasks that had been
running for ~5-10 hours and were not completing.

I cancelled a slew of katello and foreman jobs in order to unblock
everything and keep it going. You may need to manually initiate tests
on open PRs.

-Justin

Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?

Eric

··· On Wed, Feb 25, 2015 at 8:34 PM, Justin Sherrill wrote:

Jenkins appeared to be completely clogged up with tasks that had been
running for ~5-10 hours and were not completing.

I cancelled a slew of katello and foreman jobs in order to unblock
everything and keep it going. You may need to manually initiate tests on
open PRs.

-Justin


You received this message because you are subscribed to the Google Groups
"foreman-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Discovery tests on develop branch have been stuck for almost a month
waiting for a stable build and the PR to fix them was merged today.
This caused all of the open discovery PRs (12) to start running tests.
This was 6 hours ago so it couldn't have caused the clog Justin mentioned
but it is likely it caused (or at least helped) the recent back up.

··· On Thu, Feb 26, 2015 at 9:44 PM, Eric D Helms wrote:

Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?

Eric

On Wed, Feb 25, 2015 at 8:34 PM, Justin Sherrill jsherril@redhat.com > wrote:

Jenkins appeared to be completely clogged up with tasks that had been
running for ~5-10 hours and were not completing.

I cancelled a slew of katello and foreman jobs in order to unblock
everything and keep it going. You may need to manually initiate tests on
open PRs.

-Justin


You received this message because you are subscribed to the Google Groups
"foreman-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


You received this message because you are subscribed to the Google Groups
"foreman-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

As Ori alluded to, it often happens when a bunch of PR tests are
submitted at once. One person can easily cause it if they rebase a
dozen PRs and force-push, or even just repeatedly update one or two PRs
within a few minutes. Also when a master branch build is broken, we
have it configured to pause PR tests, and so these can flood in once
fixed if it's been broken for some time. Every PR test causes another
dozen jobs on Jenkins.

We have a problem with how our jobs are laid out that means Jenkins can
get deadlocked rather than crunch through a large job queue as you'd
expect (eventually). It sounds like this happened again.

Taking test_develop_pull_request as an example, it's a matrix job with
four subcategories:

  • foreman
  • rubocop
  • katello
  • upgrade17

Each one then uses the trigger job task to start another job on Jenkins,
so we end up with:

  • foreman
    • test_develop_pr_core
      • ruby/database matrix (9 jobs)
  • rubocop
    • test_develop_pr_rubocop
  • katello
    • test_katello_core
      • ruby/database matrix (1 job)
  • upgrade17
    • test_branch_upgrade
      • database matrix (3 jobs)

It's the top-level job (test_develop_pull_request) calling trigger other
jobs that causes the problem, since each of those trigger tasks takes up
an executor that's doing nothing apart from blocking on another job to
run. Since these jobs tend to run first, they can easily take up all of
the executors if lots of PR tests are submitted at once and don't leave
any capacity for the underlying jobs to run.

I've got a couple of ideas on how to solve it, but haven't spent much
time on it lately.

  1. GitHub's status API now supports multiple services
    (https://github.com/blog/1935-see-results-from-all-pull-request-status-checks),
    so we could remove our top level job and instead have four individual
    statuses showing up. This would be great for developer usability too.

This would need an update of the script we use to submit Jenkins jobs
(https://github.com/theforeman/test-pull-requests), or we can migrate to
the Jenkins plugin for GitHub PR testing which looks like it may
support this configuration now, but it would need testing (and some
changes to how we configure git repos in jobs).

  1. Write a patch for the trigger parameterised job Jenkins plugin to
    somehow have the triggered job run on the same executor that's currently
    waiting for the blocked job. Maybe have the triggering job switch to
    being a flyweight job so the slot is freed up. I've no idea how
    feasible that is.
··· On 26/02/15 19:44, Eric D Helms wrote: > Looks like Jenkins is backed up again. Did a recent change to the > infrastructure occur to suddenly cause this or are we just pushing a lot > more volume suddenly?


Dominic Cleal
Red Hat Engineering

This is purely anecdotal, but over the past week whenever I have looked at
why the CI is jammed up, it always appears to be a job that is running a
test against mysql that ends up jamming the works up.

Also, are the timeouts for each job set low enough? I'd think if they were
set to 1 to 1.5 hours they would fail and free up space such that things
would clear up over time.

Eric

··· On Mon, Mar 2, 2015 at 5:27 AM, Dominic Cleal wrote:

On 26/02/15 19:44, Eric D Helms wrote:

Looks like Jenkins is backed up again. Did a recent change to the
infrastructure occur to suddenly cause this or are we just pushing a lot
more volume suddenly?

As Ori alluded to, it often happens when a bunch of PR tests are
submitted at once. One person can easily cause it if they rebase a
dozen PRs and force-push, or even just repeatedly update one or two PRs
within a few minutes. Also when a master branch build is broken, we
have it configured to pause PR tests, and so these can flood in once
fixed if it’s been broken for some time. Every PR test causes another
dozen jobs on Jenkins.

We have a problem with how our jobs are laid out that means Jenkins can
get deadlocked rather than crunch through a large job queue as you’d
expect (eventually). It sounds like this happened again.

Taking test_develop_pull_request as an example, it’s a matrix job with
four subcategories:

  • foreman
  • rubocop
  • katello
  • upgrade17

Each one then uses the trigger job task to start another job on Jenkins,
so we end up with:

  • foreman
    • test_develop_pr_core
      • ruby/database matrix (9 jobs)
  • rubocop
    • test_develop_pr_rubocop
  • katello
    • test_katello_core
      • ruby/database matrix (1 job)
  • upgrade17
    • test_branch_upgrade
      • database matrix (3 jobs)

It’s the top-level job (test_develop_pull_request) calling trigger other
jobs that causes the problem, since each of those trigger tasks takes up
an executor that’s doing nothing apart from blocking on another job to
run. Since these jobs tend to run first, they can easily take up all of
the executors if lots of PR tests are submitted at once and don’t leave
any capacity for the underlying jobs to run.

I’ve got a couple of ideas on how to solve it, but haven’t spent much
time on it lately.

  1. GitHub’s status API now supports multiple services
    (
    https://github.com/blog/1935-see-results-from-all-pull-request-status-checks
    ),
    so we could remove our top level job and instead have four individual
    statuses showing up. This would be great for developer usability too.

This would need an update of the script we use to submit Jenkins jobs
(https://github.com/theforeman/test-pull-requests), or we can migrate to
the Jenkins plugin for GitHub PR testing which looks like it may
support this configuration now, but it would need testing (and some
changes to how we configure git repos in jobs).

  1. Write a patch for the trigger parameterised job Jenkins plugin to
    somehow have the triggered job run on the same executor that’s currently
    waiting for the blocked job. Maybe have the triggering job switch to
    being a flyweight job so the slot is freed up. I’ve no idea how
    feasible that is.


Dominic Cleal
Red Hat Engineering


You received this message because you are subscribed to the Google Groups
"foreman-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

> Discovery tests on develop branch have been stuck for almost a month
> waiting for a stable build and the PR to fix them was merged today.
> This caused all of the open discovery PRs (12) to start running tests.
> This was 6 hours ago so it couldn't have caused the clog Justin mentioned
> but it is likely it caused (or at least helped) the recent back up.

We have merged the blocking patch finally today, next nightly should be
in a good shape again and release of 3.0.0 discovery is tomorrow (for
Foreman 1.8).

Sorry for the delay, dependant blockers. Whole chain of blockers, I
should say :slight_smile:

··· -- Later, Lukas #lzap Zapletal

> This is purely anecdotal, but over the past week whenever I have looked
> at why the CI is jammed up, it always appears to be a job that is
> running a test against mysql that ends up jamming the works up.

Test times for MySQL look identical to other DBs, I expect that's just
chance.

> Also, are the timeouts for each job set low enough? I'd think if they
> were set to 1 to 1.5 hours they would fail and free up space such that
> things would clear up over time.

Good idea, I've put a two hour timeout on the test_develop_pull_request
top level job.

··· On 02/03/15 18:30, Eric D Helms wrote:


Dominic Cleal
Red Hat Engineering

>> Discovery tests on develop branch have been stuck for almost a month
>> waiting for a stable build and the PR to fix them was merged today.
>> This caused all of the open discovery PRs (12) to start running tests.
>> This was 6 hours ago so it couldn't have caused the clog Justin mentioned
>> but it is likely it caused (or at least helped) the recent back up.
>
> We have merged the blocking patch finally today

Er, it was merged last week:
https://github.com/theforeman/foreman_discovery/commit/6a89b7a4d79369015882a7969a1c9449fac02a4d

(2f9a830b didn't change anything.)

> next nightly should be in a good shape again

There are no nightly builds of plugins.

> and release of 3.0.0 discovery is tomorrow (for
> Foreman 1.8).

Sounds good! :slight_smile:

··· On 04/03/15 15:21, Lukas Zapletal wrote:


Dominic Cleal
Red Hat Engineering