Jenkins slaves not responding

Dominic_Cleal2 · August 16, 2016, 10:56am

There are a few Jenkins slaves I've seen this morning that have stopped
responding during jobs, causing IOExceptions and remote channel failure
messages.

I've taken a few of them offline so these should stop occurring. Jenkins
capacity will be much reduced until I can figure out why the hosts are
freezing.

···

-- Dominic Cleal dominic@cleal.org

Justin_Sherrill · August 29, 2016, 4:13pm

Out of curiosity were these ever put back into use? Recently I have
felt like jobs have been taking a lot longer and jenkins has gotten
overloaded more frequently. It seemed to happen after this event.

-Justin

···

On 08/16/2016 06:56 AM, Dominic Cleal wrote: > There are a few Jenkins slaves I've seen this morning that have stopped > responding during jobs, causing IOExceptions and remote channel failure > messages. > > I've taken a few of them offline so these should stop occurring. Jenkins > capacity will be much reduced until I can figure out why the hosts are > freezing. >

Dominic_Cleal2 · August 30, 2016, 6:49am

They were put back online with reduced slot counts (2 instead of 4)
which so far seems to have stopped them locking up. I intend to try 3
slots at some point this week.

···

On 29/08/16 17:13, Justin Sherrill wrote: > On 08/16/2016 06:56 AM, Dominic Cleal wrote: >> There are a few Jenkins slaves I've seen this morning that have stopped >> responding during jobs, causing IOExceptions and remote channel failure >> messages. >> >> I've taken a few of them offline so these should stop occurring. Jenkins >> capacity will be much reduced until I can figure out why the hosts are >> freezing. >> > Out of curiosity were these ever put back into use? Recently I have > felt like jobs have been taking a lot longer and jenkins has gotten > overloaded more frequently. It seemed to happen after this event.

–
Dominic Cleal
dominic@cleal.org

ehelms · August 30, 2016, 3:24pm

I'd like to propose adding the Jenkins Pipeline plugin as a way to help
reduce load through number of jobs [1]. If we were to take jobs that spawn
multiple jobs such as test_katello that does 4 different test sets each
spawning a separate job and use the Pipeline plugin this could be reduced
to a single job with multiple stages. These could either be parallelized to
keep the current view of all pieces that break or do a true serialized
pipeline with the fail fast jobs early to keep the test jobs as short as
possible and eating as few slots as possible.

[1] Pipeline

···

On Tue, Aug 30, 2016 at 2:49 AM, Dominic Cleal wrote:

On 29/08/16 17:13, Justin Sherrill wrote:

On 08/16/2016 06:56 AM, Dominic Cleal wrote:

There are a few Jenkins slaves I’ve seen this morning that have stopped
responding during jobs, causing IOExceptions and remote channel failure
messages.

I’ve taken a few of them offline so these should stop occurring. Jenkins
capacity will be much reduced until I can figure out why the hosts are
freezing.

Out of curiosity were these ever put back into use? Recently I have
felt like jobs have been taking a lot longer and jenkins has gotten
overloaded more frequently. It seemed to happen after this event.

They were put back online with reduced slot counts (2 instead of 4)
which so far seems to have stopped them locking up. I intend to try 3
slots at some point this week.

–
Dominic Cleal
dominic@cleal.org

–
You received this message because you are subscribed to the Google Groups
“foreman-dev” group.
To unsubscribe from this group and stop receiving emails from it, send an
email to foreman-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

–
Eric D. Helms
Red Hat Engineering
Ph.D. Student - North Carolina State University

Dominic_Cleal2 · August 31, 2016, 7:43am

The slave that I attempted this on crashed within a day, so they'll
remain on two slots each.

···

On 30/08/16 07:49, Dominic Cleal wrote: > On 29/08/16 17:13, Justin Sherrill wrote: >> On 08/16/2016 06:56 AM, Dominic Cleal wrote: >>> There are a few Jenkins slaves I've seen this morning that have stopped >>> responding during jobs, causing IOExceptions and remote channel failure >>> messages. >>> >>> I've taken a few of them offline so these should stop occurring. Jenkins >>> capacity will be much reduced until I can figure out why the hosts are >>> freezing. >>> >> Out of curiosity were these ever put back into use? Recently I have >> felt like jobs have been taking a lot longer and jenkins has gotten >> overloaded more frequently. It seemed to happen after this event. > > They were put back online with reduced slot counts (2 instead of 4) > which so far seems to have stopped them locking up. I intend to try 3 > slots at some point this week.

–
Dominic Cleal
dominic@cleal.org

Dominic_Cleal2 · August 31, 2016, 7:33am

I'll install the plugin at the next maintenance opportunity.

···

On 30/08/16 16:24, Eric D Helms wrote: > I'd like to propose adding the Jenkins Pipeline plugin as a way to help > reduce load through number of jobs [1]. [..] > > [1] https://wiki.jenkins-ci.org/display/JENKINS/Pipeline+Plugin

–
Dominic Cleal
dominic@cleal.org

TimoGoebel · September 1, 2016, 8:00am

Dominic,

···

> On 31.08.2016, at 09:43, Dominic Cleal wrote: > > The slave that I attempted this on crashed within a day, so they'll > remain on two slots each.

Do you know why these crashes happen? Are there any monitoring graphs, that show cpu/memory usage over time? The issue sounds like a oom problem judging from what i read here. Do you see anything related in the system’s logs?

Timo

Dominic_Cleal2 · September 1, 2016, 8:40am

No, I don't know precisely and the slave that crashed yesterday doesn't
have the Rackspace monitoring agent installed to collect memory data.

slave09, one of those that crashed on the morning of 16/08, was showing
high memory and swap usage on 15/08
(http://paste.fedoraproject.org/418780/47271889/), so memory exhaustion
is quite likely. I don't have any data about what processes were using
the memory in this instance.

slave10, that crashed yesterday only logged hanging tasks
(https://paste.fedoraproject.org/418772/72718379/), which I've seen on
consoles and logs from other crashed systems. There's no ability to send
a sysrq through the provided console, and nothing like kdump/watchdogs
are configured on those systems.

···

On 01/09/16 09:00, Timo Goebel wrote: > Dominic, > >> On 31.08.2016, at 09:43, Dominic Cleal wrote: >> >> The slave that I attempted this on crashed within a day, so they'll >> remain on two slots each. > > Do you know why these crashes happen? Are there any monitoring graphs, that show cpu/memory usage over time? The issue sounds like a oom problem judging from what i read here. Do you see anything related in the system's logs?

–
Dominic Cleal
dominic@cleal.org

Dominic_Cleal2 · September 3, 2016, 11:05am

The plugin is installed now.

···

On 31/08/16 08:33, Dominic Cleal wrote: > On 30/08/16 16:24, Eric D Helms wrote: >> I'd like to propose adding the Jenkins Pipeline plugin as a way to help >> reduce load through number of jobs [1]. [..] >> >> [1] https://wiki.jenkins-ci.org/display/JENKINS/Pipeline+Plugin > > I'll install the plugin at the next maintenance opportunity.

–
Dominic Cleal
dominic@cleal.org