Katello tests hang due to OOM on Jenkins nodes

There are times where Katello PR tests hang until killed by timeout. This appears to be most commonly caused by the node hitting an OOM condition and killing the Ruby process. Jenkins does not recognize this and waits for the timeout. There are two issues at play here:

  1. Nodes going OOM
  2. Jenkins not recognizing when the running process dies and stopping the job

For #1, I am lowering the number of executors to 2 to try and prevent resource exhaustion via concurrent jobs. We’ll need to keep an eye on this to see if this causes a spike in queued jobs to either move this up to 3 or bring more nodes online. I’m also open to additional ideas on how to prevent or tackle this.

For #2, I couldn’t find much googling on how to deal with this other than: don’t go OOM.

1 Like