Proposal - nightly pipeline monitoring rotation

Hey all,

In the upstream testing working group, we have been discussing a way to handle breakages in the nightly pipeline. The current plan is to add e2e testing to the nightly pipeline. Given that the nightly pipeline currently breaks and stays broken, the concern is that adding more testing will cause more breakages and more neglect of that pipeline.

There are some ideas on how we can improve this and how we can deal with breakages in a timely manner. The over-arching thought of these proposals is to have someone directly responsible for the pipeline who can triage breakages, alert the relevant parties, and monitor the fixes.

Here are the ideas we came up with so far:

  • Option 1: Weekly rotation
    • This would be a weekly rotation of 2 people, one from Foreman and one from Katello, to monitor the pipeline and triage fixes.
    • Pros: the coverage the pipeline would get, and just the week commitment from developers.
    • Con: we already have a community support triage rotation in Katello, and this could be a burden on developers to be on a rotation for two weeks every few months.
  • Option 2: Community + nightly rotation
    • This would be a weekly rotation of 2 people, one from Foreman and one from Katello, to monitor the pipeline, triage fixes and monitor the community support channel
    • There is a community support rotation in Katello which has been very effective, and I think there has been some interest in starting one in Foreman too. This could be combined with monitoring nightly tests.
    • Pros: Good coverage, week commitment, no “double commitment” to rotations
    • Cons: More responsibility during the week a developer is assigned to the rotation.
  • Option 3: Release owner monitors nightly pipeline
    • The release owner from Foreman/Katello/plugins would monitor the nightly pipeline for breakages and be responsible for triaging them.
    • Pros: the release owner would be eventually responsible for triaging breakages before the release, so there is a direct responsibility and ownership.
    • Cons: The release owner would be responsible for a long time-period, likely 3 months.
  • Option 4: Hybrid approach
    • Have 1 person assigned to a weekly rotation for the nightly pipeline (would be selected from Foreman/Katello/plugin/infra devs)
    • The release owner for foreman core + Katello would be a point of contact for this single triage person, so they would consult with them to help triage breakages outside of your knowledge domain. For example, if I am a Foreman core dev and notice a Content View test is failing, I would ping the Katello release owner to help me assess.
    • Pros: Less frequent rotation and release owners play a part in owning.
    • Cons: Could still be too much of a burden on release owners, communication could be hard going from weekly triage assignee -> release owner -> developer responsible for fix.

If there are more options I am missing please do comment. I tried to be objective in my pros/cons assessment, but feel free to share other concerns or benefits.

I’ll add a poll that will help gauge interest in each option

It should be added that whatever approach we take, we should stay flexible and see how the approach works in reality. We will probably have to tweak things and have more discussions as we see how an approach plays out. So whatever we decide here will not be set in stone.

Another thing to note is that we can use technology to assist us in these processes. For instance, a bot could directly ping the current owner of the nightly pipeline in the discourse thread where it reports it failing.

Let me know if there are any questions or concerns!

1 Like

Here is a poll to gauge interest in each option:

  • 2-person weekly rotation
  • 2-person weekly rotation that includes community support
  • Release owner monitors nightly pipeline
  • Hybrid approach - 1 person weekly rotation and release owners are points of contact
  • Do nothing

0 voters

There wasn’t a ton of votes, but it looks like “2-person weekly rotation that includes community support” has the most votes. Please weigh in now if you didn’t get the chance to and have an opinion. Unless there are strong objections, I think we are ready to take the next steps to setting up a rotation.

To review: This would be a weekly rotation of someone from both Foreman and Katello that includes both community support and nightly pipeline triage. The 2 people on duty wouldn’t necessarily be responsible for fixing issues themselves, but more making sure the right people are aware and facilitating fixes.

The next steps I can see are:

  • Write documentation on what is expected during this rotation and any relevant resources
  • Make a schedule including Foreman/Katello devs at weekly slots
  • Start the process!

As far as future steps, I would love to have this process as automated as possible. For instance:

  • Have a bot that emails devs reminders before their rotation starts.
  • Automatically ping the on-duty people in a pipeline failure discourse thread

Let me know if there are any questions or concerns!