Hey all,
In the upstream testing working group, we have been discussing a way to handle breakages in the nightly pipeline. The current plan is to add e2e testing to the nightly pipeline. Given that the nightly pipeline currently breaks and stays broken, the concern is that adding more testing will cause more breakages and more neglect of that pipeline.
There are some ideas on how we can improve this and how we can deal with breakages in a timely manner. The over-arching thought of these proposals is to have someone directly responsible for the pipeline who can triage breakages, alert the relevant parties, and monitor the fixes.
Here are the ideas we came up with so far:
-
Option 1: Weekly rotation
- This would be a weekly rotation of 2 people, one from Foreman and one from Katello, to monitor the pipeline and triage fixes.
- Pros: the coverage the pipeline would get, and just the week commitment from developers.
- Con: we already have a community support triage rotation in Katello, and this could be a burden on developers to be on a rotation for two weeks every few months.
-
Option 2: Community + nightly rotation
- This would be a weekly rotation of 2 people, one from Foreman and one from Katello, to monitor the pipeline, triage fixes and monitor the community support channel
- There is a community support rotation in Katello which has been very effective, and I think there has been some interest in starting one in Foreman too. This could be combined with monitoring nightly tests.
- Pros: Good coverage, week commitment, no “double commitment” to rotations
- Cons: More responsibility during the week a developer is assigned to the rotation.
-
Option 3: Release owner monitors nightly pipeline
- The release owner from Foreman/Katello/plugins would monitor the nightly pipeline for breakages and be responsible for triaging them.
- Pros: the release owner would be eventually responsible for triaging breakages before the release, so there is a direct responsibility and ownership.
- Cons: The release owner would be responsible for a long time-period, likely 3 months.
-
Option 4: Hybrid approach
- Have 1 person assigned to a weekly rotation for the nightly pipeline (would be selected from Foreman/Katello/plugin/infra devs)
- The release owner for foreman core + Katello would be a point of contact for this single triage person, so they would consult with them to help triage breakages outside of your knowledge domain. For example, if I am a Foreman core dev and notice a Content View test is failing, I would ping the Katello release owner to help me assess.
- Pros: Less frequent rotation and release owners play a part in owning.
- Cons: Could still be too much of a burden on release owners, communication could be hard going from weekly triage assignee -> release owner -> developer responsible for fix.
If there are more options I am missing please do comment. I tried to be objective in my pros/cons assessment, but feel free to share other concerns or benefits.
I’ll add a poll that will help gauge interest in each option
It should be added that whatever approach we take, we should stay flexible and see how the approach works in reality. We will probably have to tweak things and have more discussions as we see how an approach plays out. So whatever we decide here will not be set in stone.
Another thing to note is that we can use technology to assist us in these processes. For instance, a bot could directly ping the current owner of the nightly pipeline in the discourse thread where it reports it failing.
Let me know if there are any questions or concerns!