Sending Jenkins build failures to Discourse

Gwmngilfen · June 11, 2018, 1:38pm

(This has been discussed a bit privately, so I thought it was time to make an RFC)

We’ve got ongoing issues with many of the jobs on Jenkins (notably the nightly builds, but others such as plugins too). A proposal has been made to send the build failures to Discourse, both to increase visibility of the failures, and also raise awareness of the CI processes so that others can help out.

Proposal

Create a new category (suggestions for name? I’m going with “CI failures” so we can use it for more than just Jenkins if we wish)
Have Jenkins send build failures to the category via the Discourse API
The created topics should be tagged with something relating to the job name

Possible Workflow

As many people as possible add the whole category (or at least the specific job tags they can help with) to their notifications, so they know about failures
When a failure occurs, the first person to pick it up replies to the thread stating they’re working on it
The topic is closed (or perhaps “marked as solved”) once the failure is fixed

Notes / optional extras:

We need lots of buy-in here. A new category won’t have any subscribers initially so people will need to opt-in to it.
Mailing-list-mode members will get every single failure - sorry . But more seriously, if that’s an issue, consider more fine-grained configuration in your settings
Volume of traffic from Jenkins needs to be monitored (we already send 35k emails per month, I don’t want that to rise too much)
We could look at using the Assigned-To plugin for tracking whose working on topics.

Evaluation

This is worth trying, IMO. I will post a poll shortly, but I hope we can agree that trying things out and iterating from there is better than not changing anything. We should evaluate after a couple of months and see what (if anything) needs tweaking.

Proposal Updates

per @tbrisker - category should be positive, such as “Build Status”

tbrisker · June 12, 2018, 7:17am

Thanks for bringing this up Greg!

As the person who suggested this path in the first place, let me just elaborate on a couple of points:

First, the main motivation here is to increase visibility of failures. This came up since nightlies have been broken for very long periods recently, and most developers were not even aware of the situation. The only way to tell now that something is broken is to dig into Jenkins and know which tasks to look at etc., something that only a small group of people know and do on a regular basis. Another suggestion that was brought up was to send such notifications to a small group of people who want them, which I think doesn’t solve this problem either - this group would be the same people who regularly log into jenkins to check the failures. Having the failures posted to discourse, which I believe all devs read on a daily basis, will greatly increase the visibility of the state of the builds.

The second point is that this will make collaboration much easier. Right now, if I see a failure on Jenkins, I don’t know if it’s just a random failure, if someone is already working on it, etc. Having each failure as a post in discourse will allow an easy way of collaborating and reducing duplicate efforts.

One last point that hasn’t been mentioned is that having past failures recorded will allow us to possibly notice patterns or repeated issues, something which now can only be done if the same person happens to handle the same issue multiple times. The resolutions will also be recorded, so if needed in the future other people will be able to reuse them when possible, instead of having to figure out the solution on their own again.

Regarding the buy-in - I, for one, read discourse using the “Latest” view, which shows me posts in all categories. I recommend other developers do so as well, as that way you can see all posts all the time and not miss things in different categories.

Regarding the name - maybe we should be positive and call it “Build Status” or something similar?

dLobatog · June 12, 2018, 2:14pm

I’m neutral about it, to be honest I think it’s easier to post whenever you notice something’s broken like on:

How does that change with a job automatically posting failures to Discourse?

That’s a good point, but IMO for the sake of having a better noise/signal ratio, a developer can just post here when something is truly broken. If everything gets posted, because of the nature of our jobs (not stable enough, nor we care about it enough) it will be kind of similar to looking at http://ci.theforeman.org/

That’s a good point - but again I think it is similar if whoever finds something truly broken posts about it.

tbrisker · June 12, 2018, 4:49pm

Hopefully, since most devs are on discourse regularly, and assuming they won’t just ignore this category, this will increase the visibility of failures.

dLobatog:

tbrisker:

Right now, if I see a failure on Jenkins, I don’t know if it’s just a random failure, if someone is already working on it, etc. Having each failure as a post in discourse will allow an easy way of collaborating and reducing duplicate efforts.

That’s a good point, but IMO for the sake of having a better noise/signal ratio, a developer can just post here when something is truly broken. If everything gets posted, because of the nature of our jobs (not stable enough, nor we care about it enough) it will be kind of similar to looking at http://ci.theforeman.org/

tbrisker:

One last point that hasn’t been mentioned is that having past failures recorded will allow us to possibly notice patterns or repeated issues, something which now can only be done if the same person happens to handle the same issue multiple times. The resolutions will also be recorded, so if needed in the future other people will be able to reuse them when possible, instead of having to figure out the solution on their own again.

That’s a good point - but again I think it is similar if whoever finds something truly broken posts about it.

The main difference is that the post is created automatically the minute the build breaks, without someone having to manually look at jenkins and post about it. I’m not suggesting every failed build is posted there, only some we identify as more critical - such as nightlies.

dLobatog · June 13, 2018, 1:15pm

It will increase visibility of failures for sure, if you post the failure twice on IRC and Discourse more people are likely to see it. I purposefully left that out of the quote, what I was trying to say is that most people will see them and ignore them. It takes intent to open the task and figure out what went wrong, and that’s not changing by posting that to this forum.

As I said in person before, I think the critical issue here is that keeping our upstream releases steady and going well is no one’s top priority.

lzap · June 14, 2018, 6:08am

Honestly, I don’t like the idea. I’d prefer direct email to all people from the git log who made a change. I believe Jenkins can do that, even Hudson were able to do that. The message can ask users to create thread on discourse if they plan to work on this. We never had these, the last time I tried this it could not be enabled for plugins because we have one job template for all plugins. We could do it for core however.

lzap · June 14, 2018, 6:10am

Let me more elaborate this - if I break core for everyone, this is priority. Therefore I believe I need to have a message in my INBOX. That’s the first folder I read every morning. Things are getting done when I am sucking my first coffee.

In my workflow, I go to Red Hat important mailing lists and then move on to discourse and then to github. There can be notification waiting for me until evening, it will likely end up someone pinging me before lunch. End of story

Gwmngilfen · June 14, 2018, 8:36am

I’m hearing a lot of “I don’t personally like this” (which is totally fine, and good to know), but I think we need to try more ideas and evolve how we do things. Unless you believe that this trial would actively decrease the amount of work on the Jenkins builds, then I think we should try it. On a personal level, you can always mute the category/tag (or write a filter, if you use mailing list mode).

With that in mind, I did promise a poll, so here it is

Yes - let’s try it
Indifferent - I don’t personally like it
No - I think this will harm the builds

0 voters

ekohl · June 14, 2018, 8:51am

I voted No because for now I think we should first work on this with a smaller group to get stable nightly builds. There are still some things that I think are false positives or other things we should fix before exposing it to the entire community. We should make sure people can actually have an impact on fixing the problem because otherwise it’s spam to them and they’ll start to ignore it. By the time we make it usable, it’s already on the ignore list and we miss our goal.

Longer term I do think it can be a good idea.

lzap · June 14, 2018, 12:35pm

I am neurtal, slightly against this. Whatever we do, let’s make sure it won’t harm browsing/reading experience, search (!) and also Google (I say it but I mean web search engine crawlers). We don’t be loosing rank there. Perhaps disallow search engines on these topics, don’t know.

Gwmngilfen · June 19, 2018, 11:34am

I think @ekohl makes an excellent point - yes, but not right now. Given the mixed result of the poll, I’ll come back to this in a few months, after we’ve seen what impact the recent changes can have on the build failures

ehelms · August 23, 2018, 9:08pm

Based on some discussions amongst the various developers working on releases, I am reviving this idea and proposing it as an experiment with a re-assessment duration and job limitations. The same proposed workflow and rules would apply for where failures are posted to and how developers interact with the posts to communicate breakages and pending fixes. Builds are more stable than they were and we believe this workflow will help keep them that way while exposing to a broader audience when there is a failure, types of failures (in case we need to take systematic tactics) and the resolutions.

Try it out for 2 months
Include only the following jobs:
- foreman-nightly-release
- foreman-plugins-release
- katello-nightly-release

Other jobs may be proposed or brought on board if they are in a similar critical path as the ones listed above. I’m adding another poll to see what folks think since the last one.

Yes
No
Indifferent

0 voters

lzap · August 24, 2018, 6:32am

I voted no, let me explain. I am against only testing katello and foreman, I want all (or top 10) plugins to be in the pilot as well. Beacause today every time I ask for jenkins configuration change the answer is “we would need to do this for ALL plugins, it is not possible”. There are some technical limitations and plugin jobs are not on par with core and katello and that is not fair.

Here is my concern: If this turns out to be good experience, I can imagine I will not be allowed to enable discovery because “that would create too much noise”. Lets test this properly with everything, lets solve all challenges during the pilot (like deleting old posts or something like that).

ehelms · August 24, 2018, 10:58am

I’m not against the inclusion of plugins assuming in part that plugin maintainers are up for watching the topic and responding. I would like in part to point out that the current focus is on release jobs with no bias towards any particular core or otherwise. At present, Foreman, plugins and Katello are the only projects that have release jobs (and built nightly).

There are some plans in the works to add more projects being built on a nightly basis. And further, if we can design an individual plugin nightly release design then we can begin to incorporate those into this.

ekohl · August 24, 2018, 11:32am

I voted yes because we have a clear place to do root cause analysis.

This one is still very unstable. About half of the runs something fails because repoclosure is ran in parallel which it doesn’t properly support (1593331 – Concurrent use of repoclosure breaks).

lzap · August 24, 2018, 11:35am

Oh I missed these are “-release” jobs, voted yes then.

ehelms · September 27, 2018, 1:49pm

Given the approval to try this out I’d like to move forward with it. @Gwmngilfen looking for some help here to setup the discourse category properly and to figure out the best way to send this data to it. What tactic do we need to take:

Send email to discourse from Jenkins
Build CI functionality to hit discourse API to make a post

I am assuming we will also need a user account for Jenkins to send as.

Gwmngilfen · September 27, 2018, 2:33pm

Not sure why Discourse would send email to Jenkins?

Other than that, yes, that’s about it. The list is:

Create the category (I guess as a subcategory of Development?)
Assign it an incoming address (ci@community.theforeman.org?)
Create an account for Jenkins
Figure out how to post via the API

The last one could be done in parallel with an existing account, I guess.

I don’t think Discourse has OAuth2 tokens (could be wrong) so we’ll want to be careful with the Jenkins account password, of course.

ehelms · September 27, 2018, 2:52pm

Can you send an email that targets a category vs. hitting the API?

Gwmngilfen · September 27, 2018, 2:56pm

Yeah, stupid me - I was thinking that at step 2, and then got diverted to thinking about the API at step 4

Both are viable. The email approach is easier, but of course anyone could technically mail the inbound address for the category. That’s not been the case for the dev and support categories though.

I suggest we go that way and look at the API if email gets abused for some reason