(This has been discussed a bit privately, so I thought it was time to make an RFC)
We’ve got ongoing issues with many of the jobs on Jenkins (notably the nightly builds, but others such as plugins too). A proposal has been made to send the build failures to Discourse, both to increase visibility of the failures, and also raise awareness of the CI processes so that others can help out.
Proposal
Create a new category (suggestions for name? I’m going with “CI failures” so we can use it for more than just Jenkins if we wish)
Have Jenkins send build failures to the category via the Discourse API
The created topics should be tagged with something relating to the job name
Possible Workflow
As many people as possible add the whole category (or at least the specific job tags they can help with) to their notifications, so they know about failures
When a failure occurs, the first person to pick it up replies to the thread stating they’re working on it
The topic is closed (or perhaps “marked as solved”) once the failure is fixed
Notes / optional extras:
We need lots of buy-in here. A new category won’t have any subscribers initially so people will need to opt-in to it.
Mailing-list-mode members will get every single failure - sorry . But more seriously, if that’s an issue, consider more fine-grained configuration in your settings
Volume of traffic from Jenkins needs to be monitored (we already send 35k emails per month, I don’t want that to rise too much)
This is worth trying, IMO. I will post a poll shortly, but I hope we can agree that trying things out and iterating from there is better than not changing anything. We should evaluate after a couple of months and see what (if anything) needs tweaking.
Proposal Updates
per @tbrisker - category should be positive, such as “Build Status”
As the person who suggested this path in the first place, let me just elaborate on a couple of points:
First, the main motivation here is to increase visibility of failures. This came up since nightlies have been broken for very long periods recently, and most developers were not even aware of the situation. The only way to tell now that something is broken is to dig into Jenkins and know which tasks to look at etc., something that only a small group of people know and do on a regular basis. Another suggestion that was brought up was to send such notifications to a small group of people who want them, which I think doesn’t solve this problem either - this group would be the same people who regularly log into jenkins to check the failures. Having the failures posted to discourse, which I believe all devs read on a daily basis, will greatly increase the visibility of the state of the builds.
The second point is that this will make collaboration much easier. Right now, if I see a failure on Jenkins, I don’t know if it’s just a random failure, if someone is already working on it, etc. Having each failure as a post in discourse will allow an easy way of collaborating and reducing duplicate efforts.
One last point that hasn’t been mentioned is that having past failures recorded will allow us to possibly notice patterns or repeated issues, something which now can only be done if the same person happens to handle the same issue multiple times. The resolutions will also be recorded, so if needed in the future other people will be able to reuse them when possible, instead of having to figure out the solution on their own again.
Regarding the buy-in - I, for one, read discourse using the “Latest” view, which shows me posts in all categories. I recommend other developers do so as well, as that way you can see all posts all the time and not miss things in different categories.
Regarding the name - maybe we should be positive and call it “Build Status” or something similar?
I’m neutral about it, to be honest I think it’s easier to post whenever you notice something’s broken like on:
How does that change with a job automatically posting failures to Discourse?
That’s a good point, but IMO for the sake of having a better noise/signal ratio, a developer can just post here when something is truly broken. If everything gets posted, because of the nature of our jobs (not stable enough, nor we care about it enough) it will be kind of similar to looking at http://ci.theforeman.org/
That’s a good point - but again I think it is similar if whoever finds something truly broken posts about it.
Hopefully, since most devs are on discourse regularly, and assuming they won’t just ignore this category, this will increase the visibility of failures.
The main difference is that the post is created automatically the minute the build breaks, without someone having to manually look at jenkins and post about it. I’m not suggesting every failed build is posted there, only some we identify as more critical - such as nightlies.
It will increase visibility of failures for sure, if you post the failure twice on IRC and Discourse more people are likely to see it. I purposefully left that out of the quote, what I was trying to say is that most people will see them and ignore them. It takes intent to open the task and figure out what went wrong, and that’s not changing by posting that to this forum.
As I said in person before, I think the critical issue here is that keeping our upstream releases steady and going well is no one’s top priority.
Honestly, I don’t like the idea. I’d prefer direct email to all people from the git log who made a change. I believe Jenkins can do that, even Hudson were able to do that. The message can ask users to create thread on discourse if they plan to work on this. We never had these, the last time I tried this it could not be enabled for plugins because we have one job template for all plugins. We could do it for core however.
Let me more elaborate this - if I break core for everyone, this is priority. Therefore I believe I need to have a message in my INBOX. That’s the first folder I read every morning. Things are getting done when I am sucking my first coffee.
In my workflow, I go to Red Hat important mailing lists and then move on to discourse and then to github. There can be notification waiting for me until evening, it will likely end up someone pinging me before lunch. End of story
I’m hearing a lot of “I don’t personally like this” (which is totally fine, and good to know), but I think we need to try more ideas and evolve how we do things. Unless you believe that this trial would actively decrease the amount of work on the Jenkins builds, then I think we should try it. On a personal level, you can always mute the category/tag (or write a filter, if you use mailing list mode).
With that in mind, I did promise a poll, so here it is
I voted No because for now I think we should first work on this with a smaller group to get stable nightly builds. There are still some things that I think are false positives or other things we should fix before exposing it to the entire community. We should make sure people can actually have an impact on fixing the problem because otherwise it’s spam to them and they’ll start to ignore it. By the time we make it usable, it’s already on the ignore list and we miss our goal.
I am neurtal, slightly against this. Whatever we do, let’s make sure it won’t harm browsing/reading experience, search (!) and also Google (I say it but I mean web search engine crawlers). We don’t be loosing rank there. Perhaps disallow search engines on these topics, don’t know.
I think @ekohl makes an excellent point - yes, but not right now. Given the mixed result of the poll, I’ll come back to this in a few months, after we’ve seen what impact the recent changes can have on the build failures
Based on some discussions amongst the various developers working on releases, I am reviving this idea and proposing it as an experiment with a re-assessment duration and job limitations. The same proposed workflow and rules would apply for where failures are posted to and how developers interact with the posts to communicate breakages and pending fixes. Builds are more stable than they were and we believe this workflow will help keep them that way while exposing to a broader audience when there is a failure, types of failures (in case we need to take systematic tactics) and the resolutions.
Try it out for 2 months
Include only the following jobs:
foreman-nightly-release
foreman-plugins-release
katello-nightly-release
Other jobs may be proposed or brought on board if they are in a similar critical path as the ones listed above. I’m adding another poll to see what folks think since the last one.
I voted no, let me explain. I am against only testing katello and foreman, I want all (or top 10) plugins to be in the pilot as well. Beacause today every time I ask for jenkins configuration change the answer is “we would need to do this for ALL plugins, it is not possible”. There are some technical limitations and plugin jobs are not on par with core and katello and that is not fair.
Here is my concern: If this turns out to be good experience, I can imagine I will not be allowed to enable discovery because “that would create too much noise”. Lets test this properly with everything, lets solve all challenges during the pilot (like deleting old posts or something like that).
I’m not against the inclusion of plugins assuming in part that plugin maintainers are up for watching the topic and responding. I would like in part to point out that the current focus is on release jobs with no bias towards any particular core or otherwise. At present, Foreman, plugins and Katello are the only projects that have release jobs (and built nightly).
There are some plans in the works to add more projects being built on a nightly basis. And further, if we can design an individual plugin nightly release design then we can begin to incorporate those into this.
I voted yes because we have a clear place to do root cause analysis.
This one is still very unstable. About half of the runs something fails because repoclosure is ran in parallel which it doesn’t properly support (1593331 – Concurrent use of repoclosure breaks).
Given the approval to try this out I’d like to move forward with it. @Gwmngilfen looking for some help here to setup the discourse category properly and to figure out the best way to send this data to it. What tactic do we need to take:
Send email to discourse from Jenkins
Build CI functionality to hit discourse API to make a post
I am assuming we will also need a user account for Jenkins to send as.
Not sure why Discourse would send email to Jenkins?
Other than that, yes, that’s about it. The list is:
Create the category (I guess as a subcategory of Development?)
Assign it an incoming address (ci@community.theforeman.org?)
Create an account for Jenkins
Figure out how to post via the API
The last one could be done in parallel with an existing account, I guess.
I don’t think Discourse has OAuth2 tokens (could be wrong) so we’ll want to be careful with the Jenkins account password, of course.
Yeah, stupid me - I was thinking that at step 2, and then got diverted to thinking about the API at step 4
Both are viable. The email approach is easier, but of course anyone could technically mail the inbound address for the category. That’s not been the case for the dev and support categories though.
I suggest we go that way and look at the API if email gets abused for some reason