Jenkins master has been compromised

Hello,

Earlier today, at around 9:00 UTC, we noticed that Jenkins is not responding.
@mmoll logged into the server to restart it, noticed some suspicious processes are running and notified me.
After some investigation, it was discovered that around 8:00 UTC an attacker gained access to the jenkins user account on the server using a remote code execution exploit in one of our outdated Jenkins plugins. The attacker used the exploit to install a crypto-mining malware on the server. The malware led to resource exhaustion of the server causing the Jenkins outage.
By 9:29 UTC the malware has been terminated and continued monitoring indicated no further attack attempts.

At 11:45 UTC we had a sync-up meeting of the @infra team and decided to take precautionary action in case the attacker managed to cause any further damage or compromise secrets held on the server. We decided that we will decommission the existing server and stand up a new server for the Jenkins master with all plugins updated to the latest versions, and upgrade the underlying OS to CentOS 7 as well.
As a further precaution, we have revoked and updated various secrets held on the server.

We are currently in the process of bringing the worker nodes back up and restoring various previous configurations on the new Jenkins master. We hope that sometime tomorrow most of the previous functionality will be restored.
Due to the various plugin upgrades, some pipelines may now fail to work correctly, and we will be continuing to restore and fix the issues as we discover them. Expect some instability in the following days as we work out all of these issues.

If you have any further questions on this matter, feel free to reach out to me.

I would like to personally thank @mmoll, @ekohl and @evgeni for all of the hard work they’ve put into this effort today, and apologize to all developers whose workflow has been disrupted by this unplanned outage.

10 Likes

Did we manage to salvage some of the mined crypto? XD

Anyway, great work guys!

Thanks for the report, great work this is how security incidents should be done.

I think that CI systems will always be a good target for such people and I wonder if it makes sense to limit access via originating IP addresses on the HTTP server level for the admin area (or everywhere except “view report or view result” action if that’s feasible).

We’re slowly adding functionality again. PR testing should be enabled again. I’m not going to re-deliver failed webhooks manually. Use of [test x] should work to retrigger as well as pushing.

Things may still be broken. Let us know if you see anything weird, even if you’re unsure. I rather have some false positives than false negatives.

I also haven’t re-added all nodes yet, but the current ones should have sufficient compute for basic operation.