RFC: Moving off of Rackspace Infrastructure

As Rackspace has ended it’s support for open source projects we would like to move everything off of Rackspace an onto new providers. See below for a detailed layout of the current Rackspace infrastructure.

If you have questions or concerns please raise them here.

Proposal

There are three main components to the proposal:

  • Moving DNS to a centrally managed provider
  • Keeping secure infrastructure on a single provider
  • Maintain current level of Jenkins nodes

DNS

Gandi provides open source support and centralized group management. Accounts will be setup there and DNS transferred from Ohad to Gandi. Anyone within the infrastructure team that wants access may create an account and be added to the group.

Jenkins Nodes

The non-security focused Jenkins nodes (slave01 and slave05 on Rackspace) will be shutdown and replaced by AWS Jenkins nodes that are copies of the two AWS nodes we have today.

Main Infrastructure

The main and security focused infrastructure is:

  • Foreman/Puppet master
  • Jenkins master
  • Webserver
  • Debian Jenkins node
  • slave02 that handles specific jobs that push into our infrastructure

These servers will be moved alongside existing infrastructure running in Oregon State University Open Source Lab (OSUOSL) on their Openstack infrastructure. This will require a capacity increase on their footprint which they are happy to do.

Action Items

  • Disable Jenkins nodes
    • Blocked On:
      • marking new OSUOSL to handle secret stuff
    • To move:
      • slave02
      • debian01 [DONE]
  • Action Items [ewoud]
    • Add any missing secrets
    • Label OSUOSL nodes in Jenkins to do secret stuff
  • Migrate Jenkins master
  • Migrate Webserver
  • Migrate Foreman/puppet

DONE

  • Action Items [ehelms]
  • Email Ohad about DNS access [Evgeni]
    • Account on Gandi created by Evgeni
    • Create an account if you want access
    • Blocked on:
      • Waiting on transfer which has been initiated
    • Next steps
      • Flip DNS over from Godaddy to Gandi
  • OSUOL [Ewoud]
    • Email them and ask for more resources [DONE]
    • Waiting on ticket resolution [DONE]
  • Archive stats box [Evgeni]
  • Create new Jenkins nodes in AWS [ehelms]
    • Patrick to spin up 2 nodes
    • Eric to configure the nodes
    • New AWS slaves online, slave01.rackspace and slave05.rackspace shutdown and deleted

Current Rackspace Infrastructure

7 Likes

A big “thank you” for all the work done so far!

2 Likes

Agree, thank you!

Thanks to the whole infra team and beyond to anyone who is involved.

Updated action items today and built out action items for transferring Jenkins master and final Jenkins nodes.

As part of these updates, we are moving to use the term node for our Jenkins workers. I have updated the OSUOSL and AWS nodes to the following format:

node0X.jenkins..theforeman.org

node0X.jenkins.aws.theforeman.org
node0X.jenkins.osuosl.theforeman.org

I am less sure about doing the Scaleways and Netways ones properly. The debian01 and slave02 left on Rackspace will get updated as part of their move to OSUOSL.

2 Likes

I’ve spun up a node05 at OSUOSL that will replace slave02.rackspace once we figured out secrets handling and label the nodes properly.

Next up is creating a deb-node01.jenkins.osuosl.theforeman.org to replace the debian01 in Rackspace. I looked into this within OSUOSL but I could not find an Ubuntu 18.04 to match the current debian01. @mmoll what approach should we take here?

@ehelms Debian 10.3 should also be fine.

Last night theforeman.org was transferred from Godaddy to Gandi. It looks like various DNS recursors have cached the old NS records (TTL is 1 day in the .org zone) and Godaddy stopped responding to DNS requests. This is causing some instability. I expect this to stabilize during the day as TTLs expire.

I also took the opportunity to create DNS records for all hosts we manage.

1 Like

Looks like katello vcr tests are now failing with:

An HTTP request has been made that VCR does not know how to handle:
  POST https://node03.jenkins.osuosl.theforeman.org/pulp/api/v3/distributions/container/container/

That sounds like a broken test. You should never use real domains and the actual FQDN, especially with VCR.

There is now a new Debian node in OSUOSL: deb-node01.jenkins.osuosl.theforeman.org

I have ran puppet agent on it and hooked it in. I also updated the labels. @mmoll could you take a look and see if its setup correct and will handle what debian01 does today?

None of our tests are hardcoded to expect a hostname, VCR generates
requests with the current hostname in them, but does not consider that
when looking for a matching cassette. Its likely unrelated to the
hostname change. Will look into it.

@tbrisker i couldn’t find the failing test run you mentioned, do you mind linking to it?

Should have linked it when i saw it, can’t find it either now. guess it was just a fluke.

I removed the ipv6 label, as we have only legacy IP at OSUOSL and disabled the old deb slave in Jenkins now, so we’ll see if anything is tripping up.

I’ve updated the action items to where I think we are at. Next big steps are configuring all of the OSUOSL jobs to handle the types of work slave02 did, testing the new Debian node and then migrating the 3 big servers: Jenkins, webserver and Foreman/Puppet.

I have asked about IPv6 and they are looking into it.

Deleted debian01.rackspace everywhere.

Can we wait with the jenkins and webserver migration until after 2.0.0 is released please (hopefuly later today)? the release has already been delayed quite a bit by various issues, I don’t want to have to delay it further because we aren’t able to release while servers are being migrated.

Additionally, I think we should set up some monitoring for how many slots are actually being used by Jenkins. We might be assigning too many or too little resources to it and it would be good to plan according to the capacity that we actually need. If we could get some insights regarding to “special” node (e.g. debian, arm, ssh…) usage, it might also help to see where are our bottlenecks.
Some operations (e.g. major release) might need additional resources. If it was easy to scale up or down, we might be able to use some short-lived instances for these cases without having idle machines that are wasting resources just so we have capacity for these peaks.