Seeking guidance on multi-step "workflows"

This is an email not looking so much for specific technical details, but
more to explain what we're doing currently, and looking for how to best
direct our research in how to do "the equivalent or better" in foreman.

I wanted to see how to approach doing workflows in Foreman (beyond
discovery and os install). Our process is that we get bare metal racked in
multiple datacenters in one buildout. Generally, we assign a server to a
rack position by serial number, and that also tells us what switch ports it
will be connected to. For a good portion of our hardware, we get serial
numbers and not mac addresses. So we have to identify the hardware, apply
bios updates, bios configuration, and raid hardware configuration. We also
test the hardware. Installing the os is pretty much the last thing we do.

THIS IS WHERE I NEED THE MOST HELP:
I've been looking at foreman hooks, dynflow integration, and the idea of
standing up a separate server such as StackStorm to essentially drive
foreman. I can't decide what would be the best path forward. Essentially
after each step in our workflow, it seems I would have to change what the
node will pxe boot to next. This might be doable with hooks, but they might
not be up to it. dynFlow looks advanced, but it seems like you would be
writing a full program for each step. We have some experience with
stackstorm. I could see each image I boot to sending an even to stackstorm,
which then makes api calls (or hammer commands) to change foreman's
behavior in relation to the node.
END

Below are the list of steps we take when configuring a newly racked server
from Dell. We do similar processes for other gear and other OSes, but
wanted to pull one out as an example. We used to do this with a
provisioning system we built in house, and last year we started doing this
with the open source RackHD system.

What we call a "microkernel" is what foreman calls a discovery image.
Basically a small debian pxe boot image. We make a number of custom
"overlays" for it with additional packages. So I think we would be good
making customized foreman discovery images with any additional packages we
need.

A second part we do is pull down templates from the provisioning server.
These could be wrapper scripts or configuration files. It looks like
foreman has support for this as well, so not too worried about that.

Discovery Level

  1. bootstrap-ubuntu: this works just like foreman discovery (interrogates
    hardware and network and submits it back to the system for storage)
  2. "tacoma node identify": makes a http-json api call with the node serial
    number to a microservice, gets static networking information returned
  3. configure ipmi with correct ip address and credentials (ipmitool
    commands ran on discovery image)
  4. re-read ipmi configuration "facts" and update provisioner
  5. validate ipmi connectivity
  6. "gertie enrich": (takes place on server) makes omapi calls to
    isc-dhcp-server to assign node specific IP to newly discovered mac address.
    Foreman seems to have this built in.
  7. shell-reboot: node is rebooted so it will get correct ip address from
    dhcp server.

Testing and Configuration:

  1. update firmware (run racadm commands)
  2. reboot to activate updated firmware
  3. bootstrap-ubuntu (in foreman, pxe boot to discovery and await more
    commands)
  4. run cable_validation.py
  5. run hwtest.py --disks (ensure expected disk sizes are there and script
    runs to read the first few gb of raw data on each)
  6. run hwtest.py --cpu (stress test cpu)
  7. configure bios settings (things like performance profiles and boot order
  • on dell, we run racadm) -> usually reboot after this step
  1. perform a drive stress test - typically a 72 hour run of writing data to
    random locations and reading it back -> reboot at end
  2. configure drives into raid, if applicable (ie, run perccli commands),
    reboot

OS install

  1. Install desired OS
  2. post-install: we use ansible to login, add any final steps (setup
    bonding interface files, configure puppet agent or salt minion, add or
    disable users, reconfigure sshd, etc)
  3. final reboot

Notes

  • cable_validation.py runs lldpcli show neighbors to ensure expected
    network cables are connected to the right switch port.
  • hwtest.py is a wrapper script that can run memtester or stress to
    test memory/cpu and send logs back to a syslog server.
  • disks_present.py reads /proc to validate drive count and size matches our
    database