Known_hosts and Foreman Ansible

dLobatog · April 10, 2018, 4:46pm

all,

Currently on Foreman Ansible, we are facing an issue when running Ansible for the first time against a host.
The host could have been created by Foreman or it could be an already existing host we have imported.

Since it’s the first time the foreman or foreman-proxy users try to SSH to the host, the first run will fail with a familiar error like “Host key verification failed”. In order to work around it, you may try to just open an SSH connection between foreman-proxy and the host on a console, then ‘accept’ the key to be added to ~/.ssh/known_hosts.

This is most likely a problem shared with Foreman Remote Execution too. Here are some possible solutions. Any more ideas? Preferences?

If the host has been created by Foreman, add it to ./ssh/known_hosts . This can be problematic for proxies and error-prone (what if the proxy is down)
Disable this feature as http://docs.ansible.com/ansible/latest/user_guide/intro_getting_started.html#host-key-checking indicates - an attack vector for man-in-the-middle.
Disabling key verification on the first Ansible run we make by passing the variable ANSIBLE_HOST_KEY_CHECKING=False. The second time onwards we will get a warning if the host key has changed.

cc @Marek_Hulan @Bernhard_Suttner @bastilian

ekohl · April 10, 2018, 5:32pm

This is a hard problem to solve for everyone. In some organizations you can already trust on the host being present in /etc/ssh/ssh_known_hosts (puppet-ssh could easily do this with exported resources). Disabling host key checking then lowers the security for no usability benefit.

As you say, managing ~foreman-proxy/.ssh/known_hosts is also tricky and might not be workable.

Given it’s a warning, there’s a huge chance admins won’t read it. I know plenty of people who are as lazy as I am and just see my install package x playbook worked so I’m happy. When using password-based authentication it’s also already too late if you really were MITM’ed.

There’s no silver bullet. I think we should default to a secure installation even if that’s less convenient. What you could do is provide an easy way to add known hosts and document it.

Marek_Hulan · April 11, 2018, 9:17am

I think this is the best option right now, ignore the error on initial connection and store the key for later so next connections are verified. I wouldn’t warn but fail hard in case the host key changed later. Rex works this way today AFAIK and has a setting to disable host checking entirely (off by default IIRC).

Another thoughts. We could use host facts that usually contains host public key and store that in known_hosts file. That only addresses use cases where you already had e.g. puppet agent configured.

Second (and much more complicated) option could be using SSH certificates. Smart proxy or Foreman would become SSH CA and we’d generate and deploy the key as part of provisioning, similarly to how we deploy authorized_keys. The main difference is that we’d need to handle private key transfer, but that could be achieved using HTTPS and host token used for host authentication. Downside is, that it only resolves the problem for hosts provisioned from Foreman. Also the only benefit in comparing to using regular SSH keys is that you don’t have to install public keys to all smart proxies as they would trust signed host keys.

So for now I lean towards what you suggested as third option, just change the warning to hard failure.

ekohl · April 11, 2018, 9:49am

There is StrictHostKeyChecking=accept-new which is the behavior you’re describing (and IMHO a sane default for us). The default is ask and setting it to no (or off) always allows connections to hosts with changed keys (but does print a warning). The problem is that accept-new was introduced recently and doesn’t ship in Debian Stretch nor EL7 which makes it unusable for our users.

bastilian · April 12, 2018, 11:30am

In the long run we will need to handle and manage SSH better and allow a better flow of setting up (or just adding) a host with Ansible as config mgmt from within Foreman. The SSH certificates Marek mentioned could be used as a base for that, but it doesn’t seem to solve the known hosts issue.

Another solution that might work would be to actually verify the host. ssh-keyscan offers a way to do that. It could be run before a (first) job on a host with the executing user and would add the hosts public key to known_hosts and Ansible should not ask for adding it (hopefully). This is an untested idea.

We could, but that might make for weird UX also add an option to enable or disabled the verification step.

x9c4 · April 12, 2018, 11:47am

As i recall, the ssh-certificates can be used, to replace (or extend) the authorized_keys mechanism, as well as the known_host.
So i am tempted to say, it would actually solve the problem.

ekohl · April 12, 2018, 11:49am

I still think that Foreman should not have an opinion on how to handle known hosts. Either we choose a too strict way that doesn’t work for small deployments or a too insecure way for large (secure) deployments. This is different for Puppet which has its own infrastructure, but requiring a certain SSH setup is too invasive.

Marek_Hulan · April 12, 2018, 1:35pm

Would you like to verify the key for job invocation running on 1000 hosts? I think we should not add interactivity.

x9c4 · April 18, 2018, 2:18pm

Maybe foreman_bootdisk could be a way to insert a crypto_token into newly provisioned hosts, that can be used to authenticate the generated ssh_host_key later. Also the foreman ssh-key (or the ssh-ca) could be distributed that way.

dLobatog · April 18, 2018, 6:13pm

That sounds like a great idea, although for newly provisioned hosts I don’t think it’s really an issue, we could blindly trust them at first.

By now I want to do the following:

We skip host verification on the 1st job invocation. After that, the host
key is checked using ~/.ssh/known_hosts and jobs will rightfully fail if the host cannot
be verified against the saved key.

To do so, I think that the best strategy would be to add a new field to the host model
called ‘ssl_verified’ which defaults to ‘false’.

When we run a job invocation, we check if ‘ssl_verified’ is true or false.
If it’s false, we send IGNORE_SSL_VERIFICATION. If the job returns with a 0 exit code,
we set ‘host.ssl_verified’ to true.
If it’s true, we don’t send anything and the job invocation runs as usual.

Upon host rebuild, we could set ‘ssl_verified’ to false. However this might not be a good
idea as it gives an attacker the opportunity to 1. click build 2. click cancel build and
then ‘ssl_verified’ is still false, so the ssl verification is skipped. I have thought about
offering this as an option (or directly disabling SSL verification always)

We could also show a check button on the Ansible tab of the Host form.

Note I suggest using a database field as I’ve checked using code similar to

result = []
TemplateInvocation.all.each do |ti|
  next unless ti.run_host_job_task.present?
  next unless ti.run_host_job_task.main_action.live_output.to_s.match? /verification/
  result << ti
end

results on a ton of calls to the proxies that were used in the job invocations.

@iNecas @aruzicka do you agree with just using a field for ‘ssh_verified’? I would rather not have it but if the alternative is to make a ton of API calls to the proxies (not sure what is doing it, but run_host_job_task.main_action.live_output is a prime suspect, .output is just ‘’), I would rather have it for simplicity and speed.

x9c4 · April 19, 2018, 8:01am

I think, there is a name for it: TOFU (Trust on first use).
So as far as i see it, we need a (possibly opt-out-able) TOFU policy on the smart-proxy that can be resetted in the provisioning-process (Not when you hit rebuild, but when the host actually reboots into its rebuild).
And to make matters worse, this mechanism should be shared with the REX infrastructure.

dLobatog · May 30, 2018, 1:38pm

For the record, here’s a possible implementation of TOFU here - https://github.com/theforeman/smart_proxy_ansible/pull/8

It’s completely transparent to the user and requires no changes.

bastilian · May 30, 2018, 3:18pm

Nice, I like where this is going! Maybe with storing the results elsewhere in Foreman to have the scan only be done once per host (and maybe even before a job invocation) we can avoid any bottlenecks.