Issues changing a host's NIC to a subnet with a DHCP smart proxy

I'm running a Smart Proxy, which is managing DHCP/TFTP for 125 subnets.
I'm trying to figure out what would cause the DHCP record creation to fail,
when switching a host from one subnet to another. When you switch a NIC's
subnet to one with a DHCP smart proxy, the DHCP record gets automatically
created. I was using this to migrate hosts from a generic 10/8 network
(with no DHCP or TFTP proxies configured) to more specific 10.x.x.x network
with a DHCP+TFTP smart proxy. We previously did not use Foreman for
provisioning, and no DHCP/TFTP proxies were defined for the generic 10/8
network.

This was working fine for roughly the first thousand hosts. I'd move the
host to the new subnet, and the corresponding DHCP record was created on
the DHCP server. However, suddenly I'm starting to see errors when I try
to switch hosts to the subnets with the smart proxy. This happens both
from the API, and from the Web UI.

Using the API (from a custom script), I see errors like this (I changed the
IP's and hostnames):

{u'error': {u'message': u'ERF12-4395 [ProxyAPI::ProxyException]: Unable to
retrieve DHCP entry for 10.77.136.211 ([RestClient::RequestTimeout]:
Request Timeout) for proxy https://foreman.example.org:8443/dhcp'}}

and

{u'error': {u'errors': {u'base': [u'Create DHCP Settings for myhost.example.org task failed with the following error: ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry
([RestClient::RequestTimeout]: Request Timeout) for proxy
https://foreman.example.org:8443/dhcp']}, u'id': 35660, u'full_messages': [u'Create DHCP Settings for myhost.example.org task failed with the following error: ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP entry
([RestClient::RequestTimeout]: Request Timeout) for proxy
https://foreman.example.org:8443/dhcp']}}

The first looks like an error retrieving information on the record, and the
second looks like an error updating the record.

From the Web UI, I'll see something like this:

Create DHCP Settings for myhostname.example.org task failed with the
following error: ERF12-6899 [ProxyAPI::ProxyException]: Unable to set DHCP
entry ([RestClient::RequestTimeout]: Request Timeout) for proxy https:
//smartproxy.example.org:8443/dhcp

More often from the Web UI though, what happens is I go in to edit the
host's primary interface, change the subnet, click "OK", then "submit", and
it goes back to the "Host" tab on the "edit" page (ie:
https://foreman.example.org/hosts/myhostname.example.com/edit). No error
message is displayed, but the changed subnet is not saved, and the DHCP
record is not created. In looking at the smart proxy log file, I see lots
of messages like this:

E, [2016-02-11T10:00:05.005071 #15360] ERROR – : Record
10.77.112.0/10.77.126.212 not found
[11/Feb/200:00:05] "GET /10.77.112.0/10.77.126.212 HTTP/1.1" 404 44 62.1812
[11/Feb/2016 10:01:27] "GET /serverName HTTP/1.1" 200 47 0.0006

But I'm not sure if this is a true error, because a successful change looks
like this (with an extra POST at the end):

E, [2016-02-11T10:40:56.269453 #8116] ERROR – : Record
10.12.5.0/10.12.5.102 not found
[11/Feb/2016 10:40:56] "GET /10.12.5.0/10.0.5.102 HTTP/1.1" 404 36 58.2472
[11/Feb/2016 10:41:56] "POST /10.12.5.0 HTTP/1.1" 200 - 59.7205

I'm wondering if it's failing on retrieval, like from the API call, and the
Web UI just doesn't give me an error in the second scenario.

I'm looking for ways to troubleshoot this. Is there anything specific that
could cause it? I wouldn't think I'd have hit any sort of limits, but I'm
not sure. Most confusing is why it would be working fine all week, and
suddenly it's having trouble creating records. The Foreman server itself
is a production server, and is heavily used, but the smart proxy doesn't
have much going on right now.

Some things I've tried:

  • Changing the subnet through the API
  • Changing the subnet through the WebIU
  • Verified DHCP server and foreman-proxy services were running on Smart
    Proxy server, and restarted both
  • Verified that some hosts still work (I've had a change fail, gone back
    and tried again, and then it works, but some hosts don't seem to work no
    matter how many times I try)
  • Tried to tie it back to a specific subnet. There are 12 subnets (out of
  1. where I've seen failures. It doesn't appear to be related to the
    subnet though.

I'm running Foreman 1.9.3.

Also, I'm wondering if someone can point me to the Foreman code that
handles the DHCP record creation via OMAPI? I'm thinking maybe I'll try to
create the DHCP record outside of Foreman, for testing purposes.

Kyle Flavin <kyle.flavin@gmail.com> writes:

> {u'error': {u'message': u'ERF12-4395 [ProxyAPI::ProxyException]:
> Unable to retrieve DHCP entry for 10.77.136.211
> ([RestClient::RequestTimeout]: Request Timeout) for proxy
> https://foreman.example.org:8443/dhcp'}}

we are getting the same errors for provisioning new hosts. our
dhcpd.leases file is about 140kb with 670 host entries.

this is on a 8 core vm and foreman-proxy constantly uses 1 cpu.
i think this also related to

http://projects.theforeman.org/issues/12425
Bug #12392: 100% cpu usage on foreman-proxy DHCP calls - Smart Proxy - Foreman and
Bug #12359: 100% cpu usage foreman-proxy after upgrade to 1.9 - Smart Proxy - Foreman

we are using the foreman as an enc and puppet runs keep failing because
every fact upload triggers a call to foreman which calls the
foreman-proxy.

we currently mitigated this issue by removing all entries from
dhcpd.leases…

this happend after an upgrade from 1.7 to 1.9.

jm2c
toni

You're right in thinking that 404 isn't the error - what you're seeing is
Foreman trying to check for conflicts. In other words, it's making the GET
to see if any other hosts have that IP reserved. The fact that it returns
404 is good, it means the IP is free, so then you see the POST claiming
it for this host.

That means you're not really seeing the real error. I'd suggest upping the
logs to debug and trying again - there should be a stacktrace then.

As for the code, the proxy handles all interactions with OMAPI - the whole
point of the proxy is to abstract away details like from Foreman. Foreman
could be talking to a MS DHCP server for all it knows/cares. You'll find
the api to the porxy at [1] but you'll have already guessed most of it from
the logs.

Hope it helps,
Greg

[1]

>
> Kyle Flavin <kyle....@gmail.com <javascript:>> writes:
>
> > {u'error': {u'message': u'ERF12-4395 [ProxyAPI::ProxyException]:
> > Unable to retrieve DHCP entry for 10.77.136.211
> > ([RestClient::RequestTimeout]: Request Timeout) for proxy
> > https://foreman.example.org:8443/dhcp'}}
>
> we are getting the same errors for provisioning new hosts. our
> dhcpd.leases file is about 140kb with 670 host entries.
>
> this is on a 8 core vm and foreman-proxy constantly uses 1 cpu.
> i think this also related to
>
> Bug #12425: Fact import triggers ip conflicts checks, which drives cpu utilization to 100% on smart-proxy - Foreman
> Bug #12392: 100% cpu usage on foreman-proxy DHCP calls - Smart Proxy - Foreman and
> Bug #12359: 100% cpu usage foreman-proxy after upgrade to 1.9 - Smart Proxy - Foreman
>
> we are using the foreman as an enc and puppet runs keep failing because
> every fact upload triggers a call to foreman which calls the
> foreman-proxy.
>

Have you tried setting ignore_puppet_facts_for_provisioning=True? It's
under the settings->provisioning tab.

I have this set to True following a 1.7.1->1.9.3 upgrade to prevent Foreman
from pulling in interface information from the host, which causes IP/Mac
conflicts with existing interfaces. We also use Foreman as an ENC to
manage puppet, and I haven't noticed puppet being slow.

··· On Friday, February 12, 2016 at 11:28:13 AM UTC-8, Toni Schmidbauer wrote:

we currently mitigated this issue by removing all entries from
dhcpd.leases…

this happend after an upgrade from 1.7 to 1.9.

jm2c
toni

Kyle Flavin <kyle.flavin@gmail.com> writes:
> Have you tried setting ignore_puppet_facts_for_provisioning=True? It's
> under the settings->provisioning tab.

thanks for the hint! i didn't know about this setting, it seems to
reduce the cpu load of foreman-proxy. but i need to do more testing to
confirm it fixes the problem.

> I have this set to True following a 1.7.1->1.9.3 upgrade to prevent
> Foreman from pulling in interface information from the host, which
> causes IP/Mac conflicts with existing interfaces. We also use Foreman
> as an ENC to manage puppet, and I haven't noticed puppet being slow.

we got timeout exceptions in the puppetserver log, because the enc
script (node.rb) would hang.

the real problem was provisioning new hosts, because foreman-proxy did
not respond in time, so provising failed. took us some time to find out,
because we are using cloudforms and trigger provisioning of new hosts
via the foreman api.

if it really fixes the problem for us i'm also going to update the bug
reports.

once again thanks for your help
toni

Thanks Greg, that's helpful. I'll see what else I can find.

Kyle

I enabled debug mode on my smart proxy, and attempted to add two records;
including the one that was failing from earlier. They both worked this
time - the DHCP record was created, and I didn't get the timeout error.
However, in looking at the debug output, I'm starting to wonder if the
number of subnets and records I have are causing an issue. They appear to
be loaded, and reloaded, from the configuration files, every time a record
is modified. Here's what I'm seeing:

Start of output. (Unrelated, but it looks like Foreman doesn't find my
settings file. The installation put mine under
/etc/foreman-proxy/settings.yml, so I'm not sure if something went wrong
there…)

W, [2016-02-11T14:14:04.054814 #19860] WARN – : Couldn't find settings
file /etc/foreman-proxy/settings.d/foreman_proxy.yml. Using default
settings.
I, [2016-02-11T14:14:04.055127 #19860] INFO – : 'foreman_proxy' settings
were initialized with default values: :enabled: true
I, [2016-02-11T14:14:04.062034 #19860] INFO – : 'facts' module is
disabled.
I, [2016-02-11T14:14:04.062433 #19860] INFO – : 'dns' module is disabled.
I, [2016-02-11T14:14:04.062698 #19860] INFO – : 'templates' module is
disabled.
I, [2016-02-11T14:14:04.070155 #19860] INFO – : 'dhcp' settings were
initialized with default values: :dhcp_provider: isc
I, [2016-02-11T14:14:04.077175 #19860] INFO – : 'puppetca' module is
disabled.
I, [2016-02-11T14:14:04.077934 #19860] INFO – : 'puppet' module is
disabled.
I, [2016-02-11T14:14:04.078212 #19860] INFO – : 'bmc' module is disabled.
I, [2016-02-11T14:14:04.078587 #19860] INFO – : 'realm' module is
disabled.

It looks like the proxy then parses the configuration files and adds all of
my subnets. There are 125 in total. I truncated the output - there are
hundreds of lines. It appears to go through the subnets each time a record
is updated (in my case every time I change the NIC's subnet):

D, [2016-02-11T14:14:36.459693 #19862] DEBUG – : Reading config file
/etc/dhcp/dhcpd.conf
D, [2016-02-11T14:14:36.460852 #19862] DEBUG – : Reading config file
/etc/dhcp/dhcpd.hosts
D, [2016-02-11T14:14:36.464270 #19862] DEBUG – : Reading config file
/var/lib/dhcpd/dhcpd.leases
D, [2016-02-11T14:14:36.506482 #19862] DEBUG – : Loading subnets for
127.0.0.1
D, [2016-02-11T14:14:36.506808 #19862] DEBUG – : Added
10.2.71.0/255.255.255.0 to 127.0.0.1
D, [2016-02-11T14:14:36.507136 #19862] DEBUG – : Added
10.2.1.0/255.255.255.252 to 127.0.0.1
D, [2016-02-11T14:14:36.507355 #19862] DEBUG – : Added
10.2.1.4/255.255.255.252 to 127.0.0.1

After adding all the subnets, it enumerates the hosts on each subnet. For
some of the subnets, I don't see any hosts added. It just lists the
subnetwork itself:

D, [2016-02-11T14:14:36.564906 #19862] DEBUG – : Loading subnet data for
10.2.7.0/255.255.255.0
I, [2016-02-11T14:14:37.060426 #19862] INFO – : Enumerated hosts on
10.2.7.0
D, [2016-02-11T14:14:37.060682 #19862] DEBUG – : Lazy loaded
10.2.7.0/255.255.255.0 records
D, [2016-02-11T14:14:37.060828 #19862] DEBUG – : Loading subnet data for
10.2.0.0/255.255.255.252
I, [2016-02-11T14:14:37.530459 #19862] INFO – : Enumerated hosts on
10.2.0.0
D, [2016-02-11T14:14:37.530722 #19862] DEBUG – : Lazy loaded
10.2.0.0/255.255.255.252 records
D, [2016-02-11T14:14:37.530882 #19862] DEBUG – : Loading subnet data for
10.2.0.4/255.255.255.252
I, [2016-02-11T14:14:38.002359 #19862] INFO – : Enumerated hosts on
10.2.0.4
D, [2016-02-11T14:14:38.002602 #19862] DEBUG – : Lazy loaded
10.2.0.4/255.255.255.252 records

…while some subnets are following by an "add" of all host records from
the subnet (not sure why some subnets show this output while others do not):

D, [2016-02-11T14:15:06.824623 #19862] DEBUG – : Loading subnet data for
10.2.1.0/255.255.255.0
D, [2016-02-11T14:15:06.832443 #19862] DEBUG – : Added
mail1.qai.expertcity.com (10.2.1.21 / …) to 10.2.1.0/255.255.255.0
D, [2016-02-11T14:15:06.833653 #19862] DEBUG – : Added
mail2.qai.expertcity.com (10.2.1.61 / …) to 10.2.1.0/255.255.255.0
D, [2016-02-11T14:15:06.835702 #19862] DEBUG – : Added
auth1.qai.expertcity.com (10.2.1.30 / …) to 10.2.1.0/255.255.255.0

I, [2016-02-11T14:15:07.362160 #19862] INFO – : Enumerated hosts on
10.2.1.0
D, [2016-02-11T14:15:07.362319 #19862] DEBUG – : Lazy loaded
10.2.1.0/255.255.255.0 records
D, [2016-02-11T14:15:07.362753 #19862] DEBUG – : Loading subnet data for
10.2.7.0/255.255.255.0
D, [2016-02-11T14:15:07.365738 #19862] DEBUG – : Added host1.mydomain
(10.2.7.166 / …) to 10.2.7.0/255.255.255.0
D, [2016-02-11T14:15:07.366501 #19862] DEBUG – : Added host2.mydomain
(10.2.7.167 / …) to 10.2.7.0/255.255.255.0
D, [2016-02-11T14:15:07.367713 #19862] DEBUG – : Added host3.mydomain
(10.2.7.169 / …) to 10.2.7.0/255.255.255.0

At the very end, I see the record get created (in this case). I'm still
trying to replicate another timeout failure - this morning I couldn't get
it to succeed at all.

D, [2016-02-11T14:34:37.529541 #19862] DEBUG – : Added myhost4 (10.2.8.180
/ …) to 10.2.8.128/255.255.255.128
D, [2016-02-11T14:34:37.550553 #19862] DEBUG – : omshell: executed - set
name = "myhost4.mydomain"
D, [2016-02-11T14:34:37.550807 #19862] DEBUG – : true
D, [2016-02-11T14:34:37.550990 #19862] DEBUG – : omshell: executed - set
ip-address = 10.2.8.180
D, [2016-02-11T14:34:37.551126 #19862] DEBUG – : true
D, [2016-02-11T14:34:37.551274 #19862] DEBUG – : omshell: executed - set
hardware-address = …
D, [2016-02-11T14:34:37.551365 #19862] DEBUG – : true
D, [2016-02-11T14:34:37.551500 #19862] DEBUG – : omshell: executed - set
hardware-type = 1
D, [2016-02-11T14:34:37.551579 #19862] DEBUG – : true
D, [2016-02-11T14:34:37.552077 #19862] DEBUG – : omshell: executed - set
statements = "filename = &quot;pxelinux.0&quot;; next-server = …; option host-name
= &quot;myhost4.mydomain&quot;;"
D, [2016-02-11T14:34:37.552171 #19862] DEBUG – : true
D, [2016-02-11T14:34:37.552315 #19862] DEBUG – : omshell: executed - create
D, [2016-02-11T14:34:37.552412 #19862] DEBUG – : true
I, [2016-02-11T14:34:37.598551 #19862] INFO – : Added DHCP reservation
for myhost4.mydomain (10.2.8.180 / …)
[11/Feb/2016 14:34:37] "POST /10.2.8.128 HTTP/1.1" 200 - 59.9062

It takes a significant amount of time for the proxy to enumerate all the
subnets and hosts. After clicking "submit" from the UI, it's a good minute
or so before I get a message saying the DHCP record is being added, and
another minute or two after that before it completes. It happens each time
I try to modify the subnet on a NIC.

Is there anything I can do here, other than not add 100+ subnets? Do the
records have to be reloaded every time I make a change, or could they be
cached in some way?

> It takes a significant amount of time for the proxy to enumerate all the
> subnets and hosts. After clicking "submit" from the UI, it's a good minute
> or so before I get a message saying the DHCP record is being added, and
> another minute or two after that before it completes. It happens each time
> I try to modify the subnet on a NIC.
>

Indeed, it's not fast.

> Is there anything I can do here, other than not add 100+ subnets? Do the
> records have to be reloaded every time I make a change, or could they be
> cached in some way?
>
> Other than the usual cop-out answer to caching of "how much do you want to
risk invalid data?", the real answer is no. ISC DHCP makes changes to it's
lease file all the time, so the only way we have to know whether an IP is
free right now is to go parse the the lease file right now. To my
knowledge (correct me if I'm wrong on this) ISC stores all leases for all
subnets in a single file - there's no way to stores leases for a given
subnet in a separate file. Coupled with the last-entry-wins style of
parsing ISC uses, means we can't even restrict ourselves to a single subnet
when resolving the state a of a lease - we have to parse the whole file.

This is normally not an issue, since the majority of cases are only a few
subnets. However, I know Dmitri (the main proxy maintainer) is interested
in the larger-scale handling of ISC so perhaps he has some thoughts to add
:wink:

··· On 11 February 2016 at 23:04, Kyle Flavin wrote:

I have changes to isc dhcp provider that modify how and when leases file is
parsed and should result in a speedup of create/delete record operations.
I'd like to make a few more changes before creating a PR, should be doable
by mid next week. There is an issue in redmine for this:
Bug #12392: 100% cpu usage on foreman-proxy DHCP calls - Smart Proxy - Foreman, probably worth monitoring it
– any updates will be recorded there.

-d

··· On Friday, February 12, 2016 at 11:11:12 AM UTC, Greg Sutcliffe wrote: > > On 11 February 2016 at 23:04, Kyle Flavin > wrote: > >> It takes a significant amount of time for the proxy to enumerate all the >> subnets and hosts. After clicking "submit" from the UI, it's a good minute >> or so before I get a message saying the DHCP record is being added, and >> another minute or two after that before it completes. It happens each time >> I try to modify the subnet on a NIC. >> > > Indeed, it's not fast. > > >> Is there anything I can do here, other than not add 100+ subnets? Do the >> records have to be reloaded every time I make a change, or could they be >> cached in some way? >> >> Other than the usual cop-out answer to caching of "how much do you want > to risk invalid data?", the real answer is no. ISC DHCP makes changes to > it's lease file all the time, so the only way we have to know whether an IP > is free *right now* is to go parse the the lease file *right now*. To my > knowledge (correct me if I'm wrong on this) ISC stores all leases for all > subnets in a single file - there's no way to stores leases for a given > subnet in a separate file. Coupled with the last-entry-wins style of > parsing ISC uses, means we can't even restrict ourselves to a single subnet > when resolving the state a of a lease - we have to parse the whole file. > > This is normally not an issue, since the majority of cases are only a few > subnets. However, I know Dmitri (the main proxy maintainer) is interested > in the larger-scale handling of ISC so perhaps he has some thoughts to add > ;) >

Thanks guys, I'll follow the issue on Redmine. We're trying to implement
Foreman for provisioning across our 30+ datacenters, and unfortunately I've
run into this on the first datacenter. I'm worried it's going to be worse
on some of the larger datacenters. I suppose if I really get desperate, I
could just disable the DHCP function on the smart proxy and manage the host
reservations in dhcpd.conf, outside of Foreman, using puppet.

Dmitri, we're running 1.9.3. Should there be any issues if I try to
backport your changes there once they become available?