Even almost empty foreman feels slow and sluggish

cleka · June 5, 2020, 8:05am

Ah, ok, didn’t realize that. Just read ElasticSearch and went full steam ahead

(and perhaps this explains why there is no newer changes to lzap’s stuff, because this here is the right/modern way to do it.)

I will check this out.

cleka · June 5, 2020, 8:53am

Yes. The VM has 20 GB, 6 cores. It’s running on a hypervisor that has 16 vcores, 96 GB RAM and there’s almost nothing else running on same HV - an idle local mail server, an idle repo/web server, and the elastic search VM.

Inside the VM, when I did this “how long does it take to open the All Hosts page”, top command showed 6+ GB still as free (not even used as I/O cache).

At the other hand, the hardware is old. Fujitsu Primergy RX200 S5, they have a sticker “warranty ends xx/2011”. 2x Xeon X5550 @ 2.67GHz, so 16 vcores. But the RAM might not be the “right” RAM, just what was laying around, and the disks are old too; the “big” logical disk where the VM images (image files in an ext4 FS inside LVM) are 4 x striped mirrored (kind of RAID10, IIUC) old Toshiba, probably max 7000 rpm. Slow disks should not matter for switching pages in katello, once all is cached, as RAM is plenty.

I have just rebooted the hypervisor, it was set to “minimal power usage”, now I changed it to “max performance”.

Switching between different pages in katello (All hosts, provisioning templates, partitions, … - which had been displayed before, so all what it needs from disk should be in cache), takes now between 1 and 2 seconds. Never below 1 sec. (I start counting “0” in the moment I release the mouse button… a super objective measurement method .

Foreman is not monolithic app, it’s a composition of several of open source projects written in multiple languages, stack, libraries therefore lots of things are duplicated. It is the price we need to pay in order to use the best components out there and not reinventing wheels.

Yes. So, considering that, 1-2 secs response time is quite reasonable.

(Also, compared to initial situation, I have at the moment some Firefox plugins disabled, AdBlock, YoutubeBlock and Download helper, which might somewhat have slowed down rendering times in the browser as they probably look out for elements “they feel are relevant for them”; and my network connection was 100 MBit (was in the 4G gateway which has only that, now via simple 1 GBit Zyxel hub), so now I should have full 1GBit between upstairs and my garage.

Starting, stopping and foreman-install to add some module still takes long, but there’s 23 modules to deal with sequentially…
Four consecutive restarts took 1m22sec, 1m14sec, 1m25sec, 1m14sec.

Just for fun I tried how it would behave with systemctl stop - a loop over all services that “sound related” but that took even longer (2min41sec) and failed - probably because they need to be stopped in the right order, which this does not do.

cleka · June 5, 2020, 9:13am

Timo,

as said above, so now page display times are between 1 and 2 seconds. I can live with that. Do you still think this is worth looking deeper into ? I am happy to do so, this is exactly the stuff I enjoy doing, but I feel I have already wasted too much of your (all I dragged into this) time with this.

I will look into the Elastic plugin anyway. Always useful to learn new stuff

cleka · June 5, 2020, 9:18am

… Just thinking aloud about my motivations, feel free to skip this …

Always useful to learn new stuff

I have applied for some job as system specialist, servers, first interview (remote) next monday. That’s one of the reasons I play around with katello. In my previous employer too much was still done “the old way, by foot”. If I start somewhere new, and they don’t have any proper tools in place to manage their infra (I somewhat doubt it, a small company with 17 persons), I really would like to do things this time a bit better. In the old company, we “managed” one project (10 + envs, each with 2-4 VMs) with Ansible, but that was horrible. Perhaps mostly because we did Ansible the wrong way (as newbies, trying to retroactively re-fit Ansible onto an infra that had been created since 2014 with shell-scripts made by me, and we created and edited all Ansible source files with vi or emacs. I’m sure there’s better ways…

For example all our 7 hypervisors and all 100+ VMs on them were just managed with virt-manager, new VMs created with “virt-install …” and dynamically (php) created kickstart.

The company I applied, seems to have more still servers, but definitely one goal is “more cloud”, and katello seems to target both of that - right?

(The only other infra tool that comes to my mind would be terraform, just tried some 1-2 tutorials, but no real experience yet).

Katello is in that sense ideal for me for my “home garage datacenter”, which is just VMs… - but plenty of them. All of them so far I had created mostly manually… so somehow I intent to re-do some of them with katello. Then I would not need to back them up as systems, only the data inside that changes – if something goes wrong, rebuild and re-import data.

At least that’s my idea how one should manage a semi-complex infra.

At the moment, every then and now I take a backup copy of the VM disk image with cp (when VM is down). For example before I started installing extra additional stuff into a VM, like the rlogin stuff into katello, I do that - in case the “trying to add something” totally screws up the existing system. They are all smaller than 20GB, used inside partly even only 2-4 GB (mail server, DNS+DHCP server, …).

Somehow I am not toooooo eager to make them as containers yet… somehow with containers I always feel restricted in troubleshooting stuff. In a VM I can install whatever tool I would now need (netstat, strace, …) - in a container, one can “exec” into it, but working inside it is clumsy. So I haven’t made the mental switch to “containers are the best since sliced bread” yet

But managing all my VMs by hand is a mess. It’s always the “getting started” to do it better way that’s hard. “This one more time” I still do it the old way

lzap · June 5, 2020, 11:57am

It’s almost instant for me usually on much worse hardware, 16GB RAM, 2 vCPUs, libvirt. Is your DNS setup correct? A wrong resolving can cause similar issues. Check both the server and the client (the browser).

Nah, although we have some very slow endpoints typical page like subnets should return in 200ms. Host page is special as it has these BMC icons, some katello pages can be slower too.

Note Foreman/Katello is often I/O heavy and I’d suggest to avoid QCOW2 format. RAW image is better, using LVM is the best option.

Okay thanks for the story and good luck with your interview!

TimoGoebel · June 5, 2020, 12:00pm

If you want to dig deeper and figure out what’s below the surface: Knock yourself out!

cleka · June 5, 2020, 1:44pm

Even if I have given it 20 GB and 6 cores, it’s not really using them (so, in this case the amout of RAM and cores does not have much effect). In my experience a VM e.g. in a PC often react somewhat faster (snappier response) compared to server hardware. Server hardware is optimized for load, not single VM as fast as possible (for example the scheduler for the disks, …). In the VM I had set that scheduler to none, but as I understand it, in virtio disk type that’s kind of irrelevant.

My DNS setup… it’s working, but I don’t know is it “perfect”. (I remember when I had trouble with ipv4 and ipv6, it waited for both requests, but that would cause a delay of 5 seconds everytime). I have v6 disabled almost in all servers.

Is below fast or slow? That’s the address for my DNS server. It’s consistently between 0.600 and 0.800 ms, both on my laptop (browser) and on the katello server:

[root@katello ~]# ping -c1 192.168.1.19
PING 192.168.1.19 (192.168.1.19) 56(84) bytes of data.
64 bytes from 192.168.1.19: icmp_seq=1 ttl=64 time=0.710 ms

--- 192.168.1.19 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.710/0.710/0.710/0.000 ms
[root@katello ~]#

That’s just pinging it. How can I get a reasonable measure how fast the lookup is?

[root@katello ~]# cat /etc/resolv.conf 
# Created by Clemens
search kt21c.net
nameserver 192.168.1.19

[root@katello ~]# dig test-vm-8

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> test-vm-8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1756
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;test-vm-8.                     IN      A

;; AUTHORITY SECTION:
.                       10739   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2020060500 1800 900 604800 86400

;; Query time: 1 msec
;; SERVER: 192.168.1.19#53(192.168.1.19)
;; WHEN: Fri Jun 05 16:38:10 EEST 2020
;; MSG SIZE  rcvd: 113

(( strange, dig can’t resolve short name? Does only nslookup use the “search” in resolv.conf?))

[root@katello ~]# dig test-vm-8.kt21c.net

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> test-vm-8.kt21c.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21065
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;test-vm-8.kt21c.net.           IN      A

;; ANSWER SECTION:
test-vm-8.kt21c.net.    86400   IN      A       192.168.1.39

;; AUTHORITY SECTION:
kt21c.net.              86400   IN      NS      infra.kt21c.net.

;; ADDITIONAL SECTION:
infra.kt21c.net.        86400   IN      A       192.168.1.19

;; Query time: 1 msec
;; SERVER: 192.168.1.19#53(192.168.1.19)
;; WHEN: Fri Jun 05 16:38:45 EEST 2020
;; MSG SIZE  rcvd: 100

[root@katello ~]#

So the name lookup takes (less than) 1 ms, I guess that is acceptable.

lzap · June 5, 2020, 1:47pm

Word. Well I guess we are just slow then. To get you started without setting up monitoring (you can we have a Prometheus and Statsd exporters), you can use our analysis script to find which endpoint is the slowest one for you.

cleka · June 5, 2020, 1:52pm

Yes, my VMs are raw images (in an ext4 filesystem which is a logical volume). I have had some bad experiences with qcow so I avoid those normally.

I could consider making them LVs. (I did that for a setup, where I used drbd to be able to live-migrate between two hypervisor which had all only local disk, i.e. no shared storage from a SAN).

Being able to simply copy the whole image file with cp is very convenient; doing that if they are logical volumes might not be as easy or straightforward (or I don’t know yet how to do it…). One way would be lv snapshots, but performance might suffer, especially since these are slow disks anyway.

lzap · June 5, 2020, 1:55pm

Hmm okay, RAW is good. I guess if you can send us report from the tool I’ve sent you, that would be interesting.

tbrisker · June 7, 2020, 12:50pm

Looking at the log excerpt you shared seems to indicate the issue here isn’t the Foreman server - the host index page rendered from the server in 168ms (see the lines with the 6b823ead request id).
Could this be some other network issue? the ping and dns lookup seem to be quite fast, but I can’t explain otherwise why the browser shows a 3000ms wait to the response when the server replies in <200ms. Perhaps https handshake is taking very long for some reason? Or high packet loss on the network?
You might want to tail -f /var/log/foreman/production.log while refreshing or browsing to a page in the browser with the network tab open, and correlate the timings of the requests. The log has a unique id for each request (the third field in the brackets after the timestamp), so you can follow it from the Started GET... line to the Completed 200 OK in ###ms line.

cleka · June 7, 2020, 3:05pm

Ok, thanks.

Well, yes it might be that there is issues with my local home networking setup. It goes through a Zyxel unmanaged switch up here and through a Procurve managed switch in the garage, inside a vlan to the hypervisor and through a bridge to the guest VM.

And e.g. during some of my tests some browser plugins were still active, which might have slowed down page rendering inside firefox.

What are all those: “notification_recipients”; (auto screen refresh or something) can one disable them or make the interval longer?

2020-06-07T17:57:52 53d0119a info app Started GET "/notification_recipients" for 192.168.1.148 at 2020-06-07 17:57:52 +0300
2020-06-07T17:57:52 53d0119a info app Processing by NotificationRecipientsController#index as JSON
2020-06-07T17:57:52 53d0119a info app Completed 200 OK in 15ms (Views: 0.3ms | ActiveRecord: 2.3ms)

Debugging the network whether there is packets getting lost is a bit beyond my skills. Letting a ping run, all packets go nicely all within the same range of 0.700 ms, no loss at all.

But I feel I am not able to do detailed debugging of the timing delays here, and since it looks it’s not “foremans’s fault” anyway I will mark this here as resolved.

tbrisker · June 7, 2020, 3:11pm

That is the little notification drawer in the top menu checking if there are new notifications.
You could increase the interval by setting the NOTIFICATIONS_POLLING environment variable to a longer time - the default value is 10000 (milliseconds).

cleka · June 7, 2020, 3:23pm

How and where? I did create/set such an environment variable in .profile in ~foreman user, but that had no effect.

[root@katello foreman]# cd ~foreman
[root@katello foreman]# cat .profile


NOTIFICATIONS_POLLING=600000
export NOTIFICATIONS_POLLING

[root@katello foreman]#

lzap · June 8, 2020, 11:53am

Override a systemd environmental variable for httpd.service (for Foreman 2.1 then use foreman.service we have changed deployment strategy to Puma).

cleka · June 8, 2020, 12:27pm

Still does not work for me:

[root@katello foreman]# tail -n 5 /etc/sysconfig/httpd
# logged in user (default 10000 = 10 secs)

NOTIFICATIONS_POLLING=30000


[root@katello foreman]# 
[root@katello foreman]# systemctl daemon-reload
[root@katello foreman]# systemctl restart httpd

2020-06-08T15:24:25 e31da66a info app Started GET "/notification_recipients" for 192.168.1.148 at 2020-06-08 15:24:25 +0300
2020-06-08T15:24:25 e31da66a info app Processing by NotificationRecipientsController#index as JSON
2020-06-08T15:24:25 e31da66a info app Completed 200 OK in 17ms (Views: 0.2ms | ActiveRecord: 4.7ms)
2020-06-08T15:24:35 51d558aa info app Started GET "/notification_recipients" for 192.168.1.148 at 2020-06-08 15:24:35 +0300
2020-06-08T15:24:35 51d558aa info app Processing by NotificationRecipientsController#index as JSON
2020-06-08T15:24:35 51d558aa info app Completed 200 OK in 16ms (Views: 0.2ms | ActiveRecord: 2.2ms)
^C

ohadlevy · June 8, 2020, 3:04pm

I don’t think that would work as the env variable is used when compiling
webpack assets not at runtime AFAIU?