Upgrade from 3.10.0/4.12.1 to 3.11.1/4.13.1 makes foreman unusable

Problem:
We have an up-to-date 3.10.0/4.12.1 server running on AlmaLinux 8. Everything’s working fine. I have tried the upgrade to 3.11.1/4.13.1 twice today and each time the system becomes unusable. During the last steps of foreman-installer it takes ages, even though eventually foreman-installer finishes “successfully”. The GUI is not usable, it’s just rotating circles. API calls are extremely slow. For instance a GET https://foreman8.example.com/api/hosts/1 which usually takes 0.5s on 3.10.0 now takes approx. 1.5 minutes.

I also see a lot of timeouts in production log as some processes give up after 60s.

Expected outcome:
Fast and reliable system.

I have already posted some logs in the thread Yet another Foreman upgrade fail

Looking at Feedback for Foreman 3.11 & Katello 4.13 I guess it could be connected, but I haven’t evaluated it, yet. So far I quickly reverted the VM back after each attempt as the system became completely unusable.

I have the katello.log and production.log from my last attempt, if it helps.

@Partha_Aji did you see this issue during upgrade testing for 4.13.1?

@gvde are you able to tell via the installer logs which part of the installer is taking so long? I’m wondering if it’s a database migration that’s taking up all the time.

Also, did the same thing happen with 4.13.0 (if you tested it)?

We can do a reproducer test, but we’ll first need an idea about the scale of your system – general # of hosts, content views, and content view versions.

I am not sure. I think until almost the end the installer runs quick as usual. It’s only when it start accessing the foreman api when it slows down, e.g.

2024-07-26 11:54:22 [DEBUG ] [configure] /Stage[main]/Foreman::Register/Foreman_host[foreman-foreman8.example.com]: Starting to evaluate the resource (2259 of 2379)
2024-07-26 11:54:22 [DEBUG ] [configure] Foreman_host[foreman-foreman8.example.com](provider=rest_v3): Making get request to https://foreman8.example.com/api/v2/hosts?search=name%3D%22foreman8.example.com%22
2024-07-26 11:55:25 [DEBUG ] [configure] Foreman_host[foreman-foreman8.example.com](provider=rest_v3): Received response 200 from request to https://foreman8.example.com/api/v2/hosts?search=name%3D%22foreman8.example.com%22
2024-07-26 11:55:25 [DEBUG ] [configure] /Stage[main]/Foreman::Register/Foreman_host[foreman-foreman8.example.com]: Evaluated in 62.86 seconds

It’s not the installer which is slow. If I am not mistaken, the db migration also ran normally. Doesn’t the installer wait for the migration to finish? According to the logs foreman-rake-db:migrate took 30 seconds.

I have also check with top: no process is really doing anything while it waits for those api calls.

4.13.1 is the first I have tried. I am avoiding updating to .0 versions…

It’s not that big. 251 hosts, 9 content views, 6 of them with 3 versions each.

1 Like

Thanks, we can try an upgrade test with a similar system and report back. If we can’t reproduce it, we’ll need to probe deeper into the API logs (or maybe add some) and also check the postgres queries.

I upgraded from 3.11.0/4.13.0 to 3.11.1/4.13.1 and have the same issue.
Installer stuck on Executing: foreman-rake upgrade:run

2 Likes

It eventually finished, took more then two hours though

How is the performance of API calls, e.g. https://foreman8.example.com/api/hosts/1 (replace the 1 with an existing id if you don’t have an host with id anymore)?

You don’t need to measure api, clearly they are hosed. Even login takes minutes,

There is message " Some services are not properly started. See the About page for more information", but About page won’t open, but foreman-maintain service status thinks everything is running

# time hammer host list
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------
ID | NAME                | OPERATING SYSTEM | HOST GROUP | IP           | MAC               | GLOBAL STATUS | CONTENT VIEW | LIFECYCLE ENVIRONMENT
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------
1  | foreman.chepkov.lan | RHEL 9.4         |            | 192.168.1.77 | 52:54:00:38:4f:7d | OK            |              |                      
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------

real	4m4.045s
user	0m1.094s
sys	0m0.177s

# time curl -I -k https://foreman.chepkov.lan/api/hosts/1
HTTP/2 401 
content-type: application/json; charset=utf-8
apipie-checksum: 77bd25b3dcff3ed026876881eb147cc1cdbea0b0
cache-control: no-cache
x-request-id: 0d48c1f9-2463-4bbb-ac14-d42ea2590178
x-runtime: 120.154458
strict-transport-security: max-age=631139040; includeSubdomains
x-frame-options: sameorigin
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
x-download-options: noopen
x-permitted-cross-domain-policies: none
content-security-policy: default-src 'self'; child-src 'self'; connect-src 'self' ws: wss:; img-src 'self' data:; script-src 'unsafe-eval' 'unsafe-inline' 'self'; style-src 'unsafe-inline' 'self'
via: 2.0 foreman.chepkov.lan
date: Sat, 27 Jul 2024 15:54:33 GMT
server: Apache


real	2m0.249s
user	0m0.054s
sys	0m0.014s

Can you get us the output (and rough timing – “slow”/“quick”) for the following:

  • hammer ping
  • foreman-puma-status
  • pulp status
  • curl -k https://localhost:23443/candlepin/status
  • netstat -tulpen |grep java

FYI, rebooting the server fixed the problem for me, but I have repeated the same procedure on another server and it is in the same state too. I can keep it like that for the diagnostic

# hammer ping
Error: Timed out reading data from server

# foreman-puma-status
317574 (/usr/share/foreman/tmp/puma.state) Version: 6.4.2/ruby2.7.8p225 | Uptime: 23h44m | Phase: 0 | Load: 1[█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]30 | Req: 222
 └ 317681 CPU:   0.0% Mem:  420 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 78
 └ 317684 CPU:   0.0% Mem:  413 MB Uptime: 23h44m | Load: 1[█░░░░]5 | Req: 36
 └ 317688 CPU:   0.0% Mem:  412 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 14
 └ 317691 CPU:   0.0% Mem:  418 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 37
 └ 317695 CPU:   0.0% Mem:  415 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 42
 └ 317698 CPU:   0.0% Mem:  412 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 15

# curl -k https://localhost:23443/candlepin/status
curl: (7) Failed to connect to localhost port 23443: Connection timed out

# netstat -tulpen |grep java
tcp6     101      0 127.0.0.1:23443         :::*                    LISTEN      91         5842385    317187/java         

pulp status is quite verbose, I will upload it as a file

1 Like

Didn’t find attach file option, posted it here

I am currently testing again and collecting logs.

I have just checked the database activity and noticed candlepin is stuck idle in transaction.

postgres=# select *  from pg_stat_activity where state <> 'idle';
 datid |  datname  |  pid  | leader_pid | usesysid |  usename  |    application_name    | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change   
       | wait_event_type | wait_event |        state        | backend_xid | backend_xmin |                              query                              |  backend_type  
-------+-----------+-------+------------+----------+-----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+------------------------
-------+-----------------+------------+---------------------+-------------+--------------+-----------------------------------------------------------------+----------------
 16401 | candlepin | 24517 |            |    16384 | candlepin | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       35824 | 2024-07-28 16:35:25.806063+02 | 2024-07-28 16:35:25.978385+02 | 2024-07-28 16:35:25.985594+02 | 2024-07-28 16:35:25.986
474+02 | Client          | ClientRead | idle in transaction |             |    550265702 | select setting from pg_settings where name = 'edb_redwood_date' | client backend
 19028 | postgres  | 26870 |            |       10 | postgres  | psql                   |             |                 |          -1 | 2024-07-28 16:44:10.56482+02  | 2024-07-28 16:46:57.800293+02 | 2024-07-28 16:46:57.800293+02 | 2024-07-28 16:46:57.800
295+02 |                 |            | active              |             |    550283473 | select *  from pg_stat_activity where state <> 'idle';          | client backend
(2 rows)

O.K. I think it’s the same issue as in Feedback for Foreman 3.11 & Katello 4.13

I have updated, ran foreman-installer, stopped everything again, ran foreman-installer again, rebooted. No change.

Then I downgraded candlepin to 4.4.10 and restart tomcat.service and all of a sudden everything is back to normal, fast and smooth. No more timeouts. Everything is recovering.

Thus, it seems to me as if candlepin got stuck during startup, slowing everything else down, too.

I have got some logs collected, if needed, but revert the vm for now.

1 Like

Here’s a bugzilla on the Candlepin side to watch: 2300323 – candlepin-4.4.12-1 DB upgrade gets stuck

candlepin-4.4.13-1 is now available, which has the fix for the candlepin startup issue

1 Like

I don’t think it’s the case. I updated from 4.4.10 to 4.4.13 and watching

Executing: foreman-rake upgrade:run

for awhile now

The story is not yet fully over, lets give them a bit more time :slight_smile: