Upgrade from 3.10.0/4.12.1 to 3.11.1/4.13.1 makes foreman unusable

I upgraded from 3.11.0/4.13.0 to 3.11.1/4.13.1 and have the same issue.
Installer stuck on Executing: foreman-rake upgrade:run

2 Likes

It eventually finished, took more then two hours though

How is the performance of API calls, e.g. https://foreman8.example.com/api/hosts/1 (replace the 1 with an existing id if you don’t have an host with id anymore)?

You don’t need to measure api, clearly they are hosed. Even login takes minutes,

There is message " Some services are not properly started. See the About page for more information", but About page won’t open, but foreman-maintain service status thinks everything is running

# time hammer host list
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------
ID | NAME                | OPERATING SYSTEM | HOST GROUP | IP           | MAC               | GLOBAL STATUS | CONTENT VIEW | LIFECYCLE ENVIRONMENT
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------
1  | foreman.chepkov.lan | RHEL 9.4         |            | 192.168.1.77 | 52:54:00:38:4f:7d | OK            |              |                      
---|---------------------|------------------|------------|--------------|-------------------|---------------|--------------|----------------------

real	4m4.045s
user	0m1.094s
sys	0m0.177s

# time curl -I -k https://foreman.chepkov.lan/api/hosts/1
HTTP/2 401 
content-type: application/json; charset=utf-8
apipie-checksum: 77bd25b3dcff3ed026876881eb147cc1cdbea0b0
cache-control: no-cache
x-request-id: 0d48c1f9-2463-4bbb-ac14-d42ea2590178
x-runtime: 120.154458
strict-transport-security: max-age=631139040; includeSubdomains
x-frame-options: sameorigin
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
x-download-options: noopen
x-permitted-cross-domain-policies: none
content-security-policy: default-src 'self'; child-src 'self'; connect-src 'self' ws: wss:; img-src 'self' data:; script-src 'unsafe-eval' 'unsafe-inline' 'self'; style-src 'unsafe-inline' 'self'
via: 2.0 foreman.chepkov.lan
date: Sat, 27 Jul 2024 15:54:33 GMT
server: Apache


real	2m0.249s
user	0m0.054s
sys	0m0.014s

Can you get us the output (and rough timing – “slow”/“quick”) for the following:

  • hammer ping
  • foreman-puma-status
  • pulp status
  • curl -k https://localhost:23443/candlepin/status
  • netstat -tulpen |grep java

FYI, rebooting the server fixed the problem for me, but I have repeated the same procedure on another server and it is in the same state too. I can keep it like that for the diagnostic

# hammer ping
Error: Timed out reading data from server

# foreman-puma-status
317574 (/usr/share/foreman/tmp/puma.state) Version: 6.4.2/ruby2.7.8p225 | Uptime: 23h44m | Phase: 0 | Load: 1[█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]30 | Req: 222
 └ 317681 CPU:   0.0% Mem:  420 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 78
 └ 317684 CPU:   0.0% Mem:  413 MB Uptime: 23h44m | Load: 1[█░░░░]5 | Req: 36
 └ 317688 CPU:   0.0% Mem:  412 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 14
 └ 317691 CPU:   0.0% Mem:  418 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 37
 └ 317695 CPU:   0.0% Mem:  415 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 42
 └ 317698 CPU:   0.0% Mem:  412 MB Uptime: 23h44m | Load: 0[░░░░░]5 | Req: 15

# curl -k https://localhost:23443/candlepin/status
curl: (7) Failed to connect to localhost port 23443: Connection timed out

# netstat -tulpen |grep java
tcp6     101      0 127.0.0.1:23443         :::*                    LISTEN      91         5842385    317187/java         

pulp status is quite verbose, I will upload it as a file

1 Like

Didn’t find attach file option, posted it here

I am currently testing again and collecting logs.

I have just checked the database activity and noticed candlepin is stuck idle in transaction.

postgres=# select *  from pg_stat_activity where state <> 'idle';
 datid |  datname  |  pid  | leader_pid | usesysid |  usename  |    application_name    | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change   
       | wait_event_type | wait_event |        state        | backend_xid | backend_xmin |                              query                              |  backend_type  
-------+-----------+-------+------------+----------+-----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+------------------------
-------+-----------------+------------+---------------------+-------------+--------------+-----------------------------------------------------------------+----------------
 16401 | candlepin | 24517 |            |    16384 | candlepin | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       35824 | 2024-07-28 16:35:25.806063+02 | 2024-07-28 16:35:25.978385+02 | 2024-07-28 16:35:25.985594+02 | 2024-07-28 16:35:25.986
474+02 | Client          | ClientRead | idle in transaction |             |    550265702 | select setting from pg_settings where name = 'edb_redwood_date' | client backend
 19028 | postgres  | 26870 |            |       10 | postgres  | psql                   |             |                 |          -1 | 2024-07-28 16:44:10.56482+02  | 2024-07-28 16:46:57.800293+02 | 2024-07-28 16:46:57.800293+02 | 2024-07-28 16:46:57.800
295+02 |                 |            | active              |             |    550283473 | select *  from pg_stat_activity where state <> 'idle';          | client backend
(2 rows)

O.K. I think it’s the same issue as in Feedback for Foreman 3.11 & Katello 4.13

I have updated, ran foreman-installer, stopped everything again, ran foreman-installer again, rebooted. No change.

Then I downgraded candlepin to 4.4.10 and restart tomcat.service and all of a sudden everything is back to normal, fast and smooth. No more timeouts. Everything is recovering.

Thus, it seems to me as if candlepin got stuck during startup, slowing everything else down, too.

I have got some logs collected, if needed, but revert the vm for now.

1 Like

Here’s a bugzilla on the Candlepin side to watch: 2300323 – candlepin-4.4.12-1 DB upgrade gets stuck

candlepin-4.4.13-1 is now available, which has the fix for the candlepin startup issue

1 Like

I don’t think it’s the case. I updated from 4.4.10 to 4.4.13 and watching

Executing: foreman-rake upgrade:run

for awhile now

The story is not yet fully over, lets give them a bit more time :slight_smile:

Also having the same issue. I’ve just completed a series of upgrades (incrementally, from 3.6.1 → 3.6.2, and then through the rest of the major releases).

Foreman 3.9/Katello 4.11 was the last upgrade where the dashboard worked. In 3.10+4.12 I can log in, and use the navigation, but none of the screens ever come up. There didn’t seem to be any errors while foreman-install was running.

There does seem to be some kind of timeout issue. pulpcore-content, pulpcore-api and foreman-proxy time out when starting, but it looks like systemd tries (successfully) to restart pulpcore-content and pulpcore-api. I can start foreman-proxy once the system is up and Foreman has finished starting up everything else.

What information can I provide that’s useful? I’m probably going to revert to the snapshot I took of Foreman 3.9+Katello 4.11 and sit there until this is worked out.

Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: foreman-proxy.service: start operation timed out. Terminating.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: pulpcore-content.service: start operation timed out. Terminating.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: pulpcore-content.service: Failed with result 'timeout'.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: Failed to start Pulp Content App.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: pulpcore-api.service: start operation timed out. Terminating.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: foreman-proxy.service: Failed with result 'timeout'.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: Failed to start Foreman Proxy.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: pulpcore-api.service: Failed with result 'timeout'.
Aug 01 14:59:17 themis.cecnet.gmu.edu systemd[1]: Failed to start Pulp API Server.
Aug 01 14:59:18 themis.cecnet.gmu.edu systemd[1]: Starting SSSD Kerberos Cache Manager...
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: pulpcore-content.service: Service RestartSec=3s expired, scheduling restart.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: pulpcore-content.service: Scheduled restart job, restart counter is at 1.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: pulpcore-api.service: Service RestartSec=3s expired, scheduling restart.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: pulpcore-api.service: Scheduled restart job, restart counter is at 1.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: Stopped Pulp API Server.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: Starting Pulp API Server...
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: Stopped Pulp Content App.
Aug 01 14:59:20 themis.cecnet.gmu.edu systemd[1]: Starting Pulp Content App...
Aug 01 14:59:21 themis.cecnet.gmu.edu systemd[1]: Started SSSD Kerberos Cache Manager.
Aug 01 14:59:21 themis.cecnet.gmu.edu sssd_kcm[1964]: Starting up
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@worker-1[1315]: 2024-08-01T18:59:42.068Z pid=1315 tid=cf INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://localhost:6379/6"}
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@worker-hosts-queue-1[1313]: 2024-08-01T18:59:42.070Z pid=1313 tid=cd INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://localhost:6379/6"}
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@orchestrator[1311]: 2024-08-01T18:59:42.068Z pid=1311 tid=ar INFO: Booting Sidekiq 6.5.12 with Sidekiq::RedisConnection::RedisAdapter options {:url=>"redis://localhost:6379/6"}
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@worker-hosts-queue-1[1313]: 2024-08-01T18:59:42.073Z pid=1313 tid=cd INFO: GitLab reliable fetch activated!
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@orchestrator[1311]: 2024-08-01T18:59:42.073Z pid=1311 tid=ar INFO: GitLab reliable fetch activated!
Aug 01 14:59:42 themis.cecnet.gmu.edu dynflow-sidekiq@worker-1[1315]: 2024-08-01T18:59:42.073Z pid=1315 tid=cf INFO: GitLab reliable fetch activated!
Aug 01 15:00:00 themis.cecnet.gmu.edu pulpcore-worker-2[1218]: pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Aug 01 15:00:00 themis.cecnet.gmu.edu pulpcore-worker-3[1214]: pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Aug 01 15:00:00 themis.cecnet.gmu.edu pulpcore-worker-1[1213]: pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker
Aug 01 15:00:00 themis.cecnet.gmu.edu pulpcore-worker-4[1216]: pulp [None]: pulpcore.tasking.entrypoint:INFO: Starting distributed type worker

ACTUALLY…

It looks like pulpcore-api is not working correctly:

Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4124]: [2024-08-01 19:27:59 +0000] [4124] [INFO] Starting gunicorn 20.1.0
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4124]: [2024-08-01 19:27:59 +0000] [4124] [INFO] Listening at: unix:/run/pulpcore-api.sock (4124)
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4124]: [2024-08-01 19:27:59 +0000] [4124] [INFO] Using worker: pulpcore.app.entrypoint.PulpApiWorker
Aug 01 15:27:59 themis.cecnet.gmu.edu systemd[1]: Started Pulp API Server.
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4136]: [2024-08-01 19:27:59 +0000] [4136] [INFO] Booting worker with pid: 4136
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4137]: [2024-08-01 19:27:59 +0000] [4137] [INFO] Booting worker with pid: 4137
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4138]: [2024-08-01 19:27:59 +0000] [4138] [INFO] Booting worker with pid: 4138
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4139]: [2024-08-01 19:27:59 +0000] [4139] [INFO] Booting worker with pid: 4139
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4140]: [2024-08-01 19:27:59 +0000] [4140] [INFO] Booting worker with pid: 4140
Aug 01 15:27:59 themis.cecnet.gmu.edu pulpcore-api[4140]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '4140@themis.cecnet.gmu.edu' failed to write a heartbeat to the database, sleeping for '45.0' seconds.
Aug 01 15:28:44 themis.cecnet.gmu.edu pulpcore-api[4136]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '4136@themis.cecnet.gmu.edu' failed to write a heartbeat to the database, sleeping for '45.0' seconds.
Aug 01 15:28:44 themis.cecnet.gmu.edu pulpcore-api[4137]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '4137@themis.cecnet.gmu.edu' failed to write a heartbeat to the database, sleeping for '45.0' seconds.
Aug 01 15:28:44 themis.cecnet.gmu.edu pulpcore-api[4138]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '4138@themis.cecnet.gmu.edu' failed to write a heartbeat to the database, sleeping for '45.0' seconds.
Aug 01 15:28:44 themis.cecnet.gmu.edu pulpcore-api[4139]: pulp [None]: pulpcore.app.entrypoint:INFO: Api App '4139@themis.cecnet.gmu.edu' failed to write a heartbeat to the database, sleeping for '45.0' seconds.

That is a different issue. Please open a separate thread or search if another thread covers it already. This thread is about an issue upgrading from 3.10 to 3.11.

Or, I could feel unwelcome and just go back to lurking.

You are not un-welcome, it just really looks like a different issue, to not clutter the thread it would be the best to move this to it’s own thread :slightly_smiling_face:
(and if it’s in the end the same issue, finding out also doesn’t clutter the main thread)