Upgrade issue from 2.3 to 2.4

aehobbes80 · August 17, 2021, 3:04pm

Problem: I’m upgrading from 2.3 to 2.4. I’m running on CentOS 7.
The yum upgrade completed successfully without issues. I kicked off the db:migrate and it ran for most of a day before the machine I SSH’d in rebooted itself. I tried to run db:seed and was informed that migrate didn’t complete so I kicked the process off again from a system that wouldn’t reboot on me. It has now been running close to 18 hours and isn’t complete.

[root@**************** ~]# foreman-rake db:migrate
API controllers newer than Apipie cache! Run apipie:cache rake task to regenerate cache.
== 20181206131436 DropOldLocks: migrating =====================================

I ran TOP and I can see that something is happening:

top - 10:02:56 up 3 days, 19:25, 2 users, load average: 2.16, 1.53, 1.29
Tasks: 199 total, 2 running, 197 sleeping, 0 stopped, 0 zombie
%Cpu(s): 21.7 us, 9.3 sy, 0.0 ni, 68.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16230504 total, 546996 free, 3594576 used, 12088932 buff/cache
KiB Swap: 8257532 total, 7859708 free, 397824 used. 11648280 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9705 puppet 20 0 8376500 2.9g 7988 S 138.2 18.7 1552:17 java
199904 postgres 20 0 836756 558516 556060 R 99.7 3.4 1388:08 postmaster
7501 root 20 0 261260 1380 752 S 4.7 0.0 255:29.93 python
7506 root 20 0 482456 1484 888 S 4.3 0.0 236:59.40 python
7277 postgres 20 0 834420 549272 549056 S 0.7 3.4 32:21.48 postmaster
9 root 20 0 0 0 0 S 0.3 0.0 7:42.65 rcu_sched

Java and Postmaster have significant utilization, and have since this process started.

Is there an easy way to tell how far along it is? Is there a way I can tell if it is hung?

cintrix84 · August 17, 2021, 4:20pm

Hi @aehobbes80

If you do strace -p <pid> -fF -s 25555 on the posttgres and foreman processes, are you seeing them actually doing anything or just sitting there?

aehobbes80 · August 17, 2021, 4:36pm

It looks like things are still happening:

ps -ef | grep postgres
postgres 7009 1 0 Aug13 ? 00:00:17 postmaster -D /var/opt/rh/rh- postgresql12/lib/pgsql/data
postgres 7236 7009 0 Aug13 ? 00:00:00 postgres: logger
postgres 7276 7009 0 Aug13 ? 00:00:27 postgres: checkpointer
postgres 7277 7009 0 Aug13 ? 00:33:25 postgres: background writer
postgres 7278 7009 0 Aug13 ? 00:01:07 postgres: walwriter
postgres 7279 7009 0 Aug13 ? 00:00:12 postgres: autovacuum launcher
postgres 7280 7009 0 Aug13 ? 00:01:46 postgres: stats collector
postgres 7281 7009 0 Aug13 ? 00:00:00 postgres: logical replication launcher
root 116559 212360 0 12:29 pts/0 00:00:00 grep --color=auto postgres
postgres 199904 7009 99 Aug16 ? 1-01:34:28 postgres: foreman foreman [local] DELETE
postgres 199905 7009 0 Aug16 ? 00:00:00 postgres: foreman foreman [local] idle

Here is output:

 strace -p 7009 -fF -s 25555
strace: deprecated option -F ignored
strace: Process 7009 attached
select(7, [3 4 5 6], NULL, NULL, {tv_sec=52, tv_usec=376632}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=7279, si_uid=26} ---
stat("logrotate", 0x7ffcffaf1b70)       = -1 ENOENT (No such file or directory)
clone(strace: Process 116726 attached
child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f80d6845b90) = 116726
[pid 116726] set_robust_list(0x7f80d6845ba0, 24 <unfinished ...>
[pid  7009] rt_sigreturn({mask=[]} <unfinished ...>
[pid 116726] <... set_robust_list resumed>) = 0
[pid  7009] <... rt_sigreturn resumed>) = -1 EINTR (Interrupted system call)
[pid 116726] open("/proc/self/oom_score_adj", O_WRONLY <unfinished ...>
[pid  7009] rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1],  <unfinished ...>
[pid 116726] <... open resumed>)        = 11
[pid  7009] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 116726] write(11, "0", 1 <unfinished ...>
[pid  7009] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 116726] <... write resumed>)       = 1
[pid  7009] <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 116726] close(11 <unfinished ...>
[pid  7009] select(7, [3 4 5 6], NULL, NULL, {tv_sec=60, tv_usec=0} <unfinished ...>
[pid 116726] <... close resumed>)       = 0
[pid 116726] open("/dev/urandom", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 11
[pid 116726] fstat(11, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
[pid 116726] poll([{fd=11, events=POLLIN}], 1, 10) = 1 ([{fd=11, revents=POLLIN}])
[pid 116726] read(11, "k}\323\337\331\231D\252\334\325\256\261\v\ry\30\222\331]?\271T\305\177\326\340\32\3225\361\337)\310|\301\210\201\v\21\317\16b\241g\322\4Ut", 48) = 48
[pid 116726] close(11)                  = 0
[pid 116726] getuid()                   = 26

There was quite a bit more that came out. Lots of sendtos that look generally like:

[pid 116730] sendto(10, "\2\0\0\0\350\1\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\356\4\0\0\0\0\0\0\1\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0p\n\0\0\0\0\0\0\3\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0o\n\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\225\v\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\7\0\0\0\0\0\0\0", 488, 0, NULL, 0) = 488
[pid 116730] kill(7279, SIGUSR2)        = 0
[pid 116730] exit_group(0)              = ?

aruzicka · August 18, 2021, 8:50am

This migration does two things. It removes all non-exclusive locks and then ensures there is only up to 1 lock binding a task to a resource.

Now since the migration is being run in a transaction, we can’t really take a look at how far it is as far as I know.

Just to give us a general idea of how much stuff it has to deal with:

To get the number of non-exclusive locks

select count(*) from foreman_tasks_locks as ftl left outer join foreman_tasks_tasks as ftt on ftt.id = ftl.task_id where ftl.exclusive = f or ftt.state = 'stopped' or ftt.state is null;

To get the number of locks which need to be removed for the second part:

with grouped_locks as (
  select ftl.task_id, ftl.resource_type, ftl.resource_id, count(ftl.task_id) from foreman_tasks_locks as ftl where exclusive = t group by task_id, resource_type, resource_id
)
select sum(count) - count(*) from grouped_locks;

Since this is a data migration, I could probably guide you to do what the migration does with raw sql queries. But first, let’s get an idea how many records we’re dealing with.

aehobbes80 · August 19, 2021, 9:34am

The process finally finished, it took about 3 days to complete. Been travelling so I haven’t been able to run the above commands, but I’ll definitely get the results back to you as I’d like to make sure future upgrades don’t take so long.

aruzicka · August 19, 2021, 10:54am

The migration is a one-off kind of thing. If it passed then it won’t run again so there’s no real point in getting those numbers. Unless you have another instance which should be upgraded the same way