Pulpcore coredumps during stopping services

mdedek · January 16, 2024, 8:52am

Problem:
Pulcore coredumps everytime when foreman-maintain service stop is executed and critical messages are sent to /var/log/messages.
This happens everyday when offline backup is started in our environment.

Jan 16 04:00:19 hostname systemd-coredump[297768]: Process 210742 (pulpcore-api) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297770]: Process 210746 (pulpcore-api) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297769]: Process 210722 (pulpcore-api) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297767]: Process 210770 (pulpcore-api) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297794]: Process 210720 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297795]: Process 210787 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297797]: Process 210762 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297815]: Process 210908 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297817]: Process 210896 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297819]: Process 210923 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297826]: Process 210731 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:19 hostname systemd-coredump[297833]: Process 210743 (pulpcore-conten) of user 991 dumped core.
Jan 16 04:00:20 hostname systemd-coredump[297851]: Process 210696 (pulpcore-conten) of user 991 dumped core.

Expected outcome:
Foreman services will be gracefully stopped.

Foreman and Proxy versions:
foreman-3.9.1-1.el8.noarch
katello-4.11.0-1.el8.noarch

Foreman and Proxy plugin versions:

Distribution and version:

Other relevant data:

production.log (65.0 KB)
messages.log (197.1 KB)

qcjames53 · January 19, 2024, 9:42pm

Hi! Apologies for the delay with help here.

Before the core dump in your messages log, pulp is indicating that it’s unable to connect to the database. The offline backup you mentioned is likely the cause; it’s closing the db endpoints before pulp closes its connections. A workaround would be to spin down the foreman service then run the offline backup.

This is a small error on the foreman-maintain side. I’ll go ahead and submit a report for it on Monday. Thanks for letting us know.

qcjames53 · February 19, 2024, 10:17pm

Foreman maintain issue link

mdedek · September 12, 2024, 10:24am

So far no activity on this issue which must bother many users.

iballou · September 19, 2024, 2:15pm

There’s been some recent movement on Add Wants=postgresql.service to Pulpcore service files by ekohl · Pull Request #359 · theforeman/puppet-pulpcore · GitHub, which is related.

If your environment supports it, please consider helping to test it. Test validations from users on pull requests gets code merged more quickly.

mdedek · September 24, 2024, 2:54pm

I have currently tested presence of Wants=postgresql.service in services definition. No success.

pulpcore-api.service
pulpcore-content.service
pulpcore-worker@.service

“Wants” defines weak relation which affect only startup sequence and problem described here is happening during stopping of services. More reasonable would be Require:

Requires=postgresql.service
After=postgresql.service

Unfortunately also no success with “Requires”. Core dumping is still happening. Moreover core dump happens when I tried halt only service pulpcore-api and postgresql remained running. Exact command:

# systemctl stop pulpcore-api.service

iballou · September 30, 2024, 6:52pm

Thanks for the report, I’ll get it copied over to the pull request.

evgeni · October 9, 2024, 7:53am

So I was sufficiently annoyed by this problem in another context to go dig deeper. Oh my!

When you look at the stack trace of one of those core dumps, you only see Python core:

                #0  0x00007fc06668ba6c __pthread_kill_implementation (libc.so.6 + 0x8ba6c)
                #1  0x00007fc06663e686 raise (libc.so.6 + 0x3e686)
                #2  0x00007fc066628833 abort (libc.so.6 + 0x28833)
                #3  0x00007fc0661545b8 dlfree.cold (libffi.so.8 + 0x25b8)
                #4  0x00007fc063be9fdf CThunkObject_dealloc (_ctypes.cpython-311-x86_64-linux-gnu.so + 0xbfdf)
                #5  0x00007fc066be2a65 dict_dealloc (libpython3.11.so.1.0 + 0x1e2a65)
                #6  0x00007fc063bf2f8c PyCData_clear (_ctypes.cpython-311-x86_64-linux-gnu.so + 0x14f8c)
                #7  0x00007fc063be977c PyCFuncPtr_dealloc (_ctypes.cpython-311-x86_64-linux-gnu.so + 0xb77c)
                #8  0x00007fc066c435c6 subtype_dealloc (libpython3.11.so.1.0 + 0x2435c6)
                #9  0x00007fc066cb3557 free_keys_object (libpython3.11.so.1.0 + 0x2b3557)
                #10 0x00007fc066cd126a dict_tp_clear (libpython3.11.so.1.0 + 0x2d126a)
                #11 0x00007fc066bd80b5 gc_collect_main (libpython3.11.so.1.0 + 0x1d80b5)
                #12 0x00007fc066ce9a7c _PyGC_CollectNoFail.isra.0 (libpython3.11.so.1.0 + 0x2e9a7c)
                #13 0x00007fc066cd69d8 Py_FinalizeEx (libpython3.11.so.1.0 + 0x2d69d8)
                #14 0x00007fc066cfaf0c Py_Exit (libpython3.11.so.1.0 + 0x2faf0c)
                #15 0x00007fc066ce978b handle_system_exit (libpython3.11.so.1.0 + 0x2e978b)
                #16 0x00007fc066ce951d PyErr_PrintEx (libpython3.11.so.1.0 + 0x2e951d)
                #17 0x00007fc066b3be88 _PyRun_SimpleFileObject.cold (libpython3.11.so.1.0 + 0x13be88)
                #18 0x00007fc066ce8927 _PyRun_AnyFileObject (libpython3.11.so.1.0 + 0x2e8927)
                #19 0x00007fc066ce2e46 Py_RunMain (libpython3.11.so.1.0 + 0x2e2e46)
                #20 0x00007fc066ca53dd Py_BytesMain (libpython3.11.so.1.0 + 0x2a53dd)
                #21 0x00007fc0666295d0 __libc_start_call_main (libc.so.6 + 0x295d0)
                #22 0x00007fc066629680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)
                #23 0x00005620192c0095 _start (python3.11 + 0x1095)

But @hao-yu said in a private discussion it’s related to our psycopg package – and usually, when Hao says something, he’s right (spoiler: also in this case!)

According to him, the issue is fixed when using psycopg[binary] – but we can’t use that, we need to compile things from source.

Let’s dig deeper into how psycopg uses libpq (the PostgreSQL client library). Everybody who wants to read some Python: head over to psycopg/psycopg/psycopg/pq/__init__.py at master · psycopg/psycopg · GitHub , everybody else: trust me that it has 3 implementations - “C”, “binary” and “Python”. “C” and “binary” are actually the same (C/Cython) code, but built and distributed differently. They both contain a Python extension that is linked against libpq. “Python” is different – it’s also using libpq, but via ctypes (Python’s FFI interface) and not by linking to it!

Now, if we look at the stack trace again, we see it’s actually hitting _ctypes.cpython-311-x86_64-linux-gnu.so while crashing! That’s because we’re using the “Python” (ctypes) interface, while installing psycopg[binary] moves us over to the “Binary” interface.

As noted above, “C” and “Binary” are (should be?) technically identical – and there is psycopg-c · PyPI which we can build (it contains sources!) and install. I’ve build it locally and it fixes the segfaults, without pulling in the forbidden “binary” package.

evgeni · October 9, 2024, 11:23am

If you’re feeling advantageous, you can try the builds for pulpcore 3.49 and pulpcore nightly

evgeni · October 18, 2024, 8:49am

FWIW, we had those fixes released to 3.49 and nightly repos, so upgrade and enjoy

mdedek · October 21, 2024, 11:09am

Successfully tested on katello-4.13 with python3.11-psycopg-3.2.3-1.el8.noarch
Thank you very much.