Content view repos and isos are being deleted every week

tbrooks · September 28, 2020, 3:23pm

Problem:
Recently added content view repos and isos are being deleted every Sunday night around 22:00 from proxy servers. Older ones do not get deleted.
Expected outcome:
Content views on proxies should not be deleted without manual requests to delete.
Foreman and Proxy versions:
katello-3.16.0-1.el7.noarch
Foreman and Proxy plugin versions:
foreman_tasks 2.0.2
foreman_chef 0.9.0
foreman_discovery 16.1.0
foreman_docker 5.0.0
foreman_remote_execution 3.3.5
katello 3.16.0
Distribution and version:
CentOS Linux release 7.8.2003
Other relevant data:

Below is a sample from /var/log/messages where it looks like cv are being removed.

Sep 27 22:00:27 katrepoa02p mongod.27017[93918]: [conn54] remove pulp_database.repo_sync_results query: { repo_id: “1-carfax_ol_8_cv-Library-2074cb4e-dd4d-4ab1-b1c0-af2f0e7b7a29” } planSummary: COLLSCAN keysExamined:0 docsExamined:2376 ndeleted:2 keysDeleted:4 numYields:22 locks:{ Global: { acquireCount: { r: 23, w: 23 } }, Database: { acquireCount: { w: 23 } }, Collection: { acquireCount: { w: 23 } } } 113ms
Sep 27 22:00:27 katrepoa02p mongod.27017[93918]: [conn54] command pulp_database.$cmd command: delete { delete: “repo_sync_results”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_ol_8_cv-Library-2074cb4e-dd4d-4ab1-b1c0-af2f0e7b7a29” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 23, w: 23 } }, Database: { acquireCount: { w: 23 } }, Collection: { acquireCount: { w: 23 } } } protocol:op_query 113ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn133] remove pulp_database.repo_publish_results query: { repo_id: “1-carfax_ol_8_cv-Library-2074cb4e-dd4d-4ab1-b1c0-af2f0e7b7a29” } planSummary: COLLSCAN keysExamined:0 docsExamined:3566 ndeleted:3 keysDeleted:6 numYields:37 locks:{ Global: { acquireCount: { r: 38, w: 38 } }, Database: { acquireCount: { w: 38 } }, Collection: { acquireCount: { w: 38 } } } 328ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn131] remove pulp_database.repo_publish_results query: { repo_id: “1-carfax_ol_8_cv-Library-941227b2-6d62-4268-858b-2295dcb83b72” } planSummary: COLLSCAN keysExamined:0 docsExamined:3563 ndeleted:3 keysDeleted:6 numYields:30 locks:{ Global: { acquireCount: { r: 31, w: 31 } }, Database: { acquireCount: { w: 31 } }, Collection: { acquireCount: { w: 31 } } } 141ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn133] command pulp_database.$cmd command: delete { delete: “repo_publish_results”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_ol_8_cv-Library-2074cb4e-dd4d-4ab1-b1c0-af2f0e7b7a29” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 38, w: 38 } }, Database: { acquireCount: { w: 38 } }, Collection: { acquireCount: { w: 38 } } } protocol:op_query 328ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn131] command pulp_database.$cmd command: delete { delete: “repo_publish_results”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_ol_8_cv-Library-941227b2-6d62-4268-858b-2295dcb83b72” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 31, w: 31 } }, Database: { acquireCount: { w: 31 } }, Collection: { acquireCount: { w: 31 } } } protocol:op_query 142ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn64] remove pulp_database.repo_publish_results query: { repo_id: “1-carfax_ol_8_cv-Library-a990b838-88fb-421a-b3c5-22656a7bb791” } planSummary: COLLSCAN keysExamined:0 docsExamined:3560 ndeleted:3 keysDeleted:6 numYields:32 locks:{ Global: { acquireCount: { r: 33, w: 33 } }, Database: { acquireCount: { w: 33 } }, Collection: { acquireCount: { w: 33 } } } 195ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn64] command pulp_database.$cmd command: delete { delete: “repo_publish_results”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_ol_8_cv-Library-a990b838-88fb-421a-b3c5-22656a7bb791” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 33, w: 33 } }, Database: { acquireCount: { w: 33 } }, Collection: { acquireCount: { w: 33 } } } protocol:op_query 195ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn131] remove pulp_database.repo_content_units query: { repo_id: “1-carfax_ol_8_cv-Library-941227b2-6d62-4268-858b-2295dcb83b72” } planSummary: IXSCAN { repo_id: 1, unit_type_id: 1, unit_id: 1 } keysExamined:192 docsExamined:192 ndeleted:192 keysDeleted:1344 numYields:8 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 }, acquireWaitCount: { w: 3 }, timeAcquiringMicros: { w: 39492 } }, Collection: { acquireCount: { w: 9 } } } 174ms
Sep 27 22:00:28 katrepoa02p mongod.27017[93918]: [conn131] command pulp_database.$cmd command: delete { delete: “repo_content_units”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_ol_8_cv-Library-941227b2-6d62-4268-858b-2295dcb83b72” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 }, acquireWaitCount: { w: 3 }, timeAcquiringMicros: { w: 39492 } }, Collection: { acquireCount: { w: 9 } } } protocol:op_query 174ms

Sep 27 22:00:39 katrepoa02p mongod.27017[93918]: [conn54] remove pulp_database.repo_content_units query: { repo_id: “1-carfax_atomic_7_cv-Library-cbd24260-1296-4b89-9ed1-a0eb056326ba” } planSummary: IXSCAN { repo_id: 1, unit_type_id: 1, unit_id: 1 } keysExamined:928 docsExamined:928 ndeleted:928 keysDeleted:6496 numYields:13 locks:{ Global: { acquireCount: { r: 14, w: 14 } }, Database: { acquireCount: { w: 14 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros: { w: 13463 } }, Collection: { acquireCount: { w: 14 } } } 230ms
Sep 27 22:00:39 katrepoa02p mongod.27017[93918]: [conn54] command pulp_database.$cmd command: delete { delete: “repo_content_units”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_atomic_7_cv-Library-cbd24260-1296-4b89-9ed1-a0eb056326ba” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 14, w: 14 } }, Database: { acquireCount: { w: 14 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros: { w: 13463 } }, Collection: { acquireCount: { w: 14 } } } protocol:op_query 230ms

Sep 27 22:00:47 katrepoa02p mongod.27017[93918]: [conn131] remove pulp_database.repo_content_units query: { repo_id: “1-carfax_centos_8_cv-Library-dbc1a1be-bd26-4c8f-ad9a-6290a7e49406” } planSummary: IXSCAN { repo_id: 1, unit_type_id: 1, unit_id: 1 } keysExamined:1718 docsExamined:1718 ndeleted:1718 keysDeleted:12026 numYields:24 locks:{ Global: { acquireCount: { r: 25, w: 25 } }, Database: { acquireCount: { w: 25 }, acquireWaitCount: { w: 2 }, timeAcquiringMicros: { w: 10268 } }, Collection: { acquireCount: { w: 25 } } } 349ms
Sep 27 22:00:47 katrepoa02p mongod.27017[93918]: [conn131] command pulp_database.$cmd command: delete { delete: “repo_content_units”, ordered: true, deletes: [ { q: { repo_id: “1-carfax_centos_8_cv-Library-dbc1a1be-bd26-4c8f-ad9a-6290a7e49406” }, limit: 0 } ] } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 25, w: 25 } }, Database: { acquireCount: { w: 25 }, acquireWaitCount: { w: 2 }, timeAcquiringMicros: { w: 10268 } }, Collection: { acquireCount: { w: 25 } } } protocol:op_query 350ms

Jonathon_Turel · September 30, 2020, 7:58pm

I’m reaching out to another developer on my team who may have an idea of what’s occurring here. One thing that jumps out at me is that on the Katello server we have a cron job which runs at 22:00 on Sundays, to remove orphaned content:

[root@centos7-katello-3-16 vagrant]# cat /etc/cron.d/katello
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Script for initiating removal of orphaned content
00 22 * * 0     root    foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1

jjeffers · September 30, 2020, 8:07pm

@tbrooks can you say if the content you are missing is being served by pulp2 or pulp3?

Based on the queries you provided, it looks like pulp2, but I wanted to confirm.

Jonathon_Turel · September 30, 2020, 8:13pm

@tbrooks ignoring what’s logged, can you confirm that the content you expect is actually missing? Sounds like ISOs in your specific case.

tbrooks · September 30, 2020, 8:30pm

@Jonathon_Turel We had already found this cron and decided to comment it out, but it didn’t change the behavior.

tbrooks · September 30, 2020, 8:33pm

We received errors when we try to migrate from pulp2 to pulp3, so I’d have to say pulp2. I’m not sure how to verify this though, other than to tell you the migration was unsuccessful.

tbrooks · September 30, 2020, 8:40pm

Everything remained on the katello server itself. And by forcing a complete sync to each proxy we have been able to restore everything each week when this has happened. I believe it began to start happening after we upgraded from 3.14 to 3.15 followed immediately by 3.16. However, we were having issues with the metadata on 3.14 too.

Jonathon_Turel · September 30, 2020, 8:44pm

@tbrooks thanks. am I correct in saying that even with the cron job disabled that this activity on your smart proxy still happens at 22:00 on Sunday? In Monitor -> Tasks which tasks are occurring at this time?

tbrooks · September 30, 2020, 9:15pm

Yes, when the cron is disabled the activity still continues.

We see "Combined Profile Update for servername pretty much constantly. Other than that at this time, I see things like:

[Subscription expiration notification] September 27, 2020, 10:25:25 PM
[Clean up StoredValues] September 27, 2020, 10:25:40 PM
[Create RSS notifications]
[Pulp disk space notification] September 27, 2020, 10:26:10 PM

Jonathon_Turel · October 2, 2020, 8:11pm

I’m not seeing anything to go on here considering:

the cron job to disable deletion of orphaned repos (which runs on sundays at 22:00) was disabled
despite the above this activity continues to happen at the same time
there are no suspect activities going on around sunday at 22:00 according to Monitor -> Tasks

I’m not aware of any other mechanism that would be triggering deletion of content that we wouldn’t be able to trace easily. If there is any other information you can provide, no matter how minor the detail, please mention it. Maybe you can provide the entire log from your smart proxy so that we can look at it as a whole?

Despite what is logged, what is the actual symptom you are seeing with respect to the deleted content? I’m assuming your client can not access it, or something along those lines.

tbrooks · October 5, 2020, 7:52pm

Here is the current state of the cron job on the katello/foreman server:

# Script for initiating removal of orphaned content
#00 22 * * 0    root    foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1

I’m not sure why these would be considered orphaned anyway.

What is the best way to send the log?

Several of our content views are called by chef cookbooks. So, chef is failing when it is not able to reach the various contents at whatever promotion level is missing. In some cases, this means all of our chef runs will fail.

Today, I again came to missing repos. I tried to run complete syncs on all proxies, but they finished really quickly, and nothing was copied. Instead I had to just publish new content views and promote to our different promotion levels to trigger a copy to the proxies. Thankfully the products are not disappearing on the master katello/foreman server. Unfortunately, our chef runs have been failing all morning as I try to publish the content views.

jjeffers · October 26, 2020, 5:50pm

You can attach files (compressed) to this thread. I would recommend only sending portions of the log files relevant to the time period around the katello::delete_orphaned_content task is run.

Also, in your chef cookbooks, how are the content reached – via http/https? Is it possible to provide output from the chef runs when these failures occur?

tbrooks · October 26, 2020, 7:09pm

Here is an example of a failed chef run from this morning. As mentioned earlier, the deletions for isos and repos are only on ones that have been added recently. We have tried to delete them and start over, but it didn’t help.

================================================================================
  Error executing action `install` on resource 'dnf_package[autoconf, bison, flex, gcc, gcc-c++, gettext, kernel-devel, make, m4, ncurses-devel, patch]'
  ================================================================================
  RuntimeError
  ------------
  dnf-helper.py had stderr/stdout output:
  Errors during downloading metadata for repository 'carfax_chef_8_client':
    - Status code: 404 for https://katrepoa02p.a.example.com/pulp/repos/example/alpha/example-ol-8-composite-view/custom/chef_8/client/repodata/repomd.xml (IP: 172.18.138.193)
  Traceback (most recent call last):
    File "/usr/lib/python3.6/site-packages/dnf/repo.py", line 573, in load
      ret = self._repo.load()
    File "/usr/lib64/python3.6/site-packages/libdnf/repo.py", line 394, in load
      return _repo.Repo_load(self)
  RuntimeError: Failed to download metadata for repo 'example_chef_8_client': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

tbrooks · October 26, 2020, 7:21pm

Here is a sanitized version of the messages file from one of our proxies where isos/repos are being deleted. I have removed certain unrelated processes for clarity. The file begins at Oct 18 20:00:03, but deletions begin at Oct 18 22:00:27

messages.gz (67.0 KB)

tbrooks · October 26, 2020, 7:38pm

Here you can see from the cron log that the “foreman-rake katello:delete_orphaned_content” stops after Sep 6 when I commented it out, but the deletions have continued.

May 24 22:00:08 katellod02p CROND[57430]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
May 31 22:00:09 katellod02p CROND[40799]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jun  7 22:00:10 katellod02p CROND[36012]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jun 21 22:00:08 katellod02p CROND[34601]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jun 28 22:00:05 katellod02p CROND[43297]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jul  5 22:00:08 katellod02p CROND[61716]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jul 12 22:00:06 katellod02p CROND[39078]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jul 19 22:00:05 katellod02p CROND[37695]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Jul 26 22:00:09 katellod02p CROND[27089]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Aug  2 22:00:09 katellod02p CROND[31155]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Aug  9 22:00:11 katellod02p CROND[64986]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Aug 16 22:00:08 katellod02p CROND[28795]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Aug 23 22:00:08 katellod02p CROND[46151]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Aug 30 22:00:08 katellod02p CROND[15882]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)
Sep  6 22:00:08 katellod02p CROND[40719]: (root) CMD (foreman-rake katello:delete_orphaned_content RAILS_ENV=production >/dev/null 2>&1)

jjeffers · October 26, 2020, 7:47pm

Oh, that is interesting - just to confirm - you are experiencing content deletions even after removing the orphan cleanup from the schedule of jobs?

tbrooks · October 26, 2020, 8:04pm

Yes, the cron is commented out. And every Monday, for weeks, we come in to deleted repos/isos. I will then do a complete sync on each proxy to get them back. However, this takes several hours. This is a high priority for us. Managers are wanting get off Foreman.

jjeffers · October 27, 2020, 11:52am

@tbrooks Thinking more about what might be happening (and it’s clear that for pulp is being directed to remove repositories), can you also forward the foreman-proxy logs from one of the affected proxies (usually at /var/log/foreman-proxy/) from the same time period as the deletions we see in the messages log?

Additionally, the same extraction from the foreman log on the pulp primary host (wherever foreman/katello is running), from the same time period.

I am curious if or what else is occurring that is triggering these deletions.

tbrooks · October 27, 2020, 4:44pm

Unfortunately, there is almost nothing in /var/log/foreman-proxy/proxy.log Looks like the logs have rolled off for the 18th, but I was able to see this on one of the proxies for October 25th during the same time period where things are deleted:

2020-10-25T21:31:37 a3f7fc58 [I] Started GET /pulpnode/status/disk_usage
2020-10-25T21:31:37 a3f7fc58 [W] File at /var/lib/mongodb defined in mongodb_dir parameter doesn't exist or is unreadable
2020-10-25T21:31:37 a3f7fc58 [I] Finished GET /pulpnode/status/disk_usage with 200 (4.45 ms)
2020-10-25T22:35:21 49ee8080 [I] Started GET /pulpnode/status/disk_usage
2020-10-25T22:35:21 49ee8080 [W] File at /var/lib/mongodb defined in mongodb_dir parameter doesn't exist or is unreadable
2020-10-25T22:35:21 49ee8080 [I] Finished GET /pulpnode/status/disk_usage with 200 (4.69 ms)

Only one proxy had anything in the “proxy.log” at all. The logs for "smart_proxy_dynflow_core.log or smart_proxy_dynflow_core.log are empty on all proxies. There are no other log files available in /var/log/foreman-proxy/.

I’m attaching the production.log for the October 18th starting at 22:00 from the primary. I had to split it up to post it here.

production.2020-10-18T22.log00.gz (3 MB)

tbrooks · October 27, 2020, 4:45pm

Here is the second file:

production.2020-10-18T22.log01.gz (623.2 KB)