Orphaned pulp content

Because of the EL7 deprecation I am playing around with a new katello 4.4 installation on AlmaLinux 8 using ansible to configure products, repos, etc.

For testing, I have just removed all products and disabled all redhat repositories, removed subscriptions and delete the manifest. So technically, the server should have no content at all.

I also ran

foreman-rake katello:delete_orphaned_content RAILS_ENV=production

to clean up.

After that 40G of the 120G content I have had synced before has been deleted. But still there are approx. 80G of content in /var/lib/pulp. I have looked into the database and for instance pulpcore.core_artifact still has 33894 rows. So it looks to me as if there is a lot of orphaned content which is not detected and cleaned up…

Hi @gvde ,

That command should create 1 or more tasks labeled Actions::Katello::OrphanCleanup::RemoveOrphans

Could you check please on the WebUI, navigate to Monitor → Tasks and enter into the search bar: label = Actions::Katello::OrphanCleanup::RemoveOrphans and then press enter.

In the results, please locate the task(s) corresponding to the time when the rake script was run, and can you confirm that these have the state stopped with the result success or is there some other combination of state and result?

Kind regards,

Yes. One, as it’s only the main server at the moment, no external content proxies, yet.

Yes, they are all stopped - success…

Hi @gvde . I looked into what this task does. The intent behind it is to remove content which has been synced to external content proxies, which is not published in any CV version that is currently promoted to any Lifecycle Environment assigned to that content proxy.

So for example, if you have an environment path like Library → Devel → QA → Prod, one content proxy providing content for a datacenter which only has Devel and QA, then packages in a CV version that is only promoted to Prod would be removed from that content proxy by this task.

Since the Katello primary server must have all content for all LCEs so that it can be synced out to any content proxies, this should explain why it didn’t seem to do too much in your case.

I believe what you are looking for instead would be the ‘reclaim space’ job you can run for any content proxy, including the internal content proxy to Katello primary. From the WebUI:

Infrastructure → Smart Proxies → Click on the Smart Proxy you wish to clean → Click on the ‘Reclaim Space’ button. It will create a task which you can follow at Monitor → Tasks.

This will cleanup downloaded packages for repositories which have the download policy set to ‘on_demand’, so that they will not become cached and ultimately stored on disk again unless requested by some content host.

I don‘t think you understood the issue I had: it was a single server, no external proxies, added some repos, synced and downloaded content, then deleted all products, repos etc. again. It has nothing to do with on demand downloads.

@gvde

Chris here taking over the issue, I will make a redmine issue to make sure our cleanup script is working correctly.

1 Like

Has there been any progress on this?

I have removed repositories that were around 500gb in size, but that doesnt seem to have cleared any space, still using 1.8tb of space when it should have been around 1.3tb
I am running katello 4.8.4 on Alma Linux 8

Check that there are no content views holding on to the content that you want to get rid of. Try publishing a new version of the content view then delete all older versions. Login to the foreman server and on the shell as root run:

foreman-rake katello:delete_orphaned_content RAILS_ENV=production

See if that frees up some disk space for you.

Thank you @jruk, what you suggested is what I followed. If there were any products published to a content view, it would not let me delete the products, however in this case I made sure that all those CV that had those products published were gone, then I proceeded to delete all the unnecessary products, however the disk space usage did not decrease, not even after running the foreman-rake command. I am not really sure why it isnt clearing the space.

Hi, does anyone know if there is a fix for this or a workaround to clear space?

I have ran pulp repository list --limit 532 and I see repositories that are no longer visible in katello, running the commands here dont really help to remove all those repositories.

Done that.

Its now been pegging my CPU at 100% for 7 hours, top shows it is postgresql beign hammered.

Task is stuck at 50%

So, the issues with cleaning up orphaned content and reclaiming space have been raised almost 2 years ago now, in this report, and its been ignored for 2 years, while the functionality has remained hopelessly broken.

What an absolute, complete & utter disgrace.

Can you check the Foreman task on the UI and check the Dynflow console to get the exact action that seems stuck?

Yep, no probs, I stopped foreman services over the weekend, and have now restarted , and the reclaim space job has been restarted so in the dynflow console i can see it is in:

2: Actions::Pulp3::CapsuleContent::ReclaimSpace (waiting for Pulp to finish the task reclaim_space)

Did the pulp task finish after service restart?

Another notice, not fully related to the last posts: I just manually started remove orphan. I have noticed that it takes a couple of hours lately on the main foreman server. Three postgresql processes are running more or less at 100% for hours. The query:

 18140 | foreman   | 587103 |    18139 | foreman   | /usr/bin/sidekiq       |             |                 |          -1 | 2024-02-12 16:12:04.092907+01 | 2024-02-12 16:12:04.174938+01 | 2024-02-12 16:12:04.175741+01 | 2024-02-12 16:12:04.175741+01 |               
  |            | active              |             |    397156863 | SELECT "katello_rpms".* FROM "katello_rpms" WHERE "katello_rpms"."id" NOT IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_rpms") | client backend
 18140 | foreman   | 587104 |    18139 | foreman   | /usr/bin/sidekiq       |             |                 |             | 2024-02-12 16:12:04.177155+01 | 2024-02-12 16:12:04.174938+01 | 2024-02-12 16:12:04.175741+01 | 2024-02-12 16:12:04.179351+01 |               
  |            | active              |             |    397156863 | SELECT "katello_rpms".* FROM "katello_rpms" WHERE "katello_rpms"."id" NOT IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_rpms") | parallel worker
 18140 | foreman   | 587105 |    18139 | foreman   | /usr/bin/sidekiq       |             |                 |             | 2024-02-12 16:12:04.177605+01 | 2024-02-12 16:12:04.174938+01 | 2024-02-12 16:12:04.175741+01 | 2024-02-12 16:12:04.180237+01 |               
  |            | active              |             |    397156863 | SELECT "katello_rpms".* FROM "katello_rpms" WHERE "katello_rpms"."id" NOT IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_rpms") | parallel worker

Current time is 20:53… Explain:

foreman=# explain SELECT "katello_rpms".* FROM "katello_rpms" WHERE "katello_rpms"."id" NOT IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_rpms") ;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..12916336526.48 rows=212588 width=709)
   Workers Planned: 2
   ->  Parallel Seq Scan on katello_rpms  (cost=0.00..12916314267.68 rows=88578 width=709)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..134234.43 rows=4633095 width=4)
                 ->  Seq Scan on katello_repository_rpms  (cost=0.00..92969.95 rows=4633095 width=4)
(7 rows)

If I understand correctly, it actually scans all 88578 rows of katello_rpms and for each it scans all 4633095 rows of katello_repository_rpms…

Wow…That is not how we expected NOT IN to work in postgresql.

WHERE "katello_rpms"."id" NOT IN (SELECT "katello_repository_rpms"."rpm_id" FROM "katello_repository_rpms") ;

One would expect the inner query to run and then that’d act as a filter for the outer query. Strange that postgresql would run it differently. :confused:

What katello version is this btw?

Hmm, yeah it did.
It took 4 days, but it did complete “successfully”.
It appears to have freed up only 4 GB of space, which is less than expected…

I am on 4.10

foreman-3.8.0-1.el8.noarch
katello-4.10.0-1.el8.noarch

The task I have started yesterday finished after 5 hours 6 minutes…

Now I am on

foreman-3.9.1-1.el8.noarch
katello-4.11.1-1.el8.noarch

still the same issue.

I was looking at the orphan cleanup workflow and remembered there’a setting:

        setting 'orphan_protection_time',
        type: :integer,
        default: 1440,
        full_name: N_('Orphaned Content Protection Time'),
        description: N_('Time in minutes before content that is not contained within a repository and has not been accessed is considered orphaned.')

Pulp considers content orphaned after the above 1440 minutes since all repositories containing the content have been deleted. Wonder if a second cleanup task will pick those up for deletion and clean up more space.

Trying to reproduce the issue around some repositories not getting marked for deletion during orphan cleanup.