Removing old content from /var/lib/pulp/media - need help with orphan cleanup

Problem:

Summary: Trying to remove old content in /var/lib/pulp but can’t figure out why it doesn’t seem to be working after plenty of googling, reading and log review.

I have a /var/lib/pulp directory on a dedicated 1.4 TB filesystem currently using 1.2 TB (90%) of space and triggering alerts. I’d like to remove old rpm content from /var/lib/pulp/media which is where space is being consumed the most.

This Foreman serves three Linux platforms and we only need rpm content for the most recent three Content Views published for each platform.

I’m under the impression that if we delete old Content Views, then the rpms associated with them would be ‘orphaned’ and become potential for being deleted.

A cron job is deployed at /etc/cron.d/katello and logs show this is being run periodically with:

foreman-rake katello:delete_orphaned_content RAILS_ENV=production

However, when I delete content views and run this manually, we’ve only ever seen less than 25 GB of space removed.

Expected outcome:

Given that 1.2 TB of space is in use, I’d like to think that when I delete many to most of the Content Views – leaving only a few VC left – a large amount of space would be freed.

Foreman and Proxy versions:

Foreman 3.12-1
Katello 4.14.3-1
Python 3.11-pulp-rpm-3.26-1
python3.11-pulpcore-3.49.22-1

Foreman and Proxy plugin versions:

Distribution and version:

RHEL 8.10

Other relevant data:

The task appears to run without error:

# hammer task list --search orphan
-------------------------------------|-----------------|---------|---------|---------------------|---------------------|--------------------|---------------|------------
ID                                   | ACTION          | STATE   | RESULT  | STARTED AT          | ENDED AT            | DURATION           | OWNER         | TASK ERRORS
-------------------------------------|-----------------|---------|---------|---------------------|---------------------|--------------------|---------------|------------
27cda6be-4540-4d49-b09d-7540b64da390 | Remove orphans  | stopped | success | 2025/03/15 00:03:51 | 2025/03/15 00:03:58 | 7.161574           | foreman_admin |
6f3dbebf-8587-40c5-b7b5-e6bb051f07f2 | Remove orphans  | stopped | success | 2025/03/14 23:43:02 | 2025/03/14 23:43:08 | 5.990856           | foreman_admin |
99daa4cf-0780-4748-b0f0-bf548631a4e3 | Remove orphans  | stopped | success | 2025/03/14 23:35:56 | 2025/03/14 23:36:23 | 26.522241          | foreman_admin |
37497ad4-0399-4226-8633-9e5dbe4505b5 | Remove orphans  | stopped | success | 2025/03/14 21:20:00 | 2025/03/14 21:21:43 | 103.307088         | foreman_admin |

In /var/log/foreman/production.log I’ll see activity tasks like this:

2025-03-14T15:20:00 [I|bac|] Task {label: Actions::Katello::OrphanCleanup::RemoveOrphans, id: 37497ad4-0399-4226-8633-9e5dbe4505b5, execution_plan_id: 0bfff44b-40a0-436d-bf66-2c0f609ec175} state changed: planning
2025-03-14T15:20:00 [I|bac|] Task {label: Actions::Katello::OrphanCleanup::RemoveOrphans, id: 37497ad4-0399-4226-8633-9e5dbe4505b5, execution_plan_id: 0bfff44b-40a0-436d-bf66-2c0f609ec175} state changed: planned
2025-03-14T15:20:00 [I|bac|] Task {label: Actions::Katello::OrphanCleanup::RemoveOrphans, id: 37497ad4-0399-4226-8633-9e5dbe4505b5, execution_plan_id: 0bfff44b-40a0-436d-bf66-2c0f609ec175} state changed: running
2025-03-14T15:21:43 [I|bac|] Task {label: Actions::Katello::OrphanCleanup::RemoveOrphans, id: 37497ad4-0399-4226-8633-9e5dbe4505b5, execution_plan_id: 0bfff44b-40a0-436d-bf66-2c0f609ec175} state changed: stopped  result: success

Based on my efforts to google and explore for an answer, it seems like this should be working and yet I don’t see any obvious errors or warnings indicating why its not.

Reviewing a graph of storage space usage on /var/lib/pulp for the last year, I can see several points where many hundred gigabytes of space has been reduced from this filesystem. But I’m wondering if those were periods where we removed entire repositories such as Oracle 7, RHEL 7, CentOS 7, etc, and not just individual Content Views.

I am 99% sure that these drops in storage consumption were deletions of entire content views, not just an orphan-cleanup.
In general, a package is considered an orphan if it does not belong to any any content-view-version and is not in any product anymore. Whether that happens as part of CV-Version-Lifecycle or not is highly dependent on the upstream repositories you are syncing.
Most upstreams (including all EL-based distros I worked with until now) usually keep older versions of packages in the repository so you can effectively do dnf downgrade and similar actions. In this case, all the packages will forever stay part of the current CV version and thus will not become orphans.
Other upstreams (like Ubuntu afaik) only keep the latest version of each package in the repo and remove old versions as soon as they release an update to a package. In this case, your expectation would be correct that packages get marked as orphaned after a CV version removal.
To my knowledge, there is no feature in Katello/Pulp to only mirror the latest version/latest X versions of each package.

This only cleans up the katello content in the foreman/katello database. It does not cleanup the orphans in the pulpcore backend. There is a weekly task which removes the content from pulp. On my foreman server it is running each Sunday evening. This will remove the content which is not referenced anymore.

You can check with curl to see when it happens and what has been done, e.g. on the main server:

# curl --capath : --key /etc/foreman/client_key.pem --cert /etc/foreman/client_cert.pem --cacert /etc/foreman/proxy_ca.pem  'https://foreman8.example.com/pulp/api/v3/tasks/?name=pulpcore.app.tasks.orphan.orphan_cleanup&ordering=-started_at&limit=3' | jq .results

This prints out the 3 latest runs of the task.

[
  {
    "pulp_href": "/pulp/api/v3/tasks/0195a0c3-470d-7297-b431-ce77589f46e6/",
    "pulp_created": "2025-03-16T21:01:47.150259Z",
    "pulp_last_updated": "2025-03-16T21:01:47.150275Z",
    "state": "completed",
    "name": "pulpcore.app.tasks.orphan.orphan_cleanup",
    "logging_cid": "edc9ffdfde604a84814e0640ebe71f45",
    "created_by": "/pulp/api/v3/users/1/",
    "unblocked_at": "2025-03-16T21:01:47.223922Z",
    "started_at": "2025-03-16T21:01:51.536228Z",
    "finished_at": "2025-03-16T21:04:36.497417Z",
    "error": null,
    "worker": null,
    "parent_task": null,
    "child_tasks": [],
    "task_group": null,
    "progress_reports": [
      {
        "message": "Clean up orphan Content",
        "code": "clean-up.content",
        "state": "completed",
        "total": 26176,
        "done": 28712,
        "suffix": null
      },
      {
        "message": "Clean up orphan Artifacts",
        "code": "clean-up.artifacts",
        "state": "completed",
        "total": 848,
        "done": 848,
        "suffix": null
      }
    ],
    "created_resources": [],
    "reserved_resources_record": [
      "/api/v3/orphans/cleanup/",
      "shared:/pulp/api/v3/domains/018d10eb-22f7-72a8-ab19-6f036c81f631/"
    ]
  },
  {
    "pulp_href": "/pulp/api/v3/tasks/01957cb6-bf39-71c8-b964-bece33c5bf80/",
    "pulp_created": "2025-03-09T21:01:46.170423Z",
    "pulp_last_updated": "2025-03-09T21:01:46.170439Z",
    "state": "completed",
    "name": "pulpcore.app.tasks.orphan.orphan_cleanup",
    "logging_cid": "29da39f04abd4087a6a40ee442198340",
    "created_by": "/pulp/api/v3/users/1/",
    "unblocked_at": "2025-03-09T21:01:46.277690Z",
    "started_at": "2025-03-09T21:01:48.167506Z",
    "finished_at": "2025-03-09T21:04:20.237912Z",
    "error": null,
    "worker": null,
    "parent_task": null,
    "child_tasks": [],
    "task_group": null,
    "progress_reports": [
      {
        "message": "Clean up orphan Content",
        "code": "clean-up.content",
        "state": "completed",
        "total": 24973,
        "done": 24973,
        "suffix": null
      },
      {
        "message": "Clean up orphan Artifacts",
        "code": "clean-up.artifacts",
        "state": "completed",
        "total": 772,
        "done": 772,
        "suffix": null
      }
    ],
    "created_resources": [],
    "reserved_resources_record": [
      "/api/v3/orphans/cleanup/",
      "shared:/pulp/api/v3/domains/018d10eb-22f7-72a8-ab19-6f036c81f631/"
    ]
  },
  {
    "pulp_href": "/pulp/api/v3/tasks/019558aa-aa89-74f8-bf89-4420105d9cc5/",
    "pulp_created": "2025-03-02T21:02:14.666412Z",
    "pulp_last_updated": "2025-03-02T21:02:14.666428Z",
    "state": "completed",
    "name": "pulpcore.app.tasks.orphan.orphan_cleanup",
    "logging_cid": "e542658e18904759865181b842d71391",
    "created_by": "/pulp/api/v3/users/1/",
    "unblocked_at": "2025-03-02T21:02:14.770418Z",
    "started_at": "2025-03-02T21:02:17.232947Z",
    "finished_at": "2025-03-02T21:05:33.517930Z",
    "error": null,
    "worker": null,
    "parent_task": null,
    "child_tasks": [],
    "task_group": null,
    "progress_reports": [
      {
        "message": "Clean up orphan Content",
        "code": "clean-up.content",
        "state": "completed",
        "total": 29069,
        "done": 29069,
        "suffix": null
      },
      {
        "message": "Clean up orphan Artifacts",
        "code": "clean-up.artifacts",
        "state": "completed",
        "total": 1431,
        "done": 1431,
        "suffix": null
      }
    ],
    "created_resources": [],
    "reserved_resources_record": [
      "/api/v3/orphans/cleanup/",
      "shared:/pulp/api/v3/domains/018d10eb-22f7-72a8-ab19-6f036c81f631/"
    ]
  }
]

In case it didn’t work it should also show the error.

I can see that the task is running in our monitoring and the disk utilization. Please note, timestamps in the output have a “Z” at the end, i.e. they are UTC time.

I’d like to clarify with some real world examples:

RHEL 8 is released in 2019 with an initial upstream source repository and packages. Packages are added to the updates repo but never removed.

A user configures Foreman about that time to consume RHEL 8 and defines content view 1.0. Clients start downloading initial package updates from it.

Years ago by and new packages are downloaded from upstream, stored locally and periodically associated with new content view versions. We’re now on content view version 150 with the most recent packages.

The older packages stay in the repo for two reasons: First, because Redhat never deletes them from the source and second, because they are associated with a content view and any client that appears wanting that content view will need them.

If you delete the content view, clients won’t be able to see that unique view of updates, but the package will remain in the local storage repository because it is still associated with the RHEL 8 distribution.

If you stop using RHEL 8 and delete the RHEL 8 distribution from Foreman, they would finally become ‘orphan’ packages. This would provide the opportunity for a periodic weekly task to identify them and remove them from local users’ Foreman pulp storage in /var/lib/media.

Correct?

In the case of another upstream repo like Ubuntu where older update packages may be removed after some time, things might be different:

Any package associated with a Foreman content view should be maintained because a client may appear wanting to consume that content view’s older packages.

If the upstream repo deletes the file, Foreman should still maintain it as long as the local content view references it to protect the user’s interests.

But if the local content view is deleted and the upstream repo removes the package, then it is not associated with either and becomes ‘orphan’, ready to be pruned from local storage.

Correct?

So, overall, the amount of storage space for something like RHEL will start out at 1 unit of storage, then grow over time, never decreasing because even if you delete your content view associate with them, they are never removed from the repo by upstream Red Hat.

But an Ubuntu (or other) repo that deletes older content upstream might start out as 1 unit, grow over time, but if a user is pruning old content views and upstream is deleting old packages, the local storage space may rise when new incoming packages are seen and fall based on older packages becoming ‘orphaned’ and being removed.

Correct ?

In my specific case, this means just removing content views from RHEL would not be expected to free up much space because Foreman/Pulp is still maintaining an accurate mirror of the upstream ever-growing RHEL repository.

Thanks for your time, I’m hoping I’ve got this right and that it helps others.

Not quite: content views only has limited affect on what is downloaded, synced, or kept for a repository. The repositoy contains the packages. The settings for the repository defines what is synchronized, when it is downloaded and how long it is kept.

This all depends on the download and mirroring policy.

The docs explain it in detail: Managing content

The download policy defines whether packages are downloaded immediately during each sync or only on demand, i.e. if someone tries to download the package from the foreman server. With “on demand” you can also clean up the downloaded content if you want.

The mirroring policy defines how the repository is mirrored: if you set it to additive, old content never gets removed. If you set it to “Content Only” or “Complete Mirroring”, old content gets removed from the repository if it has been removed from the upstream repository.

Content view versions are a snapshot the repositories you have added to the content view. As long as you keep that snapshot it will keep the referenced repository with the content. If your download policy is immediate, it’ll keep everything. If your download policy is on demand, it only downloads and keeps what has been used. Which, of course, can be a problem if your content view references an old package which has been removed from the upstream repository. That’s why you can use “on demand” only really on upstream repositories which don’t remove old content, in particular if you want to keep and use older content views for a longer time.

So if you never purge old content view versions with immediate download policy on your repositories, you’ll see an ever growing storage usage.

As RHEL repositories keep old package versions in the repository, you can, of course, use on demand download policy and reclaim space. That should reduce the amount of packages downloaded locally to the necessary minimum of what you actually use.