RFC: Katello Alternate Content Sources

iballou · July 26, 2021, 7:38pm

Hello Foreman community,

I would like to get some feedback around a planned Katello feature called “Alternate Content Sources”. This is a feature that was available with through behind-the-scenes Pulp 2 configuration changes, but never directly in Katello itself. While it will once again be configurable outside of Katello through a later Pulp 3 release, we would like to make it a fully-fledged feature in Katello for a friendlier user experience.

What are alternate content sources (ACSs)?

An alternate content source is a filesystem or network path where Pulp will look to first when downloading content during a sync or on-demand content fetch. A repository with a slower connection might need to be the authoritative figure who determines the repository structure, but a faster ACS could be defined so that the actual content downloads (which is potentially much larger in size that the repodata) over a faster connection. With Katello in mind, this could be helpful for speeding up smart proxy synchronizations. Perhaps your smart proxy on the other side of the world has a slow connection to your Katello server. Instead of syncing CentOS 8 AppStream from this slow connection, an ACS could be set up to download it from your smart proxy’s local CentOS mirror.

Pulp 3 will be handling all of the backend ACS logic. Katello will be simply created and orchestrating them. In Katello, an ACS needs the associated smart proxies, optional content credentials, a base url (e.g. http://192.168.1.11/pub/content/), and subpaths (e.g. [beta/rhel/server/7/x86_64/satellite/6/os/, eus/rhel/server/7/7.3/x86_64/os/, dist/rhel/client/6/6.8/i386/kickstart/Client/, …]).

In Pulp, to create an ACS, Katello will first create a remote with the base path, auth credentials, and any necessary headers. Then, Katello will create one or more ACSs using that remote as an argument. The backend ACS itself will hold the subpaths. Then, to actually use the ACS, Katello will “refresh” it so that Pulp can index in all of the content metadata.

For now, every ACS will be global per smart proxy. That means, in the background, Pulp will consider every ACS for every repo sync. There is no need to associate repositories to ACSs.

How will ACSs be used in Katello?

Firstly, assume ACSs will be yum-only for now.

Here is the proposed workflow from the UI’s standpoint:

The user creates or enables some repositories.
The user optionally creates content credentials for protected alternate content sources.
The user goes to the “Alternate Content Sources” page and clicks a button to create a new one.
The user chooses the type of ACS: Custom, Red Hat CDN, or Red Hat Update Infrastructure (RHUI).
→ For custom ACSs, the user enters the ACS’s smart proxies, content credentials or basic auth info, base URL, and a list of subpaths. Katello may also suggest base paths based on repositories that are already synced, but this needs to be investigated more.
→ For Red Hat CDN ACSs, the user will likely only need to select the product and repositories that should be included in the ACS. Everything else can come from the Katello database as long as there is a valid subscription manifest. One product per ACS seems like the best balance between usability and fit with Pulp’s backend objects.
→ RHUI ACSs are still being planned out, but we hope to include a shortcut for creating them, much like the CDN ACSs.
The user submits the ACS form. Katello tells Pulp to refresh the ACS in the background.
→ Refreshing for all ACSs will also happen on a cron schedule to ensure the metadata is up to date. For now, Katello will likely create a cron entry and run an ACS refresh rake task.
→ Existing cron entry: foreman-packaging/katello.cron at rpm/develop · theforeman/foreman-packaging · GitHub
The user syncs a repository and enjoys the benefits of the ACS in the background.

The user will be able to update ACSs by selecting one from the list of ACSs and going to its information page, much like other Foreman or Katello constructs.

We are also going to have Hammer, API, and Foreman Ansible Module support for ACSs.

Development specifics

Planned DB tables:

katello_alternate_content_sources
- belongs_to :gpg_key, :ssl_ca_cert, :ssl_client_cert, :ssl_client_key (optional)
- belongs_to :product (optional, for CDN ACS)
- has_many :contents (optional, for CDN ACS)
  - Must belong to the same product
- base_url
- headers
- subpaths (array of strings)
- type (custom, cdn, rhui)
katello_smart_proxy_alternate_content_sources
- smart_proxy_id
- alternate_content_source_id
- remote_href
- alternate_content_source_href

Backend DB steps after the create form is submitted for a custom ACS:

Create ACS in Katello DB
Create ACSs in Pulp for each Smart Proxy associated to the Katello ACS:
- Create Pulp remote using Katello ACS base url & auth credentials
- Create Pulp ACS using the created remote and the Katello ACS’s subpaths
Create Smart Proxy ACSs with a remote href, ACS href, Smart Proxy ID, and ACS ID from the ACS creation tasks
- Necessary since smart proxies can share Katello ACSs but will have separate ACSs and ACS remotes in Pulp

This feature is still in the very early planning phases, so please let us know your concerns, ideas, and questions. Thanks for reading!

iballou · July 26, 2021, 8:03pm

Some questions for discussion:

I’m no UX designer, but I’d like to hear thoughts on my UI ACS creation form layout:

Name: text bar
Type: radio button that displays the appropriate fields:
- Custom:
  - Base URL: string
  - Subpaths: string separated by commas
    → Is there a UI element that would allow the user to add multiple strings in a more friendly manner?
  - Username & password for basic auth: strings
  - Content credentials for cert-based auth: drop-down list
- Red Hat CDN:
  - This is a bit fuzzy for me. The user should first select a product, and then any number of repositories. Unless it could just be per-repository for simplicity?
    → I’m imagining a drop-down list for products that the user selects. Then, a list of repositories from that product with checkboxes appears for the user to select. This new page would be in React, so perhaps that makes more things possible?
- RHUI I’ll leave out for now since it’s a bit up-in-the-air.

Does anyone see anything wrong with reusing Content Credentials for ACS certs?
Should ACSs be refreshed as part of repository synchronization if they haven’t been refreshed yet?

ekohl · July 27, 2021, 11:17am

Can you elaborate this? Let’s say as a user I have 3 repos (RHEL, CentOS, EPEL). Does that mean that for every repo it’ll try all 3 ACS? Meaning it will try to retrieve RHEL content from EPEL? That sounds very inefficient and can possibly be slower than just retrieving it from the main Pulp instance.

Also, is there a way to choose an ACS per Smart Proxy? For example, the Smart Proxy in Amsterdam might prefer a CentOS mirror from Amsterdam while the one in Sidney prefers one in Sidney. This may not be needed if mirrorlist support is implemented. However, for Debian or Ubuntu it may still be needed since there is no mirrorlist equivalent.

iballou · July 27, 2021, 1:55pm

I can’t speak to how Pulp is checking all of the ACSs during a sync, perhaps @ppicka or @daviddavis could help out here. You’re right though, for each repo sync, all ACSs will be “considered”, so to speak. I believe that’s how it worked in Pulp 2 as well.

Yep, this is exactly the case. Every ACS in Katello will have any number of smart proxies associated to it. Katello will create a matching ACS on each of the selected smart proxies’ Pulp servers. As an aside, this is why I needed to add in the SmartProxyAlternateContentSources table.

Justin_Sherrill · July 27, 2021, 2:20pm

Pulp caches knowledge about all the rpms in all of the ACS’s configured on a pulp server. So it does a light ‘sync’ of those ACSs in order to know which rpms are there (and their checksums). Then if an rpm is needed from EPEL for example, it will check the local knowledge to see if any of the ACSs have that rpm and will pull from one if so. So i wouldn’t say that it ‘tries’ every ACS because that implies its reaching out and talking to it any time a package is requested.

ehelms · July 29, 2021, 2:13pm

A few questions to help me learn and a few to think about:

a) Can a Pulp be an ACS for another Pulp? Say I had 3 Pulp mirrors deployed in a load balanced setup, could I sync to one of the Pulps and let the other two treat the one I synced to as an ACS to pay the cost once and update the others more quickly?
b) Do the content credentials get created and stored in Katello’s database and then created and stored in the Pulp database on the smart-proxy?
c) How should users think about organizing their ACS in order to classify them appropriately? I assume they will have a name, how might a user track location? Can they use Locations? Is this another case where a tagging system would really benefit?
d) If a user needs an HTTP proxy between their Pulp and an ACS, will that be possible? How would they configure that?
e) Do you know what shows up in the logs when ACS are involved for clients requesting an RPM in an on-deman scenario or when doing a complete sync? Will a user be able to piece together which ACSs were tried and which failed?
f) Does ACS only apply to packages? Or metadata as well?

iballou · August 2, 2021, 7:40pm

I don’t see why not. An ACS could be any filesystem or internet path. You only need to provide the certs, headers, and/or basic auth info required.

That’s what it’s looking like right now, yeah. We could avoid saving them in Katello’s DB if it feels like too much duplicated data, but I figured reusing content credentials would save some effort.

Each ACS will be associated to one or more smart proxies, so I think that would be a good basis for organization. I suppose an ACS could inherit the locations from its associated smart proxies. ACSs will indeed have names.

When a user defines an HTTP proxy and creates repositories with it, Katello sticks the proxy info on the related Pulp remote. For ACSs, it’s the same thing. The remote needs the HTTP proxy info. I hadn’t thought of supporting it though, so thanks for the reminder. I’m thinking the ACS creation page could have an HTTP proxy selection just like the repo creation page.

I need to ask the Pulp team about this one.

Just “binary content” as the Pulp epic puts it. The metadata should always be coming from the repository, not the ACS. That way, the ACS could just be a networked folder of random RPMs.

iballou · August 3, 2021, 1:59pm

I checked with the Pulp team about the logs question, and it’s still too early to say how they will look.

ehelms · August 3, 2021, 2:11pm

Can we add this use case to the RFC and/or via an issue to track this on the roadmap for ACS?

Cool, this would make for a good story attached to the set of use cases so it’s tracked.

This could be a good call out for when the docs get written for this, to help users understand that ACS does not mean all aspects of the content come from the ACS there are parts that still come from Pulp.

iballou · March 18, 2022, 5:28pm

We’ve done a bit of rethinking about how users would interact with the ACS feature. When going through UI design reviews with @MariSvirik, it seemed that the ACS feature needed to be simplified. While some users may enjoy the freedom of creating custom ACSs and specifying every detail, that could soon become tedious to maintain.

Most users will likely want to take advantage of one of ACSs biggest advantages: having smart proxies download content not from the Satellite, but instead from their respective upstream repositories. We considered adding a button to import all of library as ACSs to make this easier, but that seemed to only add to the complexity.

Therefore, our new ACS idea is to simplify the workflow so that ACSs only need a list of smart proxies and products associated with them. The ACSs will get remote information from the existing repositories. That way it won’t matter if the content is coming from the Red Hat CDN or from custom repositories. RHUI is a bit of a different beast, so it’ll have to be handled with a custom workflow. It’s likely that users will copy/paste or upload a file from RHUI to set up ACSs, which is nearly just as easy.

By having something that is easy to set up without too much thinking, the ACS feature would be more likely to get early adopters. These users could in turn help us figure out other ways to incorporate ACSs into Katello.

Here is what the simplified ACS creation could look like from @MariSvirik:

I’d be curious to hear what people think about this idea. The pro is that it’s easier to use. The con is that it’s less customizable.