Content view import/export

Chris_Duryee · July 31, 2018, 5:44pm

@Chris_Roberts, @ehelms and I are working on improving the content import/export process. We’d like to describe our plans in order to get feedback.

How does it work today?

In Katello 3.7, you can use hammer to export a specific yum repository, or all of the yum repositories in a content view. This is documented in the Katello manual.

This gets the content from point A to point B, but we received user feedback that the content views and CV versions were important to keep. Right now, you have to redefine the CV by hand, and hope that when you publish that the CV version has the content you expect.

Proposed improvement

We are planning on doing two things: enhance APIs to make it easier to redefine content views and versions, and add automation to make the export/import process less painful.

If you wanted to redefine a content view version today, it is difficult. You would have to perform the existing steps to get the content into Library, then set up a content view on the importing Katello by hand that has include filters which match what content you expect to exist, and then publish and confirm the CV version was created with the correct units in the repository.

As part of this effort, we will allow publishing of a content view with a specific set units from Library, which will override any filter definitions. The unit list would be defined via whatever the unique identifier exists for the unit (docker image manifest hash, erratum ID, RPM NEVRA, etc). The newly created CV version can then be promoted to the correct lifecycle environment.

For example, a CV version publish might receive JSON that looks like this:

{
 "repos": [
   { "name": "zoo repo", "type": "yum", "rpm_filenames": ["bear-4.1-1.noarch.rpm", "zebra-0.1-2.noarch.rpm"], "errata_ids": ["RHEA-2012:0002", "RHEA-2012:0004"] }
 ]
}

This would mean "create a new CV version, and take the errata_ids and rpm_filenames in zoo repo in Library to use those in the CV version’s zoo repo.

We will also allow for setting the CV version number during publish. The version number will need to be in today’s X.Y format to not break auto-increment or other areas that expect the version in a specific format.

We will also be contributing to the Foreman Ansible modules repository to add automation around content view import/export. We decided to do it this way in order to keep the process as a set of discrete steps that can be recomposed if users need to alter the import/export process in a way that’s specific to their needs.

The overall workflow will look something like this:

EXPORT (done via Ansible)

Katello health check to ensure everything is operational
Obtain CV version metadata (name + version, list of repos, list of units in each repo, etc)
Perform CV export
Put data together into tarball

IMPORT

Katello health check to ensure everything is operational
untar tarball
create/enable products if they don’t already exist
create/enable repos on products if they don’t already exist
sync enabled repos in Library, using sync override URL to point to exported repos from tarball
create content views and attach repos if they don’t already exist
publish content view using version info list of units from export

At this point, the content view version can be promoted to the correct lifecycle environment.

Next steps

Put in pull requests that allow setting version string and unit list for each repo during CV publish. The first cut will only be for RPM NEVRAs and erratum IDs, but future updates will support additional content types.
add support to content view Ansible module for pulling and setting needed data on content views, and to support repo export
add support to repo Ansible module to override sync URL and allow sync
create Ansible scripts that use new modules, one for export and another for import. The first version will be for a totally disconnected setup, but we may add in support for scenarios where the importing Katello is able to talk to the exporting Katello.

At the end of the next steps, we should have something demoable and can then iterate on improvements.

Detailed Content View Import/Export Workflow

This guide is meant to be a sketch of how the import/export process will work. Details may change during implementation.

On the Exporting Katello

Definitions + Daily operations

The user defines their products, repos and CVs with Ansible. We can provide a template but the user would be responsible for maintaining a cv-definitions.yml that would be invoked to add or update their definitions:

ansible-playbook playbooks/cv-definitions.yml

After the definitions are created, the user syncs and publishes periodically. This can be done via Ansible, hammer, or the web UI. We recommend not performing syncing or content view publishing as part of the cv-definitions.yml since we’ll be re-using that file later.

Exporting

This would be done via export.yml. It would look something like:

ansible-playbook playbooks/export.yml --extra-vars "content_view_version=5 include_repo_contents=true"

NOTE: On a production installation you would likely use a vars file that’s managed with source control instead of passing in extra_vars. The extra_vars here is just added to make the example clearer, and is not considered a best practice.

We would fetch the list of repositories for the content view version, each repo’s relative_path, and list of rpm filenames and erratum IDs. If include_repo_contents is set, we would then make a big tarball of all the relative_paths from /var/lib/pulp/published, making sure to dereference symlinks.

Example format of tarball

export.tar
- repos/
  - <relative_path>/repo_one
  - <relative_path>/repo_two
- export.json (format of JSON defined below)

At the end of the process, we have a few pieces of info:

the content view name for the CV version
the content view version’s version number (example: 45.0, or we can split it into major and minor)
the list of repos in the content view, along with the list of units in each repo
a relative_path and full_path. These will be used for disconnected and connected sync (connected sync is not supported in the initial version, but can be added later)
a debug certificate and CA certificate to be used for the connected sync (again, not needed for first iteration)
a big tarball of the repo directories with dereferenced symlinks, assuming include_repo_contents was set

The list should be in the following format in order to be compatible with the import step. This example is for the zoo repo with a filter that only pulls in bear and zebra packages (NB: we don’t capture the filters right now, I mention them only because it explains why there’s only two units in the repo) . The errata_ids are extraneous for this particular example but I added them in for clarity.

This JSON would be generated by export.yml, and is used as the input for later steps. It is not generated directly from Katello, and cannot be fed directly into Katello.

{
    "ca_cert": "<large PEM>",
    "content_view_name": "Animals CV",
    "content_view_version_major": "45",
    "content_view_version_minor": "0",
    "debug_cert": "<large PEM>",
    "repos": [
        {
            "errata_ids": [
                "RHSA1",
                "RHBA2"
            ],
            "full_path": "http://exporting-katello.com/pulp/repos/Default_Organization/content_views/Animals_Content_View/2.0/custom/Zoo_Product/Zoo_Repository/",
            "name": "zoo repo",
            "relative_path": "Default_Organization/content_views/Animals_Content_View/2.0/custom/Zoo_Product/Zoo_Repository",
            "rpm_filenames": [
                "bear-4.1-1.noarch.rpm",
                "zebra-0.1-2.noarch.rpm"
            ],
            "type": "yum"
        }
    ]
}

On the Importing Katello

Definitions

The user can re-run their cv-definitions.yml to populate the importing Katello. The importing Katello is typically not able to sync from the Red Hat CDN or other sources directly, this is handled in the next steps.

Importing

Getting the content onto the importing Katello

This step would be done via a newly defined playbook called cv-import-load.yml. It would be invoked like this:

ansible-playbook playbooks/cv-import-load.yml --extra-vars "import_definitions=export-from-other-katello.json import_repo_contents_from_tgz=true"

When invoked, this would find the Library version of each repo in the content view and sync. There are two ways the sync can work, depending on the value of import_repo_contents_from_tgz. If it’s set to true, then the override source_url is set to file:///path/to/tgz/contents/relative_path. If it’s set to false, then source_url is set to full_path. Note that we may need to add additional overrides to the sync API to add the debug certificate and CA certificate. Connected import is an optimization that will not be supported in the initial version, but can be added later.

Redefining the Content View version

After cv-import-load.yml is complete, all of the data is on the importing Katello and we just need to redefine the content view version. To do that, we will use cv-recreate-version.yml:

ansible-playbook playbooks/cv-recreate-version.yml --extra-vars "import_definitions=export.json"

This will publish a content view version with the parameters given in export.json. The recreated version will have the same content view version number, and each repo will have the same contents that were defined via the export JSON.

That’s it! We will be adding support for other content types and unit types in the future, but the workflow will be the same. Any content type that can be uniquely identified as the same unit on two servers and be copied in Pulp can be supported.

Breakdown of each proposed Ansible script

This section gives a summary of the steps done by each proposed Ansible script. Again, this is just a sketch and not a hard and fast requirement.

`cv-definitions.yml`

This file is only responsible for creating or enabling products, repos, and content views. It will be run as the first step on both the exporting and importing Katello servers.

The user is responsible for syncing and publishing. They can use another Ansible script, a scheduled sync, hammer, or the web UI. Note that the download_policy is immediate. We need the units downloaded onto the Katello server in order to export them later.

I think all of the features for this are already available in foreman-ansible-modules.

- hosts: localhost
  become: true
  tasks:
    - block:
      - name: 'zoo product'
        katello_product:
          username: "admin"
          password: "changeme"
          server_url: "https://localhost/"
          organization: "Default Organization"
          name: "Zoo Product"
          verify_ssl: false

      - name: "Create zoo repository"
        katello_repository:
          username: "admin"
          password: "changeme"
          server_url: "https://localhost/"
          name: "Zoo Repository"
          state: present
          content_type: "yum"
          product: "Zoo Product"
          organization: "Default Organization"
          url: "https://repos.fedorapeople.org/repos/pulp/pulp/demo_repos/zoo/"
          download_policy: immediate
          verify_ssl: false

      - name: "Create Animals CV"
        katello_content_view:
          username: "admin"
          password: "changeme"
          server_url: "https://localhost/"
          name: "Animals Content View"
          organization: "Default Organization"
          repositories:
            - name: 'Zoo Repository'
              product: 'Zoo Product'
          verify_ssl: false

`export.yml`

This script is responsible for assembling the export tarball. This includes creating the export.json file, and either using the Katello CV export feature, or simply including all of the repos by their relative_path into the tarball. The latter may be more expedient, but requires verifying that all units are indeed available on-disk (i.e., no broken symlinks in /var/lib/pulp/published/<relative_path>). We will also need to ensure all units are on-disk at this point by checking that all symlinks point to files. This is preferred vs checking for immediate repo disposition since it’s possible to set a repo to on_demand, flip it to immediate, but not sync it and still have broken symlinks.

This will require a few new features in foreman-ansible-modules. We will probably need to create a new katello_content_view_export module uses the nailgun library to create export.json. The tarball should not be too difficult to make since we’ll already have relative_path in hand. It is OK to ignore the CA and debug certs for the first revision.

`cv-import-load.yml`

This script is meant to be run after cv-definitions.yml runs on the importing Katello server. The job of this script is untar the tarball, find each repository that exists in the CV (as read from export.json, and then run a sync on that repo. The sync will need to have source_url defined so it can point to a file:/// URLs. This is not possible today with the katello-repository module and will need to be added in to the existing module.

Additionally, we’ll likely need a katello_content_view_import Ansible module that can crack open the tarball and read the contents.

`cv-recreate-version.yml`

This script redefines the content view versions. It makes use of two patches: one to set the major/minor version on a CV version when publishing, and another to set a list of units on each repo when publishing. These patches are both still being worked on. After they are merged, nailgun will need to be made aware of these APIs, and then the katello_content_view_publish module will need to take advantage of these. cv_recreate-version.yml will also be reading export.json, similar to cv-import-load.yml.

evgeni · August 6, 2018, 6:18am

Hey,

I totally agree that we need a more precise way of doing content exports and imports as what we have now. Thanks a ton for working on this!

Two things that popped up in my mind right now:

RPM filenames are not unique, I can totally have a bear-4.1-1.noarch.rpm in ZhenechZoo that is completelly different to bear-4.1-1.noarch.rpm in beavZoo. Pulp references the units by their checksum, and so should we?
Why not making Katello read/write the JSON export itself? I totally love ansible for executing long lists of tasks, but I loathe it when it comes to data processing of more complex structures.

That’s it for a 8am read.

iNecas · August 6, 2018, 12:10pm

Will it be possible to perform import/export with pure hammer, without ansible?

ehelms · August 9, 2018, 3:34pm

The current design does not have us going down the hammer route based on other input we had. Likely, elements of the workflow could be done with hammer and pieced together, but as far as an orchestrated workflow the goal was not to go that route. Do you have concerns with this approach?

iNecas · August 10, 2018, 7:41am

My concerns are the ansible is IMO not the best tool for entry-level user interaction. I agree that for the orchestration, the Ansible is the way to go, but I would expect from the steps themselves (export, import) to be able to perform them via hammer, as we do all of the other cli interactions.

My main concerns are:

user experience: things like passing arguments via command line, --help command etc.
tooling fragmentation

My expectation would be I could run hammer cv-export and hammer cv-import without any ansible, and using Ansible to follow the whole workflow, using hammer at the backend.

Bernhard_Suttner · August 10, 2018, 4:24pm

I agree with Ivan. In case hammer is used, it would also be possible to export / import a CV on a different host. Maybe ansible could just be called internally to do the actual work bu the frontend is still hammer!

(BTW, I really like this way to discuss such a feature first. Very appreciated!)

ehelms · August 10, 2018, 5:37pm

Can you expand on why you think this?

This is fair but that ship may have sailed. We can re-visit the general discussion.

Generally speaking, I don’t want to see us calling hammer form Ansible. I’ve tried it, and its heavy and ugly. Generally speaking our CLI today is heavy weight and has a lot of dependencies to get an environment that can handle it.

I will say one motivation towards Ansible is to push users towards more infra-as-code thinking which I find Ansible workflows encourage better than our CLI.

ehelms · August 10, 2018, 5:38pm

I don’t see how that is specific to hammer given you can configure Hammer or Ansible to point at whatever hosts you want.

sean797 · August 11, 2018, 9:03pm

I think this could be done in hammer only, as I understand all that’s really needed is to get a list of content unit (RPMs, errata_id, ect…) in a CVV and then be able to create a CVV by listing those same content units. One export and one import command - maybe I have miss understood something? (obviously, both Katello’s would need to have to same content synced, similar config, ect…)

Honestly, I’m not that keen on providing an Ansible role or play to do this, but if you would like to, go ahead! - I think this comes close to a similar discussion @ehelms started about providing supported roles

What is full_path ? I know of a couple of people who would like to setup a hub and spoke type model where Katello’s are connected, spokes would sync library from the hub but they’d like to create the exact same CVVs on all spokes, after testing it on the hub (reading your proposal I’m fairly sure it covers this use case)

Chris_Duryee · August 12, 2018, 3:18pm

Correct, the list of units is all that’s needed after both Katellos have the same content synced, repos enabled, and CVs created.

Today’s implementation works with just hammer, but has some drawbacks. It currently requires a shared workspace directory of /var/lib/pulp/katello-export, and it can expose repo directories with UUID names to the user. We also rely on Pulp’s repo and repo group exporters to copy data that’s already been published elsewhere, which can be slow and uses a lot of disk space. Copying the repos with Ansible instead of via Pulp export was not part of the first task breakdown, but would be simple to add later on.

Users occasionally request enhancements to the import/export process to support things like exporting to /var/www/html/. We figured that Ansible would give more flexibility and allow users to contribute changes more easily than if hammer was handling these use cases. Additionally, performing the import/export with hammer may not be possible if Pulp and Katello are living on two machines in the future. There are usually SELinux issues with this type of intra-app file copy as well that are avoidable with Ansible.

We can perform all of the steps (export metadata + export repos + import metadata + import repos + publish CVV) as hammer commands; my main concern is that we don’t have large “export” and “import” Foreman tasks that take hours to run and that have to handle all edge cases. It is not fun to kick off a long-running Foreman task at 9am and have it fail at 3pm.

full_path is the https URL that the repo is available at. In order for a downstream Katello to sync from an upstream Katello, you’d need the URL from full_path, the upstream Katello’s CA cert, and the upstream Katello’s debug cert. It’s a bit of a hassle to get the CA cert honored by the downstream Katello since it has to be put in a few places, but it is possible.

iNecas · August 13, 2018, 3:27pm

Thinks like missing --help, auto-completion, validation of input variables…

In general, although Ansible is quite popular, I don’t think it’s common knowledge, and building the tooling purely around Ansible adds additional knowledge requirements on the user.

While I support infra-as-code approach in general, is import-export typical use-case for this? What would be the use-case where we could sell this as infra-as-code example?

ekohl · August 13, 2018, 3:43pm

I fully agree with this. Discoverability of Ansible means reading the roles/playbooks in my experience. Modules do have documentation but that’s often developer documentation, not user documentation.

cmstjohn01 · April 5, 2023, 4:23pm

I would like the ability to export content based on a date range like Spacewalk 5 (–start-date, --end-date) or Red Hat Satellite 6 (hammer repository export --since). I incrementally export and test individual repos on my connected Foreman server and do not keep the incremental exports. When I go to update the disconnected Foreman servers I have run a full export on the connected server or I will receive the error "Please import the metadata for ‘Import-x-x-x 2.0’ before importing ‘Import-x-x-x 4.0’.