No rpms are synced because of checksum error on iso

gvde · August 10, 2022, 9:47am

Problem:
I have just added CentOS Stream 9 to my new (testing) katello server running 4.5. However, the sync of BaseOS fails with:

A file located at the url http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/images/boot.iso failed validation due to checksum.

I have checked the checksums in the .treeinfo file and compared them with the available isos and the error is correct. The sha256 of the iso which I get from http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/images/boot.iso is

cf4156d110e6a822c720bdc9e0addd18e90d0c94da302986912bed69aec4ef05  boot.iso
f7b68bd16e6f727eedb99d424425bd345be29616687442095282594ea447f203  efiboot.img

http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/.treeinfo says

[checksums]
images/boot.iso = sha256:c4980cd238a373313ddb2e7cc51ca7763f9f8a26e1e1195efbfb8452a4805ce3
images/efiboot.img = sha256:f7b68bd16e6f727eedb99d424425bd345be29616687442095282594ea447f203
images/install.img = sha256:38e4d73d5c53c04f7a23544c8d0d9abadb93355994c8d0dbf9519ec27bb2ce61
images/pxeboot/initrd.img = sha256:b7b313430ff53793c4f6b82131709dbc2fdce73a4ca3f98df51e15585a3f0788
images/pxeboot/vmlinuz = sha256:440e9d1a3028bc838377b804a179753d857a9a364946354eeaab868e13962419

efiboot.img is correct.

So far so good. I don’t mind the incorrect iso. It doesn’t need to be sync right now.

However, nothing else has been added to the repository. Package count is still 0.

Expected outcome:
Sync everything you can with correct checksums. I don’t see the reason why it should sync all rpms only because a single checksum is incorrect…

Foreman and Proxy versions:
Katello 4.5.0, Foreman 3.3

Distribution and version:
AlmaLinux 8

gvde · August 12, 2022, 8:30am

The os/images/boot.iso is a different one from the one in the iso directory http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/iso/CentOS-Stream-9-20220808.1-x86_64-boot.iso

Anyway, is this a @katello or pulp issue? The sync should only skip on files with incorrect checksums but not everything…

gvde · August 16, 2022, 5:53am

After looking into where the error was caused I have opened an issue with pulp_rpm https://github.com/pulp/pulp_rpm/issues/2713

gvde · September 17, 2022, 9:17am

@dralley Since yesterday the boot.iso checksum in CentOS Stream 9 BaseOS is incorrect again. Of course, the incorrect checksum shouldn’t happen, but with pulp not syncing anything whenever this happens, to me it basically means I cannot really use CentOS Stream 9 with Katello/pulp. It’s not really acceptable to have a hanging repository for days which doesn’t get any updates synced.

dralley · October 23, 2022, 1:47am

Different idea, which would address some other stuff at the same time: would it be acceptable to make syncing treeinfo/kickstart data optional, and possibly off by default?

It’s a reasonable feature in isolation, and as these issues seem to pop up most frequently with that metadata, would address this problem by proxy.

That way, you can explicitly tell Pulp “I don’t care about this metadata” rather than having us implicitly make assumptions about it.

gvde · October 23, 2022, 9:53am

I still don’t think it’s a good idea to abort a sync once you come across an error. IIRC, I have never seen more than one error when there was a problem during the sync. I have checked the last couple of errors in the task logs and I have seen some checksum errors and some rpm missing (404 on the mirror), but each time it was exactly one error for a repository.

I suspect, the sync process generally aborts immediately after it finds an issue. IMHO, that’s the problem. We sync against a mirrorlist, i.e. mirrors which are rsynced to upstream repos. So there is always a change, that you hit some issues during the sync, either because we sync while the upstream mirror sync is in progress or because the upstream mirror sync got interrupted for some reason. It’s never perfect.

Aborting a sync after the first issue is just a bad idea. If .treeinfo contains incorrect checksums, you can still sync all other isos with correct checksums as well as all rpms. If an rpm is missing on the mirror you can still sync everything else. Most of the time, we will never notice. However, we will notice pretty quickly. if the sync has been aborted and if it happens early in the process.

I really think the correct way would be to continue the sync and try to get everything which is available. The only reason to abort the rpm sync would be if the checksums of the repo metadata itself is incorrect, i.e. checksums in repomd.xml doesn’t match the primary or similar.

But an incorrect checksum in .treeinfo or a missing rpm on the mirror shouldn’t be a reason to abort the whole sync. If the client was using the mirror directly it wouldn’t notice the incorrect .treeinfo checksum or a missing rpm unless it needs it and then it would simply get an error that it’s missing and that’s what I would expect and need from pulp as well…

tedevil · October 23, 2022, 12:15pm

I guess in a sense a repo with packages is a complete package since syncing only part of one would end up in dependency problems. So I can sorta understand that it aborts unless it can sync all of it properly. However iso files should not have those dependencies so feels pretty safe to continue the sync even if a sync of an iso file fails as long as it is reported in a way so the Foreman admin know what has happened.

gvde · October 23, 2022, 2:50pm

Not quite. Actually, I don’t think pulp checks the dependencies. It checks if the repository metadata and the files in the repository match. This doesn’t solve any dependency if a requirement is actually not in the upstream repository and not mentioned in the metadata.

And simply syncing nothing then doesn’t make it better. If I used the external mirror directly, there might be some rpms missing, which I would only notice if I needed them. That’s just the way it is. Pulp requiring the complete repository to be in a consistent state and not syncing anything or aborting the sync if it is not is a much stronger requirement for which I don’t see the reason.

I don’t know the internals, but is the full pulp repository sync transactional? If the sync is aborted, is anything else already synced removed or is it still there? Or is the state just back to what it was before the sync?

gvde · November 16, 2022, 8:36am

This issue is really killing me.

O.K. Providing incorrect checksums in .treeinfo for the boot.iso isn’t really increasing my confidence in CentOS Stream 9 and it shouldn’t happen on their end.

Still, syncing nothing because of that is really giving me a hard time. If I enable the upstream baseos and appstream repos I can see a number of available updates. It’s like this for four days now. I don’t need .treeinfo nor boot.iso for updating a couple for rpms…

On a sidenote: has anyone a direct line to the centos developers and can ask then get their isos/hashes straight?

dralley · November 17, 2022, 5:26am

On a sidenote: has anyone a direct line to the centos developers and can ask then get their isos/hashes straight?

Probably #centos-devel on Libera.chat (IRC)

Still, syncing nothing because of that is really giving me a hard time. If I enable the upstream baseos and appstream repos I can see a number of available updates. It’s like this for four days now. I don’t need .treeinfo nor boot.iso for updating a couple for rpms…

Ok. I’ll see if we can at least implement the workaround I mentioned previously ASAP. We can have a discussion about going further… hiding issues like that is always an absolute last resort.

gvde · November 17, 2022, 5:53am

Well, in that case it’s hiding (missing) lots of updates for days or hiding the boot.iso.

At least, break it up into two sections: the repository with rpms and the .treeinfo with the distro/iso information. If there is an issue with the .treeinfo there is no reason not to do the rpm sync.

And just wondering: during the rpm/content sync, if the sync process comes across a checksum error with one rpm, does it continue or does it stop there, too?

To me, the sync doesn’t have to be better (doesn’t have to try to provide a more consistent state) than then upstream repository itself.

wibbit · November 18, 2022, 10:09am

Just a note on this topic, as I think it is related.

I raised a feature request/bug a little while back (As a user, I have proper support for mirrorlists · Issue #2286 · pulp/pulp_rpm · GitHub).

From my perspective mirror list support is currently broken, as has been mentioned previously by @gvde , mirrors are often in a state of flux, and failing on the first error results in repositories often not being synchronised.

I believe the right choice here is to reattempt the download against a different mirror, instead of simply failing. I believe this is how yum/dnf works, and though not daily, it is not infrequent to see an upgrade fail a download and try against a different mirror.

That being said, I do believe that a repository should be a purposefully curated object, and I expect upstream to do just that, I also expect my local copy of that to be just as curated as upstream.

The dependency mentioned @tedevil I believe is entirely accurate, it is not that pulp is validating RPM dependencies at sync time, it is that, there is a chance that new packages that depend on each other have been added, and if one of those fails due to a hash error, then one could find that upgrades fail. Yes, this is a slim chance, but it is a chance.

When it comes to iso’s v’s repositories, I don’t think they can be considered entirely independent, in so far as, discrepancies between the ISO and associated RPM’s can cause problems, specifically if the ISO ends up ahead of the RPM tree (think box gets installed with newer kernel/glibc).

As such for me, getting full mirror list support (i.e. retrying against different mirrors in the event of a failure) should be the primary solution, and then as a secondary, non-default option, allow for failures in one area to not cause the entire repository to fail.

However, I would very much want that to be non-default and for the risks to be understood.

Just my tuppence worth.

gvde · November 18, 2022, 10:49am

Well, trying other mirrors as well would be a nice to have, but I don’t know if that’s always such a good idea. If there is a checksum error, you would have to find out whether it’s an issue of the metadata or the file. So you would have to start downloading the metadata from all mirrors and figure out whether there is newer metadata somewhere. If it’s the same everywhere you would have to download the rpm/iso from all mirrors to see if you find it somewhere matching the checksum of the metadata. If it’s a larger file, e.g. the boot.iso, you might start downloading several gigabytes worth of files all in vain because possibly it’s not an issue of metadata not matching the files but it’s actually “broken” upstream…

I guess checking other mirrors might be possible if a file is missing on one mirror. But even then, again, you have to check validate the metadata of the other mirror as well because possibly, the file missing is supposed to disapppear but the local metadata is not yet updated…

My reference is simply what I would see if I use the mirror/upstream repos directly. I won’t see any issue, no incorrect checksums, missing files or missing dependencies as long as I don’t use them.

I don’t know the exact statistics but let’s say the average server is using only 1% of all rpms/files in a repository. If there is a checksum error somewhere between the repo metadata and a file on a mirror, chance is very slim that I am affected. If there is a dependency issue, e.g. a dependency is missing on the mirror because it hasn’t fully mirrored, yet, changes are slim I would even notice. So my guess would be that most issue I would never notice, if I used the mirrors directly. But still, I get timely updates. And if there is a problem, I can try other mirrors or simply try a day later.

What I don’t find very useful is aborting the sync the moment you find a single issue between the metadata and the packages/files. It affects one or maybe a few files in the repository but still almost everything in the repository is correct and could be synced… I don’t see a good reason why the pulp sync should try to create a repository view which is more stringent than the upstream repository.

gvde · November 24, 2022, 10:43am

O.K. It seems it’s a problem with aws or cloudfront. I have just downloaded the boot.iso with my browser and the sha256 checksum matches the .treeinfo. I then used curl to download from the same URL, and I get a boot.iso with a different checksum.

This is what I currently get for http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/.treeinfo

[checksums]
images/boot.iso = sha256:55a560c8fc43e3f2bfae3827a6e7c31fd45c1a72b61a2e9fc68f917cfddee9d1
images/efiboot.img = sha256:152556e026a93ae685572291d429877e60ebe9eda5ae3085e5a69c0679064ef9
images/install.img = sha256:f94b9fa90cdbd7c61e188524230cca77a0ee14312f792f6abdc1a194ba6c080a
images/pxeboot/initrd.img = sha256:8ed92ca2f28b79b581c137f8efe719ea1982643d30882cde8cdd510fc0c0c637
images/pxeboot/vmlinuz = sha256:7be325ccd6bc2c4af2789a49e4408d2a0ca1685b9147a70021bbf1645a7368c6

I download from http://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/images/boot.iso which redirects me to https://download.cf.centos.org/9-stream/BaseOS/x86_64/os/images/boot.iso

Using my browser I get the correct file with the response headers

HTTP/2 200 OK
content-type: application/octet-stream
content-length: 896532480
date: Thu, 24 Nov 2022 01:00:30 GMT
last-modified: Wed, 23 Nov 2022 13:27:09 GMT
etag: "55c9647dc6665a4486d0b2d9569a48e0-107"
x-amz-version-id: S26qMpy9o2y1pZBxGs7Q7PB0esllhPYu
accept-ranges: bytes
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 89efe3a7854e47cf7f1fe47e28e39348.cloudfront.net (CloudFront)
x-amz-cf-pop: MUC50-P1
x-amz-cf-id: NnhbqbROx-OgIfLp9bHPKbvVQ0nRbayhWWNNpveIvjlMenvcyOvlWQ==
age: 34167
X-Firefox-Spdy: h2

With curl on the same computer or on my foreman server I get a different, older file:

HTTP/2 200 
content-type: application/octet-stream
content-length: 893386752
date: Tue, 22 Nov 2022 06:15:13 GMT
last-modified: Fri, 18 Nov 2022 13:16:06 GMT
etag: "5e9e8621cfaba576a3be45305bbce733-107"
x-amz-version-id: q8o94togmJky1WLWL3.Zt2CVQxOHepr4
accept-ranges: bytes
server: AmazonS3
x-cache: Hit from cloudfront
via: 1.1 3f48626dd8757a1af3c75efd40b72542.cloudfront.net (CloudFront)
x-amz-cf-pop: MUC50-P1
x-amz-cf-id: serZhfOxrz6zP4Hp55pVz9bMDdawvuuTb4UQpqOO2ajgxzj6l97dSw==
age: 188555

This looks like a caching issue with AWS, offering outdated files in some cases…

gvde · November 24, 2022, 4:37pm

This is a major disaster: if I add the HTTP header Accept-Encoding: gzip I’ll get the later, correct checksum iso. If I omit that header, I get the older. Standard curl does it without. If I add the header, it gets the newer file. I have made sure it’s using the same server with the same IP for download. Both files are correct boot.iso. And it’s not the same file with just one cut off.

So cloudfront delivers different files depending on whether gzip encoding is accepted??

mhjacks · November 25, 2022, 3:36pm

If that’s an option, I think it should default to enabled, since one of the main reasons to mirror this content is to setup kickstart trees (and because that would be changing the current default, which I think would be surprising to users).
I’ve heard in other threads that there’s a “no new options” mandate in effect in any case (for repositories in particular) - how does that apply here?

Though Gerald’s recent findings vis-a-vis content headers may moot this entire discussion.

dralley · November 29, 2022, 10:34pm

@gvde I’ve forwarded your post to the CentOS infrastructure team. A few hours after I filed the issue it seemed like the checksums were matching again but if CloudFront glitchiness is the issue it may be geo-dependent, load balancer dependent, or anything else…

If I recall correctly, we (or rather aiohttp) use Accept-Encoding: gzip when downloading, but Pulp doesn’t serve files with any compression.

gvde · November 30, 2022, 5:07am

I contacted them via IRC and eventually opened a bug in redhat bugzilla, which hasn’t been touched, yet. At the moment, it seems the checksum are o.k. So I am waiting for it to happen again…

ToniF · November 19, 2024, 1:21pm

It seems that this issue has reappeared with another file (pxeboot/initrd.img), and another Distro (Alma Linux 9):

It’s sad that the Foreman won’t sync the rest of the repo, only because of one file.

gvde · November 19, 2024, 1:57pm

Disable the treeinfo for the repository or change the URL to /9.5/

It seems to be a cache issue that the /9/ url is serving some 9.4 images.