My system has not received an updated image since December 21st. It’s baffling, though, because when I go to run ujust update (or sudo rpm-ostree update, or sudo bootc upgrade…), it seems to fetch the image and create a new deployment just fine – it even shows up as a staged image in sudo bootc status – but when I reboot the system, the deployment just disappears. The only options that I have in GRUB are the December 21st image and its rollback image. When I boot back into the system, running sudo bootc status or sudo ostree admin status produces an output as though the update was never staged.
Things I’ve tried:
Removing an old pinned Fedora 42 deployment (in case it was eating up too much space)
Running sudo ostree admin cleanup
Running every command I’m aware of that can stage a new deployment.
Everything ends in the same result: a deployment of the new image seemingly going through without errors, only to disappear without a trace on the next boot. Any thoughts on what the issue might be?
Things to note about my system:
I am running my own custom Aurora image; I am on an active image tag that has had successful builds as recently as this past Sunday (January 18), to which rpm-ostree is trying to update.
The signature on the image does not appear to have any issues, nor does my current container policy prevent rpm-ostree from successfully validating the image signature (which has usually been the source of these problems for me in the past).
I am not layering any packages.
I do use a technically-not-officially-supported-but-very-carefully-chosen disk layout. That said, I have been using my custom image with this disk setup for just about a year now without any issues, and I don’t really see why it would start causing a problem now.
This is certainly strange. I’ve never dug deep into the mechanism of how new images are staged. I’m also using my own custom image (though not as custom as yours) without any issue. I know there are several bootc services you should probably look at with journalctl.
You might get some more expert opinion over on the fedora discourse. Thats where the bootc experts reside.
Thanks for the journalctl suggestion @Danathar! That seems to have uncovered the issue. There is a systemd service that runs on shutdown, ostree-finalize-staged.service, that is responsible for adding the grub entries after a successful update. This appears to be failing due to an invalid SELinux policy in the newer image:
Jan 21 18:51:55 shaftoe ostree[16670]: Copying /etc changes: 491 modified, 0 removed, 154 added
Jan 21 18:51:55 shaftoe ostree[16670]: Copying /etc changes: 491 modified, 0 removed, 154 added
Jan 21 18:51:55 shaftoe ostree[16670]: Refreshing SELinux policy
Jan 21 18:52:00 shaftoe ostree[16707]: /sbin/setfiles: /etc/selinux/final/targeted/contexts/files/file_contexts: Multiple same specifications for wildcard /usr/s?bin/incus.
Jan 21 18:52:00 shaftoe ostree[16707]: /sbin/setfiles: /etc/selinux/final/targeted/contexts/files/file_contexts: Multiple same specifications for wildcard /usr/s?bin/incus-.*.
Jan 21 18:52:00 shaftoe ostree[16707]: /sbin/setfiles: /etc/selinux/final/targeted/contexts/files/file_contexts: Multiple same specifications for wildcard /usr/lib/systemd/system/incus.*.
Jan 21 18:52:00 shaftoe ostree[16707]: /etc/selinux/final/targeted/contexts/files/file_contexts: Invalid argument
Jan 21 18:52:00 shaftoe ostree[16706]: libsemanage.semanage_validate_and_compile_fcontexts: setfiles returned error code 1.
Jan 21 18:52:00 shaftoe ostree[16706]: semodule: Failed!
Jan 21 18:52:00 shaftoe ostree[16670]: Refreshed SELinux policy in 5188 ms
Jan 21 18:52:00 shaftoe ostree[16670]: error: Finalizing deployment: Finalizing SELinux policy: Child process exited with code 1
Jan 21 18:52:00 shaftoe systemd[1]: ostree-finalize-staged.service: Control process exited, code=exited, status=1/FAILURE
Jan 21 18:52:00 shaftoe systemd[1]: ostree-finalize-staged.service: Failed with result 'exit-code'.
Jan 21 18:52:00 shaftoe systemd[1]: Stopped ostree-finalize-staged.service - OSTree Finalize Staged Deployment.
Jan 21 18:52:00 shaftoe systemd[1]: ostree-finalize-staged.service: Consumed 5.944s CPU time, 358.1M memory peak.
How that policy became invalid and how to fix it, I currently do not know, but I will investigate further. I’ll post back here when I know more.
(As an aside, maybe it might be worth surfacing failures in this service when they occur? I know it might be hard to detect, but the failure mode here is very non-obvious and it might help reduce confusion if this were to happen to anyone else.)
I don’t recall for certain but I’m pretty sure I haven’t touched the SELinux policies at all in my image - they should be identical to upstream Aurora.
Maybe it’s a bad configuration merge, maybe there’s an upstream change that interacts poorly with my build scripts. I definitely have more to look into, I’m just not in a good place to dig in further at the moment.
So, this was quite the rabbit hole. Near as I can tell, there is a bug in ostree that is causing it to merge the SELinux policies incorrectly during deployment finalization. I’ve filed a report here.
I have found a workaround, though. Running the following:
sudo semodule -d incus && sudo bootc upgrade
and then rebooting will successfully update the system (the incus SELinux module is reenabled in the configuration merge, so the semodule command is effectively temporary). However, this does nothing to protect against this problem from happening again in the future. Hopefully some traction on a fix in ostree might resolve that.