Skip to content

Conversation

@DavidFair
Copy link
Collaborator

Fixes the Nvidia PCI Alloc error users were seeing on Ubuntu after an unattended upgrade.

A shortlog of the changes included are:

  • Add a dedicated folder for fixes, to separate them from the vm_baseline (which is intended for a "minimum set of changes to be compliant with our policy). This will also make it easier to turn off fixes if we need to troubleshoot or audit.
  • Revert /etc/default/grub and drop the regex (ab)use to make this work
  • Use a new cloud override at a higher precedence than the existing 50-cloudimg-settings.cfg to make sure this flag is respected by grub
  • Add notes on why this fix is required

To test this I also had to update the README and steps to switch to the OpenStack image builder, as many parts of this repo still assume QEMU incorrectly. Without these we can't build the images and test this works:

  • Update build roles to use OpenStack image builder (which already exists) and remove QEMU install steps from prep
  • Delete old autoinstall files for Packer + Ubuntu (thankfully)
  • Switch network ID to Internal on dev instead of prod to steer people somewhere safer to build by default
  • Cleanup of readme to reflect this
  • Add new targets for existing packer targets

Adds a new role for fixes to our images, e.g. options or files we need
to modify, add, or remove, from the original upstream distro.
This is typically because the generic defaults will cause problems, or
won't be optimal for OpenStack.

This is kept as its own role, as it's not required (like our VM
baseline) but is recommended, so people can choose if they'd like to
use these fixes. E.g. for troubleshooting or to eliminate them as a
potential cause of problems.
@DavidFair DavidFair added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Nov 14, 2025
@DavidFair DavidFair force-pushed the Fix_nvidia_pci_alloc branch 4 times, most recently from 347a96c to 3f1ec91 Compare November 14, 2025 20:55
Our existing images have a (manual) fix in /etc/default/grub, however
Ubuntu also ship a 50-cloud-init.cfg file which completely removes the
lines pci=nocrs,realloc . Add a line to bring them back so unatttended
upgrades (which run update-grub) don't remove them, causing GPU driver
problems after reboot + unattended upgrades
Add the Ansible galaxy folder to .gitignore so this does not get
committed by accident
Fixes the prep steps by avoiding usage of deprecated apt add-key,
instead use the new deb format with a named GPG key source. This can be
now done as a single step using the deb822_repository, but requires
Ansible 2.15+

Fix various linting things, such as using FQDNs or not looping on apt
when we can simply pass the entire list of packages in a single step
The packer builds were changed away from KVM to OpenStack (now various
upstream fixes landed). However our docs, prep steps and roles still
assume KVM.

Cleandown a lot of the complexity now we've got OpenStack handling this
and update the readme to reflect the new steps
Adds targets and tags for the current packer builds so multiple builds
can be tested by simply using a tag such as -t all or -t ubuntu
Ansible will wait for packer indefinitely so cap the time to 10m and add
some notes to the README how to troubleshoot this when we do run into it
Rebooting using Ansible will cause a hang in packer, as the builder is
unaware we're going to reboot.

Instead split into two playbooks (this also makes it easier to test
changes, since tidy_images removes SSH keys and logs too) and update the
build script to account for this
This makes it clear when the output image was built, and hopefully
prevents confusion of having multiple "baseline" images.

The scripts to automatically rename and warehouse can also use these
dates to find the latest image, rename the existing one ...etc. simply
by name
Sometimes the packer package isn't found on the CI depending on if apt
cache fires or not
@DavidFair DavidFair force-pushed the Fix_nvidia_pci_alloc branch from 3f1ec91 to c53676f Compare November 14, 2025 21:02
state: present
update_cache: yes
- name: Restore default grub file
# As we incrementally build images theres a mixture of grub files with some subtle bugs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment. We shouldn't be incrementally building images anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process documented in this page is not acurate unless a decision has been made to change the workflow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relates to previous comment

pre_tasks:
- name: User warning
ansible.builtin.debug:
msg: "[Warning] Do not run on non-cloud machine"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can make this a check and hard stop?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants