Fix Nvidia PCI Alloc Error #74

DavidFair · 2025-11-14T18:39:23Z

Fixes the Nvidia PCI Alloc error users were seeing on Ubuntu after an unattended upgrade.

A shortlog of the changes included are:

Add a dedicated folder for fixes, to separate them from the vm_baseline (which is intended for a "minimum set of changes to be compliant with our policy). This will also make it easier to turn off fixes if we need to troubleshoot or audit.
Revert /etc/default/grub and drop the regex (ab)use to make this work
Use a new cloud override at a higher precedence than the existing 50-cloudimg-settings.cfg to make sure this flag is respected by grub
Add notes on why this fix is required

To test this I also had to update the README and steps to switch to the OpenStack image builder, as many parts of this repo still assume QEMU incorrectly. Without these we can't build the images and test this works:

Update build roles to use OpenStack image builder (which already exists) and remove QEMU install steps from prep
Delete old autoinstall files for Packer + Ubuntu (thankfully)
Switch network ID to Internal on dev instead of prod to steer people somewhere safer to build by default
Cleanup of readme to reflect this
Add new targets for existing packer targets

Adds a new role for fixes to our images, e.g. options or files we need to modify, add, or remove, from the original upstream distro. This is typically because the generic defaults will cause problems, or won't be optimal for OpenStack. This is kept as its own role, as it's not required (like our VM baseline) but is recommended, so people can choose if they'd like to use these fixes. E.g. for troubleshooting or to eliminate them as a potential cause of problems.

Our existing images have a (manual) fix in /etc/default/grub, however Ubuntu also ship a 50-cloud-init.cfg file which completely removes the lines pci=nocrs,realloc . Add a line to bring them back so unatttended upgrades (which run update-grub) don't remove them, causing GPU driver problems after reboot + unattended upgrades

Add the Ansible galaxy folder to .gitignore so this does not get committed by accident

Fixes the prep steps by avoiding usage of deprecated apt add-key, instead use the new deb format with a named GPG key source. This can be now done as a single step using the deb822_repository, but requires Ansible 2.15+ Fix various linting things, such as using FQDNs or not looping on apt when we can simply pass the entire list of packages in a single step

The packer builds were changed away from KVM to OpenStack (now various upstream fixes landed). However our docs, prep steps and roles still assume KVM. Cleandown a lot of the complexity now we've got OpenStack handling this and update the readme to reflect the new steps

Adds targets and tags for the current packer builds so multiple builds can be tested by simply using a tag such as -t all or -t ubuntu

Ansible will wait for packer indefinitely so cap the time to 10m and add some notes to the README how to troubleshoot this when we do run into it

Rebooting using Ansible will cause a hang in packer, as the builder is unaware we're going to reboot. Instead split into two playbooks (this also makes it easier to test changes, since tidy_images removes SSH keys and logs too) and update the build script to account for this

This makes it clear when the output image was built, and hopefully prevents confusion of having multiple "baseline" images. The scripts to automatically rename and warehouse can also use these dates to find the latest image, rename the existing one ...etc. simply by name

Sometimes the packer package isn't found on the CI depending on if apt cache fires or not

khalford · 2025-11-17T09:26:36Z

os_builders/roles/image_fixes/tasks/nvidia-pci.yml

+      state: present
+      update_cache: yes
+  - name: Restore default grub file
+    # As we incrementally build images theres a mixture of grub files with some subtle bugs


Not sure I understand this comment. We shouldn't be incrementally building images anymore

khalford · 2025-11-17T09:30:26Z

os_builders/README.md

The process documented in this page is not acurate unless a decision has been made to change the workflow?

khalford · 2025-11-17T09:30:53Z

os_builders/playbooks/builder.yml

Relates to previous comment

jacob-ward · 2025-11-17T10:03:42Z

os_builders/playbooks/tidy_image.yml

+  pre_tasks:
+    - name: User warning
+      ansible.builtin.debug:
+        msg: "[Warning] Do not run on non-cloud machine"


Is there any way we can make this a check and hard stop?

DavidFair added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Nov 14, 2025

DavidFair force-pushed the Fix_nvidia_pci_alloc branch 4 times, most recently from 347a96c to 3f1ec91 Compare November 14, 2025 20:55

DavidFair added 9 commits November 14, 2025 21:02

MAINT: Add .ansible to gitignore

4ef0707

Add the Ansible galaxy folder to .gitignore so this does not get committed by accident

MAINT: Add tags and targets for existing packer builds

9541c8e

Adds targets and tags for the current packer builds so multiple builds can be tested by simply using a tag such as -t all or -t ubuntu

BUG: Timeout packer build after 10m

5e94fd9

Ansible will wait for packer indefinitely so cap the time to 10m and add some notes to the README how to troubleshoot this when we do run into it

BUG: Fix apt cache sometimes being stale on CI

c53676f

Sometimes the packer package isn't found on the CI depending on if apt cache fires or not

DavidFair force-pushed the Fix_nvidia_pci_alloc branch from 3f1ec91 to c53676f Compare November 14, 2025 21:02

khalford reviewed Nov 17, 2025

View reviewed changes

os_builders/README.md

Copy link

Member

khalford Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process documented in this page is not acurate unless a decision has been made to change the workflow?

khalford reviewed Nov 17, 2025

View reviewed changes

os_builders/playbooks/builder.yml

Copy link

Member

khalford Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relates to previous comment

jacob-ward reviewed Nov 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Nvidia PCI Alloc Error #74

Fix Nvidia PCI Alloc Error #74

Uh oh!

DavidFair commented Nov 14, 2025

Uh oh!

khalford Nov 17, 2025

Uh oh!

khalford Nov 17, 2025

Uh oh!

khalford Nov 17, 2025

Uh oh!

jacob-ward Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix Nvidia PCI Alloc Error #74

Are you sure you want to change the base?

Fix Nvidia PCI Alloc Error #74

Uh oh!

Conversation

DavidFair commented Nov 14, 2025

Uh oh!

khalford Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

khalford Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

khalford Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jacob-ward Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants