-
Notifications
You must be signed in to change notification settings - Fork 1
Fix Nvidia PCI Alloc Error #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds a new role for fixes to our images, e.g. options or files we need to modify, add, or remove, from the original upstream distro. This is typically because the generic defaults will cause problems, or won't be optimal for OpenStack. This is kept as its own role, as it's not required (like our VM baseline) but is recommended, so people can choose if they'd like to use these fixes. E.g. for troubleshooting or to eliminate them as a potential cause of problems.
347a96c to
3f1ec91
Compare
Our existing images have a (manual) fix in /etc/default/grub, however Ubuntu also ship a 50-cloud-init.cfg file which completely removes the lines pci=nocrs,realloc . Add a line to bring them back so unatttended upgrades (which run update-grub) don't remove them, causing GPU driver problems after reboot + unattended upgrades
Add the Ansible galaxy folder to .gitignore so this does not get committed by accident
Fixes the prep steps by avoiding usage of deprecated apt add-key, instead use the new deb format with a named GPG key source. This can be now done as a single step using the deb822_repository, but requires Ansible 2.15+ Fix various linting things, such as using FQDNs or not looping on apt when we can simply pass the entire list of packages in a single step
The packer builds were changed away from KVM to OpenStack (now various upstream fixes landed). However our docs, prep steps and roles still assume KVM. Cleandown a lot of the complexity now we've got OpenStack handling this and update the readme to reflect the new steps
Adds targets and tags for the current packer builds so multiple builds can be tested by simply using a tag such as -t all or -t ubuntu
Ansible will wait for packer indefinitely so cap the time to 10m and add some notes to the README how to troubleshoot this when we do run into it
Rebooting using Ansible will cause a hang in packer, as the builder is unaware we're going to reboot. Instead split into two playbooks (this also makes it easier to test changes, since tidy_images removes SSH keys and logs too) and update the build script to account for this
This makes it clear when the output image was built, and hopefully prevents confusion of having multiple "baseline" images. The scripts to automatically rename and warehouse can also use these dates to find the latest image, rename the existing one ...etc. simply by name
Sometimes the packer package isn't found on the CI depending on if apt cache fires or not
3f1ec91 to
c53676f
Compare
| state: present | ||
| update_cache: yes | ||
| - name: Restore default grub file | ||
| # As we incrementally build images theres a mixture of grub files with some subtle bugs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment. We shouldn't be incrementally building images anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process documented in this page is not acurate unless a decision has been made to change the workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relates to previous comment
| pre_tasks: | ||
| - name: User warning | ||
| ansible.builtin.debug: | ||
| msg: "[Warning] Do not run on non-cloud machine" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can make this a check and hard stop?
Fixes the Nvidia PCI Alloc error users were seeing on Ubuntu after an unattended upgrade.
A shortlog of the changes included are:
vm_baseline(which is intended for a "minimum set of changes to be compliant with our policy). This will also make it easier to turn off fixes if we need to troubleshoot or audit.50-cloudimg-settings.cfgto make sure this flag is respected by grubTo test this I also had to update the README and steps to switch to the OpenStack image builder, as many parts of this repo still assume QEMU incorrectly. Without these we can't build the images and test this works: