Skip to content

[26.04_linux-nvidia-bos] NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get()#423

Closed
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:codex/nvbug-6144764-ovl-ptrerr-bos
Closed

[26.04_linux-nvidia-bos] NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get()#423
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:codex/nvbug-6144764-ovl-ptrerr-bos

Conversation

@nirmoy
Copy link
Copy Markdown
Collaborator

@nirmoy nirmoy commented May 15, 2026

Summary

Fix NVBug 6144764 by keeping err zero after a successful ovl_cache_get() in ovl_iterate_merged().

The installer crash is an overlayfs readdir failure while rsync reads through overlayfs during BaseOS/DGX OS installation. The bad path is the same as syzbot a16fb0cce329a320661c: a successful cache pointer is passed to PTR_ERR(), truncating pointer bits into a bogus int that can later be returned as a non-errno value.

Bug Links

NVBug Evidence

  • Symptom: installer can hang/crash while rsync data over overlayfs.
  • Reported stack includes ovl_iterate_merged() from getdents64() on 7.0.0-2005-nvidia-bos.
  • Latest bug comment points to the upstream fix thread and asks for PRs.

Validation

  • Cherry-picked cleanly onto upstream/26.04_linux-nvidia-bos.
  • git show --check --format=short HEAD: clean.
  • scripts/checkpatch.pl --strict --ignore COMMIT_LOG_USE_LINK,COMMIT_LOG_LONG_LINE --git HEAD: 0 errors, 0 warnings.
  • Plain scripts/checkpatch.pl --strict --git HEAD warns on Ubuntu BugLink: and the long downstream (backported from ...) lore URL line.
  • Earlier validation on arm64 virtme/KVM KASAN:
    • unpatched / Amir-only controls reproduced the overlayfs crash.
    • patched v2 completed 5/5 runs clean with OVL_SYZ_DONE rc=0 and no Oops/KASAN/panic markers.

Notes

The patch is also posted upstream:

@nirmoy nirmoy marked this pull request as draft May 15, 2026 12:37
@nirmoy nirmoy force-pushed the codex/nvbug-6144764-ovl-ptrerr-bos branch from a5df430 to 302f717 Compare May 15, 2026 12:38
@nirmoy
Copy link
Copy Markdown
Collaborator Author

nirmoy commented May 15, 2026

Boro review

Latest watcher review: open review

Head: 124a0b515cae

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

@nirmoy nirmoy force-pushed the codex/nvbug-6144764-ovl-ptrerr-bos branch from 302f717 to 3fd3bd3 Compare May 15, 2026 12:42
@nirmoy nirmoy changed the title ovl: keep err zero after successful ovl_cache_get() NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get() May 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details
Checking 1 commits...

Cherry-pick digest:
E: 124a0b515cae ("NVIDIA: SAUCE: ovl: keep err zero after "): backport trailer order: MISSING: backporter SOB after (backported from)
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 124a0b515cae │ ovl: keep err zero after successful ovl_cache_get()              │ match      │ found   │ MISSING: backporter SOB a │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy nirmoy force-pushed the codex/nvbug-6144764-ovl-ptrerr-bos branch from 3fd3bd3 to 7a40987 Compare May 15, 2026 12:45
@nirmoy nirmoy marked this pull request as ready for review May 15, 2026 12:49
@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

Wow, nice find. And Sashiko's happy with the patch: https://sashiko.dev/#/message/20260514144258.3068715-1-nirmoyd%40nvidia.com

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nirmoy nirmoy force-pushed the codex/nvbug-6144764-ovl-ptrerr-bos branch from 7a40987 to 6e81fbf Compare May 15, 2026 14:03
@nirmoy nirmoy changed the title NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get() [26.04_linux-nvidia-bos] NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get() May 15, 2026
@nirmoy nirmoy force-pushed the codex/nvbug-6144764-ovl-ptrerr-bos branch from 6e81fbf to 0b2ead9 Compare May 15, 2026 14:10
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 15, 2026

Reviewed the patch manually and with Codex, no issues.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>


@nirmoy Couple of follow-up questions...

  • Should we add a comment to that LP that this is likely the fix?
  • Do you think this will be submitted by Amir as a 7.1-rc fix?
  • We'll also need this fix in 7.0-LTS, but maybe we wait a few days to see if it's pulled in quickly so we can pick from upstream rather than carry as SAUCE?
  • FYI, I added this to the tracking SS

@nirmoy
Copy link
Copy Markdown
Collaborator Author

nirmoy commented May 15, 2026

Reviewed the patch manually and with Codex, no issues.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@nirmoy Couple of follow-up questions...

  • Should we add a comment to that LP that this is likely the fix?

I did that but looks I did in the original one https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2150636 but here I linked the newer LP

  • Do you think this will be submitted by Amir as a 7.1-rc fix?
    Amir's branch isbased on 7.1-rc2: https://github.com/amir73il/linux/commits/ovl-fixes/ so I expect it to go as a 7.1-rc overlayfs fix
  • We'll also need this fix in 7.0-LTS, but maybe we wait a few days to see if it's pulled in quickly so we can pick from upstream rather than carry as SAUCE?

Ack

  • FYI, I added this to the tracking SS

Thanks

@nirmoy
Copy link
Copy Markdown
Collaborator Author

nirmoy commented May 15, 2026

Wow, nice find.

Like Amir said it was demonic :D

And Sashiko's happy with the patch: https://sashiko.dev/#/message/20260514144258.3068715-1-nirmoyd%40nvidia.com

Acked-by: Jamie Nguyen <jamien@nvidia.com>

Thanks

Copy link
Copy Markdown
Collaborator

@clsotog clsotog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

We will need this for 26.04_linux-nvidia too.

BugLink: https://bugs.launchpad.net/bugs/2150640

ovl_iterate_merged() stores PTR_ERR(cache) in err before checking
IS_ERR(cache). On success err holds the truncated cache pointer and
can be returned as a bogus non-zero error.

The syzbot reproducer reaches this through overlay-on-overlay readdir:

  getdents64
    iterate_dir(outer overlay file)
      ovl_iterate_merged()
        ovl_cache_get()
          ovl_dir_read_merged()
            ovl_dir_read()
              iterate_dir(inner overlay file)
                ovl_iterate_merged()

Only compute PTR_ERR(cache) on the error path.

Fixes: d25e4b7 ("ovl: refactor ovl_iterate() and port to cred guard")
Reported-by: syzbot+a16fb0cce329a320661c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a16fb0cce329a320661c
Cc: stable@vger.kernel.org
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Jamie Nguyen <jamien@nvidia.com>
Acked-by: Matthew R. Ochs <mochs@nvidia.com>
Acked-by: Carol L Soto <csoto@nvidia.com>
(backported from https://lore.kernel.org/r/20260514144258.3068715-1-nirmoyd@nvidia.com)
@nirmoy
Copy link
Copy Markdown
Collaborator Author

nirmoy commented May 15, 2026

Acked-by: Carol L Soto <csoto@nvidia.com>

We will need this for 26.04_linux-nvidia too.

#425 added a draft PR so we can get that in if the fix is not merged by Monday

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 15, 2026

@nirmoy Two comments...

  • It looks like you added the buglink and Acks already prior to this being sent to Brad. You may want to sync with him because typically he adds those pieces as part of his workflow.
  • When the patch was updated, the pick tag was moved "after" the Acks and your SOB, which is not where it should normally be located. Since Brad already merged this I'll overlook it this time, but we should try to follow the established provenance process and workflows to avoid a tooling "oops".

Merged, closing PR.

b49669107f9d (nresolute/nvidia-bos-next) NVIDIA: SAUCE: ovl: keep err zero after successful ovl_cache_get()

@nvmochs nvmochs closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants