Skip to content

docs: troubleshooting guide for CUDA / Jetson / ROS2 / drone errors (closes #63)#137

Merged
rylinjames merged 1 commit into
mainfrom
docs/troubleshooting-guide
May 16, 2026
Merged

docs: troubleshooting guide for CUDA / Jetson / ROS2 / drone errors (closes #63)#137
rylinjames merged 1 commit into
mainfrom
docs/troubleshooting-guide

Conversation

@rylinjames
Copy link
Copy Markdown
Collaborator

Summary

Adds docs/troubleshooting.md — error signatures and fixes for the most common Reflex deployment failures. Closes #63.

Adapted from #129

@DsThakurRawat opened #129 with the same goal. The structure (CUDA/GPU → Jetson → ROS2 → Drone/MAVROS → Export → Registry → Quick diagnostics) and error signatures are theirs and accurate. Preserved via Co-Authored-By on the commit.

What's the same as #129

All the error-message captures and fix patterns — libcudnn_ops_infer.so.8 errors, cudaErrorNoKernelImageForDevice, JetPack version checks, thermal throttling guidance, MAVROS FCU connection, opset rollback, HF_HUB_DOWNLOAD_TIMEOUT, etc. These are real errors users hit and the fix recipes are right.

What changed — version + cross-ref refresh only

Refresh Why
onnxruntime-gpu==1.18.0>=1.25.1 Current floor per v0.9.2 CHANGELOG (Blackwell sm_120 support)
cuDNN floor: 9.0 → 9.5 Matches nvidia-cudnn-cu12>=9.5 from v0.9.2
Driver floor: R525 → R555 The v0.9.4 doctor guard pins this — cuDNN 9.5+ requires R555+
Added Blackwell sm_120 section RTX 5090 / B200 / GB200 trap that bit a real customer for 2 weeks (v0.9.3 doctor guard)
Added "First step: run reflex doctor" with the four v0.9.4 guards Multi-GPU mixed arch, Jetson R35 silent fallback, cuDNN/driver skew, TRT EP empirical session test — all surface most of these errors before they manifest
Updated drone state-vector section #129 said "PR #121 adds auto-detection." #121 was superseded by #133 which introduces explicit --state-msg-type {joint_state|imu|odom}. Section now teaches the real flag.
Cross-ref understanding_verification.mdverification.md Renamed in #136
Cross-ref adding_a_robot.md link Now exists (from #135)
Filename TROUBLESHOOTING.mdtroubleshooting.md Sibling convention (eval.md, embodiment_schema.md, verification.md, cli_reference.md, adding_a_robot.md — all lowercase)

Closes / supersedes

Test plan

 #63)

Adds docs/troubleshooting.md with the most common error signatures and
fixes Reflex users hit on edge devices, cloud GPUs, ROS2 robots, and
drones. Structure: CUDA/GPU → Jetson → ROS2 bridge → Drone/MAVROS →
Export/validation → Registry → Quick diagnostics.

Adapted from #129 — kept all the original error signatures and fix
patterns (they're real and useful). Only refreshed:
- onnxruntime-gpu pin: 1.18.0 → >=1.25.1 (matches v0.9.2 floor)
- cuDNN floor: 9.0+ → 9.5+ (matches v0.9.2 floor)
- Driver floor: R525+ → R555+ for cuDNN 9.5+ (per v0.9.4 doctor guard)
- Added Blackwell sm_120 section (RTX 5090 / B200) per v0.9.3 guard
- Added the four v0.9.4 reflex doctor guards as the prescribed first
  step on any failure (multi-GPU arch, Jetson R35, cuDNN/driver skew,
  TRT EP empirical session test)
- Replaced #129's reference to PR #121 with the actual shipped
  --state-msg-type flag from #133 (joint_state/imu/odom dispatch)
- Updated cross-ref from understanding_verification.md → verification.md
  (renamed in #136)
- Updated cross-ref to adding_a_robot.md (shipped in #135)
- File renamed TROUBLESHOOTING.md → troubleshooting.md for consistency
  with sibling docs (lowercase, no ALL_CAPS)

Supersedes #129.

Co-Authored-By: Divyansh Rawat <186957976+DsThakurRawat@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rylinjames rylinjames merged commit c05e490 into main May 16, 2026
6 checks passed
@rylinjames rylinjames deleted the docs/troubleshooting-guide branch May 16, 2026 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Compile TROUBLESHOOTING.md with Common CUDA Errors

1 participant