[Docs]Rework Bring Your Own Codegen tutorial and add TensorRT example#19839
Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly updates the Bring Your Own Codegen (BYOC) tutorial to cover both the mock 'example NPU' backend and a real production backend (NVIDIA TensorRT), including an end-to-end example of deploying a PyTorch model. Related documentation and paths are also updated. The feedback suggests using tempfile.TemporaryDirectory() as a context manager instead of tempfile.mkdtemp() to ensure proper cleanup of temporary files and prevent disk pollution.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…deployment The tutorial taught BYOC with the example NPU and tacked TensorRT on as a separate appendix, with a redundant second NPU example and the NPU-vs-TensorRT differences spread across far-apart sections. Rework it into two parts driven by one shared model: - "How BYOC works": run a single conv2d+relu through the same FuseOpsByPattern -> MergeCompositeFunctions -> RunCodegen flow on both the example NPU (a stub, so check shape) and TensorRT (real, cross-checked against a CPU build), so the only thing that varies is the backend. partition_for_tensorrt is shown as the one-line wrapper for those two passes, with the bind_constants / stub-vs-real / shape-vs-value contrasts side by side. Add an FP16 example via the relax.ext.tensorrt.options pass config and a summary table; drop the redundant second NPU section. - "Deploying a PyTorch model with TensorRT": take a real torch.nn.Module through torch.export -> from_exported_program -> partition_for_tensorrt -> build for CUDA -> run, cross-checking the GPU output against PyTorch, then export the compiled module and reload it to show the build-once / run-later deployment path. This adds the end-to-end nn.Module example requested in apache#19682, plus short notes on operator fallback, dynamic shapes, and engine caching. Also fix two stale references in the example NPU backend (the README and the runtime \file docstring pointed at src/runtime/contrib/example_npu/ rather than .../extra/...) and reword the README's "Memory constraint checking" bullet (those checks are placeholders that return True); and repoint the dangling docs/deploy/tensorrt.rst reference in cmake/config.cmake at the new tutorial. Validated end-to-end on a CUDA GPU with TensorRT 10: the example NPU, TensorRT, FP16, PyTorch-deployment, and export/reload cells all run and match their references. Each section degrades gracefully when its backend (or PyTorch) is unavailable.
ddcba0b to
646c3bb
Compare
To solve #19682 , this pr reworks BYOC tutorial into two parts driven by one shared model:
"How BYOC works": run a single conv2d+relu through the same FuseOpsByPattern -> MergeCompositeFunctions -> RunCodegen flow on both the example NPU (a stub, so check shape) and TensorRT (real, cross-checked against a CPU build), so the only thing that varies is the backend. partition_for_tensorrt is shown as the one-line wrapper for those two passes, with the bind_constants / stub-vs-real / shape-vs-value contrasts side by side. Add an FP16 example via the relax.ext.tensorrt.options pass config and a summary table; drop the redundant second NPU section.
"Deploying a PyTorch model with TensorRT": take a real torch.nn.Module through torch.export -> from_exported_program -> partition_for_tensorrt -> build for CUDA -> run, cross-checking the GPU output against PyTorch.
This pr also fixes two stale references in the example NPU backend: the README and the runtime's \file docstring pointed at src/runtime/contrib/example_npu/ but the file lives under src/runtime/extra/contrib/example_npu/; and reword the README's "Memory constraint checking: Validates tensor sizes" bullet, since _check_npu_memory_constraints / _check_npu_quantization are explicit placeholders that return True.
Validated end-to-end on a CUDA GPU with TensorRT 10: the example NPU, TensorRT, FP16, and PyTorch-deployment cells all run and match their references.