Describe the bug
tensorrt-edge-llm v0.7.1 (commit 5136119) loads Qwen3.5-4B-AWQ with 12.9–13.5 GB RAM on Jetson Orin NX 16 GB at maxInputLen=4096, which prevents the typical edge stack (ASR + TTS + LLM co-residency) from fitting on the device. The same hardware comfortably hosts Qwen3-4B-AWQ (pure attention) at ~3 GB — roughly 4× the RAM footprint for the same parameter count, indicating the cost is specific to the Gated DeltaNet hybrid mamba/attention layout as currently exposed by the runtime, not an inherent property of mamba models.
Impact: blocker for any on-device deployment that needs LLM + other models to share Orin NX 16 GB unified memory. Note the listed Qwen/Qwen3.5-4B is officially supported (docs/source/user_guide/getting_started/supported-models.md).
Architecture of the model:
- 32 hidden layers: 24 mamba/SSM (GDN) + 8 attention
hidden_size=2560, num_attn_heads=16, head_dim=256, kv_heads=8
vocab_size=248320 (Qwen3.5 expanded vocab)
After exhausting common configuration knobs (full table below), only ~1.5 GB of further savings looks achievable in software (mmap embedding sidecar + reduced vocab). Weight streaming — the canonical TRT lever — is a no-op here because AWQ weights live in plugin/constant buffers and aren't in the streamable pool.
Steps/Code to reproduce bug
Engine build (cross from x86 / native on Orin NX):
python -m llm_loader.export_all_cli \
--src harvestsu/Qwen3.5-4B-AWQ \
--dst ./onnx-mtp \
--mtp
./build/examples/llm/llm_build \
--onnxDir ./onnx-mtp/llm \
--engineDir ./engines/qwen35-4b-awq/base \
--maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 \
--specBase --maxVerifyTreeSize 8
Resulting artifacts:
eagle_base.engine 2.1 GB
eagle_draft.engine 374 MB
embedding.safetensors 1.27 GB
Runtime command used:
EDGELLM_PLUGIN_PATH=/path/to/libNvInfer_edgellm_plugin.so \
python -m experimental.server \
--engine-dir /path/to/engines/qwen35-4b-awq/base \
--served-model-name qwen3.5-4b-awq \
--port 8100 --host 0.0.0.0
Observed RAM after engine load (no requests yet):
$ free -m
total used free ...
Mem: 15656 12947 385 ...
TRT log line at engine load:
[INFO] [TensorRT] [MemUsageChange] Init cuBLAS/cuBLASLt: ... now: CPU 79, GPU 11891 (MiB)
→ ~12 GB is pre-allocated in the GPU pool right after cuBLAS init, before any inference. The shared execution context manager later reports its own 2 GB allocation on top.
Same hardware, same toolchain, with Qwen3-4B-AWQ (pure attention 32 layers, vocab 152064, similar AWQ INT4 + FP16 setup) at maxInputLen=4096:
Expected behavior
A 4 B parameter AWQ INT4 model should fit comfortably alongside an ASR engine and a TTS engine on Orin NX 16 GB unified memory. The community guidance for Qwen3.5-4B states the Q4-quantized model "needs only ~2.5 GB" disk size and fits "4–6 GB GPUs" — far from the 12+ GB we measure at runtime. NVIDIA's own Nemotron-H (92% Mamba-2 + 8% attention) is documented as 3× faster than Llama-3.1 at matched accuracy thanks to mamba's constant-memory recurrent state. The expected memory advantage of the hybrid layout is not present in our v0.7.1 deployment of Qwen3.5-4B.
What we tried (and why it didn't help)
| Lever |
Expected |
Measured |
Notes |
Lower maxInputLen 8192 → 4096 |
Linear with seq |
−1.0 GB |
Workspace pool clearly not pure-seq-linear |
| Disable MTP draft engine |
−400-500 MB |
−500 MB |
Draft is minor |
TRT Weight Streaming (setWeightStreamingBudgetV2) |
−2.5–3.7 GB |
~0 |
AWQ weights are in plugin/constant buffers, not in getStreamableWeightsSize() (reports only 2.1 MB streamable for base, 92 KB for draft). All four budget values (unset, min, 1g, off) produced identical RAM / tok-s, variance ≤ 1%. |
Hypothesis
The runtime / engine builder appears to size the mamba/SSM scan and state buffers pessimistically for every hidden layer at engine load time, regardless of actual prefill batch shape. With 24 mamba layers at hidden_size=2560 + parallel-scan workspace + Conv state caches, the worst-case pool dominates the 12 GB.
By contrast, pure-attention models like Qwen3-4B only allocate KV cache (a few hundred MB at 4 k context for 8 KV heads) plus regular attention scratch, both of which scale much more gracefully.
If this is correct, the fix is in how the mamba/GDN layers' workspace pool is sized — potentially making it elastic with the batched prefill shape rather than fixed at maxInputLen worst-case.
System information (Edge Device)
- Platform: NVIDIA Jetson Orin NX 16GB
- Software release: JetPack 6.2
- CPU architecture: aarch64
- GPU compute capability: SM87
- Total device memory: 16 GB (unified)
- Build type: Release
- Library versions:
- TensorRT Edge-LLM version or commit hash: v0.7.1 + customvoice product layer migration (commit
5136119)
- CUDA: 12.6.68
- TensorRT: 10.3.0.30
- Model:
harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine (private — happy to share access)
- Engine build flags:
--maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 --specBase --maxVerifyTreeSize 8
Asks
- Are mamba/GDN layer workspace allocations meant to be elastic at runtime in v0.7.1, or are they intentionally pre-sized at
maxInputLen? If the latter, can a future release expose a knob to clamp this pool independently of maxInputLen (e.g., a mambaPrefillMaxSeq)?
- Can AWQ weights for hybrid models be migrated into the regular TRT streamable weight pool, so
setWeightStreamingBudgetV2 becomes effective for AWQ engines? Today getStreamableWeightsSize() returns only ~2 MB on this engine, while ~2 GB of AWQ weights sit in plugin constant buffers.
- Any pointers on profiling the 12 GB pool composition would also be very welcome (e.g., per-layer or per-buffer breakdown) —
--verbose only shows aggregate MemUsageChange lines.
Reproducer artifacts
- Engine:
harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine (private — willing to share access)
- Server log +
free -m snapshots: available on request
- Full investigation writeup mirrored at
docs/known-issues/qwen35-orin-nx-oom.md on our fork suharvest/TensorRT-Edge-LLM branch v071/customvoice-product
Thanks for the great work on the framework!
Describe the bug
tensorrt-edge-llmv0.7.1 (commit5136119) loadsQwen3.5-4B-AWQwith 12.9–13.5 GB RAM on Jetson Orin NX 16 GB atmaxInputLen=4096, which prevents the typical edge stack (ASR + TTS + LLM co-residency) from fitting on the device. The same hardware comfortably hostsQwen3-4B-AWQ(pure attention) at ~3 GB — roughly 4× the RAM footprint for the same parameter count, indicating the cost is specific to the Gated DeltaNet hybrid mamba/attention layout as currently exposed by the runtime, not an inherent property of mamba models.Impact: blocker for any on-device deployment that needs LLM + other models to share Orin NX 16 GB unified memory. Note the listed
Qwen/Qwen3.5-4Bis officially supported (docs/source/user_guide/getting_started/supported-models.md).Architecture of the model:
hidden_size=2560,num_attn_heads=16,head_dim=256,kv_heads=8vocab_size=248320(Qwen3.5 expanded vocab)After exhausting common configuration knobs (full table below), only ~1.5 GB of further savings looks achievable in software (mmap embedding sidecar + reduced vocab). Weight streaming — the canonical TRT lever — is a no-op here because AWQ weights live in plugin/constant buffers and aren't in the streamable pool.
Steps/Code to reproduce bug
Engine build (cross from x86 / native on Orin NX):
python -m llm_loader.export_all_cli \ --src harvestsu/Qwen3.5-4B-AWQ \ --dst ./onnx-mtp \ --mtp ./build/examples/llm/llm_build \ --onnxDir ./onnx-mtp/llm \ --engineDir ./engines/qwen35-4b-awq/base \ --maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 \ --specBase --maxVerifyTreeSize 8Resulting artifacts:
eagle_base.engine2.1 GBeagle_draft.engine374 MBembedding.safetensors1.27 GBRuntime command used:
EDGELLM_PLUGIN_PATH=/path/to/libNvInfer_edgellm_plugin.so \ python -m experimental.server \ --engine-dir /path/to/engines/qwen35-4b-awq/base \ --served-model-name qwen3.5-4b-awq \ --port 8100 --host 0.0.0.0Observed RAM after engine load (no requests yet):
TRT log line at engine load:
→ ~12 GB is pre-allocated in the GPU pool right after cuBLAS init, before any inference. The shared execution context manager later reports its own 2 GB allocation on top.
Same hardware, same toolchain, with
Qwen3-4B-AWQ(pure attention 32 layers, vocab 152064, similar AWQ INT4 + FP16 setup) atmaxInputLen=4096:Expected behavior
A 4 B parameter AWQ INT4 model should fit comfortably alongside an ASR engine and a TTS engine on Orin NX 16 GB unified memory. The community guidance for Qwen3.5-4B states the Q4-quantized model "needs only ~2.5 GB" disk size and fits "4–6 GB GPUs" — far from the 12+ GB we measure at runtime. NVIDIA's own Nemotron-H (92% Mamba-2 + 8% attention) is documented as 3× faster than Llama-3.1 at matched accuracy thanks to mamba's constant-memory recurrent state. The expected memory advantage of the hybrid layout is not present in our v0.7.1 deployment of Qwen3.5-4B.
What we tried (and why it didn't help)
maxInputLen8192 → 4096setWeightStreamingBudgetV2)getStreamableWeightsSize()(reports only 2.1 MB streamable for base, 92 KB for draft). All four budget values (unset,min,1g,off) produced identical RAM / tok-s, variance ≤ 1%.Hypothesis
The runtime / engine builder appears to size the mamba/SSM scan and state buffers pessimistically for every hidden layer at engine load time, regardless of actual prefill batch shape. With 24 mamba layers at
hidden_size=2560+ parallel-scan workspace + Conv state caches, the worst-case pool dominates the 12 GB.By contrast, pure-attention models like Qwen3-4B only allocate KV cache (a few hundred MB at 4 k context for 8 KV heads) plus regular attention scratch, both of which scale much more gracefully.
If this is correct, the fix is in how the mamba/GDN layers' workspace pool is sized — potentially making it elastic with the batched prefill shape rather than fixed at
maxInputLenworst-case.System information (Edge Device)
5136119)harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine(private — happy to share access)--maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 --specBase --maxVerifyTreeSize 8Asks
maxInputLen? If the latter, can a future release expose a knob to clamp this pool independently ofmaxInputLen(e.g., amambaPrefillMaxSeq)?setWeightStreamingBudgetV2becomes effective for AWQ engines? TodaygetStreamableWeightsSize()returns only ~2 MB on this engine, while ~2 GB of AWQ weights sit in plugin constant buffers.--verboseonly shows aggregateMemUsageChangelines.Reproducer artifacts
harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine(private — willing to share access)free -msnapshots: available on requestdocs/known-issues/qwen35-orin-nx-oom.mdon our forksuharvest/TensorRT-Edge-LLMbranchv071/customvoice-productThanks for the great work on the framework!