Skip to content

[Bug] Qwen3.5-4B-AWQ uses ~13GB RAM on Orin NX 16GB — GDN mamba workspace pool prevents ASR+TTS+LLM co-residency #99

@suharvest

Description

@suharvest

Describe the bug

tensorrt-edge-llm v0.7.1 (commit 5136119) loads Qwen3.5-4B-AWQ with 12.9–13.5 GB RAM on Jetson Orin NX 16 GB at maxInputLen=4096, which prevents the typical edge stack (ASR + TTS + LLM co-residency) from fitting on the device. The same hardware comfortably hosts Qwen3-4B-AWQ (pure attention) at ~3 GBroughly 4× the RAM footprint for the same parameter count, indicating the cost is specific to the Gated DeltaNet hybrid mamba/attention layout as currently exposed by the runtime, not an inherent property of mamba models.

Impact: blocker for any on-device deployment that needs LLM + other models to share Orin NX 16 GB unified memory. Note the listed Qwen/Qwen3.5-4B is officially supported (docs/source/user_guide/getting_started/supported-models.md).

Architecture of the model:

  • 32 hidden layers: 24 mamba/SSM (GDN) + 8 attention
  • hidden_size=2560, num_attn_heads=16, head_dim=256, kv_heads=8
  • vocab_size=248320 (Qwen3.5 expanded vocab)

After exhausting common configuration knobs (full table below), only ~1.5 GB of further savings looks achievable in software (mmap embedding sidecar + reduced vocab). Weight streaming — the canonical TRT lever — is a no-op here because AWQ weights live in plugin/constant buffers and aren't in the streamable pool.

Steps/Code to reproduce bug

Engine build (cross from x86 / native on Orin NX):

python -m llm_loader.export_all_cli \
    --src harvestsu/Qwen3.5-4B-AWQ \
    --dst ./onnx-mtp \
    --mtp

./build/examples/llm/llm_build \
    --onnxDir ./onnx-mtp/llm \
    --engineDir ./engines/qwen35-4b-awq/base \
    --maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 \
    --specBase --maxVerifyTreeSize 8

Resulting artifacts:

  • eagle_base.engine 2.1 GB
  • eagle_draft.engine 374 MB
  • embedding.safetensors 1.27 GB

Runtime command used:

EDGELLM_PLUGIN_PATH=/path/to/libNvInfer_edgellm_plugin.so \
python -m experimental.server \
    --engine-dir /path/to/engines/qwen35-4b-awq/base \
    --served-model-name qwen3.5-4b-awq \
    --port 8100 --host 0.0.0.0

Observed RAM after engine load (no requests yet):

$ free -m
               total        used        free   ...
Mem:           15656       12947         385   ...

TRT log line at engine load:

[INFO] [TensorRT] [MemUsageChange] Init cuBLAS/cuBLASLt: ... now: CPU 79, GPU 11891 (MiB)

~12 GB is pre-allocated in the GPU pool right after cuBLAS init, before any inference. The shared execution context manager later reports its own 2 GB allocation on top.

Same hardware, same toolchain, with Qwen3-4B-AWQ (pure attention 32 layers, vocab 152064, similar AWQ INT4 + FP16 setup) at maxInputLen=4096:

Mem:            ~3000   used   ...

Expected behavior

A 4 B parameter AWQ INT4 model should fit comfortably alongside an ASR engine and a TTS engine on Orin NX 16 GB unified memory. The community guidance for Qwen3.5-4B states the Q4-quantized model "needs only ~2.5 GB" disk size and fits "4–6 GB GPUs" — far from the 12+ GB we measure at runtime. NVIDIA's own Nemotron-H (92% Mamba-2 + 8% attention) is documented as 3× faster than Llama-3.1 at matched accuracy thanks to mamba's constant-memory recurrent state. The expected memory advantage of the hybrid layout is not present in our v0.7.1 deployment of Qwen3.5-4B.

What we tried (and why it didn't help)

Lever Expected Measured Notes
Lower maxInputLen 8192 → 4096 Linear with seq −1.0 GB Workspace pool clearly not pure-seq-linear
Disable MTP draft engine −400-500 MB −500 MB Draft is minor
TRT Weight Streaming (setWeightStreamingBudgetV2) −2.5–3.7 GB ~0 AWQ weights are in plugin/constant buffers, not in getStreamableWeightsSize() (reports only 2.1 MB streamable for base, 92 KB for draft). All four budget values (unset, min, 1g, off) produced identical RAM / tok-s, variance ≤ 1%.

Hypothesis

The runtime / engine builder appears to size the mamba/SSM scan and state buffers pessimistically for every hidden layer at engine load time, regardless of actual prefill batch shape. With 24 mamba layers at hidden_size=2560 + parallel-scan workspace + Conv state caches, the worst-case pool dominates the 12 GB.

By contrast, pure-attention models like Qwen3-4B only allocate KV cache (a few hundred MB at 4 k context for 8 KV heads) plus regular attention scratch, both of which scale much more gracefully.

If this is correct, the fix is in how the mamba/GDN layers' workspace pool is sized — potentially making it elastic with the batched prefill shape rather than fixed at maxInputLen worst-case.

System information (Edge Device)

  • Platform: NVIDIA Jetson Orin NX 16GB
  • Software release: JetPack 6.2
  • CPU architecture: aarch64
  • GPU compute capability: SM87
  • Total device memory: 16 GB (unified)
  • Build type: Release
  • Library versions:
    • TensorRT Edge-LLM version or commit hash: v0.7.1 + customvoice product layer migration (commit 5136119)
    • CUDA: 12.6.68
    • TensorRT: 10.3.0.30
  • Model: harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine (private — happy to share access)
  • Engine build flags: --maxBatchSize 1 --maxInputLen 4096 --maxKVCacheCapacity 4096 --specBase --maxVerifyTreeSize 8

Asks

  1. Are mamba/GDN layer workspace allocations meant to be elastic at runtime in v0.7.1, or are they intentionally pre-sized at maxInputLen? If the latter, can a future release expose a knob to clamp this pool independently of maxInputLen (e.g., a mambaPrefillMaxSeq)?
  2. Can AWQ weights for hybrid models be migrated into the regular TRT streamable weight pool, so setWeightStreamingBudgetV2 becomes effective for AWQ engines? Today getStreamableWeightsSize() returns only ~2 MB on this engine, while ~2 GB of AWQ weights sit in plugin constant buffers.
  3. Any pointers on profiling the 12 GB pool composition would also be very welcome (e.g., per-layer or per-buffer breakdown) — --verbose only shows aggregate MemUsageChange lines.

Reproducer artifacts

  • Engine: harvestsu/Qwen3.5-4B-AWQ-TensorRT-EdgeLLM-engine (private — willing to share access)
  • Server log + free -m snapshots: available on request
  • Full investigation writeup mirrored at docs/known-issues/qwen35-orin-nx-oom.md on our fork suharvest/TensorRT-Edge-LLM branch v071/customvoice-product

Thanks for the great work on the framework!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions