Skip to content

feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48

Open
strayberry wants to merge 1 commit into
LiangSu8899:mainfrom
strayberry:feat/adaptive-cache-refresh
Open

feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48
strayberry wants to merge 1 commit into
LiangSu8899:mainfrom
strayberry:feat/adaptive-cache-refresh

Conversation

@strayberry
Copy link
Copy Markdown
Contributor

@strayberry strayberry commented May 23, 2026

Summary

Add adaptive refresh to Pi0.5 RTX temporal KV cache — the cached encoder K/V can now be flushed early based on an external signal (e.g., pixel-level scene change detection), not just on a fixed schedule. This preserves the latency benefit of cache=N while improving action quality when the visual input changes between scheduled full frames.

Changes

  • flash_rt/frontends/torch/pi05_rtx.pyinfer() gains force_full parameter and returns used_full_pipeline / cache_forced_full flags. Frame-count reset logic ensures a forced refresh starts a new reuse window. No change to the fixed-window scheduling math when force_full is not used.
  • examples/orin/eval_libero.py — new offline benchmark script that auto-downloads a pre-exported LIBERO NPZ (42 MB, 3 episodes × 100 frames, exported via export_libero_npz.py) and runs 4 configs (baseline, cache2, cache3, adaptive_cache3) reporting latency and cosine similarity vs baseline. The NPZ is a pre-exported snapshot of LeRobotDataset frames, avoiding the LeRobot/SIMEP runtime dependency and enabling reproducible offline eval without reloading the dataset.

Prior experiment: full sweep

All 7 configurations tested offline on 300 pre-exported LIBERO observation frames. BF16, threshold=6.5 for adaptive configs.

Config p50 forced_full cos_mean cos_min Retained?
bf16_baseline (cache=1) 218.3 ms baseline baseline yes
bf16_cache2 135.5 ms 0.982403 0.478313 yes — highest fidelity, 7.4 Hz
bf16_cache3 56.6 ms 0.960540 0.122606 yes
bf16_cache4 55.9 ms 0.945754 −0.058609 no — diminishing returns, tail quality drops
bf16_adaptive_cache2 216.4 ms ❌ 34/300 0.983152 0.478313 no — p50 jumps to full-frame latency
bf16_adaptive_cache3 55.6 ms 89/300 0.974735 0.434654 yes — best speed/quality trade-off
bf16_adaptive_cache4 56.6 ms 106/300 0.964295 −0.058609 no — mean improves but tail doesn't

Key finding: adaptive_cache3 is the only clear win — maintains cache3 p50 (~56 ms) while significantly improving quality (cos_mean 0.961→0.975, cos_min 0.123→0.435). adaptive_cache2 is unusable (p50 jumps to full-frame); adaptive_cache4 improves mean but not tail. Fixed cache4 and all adaptive_cache2/4 are not retained as main tracks.

Benchmark (Jetson AGX Orin, 300 frames, BF16)

Final numbers from the PR-branch build:

Config p50 p95 forced_full cos_mean cos_min
bf16_baseline (cache=1) 218.4 ms 218.7 ms baseline baseline
bf16_cache2 135.7 ms 216.8 ms 0.981 0.556
bf16_cache3 55.7 ms 217.2 ms 0.959 −0.371
bf16_adaptive_cache3 55.4 ms 216.6 ms 91/300 0.974 0.534

Results consistent with prior experiment. The adaptive refresh triggers on ~30% of frames (91/300) at the default pixel-MAE threshold of 6.5, catching scene transitions that fixed-window cache3 misses.

Threshold sensitivity (adaptive_cache3, 30 frames)

Threshold p50 forced_full cos_mean cos_min
5.0 216.8 ms ❌ 16/30 0.996 0.896
5.5 215.1 ms ❌ 11/30 0.995 0.896
6.0 57.2 ms ✅ 9/30 0.995 0.889
6.5 55.6 ms 7/30 0.998 0.992
7.0 55.6 ms 7/30 0.998 0.992
7.5 55.4 ms 5/30 0.994 0.896 ❌

The default threshold of 6.5 sits in a stable operating window (6.0–7.0). Below 6.0 the cache refreshes too aggressively, collapsing p50 to full-frame latency. At 7.5 the sweep begins to miss scene transitions (cos_min drops to 0.896), making it unreliable. Thresholds 6.5 and 7.0 produced identical results on this 30-frame sample, confirming the metric is not critically sensitive to the exact value within the operating window.

Note: The optimal threshold is data-dependent and may vary with camera placement, scene complexity, and task characteristics. The pixel-MAE distribution changes with resolution, lighting, and background complexity — deployment on different environments or robot setups may require adjusting this value.

Retained cache routes

Route Use case
bf16_cache2 highest fidelity (cos_mean=0.981), p50=136ms, 7.4 Hz
bf16_adaptive_cache3 best speed/quality trade-off (cos_mean=0.974), p50=55ms, ~18 Hz

Reproduce

# Quick smoke test (30 frames, ~2 min)
python examples/orin/eval_libero.py \
    --checkpoint /path/to/pi05_libero_finetuned_v044

# Full benchmark (300 frames, ~12 min)
python examples/orin/eval_libero.py \
    --checkpoint /path/to/pi05_libero_finetuned_v044 \
    --frames 300

# Threshold sensitivity
for t in 6.0 6.5 7.0; do
    python examples/orin/eval_libero.py \
        --checkpoint /path/to/ckpt --frames 30 \
        --threshold $t --configs bf16_adaptive_cache3
done

The NPZ evaluation dataset downloads automatically on first run (~42 MB).

Test plan

  • Quick test: python examples/orin/eval_libero.py --checkpoint /path/to/ckpt (30 frames, ~2 min)
  • Full eval: python examples/orin/eval_libero.py --checkpoint /path/to/ckpt --frames 300 (~12 min)
  • Verified fixed-window cache scheduling is bit-identical to original when force_full=False
  • _infer_cfg_batched() path unaffected (no temporal cache logic)

infer() gains force_full for early encoder K/V refresh via
pixel-MAE scene signal. Frame-count resets on forced refresh
so a new reuse window starts immediately.

Benchmark (Orin, 300 frames, BF16): adaptive_cache3 matches
cache3 latency (55ms) while raising cos_min from -0.371 to 0.534.
@strayberry strayberry requested a review from LiangSu8899 as a code owner May 23, 2026 07:56
@strayberry strayberry changed the title feat(pi05): adaptive temporal KV cache refresh for RTX feat(pi05): pixel-MAE adaptive refresh for temporal KV cache May 23, 2026
@LiangSu8899
Copy link
Copy Markdown
Owner

Thanks for the contribution, this is a very interesting direction.

One thing I want to be careful about: in my previous Pi0.5 experiments, almost every form of temporal / KV cache reuse caused task-level degradation, even when some offline metrics looked acceptable. I also observed a pretty unstable cosine cliff phenomenon: depending on the input frame, the reference action similarity could suddenly diverge a lot.

As a comparison, OpenVLA-style architectures seem much more tolerant to cache manipulation. My current guess is that this may come from architectural differences in the attention path: Pi0.5 has a tighter vision-language/action coupling through the diffusion/action decoder, so stale visual KV states can directly perturb the action distribution, while OpenVLA’s autoregressive token path may absorb or localize this kind of cache error better. But I’m not fully sure yet.

So I’m very curious here: do you have any closed-loop sim benchmark or real-robot comparison for this adaptive refresh strategy? For example, LIBERO success rate with cache=1 vs cache=3 vs adaptive cache3 would be much more convincing than offline action cosine alone.

--To be clear, a small accuracy drop is not necessarily unacceptable for this feature. What I’m mainly trying to understand is the distribution across a broader test range: does the degradation stay smooth and bounded, or are there unstable cliff cases? In my own tests, this kind of cache reuse showed very input-dependent instability.

Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants