feat(pi05): pixel-MAE adaptive refresh for temporal KV cache by strayberry · Pull Request #48 · LiangSu8899/FlashRT

strayberry · 2026-05-23T07:56:50Z

Summary

Add adaptive refresh to Pi0.5 RTX temporal KV cache — the cached encoder K/V can now be flushed early based on an external signal (e.g., pixel-level scene change detection), not just on a fixed schedule. This preserves the latency benefit of cache=N while improving action quality when the visual input changes between scheduled full frames.

Changes

flash_rt/frontends/torch/pi05_rtx.py — infer() gains force_full parameter and returns used_full_pipeline / cache_forced_full flags. Frame-count reset logic ensures a forced refresh starts a new reuse window. No change to the fixed-window scheduling math when force_full is not used.
examples/orin/eval_libero.py — new offline benchmark script that auto-downloads a pre-exported LIBERO NPZ (42 MB, 3 episodes × 100 frames, exported via export_libero_npz.py) and runs 4 configs (baseline, cache2, cache3, adaptive_cache3) reporting latency and cosine similarity vs baseline. The NPZ is a pre-exported snapshot of LeRobotDataset frames, avoiding the LeRobot/SIMEP runtime dependency and enabling reproducible offline eval without reloading the dataset.

Prior experiment: full sweep

All 7 configurations tested offline on 300 pre-exported LIBERO observation frames. BF16, threshold=6.5 for adaptive configs.

Config	p50	forced_full	cos_mean	cos_min	Retained?
bf16_baseline (cache=1)	218.3 ms	—	baseline	baseline	yes
bf16_cache2	135.5 ms	—	0.982403	0.478313	yes — highest fidelity, 7.4 Hz
bf16_cache3	56.6 ms	—	0.960540	0.122606	yes
bf16_cache4	55.9 ms	—	0.945754	−0.058609	no — diminishing returns, tail quality drops
bf16_adaptive_cache2	216.4 ms ❌	34/300	0.983152	0.478313	no — p50 jumps to full-frame latency
bf16_adaptive_cache3	55.6 ms ✅	89/300	0.974735	0.434654	yes — best speed/quality trade-off
bf16_adaptive_cache4	56.6 ms	106/300	0.964295	−0.058609	no — mean improves but tail doesn't

Key finding: adaptive_cache3 is the only clear win — maintains cache3 p50 (~56 ms) while significantly improving quality (cos_mean 0.961→0.975, cos_min 0.123→0.435). adaptive_cache2 is unusable (p50 jumps to full-frame); adaptive_cache4 improves mean but not tail. Fixed cache4 and all adaptive_cache2/4 are not retained as main tracks.

Benchmark (Jetson AGX Orin, 300 frames, BF16)

Final numbers from the PR-branch build:

Config	p50	p95	forced_full	cos_mean	cos_min
bf16_baseline (cache=1)	218.4 ms	218.7 ms	—	baseline	baseline
bf16_cache2	135.7 ms	216.8 ms	—	0.981	0.556
bf16_cache3	55.7 ms	217.2 ms	—	0.959	−0.371
bf16_adaptive_cache3	55.4 ms	216.6 ms	91/300	0.974	0.534

Results consistent with prior experiment. The adaptive refresh triggers on ~30% of frames (91/300) at the default pixel-MAE threshold of 6.5, catching scene transitions that fixed-window cache3 misses.

Threshold sensitivity (adaptive_cache3, 30 frames)

Threshold	p50	forced_full	cos_mean	cos_min
5.0	216.8 ms ❌	16/30	0.996	0.896
5.5	215.1 ms ❌	11/30	0.995	0.896
6.0	57.2 ms ✅	9/30	0.995	0.889
6.5	55.6 ms ✅	7/30	0.998	0.992
7.0	55.6 ms ✅	7/30	0.998	0.992
7.5	55.4 ms	5/30	0.994	0.896 ❌

The default threshold of 6.5 sits in a stable operating window (6.0–7.0). Below 6.0 the cache refreshes too aggressively, collapsing p50 to full-frame latency. At 7.5 the sweep begins to miss scene transitions (cos_min drops to 0.896), making it unreliable. Thresholds 6.5 and 7.0 produced identical results on this 30-frame sample, confirming the metric is not critically sensitive to the exact value within the operating window.

Note: The optimal threshold is data-dependent and may vary with camera placement, scene complexity, and task characteristics. The pixel-MAE distribution changes with resolution, lighting, and background complexity — deployment on different environments or robot setups may require adjusting this value.

Retained cache routes

Route	Use case
bf16_cache2	highest fidelity (cos_mean=0.981), p50=136ms, 7.4 Hz
bf16_adaptive_cache3	best speed/quality trade-off (cos_mean=0.974), p50=55ms, ~18 Hz

Reproduce

# Quick smoke test (30 frames, ~2 min)
python examples/orin/eval_libero.py \
    --checkpoint /path/to/pi05_libero_finetuned_v044

# Full benchmark (300 frames, ~12 min)
python examples/orin/eval_libero.py \
    --checkpoint /path/to/pi05_libero_finetuned_v044 \
    --frames 300

# Threshold sensitivity
for t in 6.0 6.5 7.0; do
    python examples/orin/eval_libero.py \
        --checkpoint /path/to/ckpt --frames 30 \
        --threshold $t --configs bf16_adaptive_cache3
done

The NPZ evaluation dataset downloads automatically on first run (~42 MB).

Test plan

Quick test: python examples/orin/eval_libero.py --checkpoint /path/to/ckpt (30 frames, ~2 min)
Full eval: python examples/orin/eval_libero.py --checkpoint /path/to/ckpt --frames 300 (~12 min)
Verified fixed-window cache scheduling is bit-identical to original when force_full=False
_infer_cfg_batched() path unaffected (no temporal cache logic)

infer() gains force_full for early encoder K/V refresh via pixel-MAE scene signal. Frame-count resets on forced refresh so a new reuse window starts immediately. Benchmark (Orin, 300 frames, BF16): adaptive_cache3 matches cache3 latency (55ms) while raising cos_min from -0.371 to 0.534.

LiangSu8899 · 2026-05-23T09:06:50Z

Thanks for the contribution, this is a very interesting direction.

One thing I want to be careful about: in my previous Pi0.5 experiments, almost every form of temporal / KV cache reuse caused task-level degradation, even when some offline metrics looked acceptable. I also observed a pretty unstable cosine cliff phenomenon: depending on the input frame, the reference action similarity could suddenly diverge a lot.

As a comparison, OpenVLA-style architectures seem much more tolerant to cache manipulation. My current guess is that this may come from architectural differences in the attention path: Pi0.5 has a tighter vision-language/action coupling through the diffusion/action decoder, so stale visual KV states can directly perturb the action distribution, while OpenVLA’s autoregressive token path may absorb or localize this kind of cache error better. But I’m not fully sure yet.

So I’m very curious here: do you have any closed-loop sim benchmark or real-robot comparison for this adaptive refresh strategy? For example, LIBERO success rate with cache=1 vs cache=3 vs adaptive cache3 would be much more convincing than offline action cosine alone.

--To be clear, a small accuracy drop is not necessarily unacceptable for this feature. What I’m mainly trying to understand is the distribution across a broader test range: does the degradation stay smooth and bounded, or are there unstable cliff cases? In my own tests, this kind of cache reuse showed very input-dependent instability.

Thx!

strayberry requested a review from LiangSu8899 as a code owner May 23, 2026 07:56

strayberry changed the title ~~feat(pi05): adaptive temporal KV cache refresh for RTX~~ feat(pi05): pixel-MAE adaptive refresh for temporal KV cache May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48

feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48
strayberry wants to merge 1 commit into
LiangSu8899:mainfrom
strayberry:feat/adaptive-cache-refresh

strayberry commented May 23, 2026 •

edited

Loading

Uh oh!

LiangSu8899 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

strayberry commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Prior experiment: full sweep

Benchmark (Jetson AGX Orin, 300 frames, BF16)

Threshold sensitivity (adaptive_cache3, 30 frames)

Retained cache routes

Reproduce

Test plan

Uh oh!

LiangSu8899 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

strayberry commented May 23, 2026 •

edited

Loading