feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48
feat(pi05): pixel-MAE adaptive refresh for temporal KV cache#48strayberry wants to merge 1 commit into
Conversation
infer() gains force_full for early encoder K/V refresh via pixel-MAE scene signal. Frame-count resets on forced refresh so a new reuse window starts immediately. Benchmark (Orin, 300 frames, BF16): adaptive_cache3 matches cache3 latency (55ms) while raising cos_min from -0.371 to 0.534.
|
Thanks for the contribution, this is a very interesting direction. One thing I want to be careful about: in my previous Pi0.5 experiments, almost every form of temporal / KV cache reuse caused task-level degradation, even when some offline metrics looked acceptable. I also observed a pretty unstable cosine cliff phenomenon: depending on the input frame, the reference action similarity could suddenly diverge a lot. As a comparison, OpenVLA-style architectures seem much more tolerant to cache manipulation. My current guess is that this may come from architectural differences in the attention path: Pi0.5 has a tighter vision-language/action coupling through the diffusion/action decoder, so stale visual KV states can directly perturb the action distribution, while OpenVLA’s autoregressive token path may absorb or localize this kind of cache error better. But I’m not fully sure yet. So I’m very curious here: do you have any closed-loop sim benchmark or real-robot comparison for this adaptive refresh strategy? For example, LIBERO success rate with cache=1 vs cache=3 vs adaptive cache3 would be much more convincing than offline action cosine alone. --To be clear, a small accuracy drop is not necessarily unacceptable for this feature. What I’m mainly trying to understand is the distribution across a broader test range: does the degradation stay smooth and bounded, or are there unstable cliff cases? In my own tests, this kind of cache reuse showed very input-dependent instability. Thx! |
Summary
Add adaptive refresh to Pi0.5 RTX temporal KV cache — the cached encoder K/V can now be flushed early based on an external signal (e.g., pixel-level scene change detection), not just on a fixed schedule. This preserves the latency benefit of cache=N while improving action quality when the visual input changes between scheduled full frames.
Changes
flash_rt/frontends/torch/pi05_rtx.py—infer()gainsforce_fullparameter and returnsused_full_pipeline/cache_forced_fullflags. Frame-count reset logic ensures a forced refresh starts a new reuse window. No change to the fixed-window scheduling math whenforce_fullis not used.examples/orin/eval_libero.py— new offline benchmark script that auto-downloads a pre-exported LIBERO NPZ (42 MB, 3 episodes × 100 frames, exported via export_libero_npz.py) and runs 4 configs (baseline, cache2, cache3, adaptive_cache3) reporting latency and cosine similarity vs baseline. The NPZ is a pre-exported snapshot of LeRobotDataset frames, avoiding the LeRobot/SIMEP runtime dependency and enabling reproducible offline eval without reloading the dataset.Prior experiment: full sweep
All 7 configurations tested offline on 300 pre-exported LIBERO observation frames. BF16, threshold=6.5 for adaptive configs.
Key finding:
adaptive_cache3is the only clear win — maintains cache3 p50 (~56 ms) while significantly improving quality (cos_mean 0.961→0.975, cos_min 0.123→0.435).adaptive_cache2is unusable (p50 jumps to full-frame);adaptive_cache4improves mean but not tail. Fixed cache4 and all adaptive_cache2/4 are not retained as main tracks.Benchmark (Jetson AGX Orin, 300 frames, BF16)
Final numbers from the PR-branch build:
Results consistent with prior experiment. The adaptive refresh triggers on ~30% of frames (91/300) at the default pixel-MAE threshold of 6.5, catching scene transitions that fixed-window cache3 misses.
Threshold sensitivity (adaptive_cache3, 30 frames)
The default threshold of 6.5 sits in a stable operating window (6.0–7.0). Below 6.0 the cache refreshes too aggressively, collapsing p50 to full-frame latency. At 7.5 the sweep begins to miss scene transitions (cos_min drops to 0.896), making it unreliable. Thresholds 6.5 and 7.0 produced identical results on this 30-frame sample, confirming the metric is not critically sensitive to the exact value within the operating window.
Retained cache routes
Reproduce
Test plan
python examples/orin/eval_libero.py --checkpoint /path/to/ckpt(30 frames, ~2 min)python examples/orin/eval_libero.py --checkpoint /path/to/ckpt --frames 300(~12 min)force_full=False_infer_cfg_batched()path unaffected (no temporal cache logic)