[VAS-56] Implement SnapKVEvictionPolicy + pooled-select kernel design#1
Open
vasudev13 wants to merge 3 commits into
Open
[VAS-56] Implement SnapKVEvictionPolicy + pooled-select kernel design#1vasudev13 wants to merge 3 commits into
vasudev13 wants to merge 3 commits into
Conversation
Define the web-llm (TypeScript) side of a pluggable, attention-aware KV-cache eviction framework (StreamingLLM/H2O/SnapKV/PyramidKV): - src/eviction_policy.ts: EvictionPolicyKind, EvictionConfig, EvictionPolicy contract, NoOpEvictionPolicy (default), DEFAULT_EVICTION_CONFIG, isNoOpEviction. - config.ts: optional ChatConfig.eviction_config, defaulting to no-op so the engine runs unchanged when unset (VAS-52 safety guarantee). - index.ts: export the new public API surface. - docs/eviction/EVICTION_BOUNDARY.md: draws the TVM vs MLC-LLM vs web-llm boundary and documents the paged_kv_cache.cc hook points each policy attaches to. Declarative scaffold only; the TVM/MLC-LLM runtime hook binding depends on the local eviction toolchain (VAS-87) and the attention-score spike (VAS-47). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement the web-llm (TypeScript) side of the StreamingLLM eviction policy on top of the VAS-52 abstraction. StreamingLLM retains the first `sinkTokens` attention-sink tokens plus a sliding window of the most-recent tokens and evicts the middle. It adds no new kernels — it maps directly onto TVM's existing `EnableSlidingWindowForSeq(seq_id, window, sink)` (apache/tvm#16729), so it is the differentiation baseline (VAS-86) that the attention-aware policies must beat at equal budget, and it validates the VAS-52 policy-hook plumbing end-to-end. - src/eviction_policy.ts: - StreamingLLMEvictionPolicy: resolves budget (ratio or absolute) to an absolute token count, defaults sinkTokens to 4, derives windowSize = budget - sink (or honors an explicit window), range-checks, and emits a normalized {kind, budget, sinkTokens, windowSize} config. - resolveBudgetTokens(): shared ratio<->absolute budget resolver. - createEvictionPolicy(): central factory; not-yet-implemented policies (SnapKV/PyramidKV/H2O) throw with a tracking pointer. - DEFAULT_SINK_TOKENS constant. - src/index.ts: export the new public API surface. - tests/eviction_policy.test.ts: 15 unit tests (budget resolution, window derivation, validation, factory dispatch). - docs/eviction/EVICTION_BOUNDARY.md: record the StreamingLLM TS policy status and the remaining runtime-binding work (VAS-87). Runtime binding to EnableSlidingWindowForSeq + sparse position IDs (plan §3.5 Option 1) depends on the local eviction toolchain (VAS-87); until then the policy resolves config but eviction does not physically fire (engine runs full-cache). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
★ Primary attention-aware policy for the EMNLP demo. Continues the eviction policy chain after StreamingLLM (VAS-53), following the same declarative-TS-layer + runtime-boundary-doc pattern. TS layer (src/eviction_policy.ts): - SnapKVEvictionPolicy: one-shot prefill-time selection, fixed budget after prefill (zero per-decode-step dispatch — the H1/H3 reason SnapKV beats H2O in-browser at batch=1). Resolves budget (ratio/absolute), defaults sink=4 / obsWindow=32 / poolingKernel=7 (SnapKV paper), validates odd pooling kernel and sink+obsWindow < budget. - New EvictionConfig.poolingKernelSize field (ablation axis, VAS-65) + DEFAULT_OBSERVATION_WINDOW / DEFAULT_POOLING_KERNEL_SIZE. - createEvictionPolicy() dispatches snapkv; protected ctor so PyramidKV (VAS-57) can subclass. Exported from index.ts. - 23 unit tests pass (tests/eviction_policy.test.ts). Runtime design: - docs/eviction/snapkv_pooled_select.wgsl: pooled-select kernel design reference (obs-window scoring without materializing the full attention matrix → avg-pool → top-K). EVICTION_BOUNDARY.md SnapKV section. - Kernel binding into a compiled model lib remains gated on VAS-47 (attention-score read path) and VAS-87 (eviction toolchain). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the ★ primary attention-aware eviction policy (SnapKV) for the EMNLP demo, continuing the policy chain after StreamingLLM (VAS-53). Same pattern: declarative TS layer + runtime boundary/kernel design docs.
What's in
SnapKVEvictionPolicy(src/eviction_policy.ts) — one-shot prefill-time selection, fixed budget after prefill ⇒ zero per-decode-step dispatch (the H1/H3 reason SnapKV is predicted to beat rolling H2O in-browser at batch=1). Resolves budget (ratio/absolute), defaults sink=4 / obsWindow=32 / poolingKernel=7 (SnapKV paper), validates odd pooling kernel andsink + obsWindow < budget.EvictionConfig.poolingKernelSizefield (ablation axis, VAS-65) +DEFAULT_OBSERVATION_WINDOW/DEFAULT_POOLING_KERNEL_SIZE.createEvictionPolicy()dispatchessnapkv;protectedctor so PyramidKV (VAS-57) can subclass. Exported fromindex.ts.tests/eviction_policy.test.ts).docs/eviction/snapkv_pooled_select.wgsl— pooled-select kernel design reference (obs-window scoring without materializing the full attention matrix → avg-pool → top-K).EVICTION_BOUNDARY.mdSnapKV section.Out of scope / follow-ups
paged_kv_cache.ccprefill-end compaction hook — gated on VAS-47 (attention-score read path) and VAS-87 (eviction toolchain). Until then the policy resolves config but no eviction physically fires (engine runs full-cache), consistent with VAS-53.Linear: VAS-56
🤖 Generated with Claude Code