Skip to content

[VAS-56] Implement SnapKVEvictionPolicy + pooled-select kernel design#1

Open
vasudev13 wants to merge 3 commits into
mainfrom
claude/vas-56-snapkv-policy
Open

[VAS-56] Implement SnapKVEvictionPolicy + pooled-select kernel design#1
vasudev13 wants to merge 3 commits into
mainfrom
claude/vas-56-snapkv-policy

Conversation

@vasudev13

Copy link
Copy Markdown
Owner

Implements the ★ primary attention-aware eviction policy (SnapKV) for the EMNLP demo, continuing the policy chain after StreamingLLM (VAS-53). Same pattern: declarative TS layer + runtime boundary/kernel design docs.

What's in

  • SnapKVEvictionPolicy (src/eviction_policy.ts) — one-shot prefill-time selection, fixed budget after prefill ⇒ zero per-decode-step dispatch (the H1/H3 reason SnapKV is predicted to beat rolling H2O in-browser at batch=1). Resolves budget (ratio/absolute), defaults sink=4 / obsWindow=32 / poolingKernel=7 (SnapKV paper), validates odd pooling kernel and sink + obsWindow < budget.
  • New EvictionConfig.poolingKernelSize field (ablation axis, VAS-65) + DEFAULT_OBSERVATION_WINDOW / DEFAULT_POOLING_KERNEL_SIZE. createEvictionPolicy() dispatches snapkv; protected ctor so PyramidKV (VAS-57) can subclass. Exported from index.ts.
  • 23 unit tests pass (tests/eviction_policy.test.ts).
  • docs/eviction/snapkv_pooled_select.wgsl — pooled-select kernel design reference (obs-window scoring without materializing the full attention matrix → avg-pool → top-K). EVICTION_BOUNDARY.md SnapKV section.

Out of scope / follow-ups

  • Runtime binding of the kernel into a compiled model lib + the paged_kv_cache.cc prefill-end compaction hook — gated on VAS-47 (attention-score read path) and VAS-87 (eviction toolchain). Until then the policy resolves config but no eviction physically fires (engine runs full-cache), consistent with VAS-53.

Linear: VAS-56

🤖 Generated with Claude Code

vasudev13 and others added 3 commits June 7, 2026 09:39
Define the web-llm (TypeScript) side of a pluggable, attention-aware KV-cache
eviction framework (StreamingLLM/H2O/SnapKV/PyramidKV):

- src/eviction_policy.ts: EvictionPolicyKind, EvictionConfig, EvictionPolicy
  contract, NoOpEvictionPolicy (default), DEFAULT_EVICTION_CONFIG, isNoOpEviction.
- config.ts: optional ChatConfig.eviction_config, defaulting to no-op so the
  engine runs unchanged when unset (VAS-52 safety guarantee).
- index.ts: export the new public API surface.
- docs/eviction/EVICTION_BOUNDARY.md: draws the TVM vs MLC-LLM vs web-llm
  boundary and documents the paged_kv_cache.cc hook points each policy attaches to.

Declarative scaffold only; the TVM/MLC-LLM runtime hook binding depends on the
local eviction toolchain (VAS-87) and the attention-score spike (VAS-47).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement the web-llm (TypeScript) side of the StreamingLLM eviction policy on
top of the VAS-52 abstraction. StreamingLLM retains the first `sinkTokens`
attention-sink tokens plus a sliding window of the most-recent tokens and evicts
the middle. It adds no new kernels — it maps directly onto TVM's existing
`EnableSlidingWindowForSeq(seq_id, window, sink)` (apache/tvm#16729), so it is
the differentiation baseline (VAS-86) that the attention-aware policies must beat
at equal budget, and it validates the VAS-52 policy-hook plumbing end-to-end.

- src/eviction_policy.ts:
  - StreamingLLMEvictionPolicy: resolves budget (ratio or absolute) to an
    absolute token count, defaults sinkTokens to 4, derives
    windowSize = budget - sink (or honors an explicit window), range-checks, and
    emits a normalized {kind, budget, sinkTokens, windowSize} config.
  - resolveBudgetTokens(): shared ratio<->absolute budget resolver.
  - createEvictionPolicy(): central factory; not-yet-implemented policies
    (SnapKV/PyramidKV/H2O) throw with a tracking pointer.
  - DEFAULT_SINK_TOKENS constant.
- src/index.ts: export the new public API surface.
- tests/eviction_policy.test.ts: 15 unit tests (budget resolution, window
  derivation, validation, factory dispatch).
- docs/eviction/EVICTION_BOUNDARY.md: record the StreamingLLM TS policy status
  and the remaining runtime-binding work (VAS-87).

Runtime binding to EnableSlidingWindowForSeq + sparse position IDs (plan §3.5
Option 1) depends on the local eviction toolchain (VAS-87); until then the policy
resolves config but eviction does not physically fire (engine runs full-cache).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
★ Primary attention-aware policy for the EMNLP demo. Continues the
eviction policy chain after StreamingLLM (VAS-53), following the same
declarative-TS-layer + runtime-boundary-doc pattern.

TS layer (src/eviction_policy.ts):
- SnapKVEvictionPolicy: one-shot prefill-time selection, fixed budget
  after prefill (zero per-decode-step dispatch — the H1/H3 reason SnapKV
  beats H2O in-browser at batch=1). Resolves budget (ratio/absolute),
  defaults sink=4 / obsWindow=32 / poolingKernel=7 (SnapKV paper),
  validates odd pooling kernel and sink+obsWindow < budget.
- New EvictionConfig.poolingKernelSize field (ablation axis, VAS-65) +
  DEFAULT_OBSERVATION_WINDOW / DEFAULT_POOLING_KERNEL_SIZE.
- createEvictionPolicy() dispatches snapkv; protected ctor so PyramidKV
  (VAS-57) can subclass. Exported from index.ts.
- 23 unit tests pass (tests/eviction_policy.test.ts).

Runtime design:
- docs/eviction/snapkv_pooled_select.wgsl: pooled-select kernel design
  reference (obs-window scoring without materializing the full attention
  matrix → avg-pool → top-K). EVICTION_BOUNDARY.md SnapKV section.
- Kernel binding into a compiled model lib remains gated on VAS-47
  (attention-score read path) and VAS-87 (eviction toolchain).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant