perf(inference): broader Candle inference deletion (deferred from #1273) — audit plasticity LoRA reachability first

**Follow-on to #1273** (Candle Qwen3.5 deletion, PR #1279) — broader Candle inference cleanup deferred from #1273's original scope.

PR #1279 deleted the Qwen3.5-specific Candle path (Qwen35GgufBackend + vendored quantized_qwen35), but the broader Candle inference chain remains:

- `inference/candle_adapter.rs` (~1500 LOC) — `CandleAdapter` is imported but never instantiated by `AIProviderModule::register_adapters`. Dead in production.
- `inference/model.rs` (~900 LOC) — `ContinuumModel` has no callers outside the inference module.
- `inference/quantized.rs` (~300 LOC) — `load_quantized_model`, `load_default_quantized` only called by `candle_adapter.rs`.
- `inference/backends/{generate, load_gguf_backend}` — only called by the dead chain above + `bin/*` debug binaries.
- `inference/vendored/{quantized_llama, qwen2, compact_llama}.rs` — Candle backends not used by any registered adapter.
- `inference/backends/{llama_gguf, llama_safetensors, qwen2_safetensors}.rs` — used only by the dead chain or by plasticity tests.
- `bin/inference_test.rs`, `bin/test_qwen_gguf.rs`, `bin/diagnose_prefill.rs`, `bin/mixed_quant.rs` — not exposed as Cargo binaries; orphaned source files.

## Why deferred from #1273

Initial #1273 plan was a single atomic deletion. `cargo check` after deleting `model.rs` revealed `backends/llama_safetensors.rs:20` uses `model::rebuild_with_stacked_lora`, and `modules/plasticity/validation.rs:766, :802` uses `compact_llama_safetensors::CompactLlamaSafetensorsBackend` — these are test code (`#[test]` blocks), but they're **plasticity tests for LoRA training infrastructure**. Whether plasticity LoRA training is itself a live production path or a vestigial training surface is the question that determines the deletion blast radius.

## Acceptance

1. **Audit plasticity LoRA training reachability.** Is `crate::modules::plasticity` exercised by any live IPC handler / CLI command / chat hot path? Or is it test-only / experimental infrastructure that's never invoked in production?
2. **If plasticity is dead in production:** delete the entire Candle chain — `candle_adapter.rs`, `ContinuumModel` (and `model.rs` if everything in it depends on it), `quantized.rs`, `qwen2/llama` vendored backends, `llama_gguf/llama_safetensors/qwen2_safetensors` backends, and the orphaned `bin/*` files. ~4000+ LOC.
3. **If plasticity is live in production:** scope deletion to only the chain that's truly unreachable — likely `candle_adapter.rs` + `quantized.rs` (since plasticity uses safetensors backends directly, not via CandleAdapter). Document which Candle infrastructure stays and why.

## Why this matters

Per Joel's "if not UI/UX it is rust" rule and the no-CPU-fallback alpha contract, dead Candle infrastructure that contradicts these rules on paper but never executes is the same anti-pattern this audit (#1262) was filed for. The deferred scope is the bulk of the dead-code mass.

Lane: alpha flywheel #1272 lane 6 (continues #1262 audit work).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inference): broader Candle inference deletion (deferred from #1273) — audit plasticity LoRA reachability first #1280

Why deferred from #1273

Acceptance

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(inference): broader Candle inference deletion (deferred from #1273) — audit plasticity LoRA reachability first #1280

Description

Why deferred from #1273

Acceptance

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions