Self-improving agentic eval-readiness: code fixes shipped v0.9.1+; remaining gap is model behavior + test infra

## Status (verified 2026-05-17)

Original 3 LLM-behavior issues from \`docs/plans/2026-04-13-agent-loop-robustness-design.md\`:

1. **Sequential enforcement** — SHIPPED v0.9.1 (rejects multiple tool calls, teaches model, 5-retry cap)
2. **Response pagination + filter** — SHIPPED v0.9.1 (page=0 guard, adaptive page size, MCP list-tool filter/summary)
3. **Empty/verbalized response continuation** — SHIPPED v0.9.1 (intent-detection + max-2 retries)

Plugin is now at v0.9.3 (workflow v0.51.7 → v0.53.1 bump; authz v0.5.4).

## Remaining gap

Per memory \`project_self_improving_agentic_status.md\` (2026-04-13):

- **gemma4** loops on file_read (3x identical results) — loop-break heuristic could help.
- **phi4-mini** verbalizes tool calls in text without using the tool-calling protocol — intent detection fires but model doesn't comply.
- **16GB hardware** is too constrained for the 2-model test matrix (load avg 22+).

These are **model-quality + test-infra** issues, not code issues. The plan's "Next Steps" called for:

- Re-execute on 32GB+ hardware with no competing load.
- Try qwen2.5:7b (already downloaded; may have better tool compliance).
- Try Anthropic Claude or OpenAI via API (cloud models have reliable tool calling).
- Add loop-breaking strategy: when file_read loops 3x, inject "You already read the file. Now modify it and call file_write".
- Consider simpler first-pass: fixed pipeline that reads → LLM modifies → validates → writes (skip agent tool-choice).

## Recommendation

Close this item as code-complete. Move eval execution + model-behavior investigation to a separate test-infra epic. Loop-break heuristic is a small followup if/when the eval matrix surfaces a clear repro.

## Why deferred from 2026-05-17 session

Filed during autonomous "continue cycle" mandate. Investigation confirmed the queued "3 remaining LLM behavior issues" are SHIPPED; what remains is hardware + model selection outside the autonomous-pipeline scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-improving agentic eval-readiness: code fixes shipped v0.9.1+; remaining gap is model behavior + test infra #15

Status (verified 2026-05-17)

Remaining gap

Recommendation

Why deferred from 2026-05-17 session

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Self-improving agentic eval-readiness: code fixes shipped v0.9.1+; remaining gap is model behavior + test infra #15

Description

Status (verified 2026-05-17)

Remaining gap

Recommendation

Why deferred from 2026-05-17 session

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions