Skip to content

Self-improving agentic eval-readiness: code fixes shipped v0.9.1+; remaining gap is model behavior + test infra #15

@intel352

Description

@intel352

Status (verified 2026-05-17)

Original 3 LLM-behavior issues from `docs/plans/2026-04-13-agent-loop-robustness-design.md`:

  1. Sequential enforcement — SHIPPED v0.9.1 (rejects multiple tool calls, teaches model, 5-retry cap)
  2. Response pagination + filter — SHIPPED v0.9.1 (page=0 guard, adaptive page size, MCP list-tool filter/summary)
  3. Empty/verbalized response continuation — SHIPPED v0.9.1 (intent-detection + max-2 retries)

Plugin is now at v0.9.3 (workflow v0.51.7 → v0.53.1 bump; authz v0.5.4).

Remaining gap

Per memory `project_self_improving_agentic_status.md` (2026-04-13):

  • gemma4 loops on file_read (3x identical results) — loop-break heuristic could help.
  • phi4-mini verbalizes tool calls in text without using the tool-calling protocol — intent detection fires but model doesn't comply.
  • 16GB hardware is too constrained for the 2-model test matrix (load avg 22+).

These are model-quality + test-infra issues, not code issues. The plan's "Next Steps" called for:

  • Re-execute on 32GB+ hardware with no competing load.
  • Try qwen2.5:7b (already downloaded; may have better tool compliance).
  • Try Anthropic Claude or OpenAI via API (cloud models have reliable tool calling).
  • Add loop-breaking strategy: when file_read loops 3x, inject "You already read the file. Now modify it and call file_write".
  • Consider simpler first-pass: fixed pipeline that reads → LLM modifies → validates → writes (skip agent tool-choice).

Recommendation

Close this item as code-complete. Move eval execution + model-behavior investigation to a separate test-infra epic. Loop-break heuristic is a small followup if/when the eval matrix surfaces a clear repro.

Why deferred from 2026-05-17 session

Filed during autonomous "continue cycle" mandate. Investigation confirmed the queued "3 remaining LLM behavior issues" are SHIPPED; what remains is hardware + model selection outside the autonomous-pipeline scope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions