Skip to content

Progress: core runtime refactor checkpoints#192

Open
shuxueshuxue wants to merge 137 commits intomainfrom
pr188-agent-optimize
Open

Progress: core runtime refactor checkpoints#192
shuxueshuxue wants to merge 137 commits intomainfrom
pr188-agent-optimize

Conversation

@shuxueshuxue
Copy link
Copy Markdown
Collaborator

Summary

  • add a minimal QueryLoop.aget_state/aupdate_state bridge for backend/web callers after the reopened ql-06 regression
  • cover both live caller shapes: resumed-thread __start__ appends and RemoveMessage-based repair updates
  • add backend-facing regression tests for _repair_incomplete_tool_calls() and get_thread_history() so the caller contract stays locked

Test Plan

  • uv run pytest tests/unit/test_loop.py tests/test_query_loop_backend_bridge.py -q
  • hostile re-review on backend :8010 reported the original caller-surface blocker no longer reproduces

nmhjklnm and others added 30 commits April 1, 2026 22:48
- Add docs/architecture/ with 11 deep-dive docs covering CC patterns:
  query loop, tool execution, state/agents, security/permissions,
  API/prompt infra, PowerShell, plugins, settings/platform,
  compaction pipeline (4-layer, SM-Compact, Legacy Compact details)
- Add cc-patterns.md master blueprint with LangChain mapping,
  implementation priority roadmap (Phase 1-5), and PARTIAL gap registry
- Refactor core agent modules: chat_tool_service, delivery, service,
  agent runtime, registry, filesystem/search/wechat tool services
- Add core/runtime/prompts.py
- Phase 1: slim system prompt — move tool usage guidance to descriptions,
  keep only sub-agent type routing in system prompt
- Phase 2: rewrite all tool descriptions to convey non-intuitive boundary
  conditions (Read/Write/Edit/Glob/Grep/Bash/Agent/WebSearch/WebFetch/
  TaskOutput/TaskStop/TaskCreate/tool_search/load_skill)
- Phase 3: add pages param to Read schema; add line_numbers param to Grep
  schema and handler; add subagent_type enum to Agent schema
- Phase 4: mark WebSearch/WebFetch/tool_search/load_skill/TaskGet/TaskList/
  wechat_contacts as is_concurrency_safe + is_read_only
- Phase 5: sub-agent tool filtering — AGENT_DISALLOWED/EXPLORE_ALLOWED/
  PLAN_ALLOWED/BASH_ALLOWED constants; LeonAgent accepts extra_blocked_tools
  and allowed_tools; _run_agent applies per-type filters
- Phase 6: add LSP placeholder to tool_catalog (deferred, default=False)
- Extras: search_hint for Agent/TaskOutput/TaskStop/chat tools/wechat_send;
  TaskOutput marked is_read_only; Edit description adds .ipynb workaround;
  fix prompt caching to place cache_control on system_message content block;
  add forkContext parent message inheritance with _filter_fork_messages;
  expose set_current_messages ContextVar for sub-agent context passing
- Add --max-columns 500 to suppress minified/base64 output
- Add missing VCS excludes: .svn, .hg, .bzr, .jj, .sl
- Default head_limit 250 (matches CC's undocumented cap)
Registers a DEFERRED LSP tool providing code intelligence:
goToDefinition, findReferences, hover, documentSymbol, workspaceSymbol.

- _LSPSession: holds multilspy LanguageServer alive in a background asyncio
  task using start_server() context manager + Event-based lifecycle control
- LSPService: lazy per-language session pool, auto-detects language from
  file extension, converts absolute paths to workspace-relative
- Integrated into LeonAgent._init_services() with CleanupRegistry at priority 1
- Optional dep: pip install multilspy (or leonai[lsp])
- Supported: python, typescript, javascript, go, rust, java, ruby, kotlin, csharp
- Language servers auto-downloaded on first use per multilspy design
- multilspy moved from optional to core dependencies (avoid restart cost)
- Add 10 MB file size limit (matches CC LSP spec)
- Add gitignore filtering on returned locations via git check-ignore,
  batched in groups of 50 (matches CC batch size)
- Remove multilspy availability check from handler (always available now)
Adds 4 missing LSP operations via multilspy internal API:
- goToImplementation (textDocument/implementation)
- prepareCallHierarchy (textDocument/prepareCallHierarchy)
- incomingCalls (callHierarchy/incomingCalls)
- outgoingCalls (callHierarchy/outgoingCalls)

Total supported operations: 9 (matches CC LSP tool surface).
incomingCalls/outgoingCalls take the 'item' output from prepareCallHierarchy.
Language auto-detected from item.uri for call hierarchy ops.
- _fmt_symbol: handle both SymbolInformation (workspaceSymbol, has location.uri)
  and DocumentSymbol (documentSymbol, has top-level range/selectionRange)
- request_definition/references/hover/document_symbols: catch AssertionError
  from multilspy when server returns None (maps to empty result / no hover)
…langserver

Python's Jedi server doesn't support goToImplementation or call hierarchy.
Add _PyrightSession — a minimal asyncio LSP client over stdio — that talks to
pyright-langserver (bundled with `pip install pyright`, already a core dep).

Changes:
- _PyrightSession: JSON-RPC/Content-Length stdio client, initialize handshake,
  textDocument/didOpen, callHierarchy/{incomingCalls,outgoingCalls},
  textDocument/{implementation,prepareCallHierarchy}
- Acks server-to-client requests (window/workDoneProgress/create etc.)
- Keeps files open for session lifetime (required for call hierarchy)
- LSPService routes Python advanced ops to pyright, other languages to multilspy
- Fix _fmt_symbol: handle both SymbolInformation (workspaceSymbol) and
  DocumentSymbol (documentSymbol) response formats
- Fix AssertionError from multilspy null responses → empty result
- pyproject.toml: add core.tools.lsp to packages list (was missing — would
  cause lsp tool to be absent after pip install leonai)
- pyproject.toml: add pyright>=1.1.0 as core dep (required by _PyrightSession)
- lsp/service.py: remove unused _wait_for_idle, _active_progress, _idle_event,
  _progress_started from _PyrightSession (pyright doesn't send $/progress)
- plan-tool-alignment.md: replace Phase 6 placeholder with actual implementation
  summary (9 operations, dual-backend architecture, deps)
Language servers (multilspy + pyright) now live in a module-level
_LSPSessionPool instead of per-LSPService instances. Sessions are keyed
by (language, workspace_root), start lazily on first use, and survive
agent restarts. Cleanup moved from CleanupRegistry to the backend
lifespan finally block via `await lsp_pool.close_all()`.

- Add _LSPSessionPool with asyncio.Task-based dedup for concurrent starts
- Simplify LSPService to delegate all session management to lsp_pool
- Remove _cleanup_lsp_service from LeonAgent and CleanupRegistry
- Add lsp_pool.close_all() to backend/web/core/lifespan.py shutdown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

Processed the concrete review items on this PR and merged latest main into the branch.

Addressed slices:

  • c3e865fe mutable AgentMiddleware.tools default
  • b194c757 filesystem default edit-cap mismatch
  • 0c8810b5 async LSP gitignore filtering
  • 7fbc0c63 terminal notification predicate dedupe
  • 7c85cdd8 child-fork encapsulation on LeonAgent
  • 81d5aa4d child-agent factory injection into AgentService
  • ec5c5a6c typed core callable fields in ToolUseContext
  • fb057dfc canonical existing-lease binding helper reuse
  • 9d2d7bb2 typed model-error recovery result + split string tool-call readiness fix
  • d9212065 merge seam in backend/web/core/dependencies.py (keeps main auth hardening + Supabase human-entity fallback)
  • 8cc28042 dedicated core/runtime/agent.py unit coverage

Verification run on the relevant slices:

  • uv run pytest -q tests/Unit/filesystem/test_filesystem_service.py
  • uv run pytest -q tests/Unit/platform/test_lsp_service.py
  • uv run pytest -q tests/Unit/core/test_terminal_notifications.py tests/Unit/core/test_queue_formatters.py tests/Integration/test_query_loop_backend_bridge.py -k "terminal_followthrough or command_notification or agent_notification or terminal_notifications"
  • uv run pytest -q tests/Unit/core/test_agent_service.py
  • uv run pytest -q tests/Unit/core/test_runtime_support.py tests/Unit/core/test_loop.py -k "core_callable_fields_are_not_typed_as_any or tool_use_context or permission"
  • uv run pytest -q tests/Integration/test_threads_router.py
  • uv run pytest -q tests/Unit/core/test_loop.py
  • uv run pytest -q tests/Fix/test_auth_entity_resolution.py tests/Fix/test_panel_auth_shell_coherence.py -k "entity_resolution or builtin_member_surface_exposes_chat_tools"
  • uv run pytest -q tests/Unit/core/test_runtime_agent.py tests/Integration/test_leon_agent.py -k "forked_child_context or close or destructor"

I also had a second hostile pass on each narrow slice outside the main loop; those all came back green.

The remaining higher-level concerns from the review (e.g. QueryLoop size, _run_agent_to_buffer size, broader frontend/LSP coverage, PR scope) are acknowledged, but I did not widen this PR further to chase them performatively. The concrete correctness / contract issues called out in the review are the ones addressed here.

@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

Follow-up on the review items from #4058409950:

Handled slices now on this branch:

  • c3e865fe mutable AgentMiddleware.tools default
  • b194c757 filesystem default edit-cap mismatch
  • 0c8810b5 async LSP gitignore filtering
  • 7fbc0c63 terminal-notification dedupe predicate
  • 7c85cdd8 child-fork encapsulation via LeonAgent.apply_forked_child_context(...)
  • 81d5aa4d child_agent_factory injection seam
  • ec5c5a6c narrowed ToolUseContext core callable typing
  • fb057dfc canonical existing-lease binding helper
  • 9d2d7bb2 typed model-error-recovery result + split string tool-call readiness fix
  • d9212065 merge seam in backend/web/core/dependencies.py
  • 8cc28042 dedicated core/runtime/agent.py unit coverage
  • 2b012521 auth-router / monitor test drift repair
  • 01f452f6 Python CI closure (ruff check, ruff format --check, workflow-equivalent unit pack)

Local verification on the current head:

  • uv run ruff check .
  • uv run ruff format --check .
  • uv run pytest tests/ --ignore=tests/test_e2e_providers.py --ignore=tests/test_sandbox_e2e.py --ignore=tests/test_daytona_e2e.py --ignore=tests/test_e2e_backend_api.py --ignore=tests/test_e2e_summary_persistence.py --ignore=tests/test_p3_e2e.py --maxfail=5 --timeout=60 -q
    • 906 passed, 44 skipped

Independent hostile re-checks on the narrowed slices are green as well.

The only residual caveat I would call out is scope honesty: the router-test drift fixes prove the router contract, not deeper AuthService/Supabase behavior by themselves.

@nmhjklnm
Copy link
Copy Markdown
Collaborator

nmhjklnm commented Apr 4, 2026

Simplification Suggestions — Same Power, Less Complexity

Following up on the architecture review. These are patterns in the PR that could be simpler AND more extensible with relatively small changes. Ordered by impact.


1. Eliminate sync/async twins in runner.py (~400 lines saved)

10 pairs of near-identical methods exist because some tool handlers are synchronous. The sync path uses _run_awaitable_sync (daemon thread + nested event loop) — fragile in production.

Fix: Wrap sync handlers at registration time, keep only the async path:

# registry.py — at register time
def register(self, entry: ToolEntry):
    if not asyncio.iscoroutinefunction(entry.handler):
        original = entry.handler
        async def async_wrapper(**kwargs):
            return await asyncio.to_thread(original, **kwargs)
        entry = replace(entry, handler=async_wrapper)
    self._tools[entry.name] = entry

Then delete all 10 _xxx_sync methods in runner.py. Net reduction: ~400 lines. Eliminates the daemon-thread-nested-event-loop footgun entirely.


2. Extract CheckpointStore interface (~60 lines saved, prerequisite for removing langgraph dep)

_save_messages and _hydrate_thread_state_from_checkpoint directly construct LangGraph's internal format (empty_checkpoint(), channel_values, CheckpointMetadata). This knowledge is scattered across 4 methods in loop.py.

Fix: Define a thin interface + adapter:

class CheckpointStore(Protocol):
    async def save(self, thread_id: str, state: ThreadState) -> None: ...
    async def load(self, thread_id: str) -> ThreadState | None: ...

@dataclass
class ThreadState:
    messages: list
    permissions: dict
    pending_requests: dict
    resolved_requests: dict
    memory_state: dict

class LangGraphCheckpointAdapter:
    """Only place that knows channel_values format."""
    def __init__(self, saver: AsyncSqliteSaver): ...
    async def save(self, thread_id, state):
        checkpoint = empty_checkpoint()
        checkpoint["channel_values"] = asdict(state)
        await self._saver.aput(...)

QueryLoop speaks save(thread_id, ThreadState) / load(thread_id) — no LangGraph imports. This is also the prerequisite for eventually dropping the langgraph dependency entirely.


3. Error recovery: if-else cascade → strategy chain (0 lines saved, cognitive load halved)

_handle_model_error_recovery is ~120 lines with 5 interleaved if-else branches sharing mutable context variables. Adding a 6th strategy means reading the entire method.

Fix: Each strategy becomes an independent function:

RecoveryStrategy = Callable[[ErrorContext], Awaitable[_ModelErrorRecoveryResult | None]]

class QueryLoop:
    def __init__(self):
        self._recovery_chain: list[RecoveryStrategy] = [
            self._try_context_overflow_escalate,
            self._try_transient_retry,
            self._try_max_output_tokens,
            self._try_collapse_drain,
            self._try_reactive_compact,
        ]

    async def _handle_model_error_recovery(self, ctx: ErrorContext):
        for strategy in self._recovery_chain:
            result = await strategy(ctx)
            if result is not None:
                return result
        return None

Same code, just reorganized. Each strategy is independently testable. New strategies = append to list.


4. Middleware introspection: __dict__ reflection → explicit declaration

_mw_overrides_model_call probes type(mw).__dict__ to check if a subclass overrides a method. This is CPython-specific and breaks with multiple inheritance or mixins.

Fix: One ClassVar per hook point:

class AgentMiddleware:
    handles_model_call: ClassVar[bool] = False
    handles_tool_call: ClassVar[bool] = False

class MemoryMiddleware(AgentMiddleware):
    handles_model_call = True  # explicit opt-in

Delete both _mw_overrides_* functions. ~20 lines.


5. _StreamingToolExecutor: back-reference → dependency injection

The embedded class calls self._loop._execute_single_tool() and self._loop._tool_is_concurrency_safe() — bidirectional coupling that makes both untestable in isolation.

Fix: Inject two callables instead of the entire loop:

class StreamingToolExecutor:  # promote to top-level
    def __init__(
        self,
        execute_tool: Callable[[dict, ToolUseContext], Awaitable[ToolMessage]],
        is_concurrency_safe: Callable[[dict], bool],
    ):
        self._execute = execute_tool
        self._is_safe = is_concurrency_safe

Now testable with simple mocks. ~30 lines changed.


6. MemoryMiddleware: temporal coupling → constructor injection

set_model() / set_runtime() must be called post-construction in the right order, otherwise compaction silently skips. No error, no log.

Fix: Either pass at construction time, or make the skip explicit:

async def awrap_model_call(self, request, handler):
    if self._model is None:
        logger.warning("MemoryMiddleware: model not set, skipping compaction")
        return await handler(request)

~20 lines.


Summary

# Change Lines saved Extensibility gain Effort
1 Kill sync/async twins ~400 High — removes footgun Medium
2 CheckpointStore interface ~60 Very high — unblocks langgraph removal Medium
3 Recovery strategy chain 0 (reorg) High — new strategies = 1 function Low
4 Explicit middleware declaration ~20 Medium Very low
5 Executor DI ~10 Medium — enables unit testing Low
6 Constructor injection ~10 Low — prevents silent failures Very low

Total: ~500 lines of refactoring to go from "works but only the author can extend" to "anyone can add a recovery strategy / middleware / checkpoint backend".

@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

Follow-up closeout for review thread 4058409950.

Everything I could address in this PR has now been handled on the branch, including the later CI fallout from those changes. The last follow-up slices were:

  • Windows-only remote path semantics + spill-path normalization + WAL cleanup on 43b45c2 / 950f3e5
  • auth-router / monitor test drift cleanup on 2b01252
  • Python CI lint + test drift closure on 01f452f

Current status on this PR head (950f3e5):

  • Lint & Type Check (Python): pass
  • Unit Tests (ubuntu-latest): pass
  • Unit Tests (macos-latest): pass
  • Unit Tests (windows-latest): pass
  • Lint & Type Check & Build (Frontend): pass

I also ran the relevant local verification before pushing, and the partner hostile/review loop came back green on the narrow review slices, including the final Windows-only boundary.

If there is still a remaining reviewer concern, it should now be a new narrow item rather than an unaddressed piece of the original thread.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🚀 预发部署已触发

分支: pr188-agent-optimize

@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

Deployment note for reviewers (sanitized):

Current status

  • Local deployment on this branch is validated end-to-end. I restarted backend/frontend from the current branch, verified login, members/profile, thread page, message live updates, and the child-agent right pane live stream.
  • The recent user-facing regressions we fixed are code-path fixes, not deployment hacks. The main runtime/UI fixes were around stream wiring, thread/event contracts, chat tool contracts, and stale sandbox/session refresh.

Important caveat: staging CD is currently misleading

  • The PR label-based Deploy Staging workflow can report success even when deployment did not actually happen.
  • In the latest run, the workflow log showed API responses like Application not found and No resources found, but the job still completed green because the workflow uses plain curl -s and does not fail on bad response bodies.
  • Coolify still showed the shared staging app pinned to main, which means the PR branch was not actually deployed there.
  • As a result, any errors seen on app.staging... right now should not be over-attributed to this PR until the staging workflow itself is fixed.

Local deployment: what may trip you up

  • Local backend is not a one-flag startup on this machine. It needs the full Supabase runtime contract, not just a generic config.env.
  • In practice that means the backend must start with:
    • Supabase storage strategy enabled
    • explicit Supabase client factory wiring
    • staging schema selected for local/staging paths
    • a valid Supabase URL/key/JWT runtime contract
  • If that contract is incomplete, backend login can fail even though the code is fine.
  • Also avoid relying on the legacy supabase.f2j.space host for local auth checks; the working path is the local tunnel/new domain, not the old dead hostname.

Minimal local verification steps

  1. Ensure the Supabase/Daytona tunnel is up from ops/tunnel.sh.
  2. Start backend on the worktree backend port with the full Supabase runtime contract.
  3. Start frontend on the worktree frontend port.
  4. Log in through the real frontend.
  5. Verify these exact user flows:
    • members/profile load
    • open an existing thread
    • send a message and watch live updates without refresh
    • trigger a child agent and confirm the right Agent pane receives intermediate child events live

What is normal vs. not normal

  • Normal on local current branch: login works, panel loads, thread loads, SSE works, child-agent pane live updates work.
  • Not trustworthy right now: shared staging behavior after deploy-staging, because the CD workflow can false-green while leaving staging on main.

Recommended next step

  • Treat local validation on this branch as the primary correctness signal for now.
  • Fix deploy-staging.yml to fail loudly on Coolify API errors before using staging as review evidence.

@nmhjklnm
Copy link
Copy Markdown
Collaborator

nmhjklnm commented Apr 4, 2026

Nice work on the review items — the fork encapsulation (apply_forked_child_context), factory injection, and typed recovery result are exactly the right fixes.

Quick follow-up on the simplification suggestions thread (#4187099517) — any thoughts on those? Specifically interested in your take on:

  1. sync/async twin elimination — the _run_awaitable_sync daemon-thread bridge in runner.py is the one I'm most concerned about long-term. Wrapping sync handlers at registration time would delete ~400 lines and the nested-event-loop footgun in one shot.

  2. CheckpointStore interface — given that _save_messages already constructs empty_checkpoint() + channel_values manually, extracting a CheckpointStore protocol would be a small diff now but is the prerequisite for eventually dropping the langgraph dependency.

Not suggesting these go into this PR if you want to keep scope tight — just want to know if you see them as valid follow-ups or if there's a reason the current shape is intentional.

@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

Deployment follow-up, de-sensitive:

  • The deploy-staging.yml workflow contract is now fixed and does deploy the shared staging stack off this branch.
  • After that, shared staging still had a real runtime issue on GET /api/threads/{id} and GET /api/threads/{id}/runtime.
  • I traced that into two separate seams:
    1. Code seam: shared Supabase local threads were carrying host-local cwd values (for example a developer machine path) into a Linux staging container. Commit 6f3e9910 fixes that by treating persisted local cwd as advisory and only promoting it to workspace_root when it exists on the current host.
    2. Deploy/runtime-config seam: the shared staging backend container was missing model runtime config, so LeonAgent bootstrap still failed even after the dead-cwd fix. The concrete symptom in remote backend logs was ValueError: API key must be set via ... during agent init.
  • After restoring the backend's model runtime config on shared staging, blackbox retest turned green for this boundary:
    • POST /api/auth/login -> 200
    • GET /api/threads/m_SNFbl9JUgpap-1 -> 200
    • GET /api/threads/m_SNFbl9JUgpap-1/runtime -> 200

One remaining shared-staging note that is separate from the thread-detail/runtime 500: /api/panel/members is still returning an empty list because the shared staging filesystem side does not currently have member bundles seeded under its Leon home directory. That is a separate environment/filesystem-seeding issue, not the thread-detail/runtime route bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants