Skip to content

Latest commit

 

History

History
272 lines (236 loc) · 30.2 KB

File metadata and controls

272 lines (236 loc) · 30.2 KB

Agent Instructions

General

  • You are Claude Code. Actions that would be time consuming for a human — writing tests, building features, refactoring code — are fast and comparatively cheap for you.
  • Conversation history gets compacted once the context window reaches its limit. Important details from earlier in the conversation — including plans, discoveries, and decisions — may be lost. Proactively write important information to files so it persists beyond context compression.

Planning

  • Confirm before implementing: After writing a plan but before starting implementation, always present the plan to the user and ask if they have any questions or concerns. Do not begin coding until the user confirms.

Long-Running Tests

  • A "long-running test" is any test that takes >= 60 seconds. Many tests in this project run for 10–30 minutes (especially OSS pipeline benchmarks and integration tests). Never wait blindly for completion.
  • Run long tests in the background using run_in_background: true. Then poll the output at regular intervals (every 30–60 seconds) using TaskOutput with block: false to inspect incremental stderr/stdout.
  • Act on partial output: errors, warnings, and progress lines appear long before the test finishes. If you see repeated failures (e.g., wrong paths, IPC errors, assertion failures), stop the test early, diagnose, and fix — don't wait for the full run to complete.
  • After applying a fix, delete any cached DB files under target/test-repos/*.db before re-running the test so stale results don't mask the fix.

Core Principles

  • Verify before deleting: Before deleting any files or folders, always verify they are not referenced elsewhere in the codebase using grep or other search tools. Never assume a file is unused.
  • Verify assumptions: Before acting on any assumption about the codebase (API signatures, available methods, file locations, type constraints, etc.), read the relevant source. Use grep, glob, or file reads to confirm. Do not assume — check.
  • Verify with builds and tests: After making changes, build the affected project and run existing tests to confirm nothing is broken. When the correct behaviour of a piece of logic is non-obvious, write a test to verify it — including temporary/throwaway tests if that is the fastest way to confirm an assumption. Remove temporary tests once they have served their purpose.

Code Documentation

  • Do not add comments that merely describe the changes made (e.g., "Modified this to fix bug X").
  • Comments should be reserved for explaining the code and functionality themselves (the "how" and "why" of the logic), adhering to standard clean code practices.

Variable Naming

  • Use clear, descriptive names for all variables.
  • Avoid obscure abbreviations (e.g., use isCollection instead of isColl).

Workflow

  • Write plans to a file before implementing: For non-trivial tasks, write the plan to a markdown file in the repo before starting implementation. Delete when done.
  • Stop and reassess after repeated failures: If consecutive fix attempts fail to resolve an issue, stop and reconsider the approach rather than continuing to apply further fixes.
  • Commits should be focused and well-delimited: Each commit should represent one coherent, self-contained piece of work (e.g. a bug fix, a single new feature, a refactor, a docs update). Do not bundle unrelated changes into a single commit. Compare DIFFs to do so. When a file contains changes that belong in separate commits, use git add -p to stage specific hunks rather than editing the file, committing, and re-applying changes. Do not add any Claude attribution or co-author lines to commit messages.
  • Keep documentation up to date: After implementing features, fixing bugs, or adding tests, update the following files to reflect the changes:
    • CLAUDE.md — architecture decisions, key conventions, layout tree, test counts.
    • MEMORY.md — current implementation state, test counts, new conventions, gotchas, and non-obvious decisions. Update at the end of every session.
    • README.md — user-facing project description and usage instructions.
    • TESTS_IMPLEMENTATION_PLAN.md — add new test entries with ✅ status, promote ⬜→✅ for newly implemented tests, and update phase-coverage summary counts. Every test row must have a unique T-NNN id.
    • TEST_COVERAGE.mddo not edit by hand. After any change to TESTS_IMPLEMENTATION_PLAN.md, run pwsh .\scripts\Sync-TestCoverage.ps1 from the repo root to regenerate it.

Project-Specific Knowledge

  • IMPLEMENTATION_PLAN.md is the full implementation plan for the Code Agent Platform — a local-first indexing and retrieval engine for C# and TypeScript/React codebases, exposed as an MCP server. It covers the complete architecture (SQLite graph schema, concurrency model, ingest pipeline, retrieval pipeline, and MCP tool surface) and defines a six-phase build roadmap spanning syntactic indexing (Phase 1), semantic enrichment via Roslyn/TS Language Service (Phase 2a), rename detection (Phase 2b), embeddings (Phase 3), hybrid retrieval and eval (Phase 4), MCP server (Phase 5), and hardening & observability (Phase 6), along with testing strategy and key risks.

  • INVARIANTS_CHECKLIST.md is a rigorous specification of correctness invariants for the Code Agent Platform — it defines the rules the system must never violate around symbol identity stability, invalidation, storage schema, and MCP tool behavior, along with a phased test assertion checklist to verify compliance.

  • TESTS_IMPLEMENTATION_PLAN.md is the single source of truth for all test entries. Every test row has a unique T-NNN id in the first column. It lists implemented (✅), planned (⬜), and deferred (🔁) tests for every phase (Phases 1–6), organised by module. Phase 5 covers MCP server tests. Update status from ⬜ to ✅ when a planned test is implemented, and update the phase-coverage summary table counts. Also the place to add newly-planned tests. Do not use | characters in description text — the pipe is the Markdown table delimiter and breaks the sync script.

  • TEST_COVERAGE.md is auto-generated from TESTS_IMPLEMENTATION_PLAN.md by scripts/Sync-TestCoverage.ps1. Run that script after any change to TESTS_IMPLEMENTATION_PLAN.md (adding, removing, or promoting a test entry) rather than editing TEST_COVERAGE.md by hand. The "Coverage Gaps" section at the bottom is preserved verbatim across regenerations.

  • scripts/Sync-TestCoverage.ps1 — PowerShell script that (1) assigns stable T-NNN ids to every test row in TESTS_IMPLEMENTATION_PLAN.md and (2) regenerates TEST_COVERAGE.md from the ✅ rows. Run with pwsh .\scripts\Sync-TestCoverage.ps1 from the repo root; use -DryRun to preview without writing. Idempotent: existing ids are never renumbered.

  • MEMORY.md is a living scratchpad that records the current implementation state, key conventions, known bugs (and their fixes), schema version history, and per-phase architecture notes. Keep it up to date at the end of every session — update the test count, add any new conventions discovered, and record any gotchas or non-obvious decisions so future sessions don't repeat the same mistakes.

  • MCP_SERVER_SPEC.md defines the MCP server tool surface — the universal API through which all clients (LLM agents, apps, IDE extensions) interact with the index engine. It specifies file system tools (list_directory, read_file, get_directory_tree), search and discovery tools, inspection and navigation tools, and engine management tools. The MCP server replaces the previously planned RLM orchestration layer; orchestration is now the client's responsibility.

  • The LMGenie app is provided as a demo app for testing.

  • LMGenie.Desktop contains the Tauri app. The CodeAgent React component will later make use of the 'indexing engine' we're implementing.

Phase 1 — Offline Index Foundation (COMPLETE)

The Rust workspace lives at codeagent-engine/ with two crates:

  • crates/codeagent-core — library crate with all indexing logic
  • crates/codeagent-clicodeagent binary for debug inspection

Key architectural decisions already made and encoded in code:

  • Single-writer SQLite: one dedicated OS thread owns an exclusive rusqlite::Connection; all writes go through a bounded tokio::sync::mpsc channel (depth 10). Readers use an r2d2 connection pool. WAL mode. Never add a second writer connection.
  • BLOB(16) for UUIDs, BLOB(32) for hashes: NodeId, FileId, ProjectId are Uuid wrappers; ContentHash is [u8; 32]. These types are enforced in every SQL query — never store them as TEXT.
  • No is_deleted flag: deletion is always hard-delete, journaled in deletion_log first. The 4-step file deletion order (delete node_spans → process nodes → delete file node LAST) is an invariant — see graph/deletion.rs.
  • symbol_disambiguator NOT NULL DEFAULT '': never NULL, prevents SQLite UNIQUE index bypass. The identity key is (language, project_id, symbol_key, symbol_disambiguator).
  • Partial unique index for primary spans: CREATE UNIQUE INDEX idx_node_spans_one_primary ON node_spans(node_id) WHERE is_primary = 1. Never try to UPDATE … SET is_primary = 1 … LIMIT 1; bundled SQLite lacks SQLITE_ENABLE_UPDATE_DELETE_LIMIT. Use span_id to target a specific row instead.
  • Self-referential root project: PRAGMA defer_foreign_keys = ON inside the transaction — see graph/nodes.rs::ensure_root_project(). Required because the synthetic root node has project_id = node_id.
  • MAX_NODES_PER_TX = 500: callers must chunk before calling upsert_nodes_in_tx().
  • Cancellation before COMMIT: the writer thread checks CancellationToken before every COMMIT and rolls back on cancel — never bypass this.
  • ChangeBatch ordering: creates/modifications are always processed before deletes. This is how file moves preserve node identity.
  • Phase 1 TypeScript symbol_key includes file_id: all TS exported symbols are ExportScope::Module (conservative). The file_id is SHA-256(path)[0..16] — deterministic, not random. Phase 2a may promote to ExportScope::Package.
  • Phase 1 C# symbol_key: qualified_name:kind:param_count(param_types)<generic_arity> — overload-safe without Roslyn.
  • LIMIT 1 in SELECT is fine; it's only banned in UPDATE/DELETE (SQLite compile flag issue).
  • FTS5 tokenizer: unicode61 tokenchars '_.$:@' with prefix = '2 3 4'. Columns: node_id UNINDEXED, name, qualified_name, parameter_signature, return_type. Synced manually via sync_fts_insert() / sync_fts_delete() in graph/nodes.rs — not a content table.
  • upsert_node() does NOT sync FTS automatically: insert_node() / update_node() write only to the nodes table. Callers (adapters) must call sync_fts_insert() separately after every upsert_node(). Tests that bypass adapters and call upsert_node() directly must also call sync_fts_insert() explicitly if they intend to query FTS.
  • FTS5 MATCH with JOIN must use table name, not alias: in a query that aliases fts_nodes as fts, the MATCH clause must be WHERE fts_nodes MATCH ?1 — NOT WHERE fts MATCH ?1. Using the alias produces a runtime error "no such column: <alias>". Column access via the alias (fts.rank, fts.node_id) is fine.
  • ORT embedding: ort 2.0.0-rc.11 with features = ["load-dynamic", "ndarray"]. Session::run() requires &mut self, so EmbeddingModel wraps it in Arc<Mutex<Session>>. Imports are ort::session::Session and ort::session::builder::GraphOptimizationLevel — these are NOT re-exported at the ort crate root.
  • InvalidationPlanner: all invalidation decisions go through ingest/invalidation.rs::plan() — matches the §6.6 decision matrix exactly. Never scatter invalidation logic elsewhere.
  • row_to_node() column order: the 29-column SELECT order in query/mod.rs is documented in the function's doc comment. Any new query that returns node rows must use the same column order.
  • Multi-span chunk_hash: SHA-256 of all span_hash values sorted by bytes ASC — not insertion order. See compute_multi_span_chunk_hash().

Phase 2 — Semantic Enrichment & Rename Detection (COMPLETE)

Phase 2a: Semantic enrichment via language service child processes.

Phase 2b: Rename/move detection via fingerprinting.

494 tests pass across all workspace crates (417 core unit + 41 fixture + 31 MCP + 5 CLI; 2 ignored for Windows symlinks). 27 additional tests in the Rust extractor binary (outside the workspace). OSS integration tests are feature-gated. CURRENT_SCHEMA_VERSION is now 5.

Key architectural decisions added in Phase 2:

  • IPC: newline-delimited JSON-RPC 2.0 over stdin/stdout: Language service processes (C# Roslyn, TS LS) communicate with the Rust core via tokio::process::Command with piped stdio. Each line is one complete JSON message. Never use HTTP or named pipes.
  • writer.submit() only returns Result<()>: the writer channel is typed to (). To return data from a write operation, use an Arc<std::sync::Mutex<T>> passed into the closure (see detect_renames() in pipeline.rs). Alternatively, write first and read back via reader_pool.read() (see run_project_detection()).
  • reader_pool.read() is async: DbReaderPool::read() takes a FnOnce(&Connection) -> Result<T> closure and runs it in a spawn_blocking task. Never call reader_pool.get() — that method does not exist on DbReaderPool. Always use reader_pool.read(|conn| { ... }).await.
  • IpcManager field on IngestPipeline: created in new() via IpcManager::new(config, repo_root). Routes analyze_file() calls to the C# or TS child process based on language; falls back to syntactic_only on any error. Auto-respawns on IpcChildExited.
  • Safe mode: indexing.safe_mode = true blocks MSBuild evaluation entirely (C# falls back to syntactic_only). TS Language Service strips plugins from tsconfig.json compiler options unconditionally before creating a LanguageService instance.
  • Project detection runs first (Step 2 in pipeline): detect_projects() walks the repo for .csproj and package.json files; ensure_projects_in_db() upserts project nodes. This must complete before symbol indexing so project_id is correct. Only re-runs when project files appear in the batch or no real projects exist yet.
  • Identity reconciliation: when Roslyn/TS LS provides a better symbol_key, try in-place UPDATE. On UNIQUE constraint violation (another node already owns the key), fall back to delete-old + insert-new + journal in node_identity_map. See graph/identity.rs.
  • Atomic semantic edge replace: replace_semantic_edges_in_tx() deletes all confidence='exact' edges for a set of source node IDs, then inserts new ones — both in a single transaction. Never mix stale and fresh edges.
  • InvalidationAction::RecomputeSemanticEdges: returned by InvalidationPlanner for SemanticContextChanged. The pipeline acts on it in Step 7 by calling SemanticContextRecomputer::recompute_project(). The function takes repo_root: &Path to resolve repo-relative file paths to absolute paths for disk reads.
  • Rename detection is 3-tier (Phase 2b):
    1. Git (git diff --find-renames --name-status HEAD) — parses R<score>\t<old>\t<new> lines.
    2. Fingerprint: reads chunk_fingerprint from deletion_log, computes fingerprint of new file, Jaccard similarity ≥ rename_similarity_threshold (default 0.80).
    3. Symbol-level: same container + kind + arity match in deletion_log → inserts node_identity_map entry. Rename detection runs as Step 3 (before deletes). Paths confirmed as renames are skipped in the delete pass.
  • Token winnowing fingerprint: FNV-1a 64-bit hash of identifier-normalised k-grams (k=4, window=4). Identifiers replaced with the token IDENT; structural tokens and keywords preserved. Results stored as LE-encoded u64 bytes. jaccard_similarity() runs in O(n+m) via sorted-set merge.
  • node_identity_map table (Migration 002): tracks old symbol_key→new node_id mappings for identity reconciliation and symbol-level renames. Indexed on both old and new key for fast lookup.
  • normalize_path_lossy() in project detection: project detection uses normalize_path_lossy() (returns String, falls back to raw path on error) not normalize_path() (returns Result<String>), because project scanning should never hard-fail on an individual path.
  • Solution-level prebuild (Step 2.5 in pipeline): run_solution_prebuild() detects the best .sln (or generates a synthetic one), runs dotnet restore (NOT dotnet build — MSBuildWorkspace only needs NuGet packages, not compiled assemblies), then loads the workspace in the C# extractor via load_csharp_solution IPC call. Config: solution_restore_timeout_ms (default 600s, set to 0 to disable). Sub-phase timings tracked in SolutionSubTimings (scan, generate, restore, load).
  • Solution-based batch analysis skips WithDocumentText(): AnalyzeSingleFromSolution() uses documents directly from the loaded solution (solution.GetDocument(docId)) instead of forking a new Solution snapshot per file. This allows all parallel workers to share Roslyn's internal semantic model cache. For initial indexing the files haven't changed since OpenSolutionAsync() loaded them.
  • Batch-prefetched semantic enrichment: prefetch_enrichment_lookups() bulk-loads node locations from DB before the write pass. apply_enrichment_batch_prefetched() uses the pre-fetched HashMap for identity reconciliation and edge replacement, avoiding per-file DB lookups.
  • Minimal solution reload for incremental batches: after the initial index, IPC pools are shut down to free memory. On the next large incremental batch, load_minimal_csharp_solution() generates a synthetic solution containing only the touched projects (not the full repo), restores and loads it. This preserves cross-project type resolution without re-loading the entire workspace. The C# group merge in build_semantic_groups() checks is_csharp_solution_loaded() (actual IPC process state) before merging — not just the pipeline-level solution_load cache.
  • PRAGMA safety in bulk writes: process_parallel() relaxes PRAGMAs (synchronous=OFF, foreign_keys=OFF) for bulk writes. The write result is captured without early return, PRAGMAs are unconditionally restored, then the error is propagated. This prevents the writer connection from being left in an unsafe state on failure/cancellation.
  • Deterministic project file selection: find_project_file_for() sorts read_dir results before picking, ensuring consistent behavior across platforms when multiple .csproj files share a directory.

Rust Language Support (Phases A/B/C — COMPLETE)

Phase A: Tree-sitter syntactic indexing for .rs files.

Phase B: Cargo workspace detection (Cargo.toml scanning, workspace member glob expansion).

Phase C: Core-side wiring (IPC pools, pipeline semantic groups, config) + LSP adapter extractor binary.

494 workspace tests + 27 extractor tests. All existing C#/TS tests unaffected.

Key architectural decisions for Rust support:

  • Language::Rust variant: added to the Language enum in types.rs. detect_language() maps .rs extension. Semantic context files: Cargo.toml, Cargo.lock, rust-toolchain.toml.
  • Rust symbol_key format: crate::module::container::name:kind. Module path derived from file path relative to src/ (e.g., src/foo/bar.rscrate::foo::bar). Inline mod name { ... } blocks push onto the module path. Kind suffixes: :struct, :enum, :trait, :fn (free function), :method (impl/trait method), :mod, :const, :static, :field, :variant.
  • Impl block symbol keys: inherent impl Config → methods keyed as crate::mod::Config::method:method. Trait impl Display for Config → methods keyed as crate::mod::Config.Display::fmt:method.
  • RustAdapter (tree-sitter, Phase A): follows LanguageAdapter trait. Supports index_file(), index_file_fresh(), and extract_file() (parallel path). parse_status = syntactic_only for all nodes until semantic enrichment via the extractor.
  • Cargo workspace detection (Phase B): detect_rust_projects() in project_detection.rs scans for Cargo.toml files. Parses [workspace] sections, expands members globs (e.g., "crates/*"), emits one DetectedProject per [package]. Virtual workspaces (no [package]) emit only member projects. Deterministic via sorted read_dir.
  • IPC pool for Rust (Phase C): IpcManager owns a rust_pool with memory-aware concurrency (LanguageMemoryProfile::RUST: 150 MB/process, 200 MB spawn gate). Routes via analyze_rust() method. Batch dispatch supported.
  • rust_extractor_path config: IndexingConfig.rust_extractor_path: Option<PathBuf>. When absent, Rust semantic enrichment is unavailable (tree-sitter only).
  • Pipeline integration (Phase C): build_semantic_groups() collects Rust groups alongside C#/TS. Groups processed in order: TS → C# → Rust. Rust IPC pool shut down before write phase to free memory.
  • Rust rename detection: RUST_KEYWORDS in fingerprint.rs — Rust keywords and structural tokens preserved during token normalization for fingerprint-based rename detection.
  • reconcile_rust_symbol_key() in identity.rs: guards against stale data by checking old key matches. Attempts in-place UPDATE, falls back to delete+create + node_identity_map entry on UNIQUE conflict.
  • Rust extractor binary: standalone codeagent-rust-extractor crate (outside the Cargo workspace). Architecture: codeagent-core ←JSON-RPC→ codeagent-rust-extractor ←LSP (Content-Length)→ rust-analyzer. Uses lsp-server for message framing and lsp-types for protocol definitions. Spawns rust-analyzer --stdio on first analysis request. Falls back to syntactic_only if rust-analyzer is not found on PATH. RUST_ANALYZER_PATH env var overrides PATH lookup.

Extractor binaries (inside codeagent-engine/, outside the Rust workspace):

  • extractors/csharp/ — .NET 8 console app (CodeAgentExtractor.csproj). Uses Microsoft.CodeAnalysis.Workspaces.MSBuild + Microsoft.Build.Locator. Entry point: src/Program.cs (JSON-RPC loop). Extractor logic: src/RoslynExtractor.cs. Protocol types: src/Protocol.cs. Launched by Rust as dotnet <dll> or a native AOT binary.
  • extractors/typescript/ — Node.js app (package.json, TypeScript 5.4). Entry point: src/index.ts (JSON-RPC loop). Extractor logic: src/extractor.ts (ProjectContext with lazy tsconfig load, plugin stripping). Protocol types: src/protocol.ts. Launched by Rust as node --max-old-space-size=2048 dist/index.js.
  • extractors/rust/ — Standalone Rust binary (codeagent-rust-extractor). LSP adapter: spawns rust-analyzer --stdio, translates documentSymbol responses into SemanticNode/SemanticEdge. Entry point: src/main.rs (JSON-RPC loop). LSP client: src/lsp_client.rs (Content-Length framing via lsp-server). Analyzer: src/analyzer.rs. Launched by Rust core via the path in config.indexing.rust_extractor_path.

codeagent-engine/ layout (cumulative):

codeagent-engine/
  Cargo.toml               — workspace root
  crates/
    codeagent-core/        — library crate (all indexing logic)
      src/
        lib.rs             — crate root
        error.rs           — CoreError enum (+ Ipc, IpcVersionMismatch, IpcTimeout, IpcChildExited, IdentityConflict)
        types.rs           — NodeId, FileId, ProjectId, Language, NodeType, EdgeType, …
        config.rs          — Config (+ IndexingConfig, EmbeddingConfig, RetrievalConfig, McpConfig, LoggingConfig)
        path.rs            — normalize_path(), normalize_path_lossy(), detect_language(), is_generated()
        db/
          schema.rs        — MIGRATION_001..005 DDL (CURRENT_SCHEMA_VERSION = 5)
          migrations.rs    — run_migrations(), quick_check(), integrity_check()
          connection.rs    — DbWriterHandle (mpsc), DbReaderPool (r2d2 + async read()), start_writer_thread()
        graph/
          nodes.rs         — upsert_node(), ensure_root_project(), compute_span_hash(), sync_fts_*()
          edges.rs         — upsert_edge(), replace_semantic_edges_in_tx(), get_outgoing_edges(), get_incoming_edges()
          spans.rs         — insert_span(), replace_spans(), reassign_primary_span()
          deletion.rs      — delete_file_transactional(), hard_delete_node(), journal_node()
          identity.rs      — reconcile_csharp_symbol_key(), upgrade_ts_export_scope(), reconcile_rust_symbol_key()
          api_endpoints.rs — upsert_api_endpoint(), search_api_endpoints(), ApiEndpoint, ApiStyle
          ranking.rs       — compute_pagerank(), get_pagerank_summary(), PageRankOptions
          vectors.rs       — colbert_search(), centroid_prefilter(), maxsim_score(), insert_or_replace_embedding()
        ipc/
          mod.rs           — re-exports IpcManager
          protocol.rs      — RpcRequest/Response, HandshakeParams/Result, AnalyzeFileParams/Result, SemanticNode/Edge
          process.rs       — LanguageServiceProcess (spawn, watchdog, codec, bounded semaphore)
          manager.rs       — IpcManager (routes to C#/TS/Rust process, respawns on crash, safe_mode enforcement, is_csharp_solution_loaded())
          pool_sizing.rs   — IPC concurrency pool sizing heuristics
        ingest/
          batch.rs         — ChangeBatch, FileChange (creates-before-deletes ordering)
          invalidation.rs  — InvalidationPlanner, ChangeType, InvalidationAction (+ RecomputeSemanticEdges)
          project_detection.rs — detect_projects(), ensure_projects_in_db(), resolve_project_id()
          semantic.rs      — enrich_file(), prefetch_enrichment_lookups(), apply_enrichment_batch_prefetched(), SemanticContextRecomputer::recompute_project()
          extraction.rs    — FileExtraction, IdentityKey, IdentityMap, write_file_extraction(), write_file_extraction_initial()
          fingerprint.rs   — compute_fingerprint(), jaccard_similarity(), fingerprint_to/from_bytes()
          rename.rs        — RenameDetector::detect() (3-tier: git → fingerprint → symbol)
          watcher.rs       — start_watcher(), run_coalescer() (debounce + burst recovery)
          pipeline.rs      — IngestPipeline::process_batch() (7-step async pipeline), process_parallel(), load_minimal_csharp_solution(), SolutionSubTimings, PipelineTimings
        adapters/
          mod.rs           — LanguageAdapter trait, TREESITTER_EXTRACTOR_VERSION
          csharp.rs        — CSharpAdapter (tree-sitter; handles namespace/class/interface/method/…)
          typescript.rs    — TypeScriptAdapter (.ts and .tsx; React component detection)
          rust.rs          — RustAdapter (tree-sitter; module path from file position, impl block handling)
        embedding/
          onnx.rs          — EmbeddingModel, DocumentEmbedding, QueryEmbedding, build_embedding_text()
          provider.rs      — EmbeddingProvider trait, HashEmbeddingProvider (deterministic test provider)
        query/
          mod.rs           — get_node(), get_neighbors(), get_source(), get_outline(), filter_nodes()
          dead_code.rs     — find_dead_code(), count_dead_code(), DeadCodeOptions, DeadCodeEntry
        retrieval/
          mod.rs           — retrieve() (hybrid search entry point)
          types.rs         — QueryIntent, RetrievalQuery, ScoredNode, RetrievalResult, AssembledContext
          channels.rs      — vector_search(), bm25_search(), qualified_name_search()
          reranker.rs      — merge_and_rerank() (RRF fusion)
          context.rs       — assemble_context() (token-budget packing)
          intent.rs        — classify_intent() (query → intent classification)
        eval/
          mod.rs           — run_eval(), EvalResult, ArchetypeMetrics
          metrics.rs       — ndcg_at_k(), mrr(), precision_at_k(), recall_at_k()
          dataset.rs       — load_eval_dataset(), QueryArchetype, EvalQuery
        test_support.rs    — shared test helpers
    codeagent-cli/         — codeagent binary (init, hooks, debug inspection)
      src/
        main.rs            — CLI entry point (clap)
        init.rs            — `codeagent init` (config, .gitignore, Claude Code hooks)
        hooks.rs           — hook handlers (pre-compact, post-tool-use, subagent-start, task-completed)
    codeagent-mcp/         — MCP server binary (codeagent-mcp)
      src/
        main.rs            — MCP server entry point (stdio transport)
        state.rs           — ServerState (DB handles, config, repo root)
        sandbox.rs         — path sandboxing to repo root
        serialization.rs   — MCP JSON serialization helpers
        error.rs           — MCP error types
        test_helpers.rs    — MCP test helpers
        tools/
          mod.rs           — tool registration and dispatch
          filesystem.rs    — list_directory, read_file, get_directory_tree
          search.rs        — search_symbols, lookup_symbol, find_similar
          navigation.rs    — get_symbol, get_source_spans, get_file_outline, get_callers, etc.
          management.rs    — index_files, get_status
  extractors/
    csharp/                — .NET 8 Roslyn extractor (CodeAgentExtractor.csproj)
      src/Protocol.cs      — JSON-RPC 2.0 types
      src/RoslynExtractor.cs — RoslynProjectContext (lazy MSBuildWorkspace), RoslynExtractor (solution batch analysis, shared semantic cache)
      src/Program.cs       — stdin/stdout JSON-RPC main loop
    typescript/            — Node.js TS Language Service extractor
      src/protocol.ts      — JSON-RPC 2.0 types
      src/extractor.ts     — ProjectContext (lazy tsconfig load, plugin stripping), TsExtractor
      src/index.ts         — stdin readline JSON-RPC main loop
    rust/                  — Rust semantic extractor (LSP adapter wrapping rust-analyzer)
      Cargo.toml           — standalone binary crate (lsp-types 0.97, lsp-server 0.7)
      src/main.rs          — stdin/stdout JSON-RPC main loop (handshake, analyze_file, analyze_files_batch)
      src/protocol.rs      — JSON-RPC 2.0 types (mirrors codeagent-core/src/ipc/protocol.rs)
      src/lsp_client.rs    — LspClient (spawns rust-analyzer --stdio, Content-Length framing, request/response matching)
      src/analyzer.rs      — RustAnalyzer (LSP adapter: documentSymbol → SemanticNode/SemanticEdge, impl block parsing)