diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 000000000..e6be1c4da --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,81 @@ +# Copilot Instructions + +Tier 1 PR reviewer for the kagent monorepo. Maintainer time is limited; your vigilance helps ensure project quality. + +## Review style + +- Explain the "why" behind recommendations. +- Skip details caught by CI linters (golangci-lint, ruff, eslint). +- Verify: clear docstrings, type hints, descriptive names, proper decomposition, sufficient comments. +- Verify: new logic has tests proportional to change size; PR description explains the "why". +- Identify design flaws, redundancy, and security issues. +- Estimate LLM-generation likelihood with confidence level as an HTML comment in "PR Overview". +- Comment "All Copilot criteria are met." when all criteria are met. + +## Review strategy + +For each batch: load review guide -> read diffs -> run code-tree queries -> post comments -> compress or drop context -> unload guide. + +### Small-batch file review + +Review files in batches of 3-5, in this order: + +1. **CRD Types** -- `go/api/v1alpha2/` changes (type safety, backward compat, JSON tags) +2. **Controllers & Translators** -- `go/core/internal/controller/` changes (reconciliation, error handling) +3. **Go ADK & HTTP** -- `go/adk/`, `go/core/internal/httpserver/` changes +4. **Python ADK** -- `python/packages/kagent-adk/` changes (type alignment, async patterns) +5. **UI Components** -- `ui/src/` changes (TypeScript strict mode, React patterns) +6. **Helm & CI** -- `helm/`, `.github/workflows/` changes (security, RBAC) +7. **Tests** -- test files reviewed against the code they test + +### Context pruning between batches + +After each batch, summarize key observations (new symbols, behavior changes, test gaps). Drop file contents; keep only the summary for cross-referencing in later batches. + +## Code-tree impact analysis + +Before posting a final review, run these queries on the changed files: + +```bash +# 1. Regenerate graph (incremental, quiet) +python3 tools/code-tree/code_tree.py --repo-root . --incremental -q + +# 2. For each changed source file, check blast radius +python3 tools/code-tree/query_graph.py --impact --depth 3 + +# 3. Check which tests are affected +python3 tools/code-tree/query_graph.py --test-impact + +# 4. For hub file changes, increase depth +python3 tools/code-tree/query_graph.py --impact --depth 5 +``` + +Flag untested impact paths in the review. + +## Security scan + +For PRs touching Helm templates, RBAC, Dockerfiles, or credentials, load [security-review.md](review-guides/security-review.md) and apply its checklist. + +## Review guide files + +Load the relevant guide per batch: + +| Guide | When to load | +|-------|-------------| +| [architecture-context.md](review-guides/architecture-context.md) | Multi-subsystem PRs, controller changes, CRD hierarchy changes | +| [impact-analysis.md](review-guides/impact-analysis.md) | Large PRs, CRD changes, hub file changes | +| [language-checklists.md](review-guides/language-checklists.md) | Any code change (pick relevant language section) | +| [security-review.md](review-guides/security-review.md) | Helm, RBAC, security contexts, credentials, Dockerfiles | +| [test-quality.md](review-guides/test-quality.md) | PRs adding/modifying tests, or PRs missing tests | + +## Quick checklist + +- [ ] Code reuse: no duplicated logic; shared helpers extracted +- [ ] Tests proportional to change size +- [ ] CRD changes: types + manifests + translator + ADK types (Go + Python) + tests +- [ ] Hub file changes: impact analysis run, extra test coverage +- [ ] Security: no hardcoded credentials, proper RBAC, non-root containers +- [ ] Generated files: not hand-edited, regeneration commands run +- [ ] Cross-language alignment: Go ↔ Python types match +- [ ] Conventional commit message format +- [ ] DCO sign-off present on all commits diff --git a/.github/review-guides/architecture-context.md b/.github/review-guides/architecture-context.md new file mode 100644 index 000000000..bc56be33c --- /dev/null +++ b/.github/review-guides/architecture-context.md @@ -0,0 +1,86 @@ +# Architecture Context for Reviews + +Load when reviewing PRs that touch multiple subsystems, modify controllers, or change CRD types. + +**See also:** [impact-analysis.md](impact-analysis.md), [language-checklists.md](language-checklists.md) + +--- + +## High-level architecture + +``` +UI (Next.js) -> Controller HTTP Server (Go) -> A2A proxy -> Agent Pod (Python/Go ADK) + -> MCP Tool Servers -> LLM Provider -> back to UI via SSE streaming +``` + +## Key subsystem boundaries + +| Subsystem | Language | Root path | +|-----------|----------|-----------| +| CRD Types & API | Go | `go/api/` | +| Controllers | Go | `go/core/internal/controller/` | +| HTTP Server | Go | `go/core/internal/httpserver/` | +| Database Layer | Go | `go/core/internal/database/` | +| A2A Protocol | Go | `go/core/internal/a2a/` | +| MCP Integration | Go | `go/core/internal/mcp/` | +| CLI | Go | `go/core/cli/` | +| Go ADK | Go | `go/adk/` | +| Python ADK | Python | `python/packages/kagent-adk/` | +| Web UI | TypeScript | `ui/src/` | +| Helm Charts | YAML | `helm/` | + +## Critical dependency directions + +Flag violations of these dependency rules: + +``` +# Allowed direction (arrow = "may depend on"). Reverse is forbidden. +go/core/ -> go/api/ (core may use api types, NOT the reverse) +go/adk/ -> go/api/ (adk may use api types, NOT the reverse) +go/core/internal/controller/ -> go/core/internal/database/ + +# Forbidden imports +go/api/ must NOT import go/core/ or go/adk/ +ui/ must NOT import go/ or python/ directly +``` + +Full dependency map: see [code-tree.md](../../docs/agents/code-tree.md#key-module-dependencies). + +## Controller patterns + +- **Shared reconciler**: All controllers share `kagentReconciler` — changes affect all CRD types +- **Translator pattern**: CRD spec → K8s resources via translators. PRs must update translator when adding CRD fields +- **Database-level concurrency**: Atomic upserts, no application-level locks. Do NOT introduce mutexes +- **Idempotent reconciliation**: Each loop iteration must be safe to retry +- **Network I/O outside transactions**: Long-running operations must not hold database locks + +## CRD type alignment + +When Go CRD types mirror Python ADK types (e.g., `AgentSpec` → `types.py`): + +- Add cross-reference comments in both languages +- Go types are the source of truth +- Flag changes to one side without corresponding changes to the other +- Both serialize to JSON via `config.json` — field names must match + +## Backward compatibility + +For CRD/API changes: + +- New fields must have safe defaults (zero-value must not break existing agents) +- What happens when an old controller reads a new CRD? Migration path must be explicit +- `v1alpha2` allows breaking changes, but prefer backward-compatible additions +- Database schema changes require migration logic in `go/core/internal/database/` + +## Cardinality changes + +When a value changes from single to list (or vice versa): + +1. Use `--callers` and `--rdeps` to find all consumers +2. Check for single-value assumptions in translators, HTTP handlers, and UI +3. Verify database model handles the change +4. Verify Helm templates handle the change + +## Type hierarchies + +Changes to base types affect all consumers. Key hierarchies in [code-tree.md](../../docs/agents/code-tree.md#key-type-hierarchies). diff --git a/.github/review-guides/impact-analysis.md b/.github/review-guides/impact-analysis.md new file mode 100644 index 000000000..cd1795927 --- /dev/null +++ b/.github/review-guides/impact-analysis.md @@ -0,0 +1,62 @@ +# Impact Analysis for Reviews + +Load when assessing blast radius, CRD changes, or reviewing large PRs. + +**See also:** [architecture-context.md](architecture-context.md) (backward compat, cardinality), [test-quality.md](test-quality.md) (coverage adequacy) + +--- + +## PR size and scope + +- Flag PRs > 500 additions crossing 3+ subsystems as "needs decomposition discussion" +- XXL PRs (>1000 additions): each component needs dedicated tests, not just E2E +- Suggest splitting when a single PR includes: CRD types + controller + ADK types + E2E + Helm + UI + +## Hub files + +Changes to these files need extra scrutiny. Full table in [code-tree.md](../../docs/agents/code-tree.md#hub-files-most-connected). + +| File | Role | +|------|------| +| `go/api/v1alpha2/agent_types.go` | Agent CRD — changes cascade to controllers, translators, ADK, UI | +| `go/api/adk/types.go` | ADK config — changes require Python mirror update | +| `python/packages/kagent-adk/src/kagent/adk/_agent_executor.py` | Core executor — all agent requests flow through here | +| `ui/src/lib/messageHandlers.ts` | A2A parsing — all UI message rendering depends on this | + +## Cross-boundary changes + +Flag without clear justification: + +- CRD type changes that don't update the translator +- Go ADK type changes without Python mirror updates +- Controller changes that modify HTTP server behavior +- Helm template changes without corresponding Go/Python changes +- UI changes that assume new API fields without backend support + +## Cross-file consistency + +When the same pattern is applied to many files (new CRD field across types, new config option): + +- Spot-check at least 3 files for consistency +- Verify the change propagates through: CRD types → translator → ADK types (Go + Python) → UI + +## CRD change checklist + +1. Field added to type struct in `go/api/v1alpha2/` +2. `make -C go manifests generate` run (CRD YAML + DeepCopy updated) +3. Translator updated in `go/core/internal/controller/translator/` +4. ADK types updated in `go/api/adk/types.go` +5. Python types mirrored in `kagent-adk/src/kagent/adk/types.py` +6. Helm CRD templates updated if schema changed +7. Unit tests for translator handling +8. E2E test for end-to-end flow +9. Generated CRD manifests committed (not hand-edited) + +## Generated files + +Never hand-edit. Key generated files: + +- `go/api/config/crd/` — CRD manifests (generated by controller-gen) +- `go/api/v1alpha2/zz_generated.deepcopy.go` — DeepCopy methods + +Reject PRs modifying generated files without running generators. diff --git a/.github/review-guides/language-checklists.md b/.github/review-guides/language-checklists.md new file mode 100644 index 000000000..5c600137e --- /dev/null +++ b/.github/review-guides/language-checklists.md @@ -0,0 +1,64 @@ +# Language-Specific Review Checklists + +Load when reviewing code changes. Pick the relevant language section. + +--- + +## Go (controllers, API, ADK) + +### Error handling and control flow +- Every `err :=` must be checked or returned. Swallowed errors cause silent failures. +- Wrap errors: `fmt.Errorf("context: %w", err)` +- `return` inside loops: verify it should be `continue`, not premature exit +- Loops assigning to a variable: verify intermediate values aren't discarded +- No variable shadowing of function arguments +- No unnecessary `else` after `return` + +### Concurrency +- No nested goroutines (`go func() { go func() { ... } }()`) +- Reuse K8s clients from struct fields, not `kubernetes.NewForConfig()` per call +- Fire-and-forget goroutines: require `context.WithTimeout` + error logging +- Database operations: atomic upserts, no application-level mutexes + +### Code quality +- `golangci-lint run` passes +- Context propagation (`context.Context` as first parameter) +- Resource cleanup (`defer` close for readers/connections) +- Table-driven tests for new functions +- Descriptive variable names (`fingerPrint` not `fp`, `cacheKey` not `ck`) +- camelCase for locals, PascalCase for exports +- No deprecated fields in new code + +### CRD-specific +- New fields have JSON tags matching the Go field name (camelCase) +- `+optional` marker for optional fields +- DeepCopy generated (`make -C go generate`) +- CRD manifests regenerated (`make -C go manifests`) + +## Python (ADK, skills, integrations) + +- Type hints on all function signatures +- No bare `Any` types without justification +- Ruff formatting compliance +- `async/await` used consistently (no mixing sync and async) +- Cross-language types: add `# Must stay aligned with Go type in ...` comments +- No bare `Exception` catches — use specific exception types +- No mutable default arguments +- Tests use `pytest` with `@pytest.mark.asyncio` for async tests + +## TypeScript (UI) + +- Strict mode compliance (no `any` type) +- No inline styles — use TailwindCSS classes +- No direct DOM manipulation — use React patterns +- Radix UI primitives for accessibility +- React Hook Form + Zod for form validation +- Jest tests for logic-heavy components +- ESLint + Next.js lint passing + +## YAML (Helm charts, CI) + +- Helm templates use proper `.Values` references +- No hardcoded image tags — use chart values +- CI workflows: no secrets in logs, pinned action versions +- CRD templates match generated manifests in `go/api/config/crd/` diff --git a/.github/review-guides/security-review.md b/.github/review-guides/security-review.md new file mode 100644 index 000000000..4e6c4858f --- /dev/null +++ b/.github/review-guides/security-review.md @@ -0,0 +1,55 @@ +# Security Review Guide + +Load when reviewing PRs touching Helm templates, RBAC, security contexts, Dockerfiles, or credentials. + +--- + +## Principles + +- **Enforcement > defaults**: Security contexts must be enforced, not just defaulted +- **Never configurable**: `allowPrivilegeEscalation`, `privileged` must be hardcoded `false` — no env var overrides +- **System containers**: Controller, ADK runtime, UI must ALWAYS run non-root +- **Defense in depth**: Backend must enforce policy independently of UI. UI is convenience; controller is enforcement +- **Threat model required**: PRs adding security-configurable fields should answer: "What could a malicious agent definition do?" + +## Container SecurityContext checklist + +**NEVER user-configurable:** `privileged`, `add_capabilities`, `allowPrivilegeEscalation` +**ALWAYS hardcoded:** `drop: ["ALL"]`, `seccompProfile: RuntimeDefault` +**MAY be user-configurable:** `runAsUser` (warn on UID 0), `runAsGroup`, `readOnlyRootFilesystem` + +- Container-level `securityContext` must not conflict with pod-level `PodSecurityContext` +- Admin-set contexts take precedence over user-specified values +- Tests must not normalize violations + +## RBAC manifest review + +- `resourceNames` restrictions on `update`/`patch`/`delete` where possible +- `create` can't be scoped by `resourceNames` — document why needed +- Helm chart RBAC templates use `.Values` for namespace +- New ClusterRoles follow `kagent-*` naming convention +- ServiceAccount per agent pod (not shared) + +## Credentials and secrets + +- No hardcoded credentials in any file +- API keys stored in Kubernetes Secrets, referenced via `apiKeySecret` in ModelConfig CRD +- `ValueRef` type used for secret references (namespace + name + key) +- MCP server auth headers referenced via `headersFrom` (SecretKeySelector) +- No credentials in container environment variables — mount as files or use Secret references + +## Network security + +- Agent pods communicate with controller via ClusterIP services +- MCP tool server connections should use TLS when available +- A2A protocol endpoints should validate request origin +- `allowedNamespaces` on RemoteMCPServer controls cross-namespace access + +## General + +- No secrets in CI workflow logs +- Pinned action versions in GitHub workflows (SHA, not tag) +- Docker images use non-root USER +- Helm values: sensitive defaults are empty, not example values +- Input validation on agent names and configurations +- ConfigMap/Secret cleanup on resource deletion (no orphans) diff --git a/.github/review-guides/test-quality.md b/.github/review-guides/test-quality.md new file mode 100644 index 000000000..11c292ae2 --- /dev/null +++ b/.github/review-guides/test-quality.md @@ -0,0 +1,74 @@ +# Test Quality Review Guide + +Load when reviewing PRs that add/modify tests, or should include tests. + +--- + +## Coverage requirements + +- Every new public function: happy path + edge cases + error paths +- New controller reconciliation paths: unit tests with fake K8s client +- New HTTP API endpoints: integration tests +- New CRD fields: translator unit tests + E2E test +- New algorithmic code (BFS, tree walks, parsing): dedicated unit tests — E2E alone is insufficient + +## Proportional coverage + +| Change size | Expected | +|-------------|----------| +| Bug fix (< 50 lines) | Unit test reproducing bug + fix | +| Small feature (50-200) | Unit tests + integration if API-facing | +| Medium feature (200-500) | Unit + integration + E2E | +| Large feature (500+) | Unit + integration + E2E + negative tests | + +## Type alignment coverage + +When changing types across Go and Python: + +- Verify both sides are updated +- Add serialization round-trip tests where possible +- Verify backward compatibility (old config.json still parseable) + +## Test organization + +- Go: table-driven tests in `_test.go` files alongside source +- Python: `pytest` tests in package test directories +- UI: Jest tests in `__tests__/` directories or `.test.tsx` files +- E2E: `go/core/test/e2e/` for full-stack tests +- Utilities go in shared test helpers, not individual test files + +## Assertions and matchers + +- Assertion failure messages must match what the assertion actually checks +- Use specific assertions over generic (check exact values, not just nil) +- Go: use `t.Errorf` with descriptive messages including got/want +- Python: use pytest assertions with `-v` for detailed output +- TypeScript: use Jest matchers (`toEqual`, `toHaveBeenCalledWith`) + +## Test naming + +- Go: `TestFunctionName_Scenario` (e.g., `TestReconcile_AgentNotFound`) +- Python: `test_function_name_scenario` (e.g., `test_executor_handles_timeout`) +- Descriptive names that explain the scenario being tested + +## Fixture integrity + +- Never modify shared fixtures for new features — create new ones +- Tests must not depend on execution order +- Clean up resources in teardown, not via namespace deletion +- Sleep durations: 5s max (not 20s+); prefer polling with timeout +- No hardcoded paths or env-specific values + +## Resource management + +- Go: `defer` cleanup for test resources (clients, connections, temp files) +- Python: `pytest` fixtures with `yield` for setup/teardown +- E2E: verify pod cleanup after test completion +- Tests parallel-safe (no name collisions, no shared mutable state) + +## CI quality + +- Top-level comment explaining test purpose +- CI matrix job names include all parameters +- New dependencies: justified, maintained, license-compatible +- Test data in `testdata/` directories, not inline diff --git a/.gitignore b/.gitignore index 75b695443..c225560fd 100644 --- a/.gitignore +++ b/.gitignore @@ -213,4 +213,10 @@ file::memory:* ## Test certificates python/packages/kagent-adk/tests/fixtures/certs/*.pem -python/packages/kagent-adk/tests/fixtures/certs/*.srl \ No newline at end of file +python/packages/kagent-adk/tests/fixtures/certs/*.srl + +## Code Knowledge Graph (local SQLite DB, auto-generated) +.code-review-graph/ + +## Generated code-tree knowledge graph artifacts (rebuilt on session start) +docs/code-tree/ diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..538c6abf1 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,193 @@ +# Agent Guide: Kagent Monorepo + +Entry point for AI agents and developers. Load only the reference file relevant to your current task. + +- Scope: kagent main branch, controller (Go), ADK (Python/Go), UI (Next.js), Helm charts +- Current version: v0.x.x (Alpha stage) + +## Development Workflow Skill + +**For detailed development workflows, use the `kagent-dev` skill.** The skill provides comprehensive guidance on: + +- Adding CRD fields (step-by-step with examples) +- Running and debugging E2E tests +- PR review workflows +- Local development setup +- CI failure troubleshooting +- Common development patterns + +The skill includes detailed reference materials on CRD workflows, translator patterns, E2E debugging, and CI failures. + +--- + +## Language guidelines + +| Language | Use For | Don't Use For | +|----------|---------|---------------| +| **Go** | K8s controllers, CLI tools, core APIs, HTTP server, database layer | Agent runtime, LLM integrations, UI | +| **Python** | Agent runtime, ADK, LLM integrations, AI/ML logic | Kubernetes controllers, CLI, infrastructure | +| **TypeScript** | Web UI components and API clients only | Backend logic, controllers, agents | + +**Rule of thumb:** Infrastructure in Go, AI/Agent logic in Python, User interface in TypeScript. + +## API versioning + +- **v1alpha2** (current) — All new features go here +- **v1alpha1** (legacy/deprecated) — Minimal maintenance only + +Breaking changes are acceptable in alpha versions. + +--- + +### Code reuse policy (agents and contributors) + +- **Never duplicate code.** Extract shared helpers, use common abstractions, and avoid copy-paste. If you find yourself writing similar logic in more than one place, refactor to a shared location. +- Before creating a new helper or utility, search the codebase for existing implementations. Use the code-tree query tool if available: `python3 tools/code-tree/query_graph.py --search ""`. + +### Testing policy (agents and contributors) + +- Every new non-trivial function, method, or exported API must have accompanying unit tests before merging. +- All existing tests must pass locally before pushing changes. Run the relevant test suites listed in the essential commands section. +- When modifying existing functions, verify that existing tests still pass and add new test cases if the behavior changes. +- Do not submit changes that break existing tests. If a test failure is pre-existing and unrelated to your changes, note it explicitly in the PR description. + +### Commit policy (agents and contributors) + +- Use **Conventional Commits** format: `: ` (types: `feat`, `fix`, `docs`, `refactor`, `test`, `chore`, `perf`, `ci`) +- Always sign off on commits with `git commit -s` (adds a `Signed-off-by:` trailer). +- Never include AI agents (e.g. Claude Code, Copilot, or similar tools) as co-authors on commits. The human author is responsible for the work. + +--- + +## Best practices + +### Do's + +- Read existing code before making changes +- Follow the language guidelines (Go for infra, Python for agents, TS for UI) +- Write table-driven tests in Go +- Wrap errors with context using `%w` +- Use conventional commit messages +- Mock external services in unit tests +- Update documentation for user-facing changes +- Run `make lint` before submitting + +### Don'ts + +- Don't add features beyond what's requested (avoid over-engineering) +- Don't modify v1alpha1 unless fixing critical bugs (focus on v1alpha2) +- Don't vendor dependencies (use go.mod) +- Don't commit without testing locally first +- Don't use `any` type in TypeScript +- Don't skip E2E tests for API/CRD changes +- Don't create new MCP servers in the main kagent repo + +--- + +## Essential commands + +| Task | Command | +|------|---------| +| Create Kind cluster | `make create-kind-cluster` | +| Deploy kagent | `make helm-install` | +| Build all | `make build` | +| Run all tests | `make test` | +| Go unit tests | `make -C go test` | +| Go E2E tests | `make -C go e2e` | +| Go lint | `make -C go lint` | +| Go CRD generation | `make -C go manifests generate` | +| Python tests | `make -C python test` | +| Python format | `make -C python format` | +| Python lint | `make -C python lint` | +| UI tests | `cd ui && npx jest` | +| UI lint | `cd ui && npx next lint` | +| Access UI | `kubectl port-forward -n kagent svc/kagent-ui 3000:8080` | + +--- + +## Reference file index + +Load only the guide you need for the current task: + +| Guide | Contents | When to load | +|-------|----------|-------------| +| [docs/agents/architecture.md](docs/agents/architecture.md) | System architecture, subsystem boundaries, dependency rules, CRDs, protocols | Understanding how components interact | +| [docs/agents/go-guide.md](docs/agents/go-guide.md) | Go development: controllers, API, ADK, CLI, linting, testing patterns | Working on Go code | +| [docs/agents/python-guide.md](docs/agents/python-guide.md) | Python ADK, agent runtime, LLM integrations, framework support | Working on Python code | +| [docs/agents/ui-guide.md](docs/agents/ui-guide.md) | Next.js UI, components, routing, state management | Working on frontend code | +| [docs/agents/testing-ci.md](docs/agents/testing-ci.md) | Test suites, CI workflows, coverage requirements, local CI reproduction | Running tests or debugging CI | +| [docs/agents/code-tree.md](docs/agents/code-tree.md) | Knowledge graph queries, hub files, module dependencies, entry points | Scoping changes and impact analysis | + +--- + +## Task routing + +| Task | Start with | +|------|-----------| +| Add CRD field | [go-guide.md](docs/agents/go-guide.md#adding-crd-fields) → [architecture.md](docs/agents/architecture.md#crd-types-v1alpha2) | +| Add controller | [go-guide.md](docs/agents/go-guide.md#controller-development) → [architecture.md](docs/agents/architecture.md#controller-patterns) | +| Add LLM provider | [python-guide.md](docs/agents/python-guide.md#adding-llm-provider-support) | +| Add UI page | [ui-guide.md](docs/agents/ui-guide.md#adding-a-new-page) | +| Debug CI failure | [testing-ci.md](docs/agents/testing-ci.md#common-ci-failure-patterns) | +| Scope blast radius | [code-tree.md](docs/agents/code-tree.md#querying-the-graph) | +| Review a PR | [.github/copilot-instructions.md](.github/copilot-instructions.md) | + +--- + +## Validation checklist + +Before pushing any change: + +1. **Lint**: `make -C go lint` / `make -C python lint` / `cd ui && npx next lint` +2. **Test**: `make -C go test` / `make -C python test` / `cd ui && npx jest` +3. **Generated files**: If CRD types changed, `make -C go manifests generate` and commit +4. **Type alignment**: If Go ADK types changed, verify Python mirror is updated +5. **Commit message**: Conventional format (`feat:`, `fix:`, `docs:`, etc.) +6. **Sign-off**: `git commit -s` + +--- + +## Code Knowledge Graph + +Two options for code-aware PR reviews and blast-radius analysis: + +### Option 1: code-tree tools (zero-dependency, in-repo) + +```bash +# Build the knowledge graph +python3 tools/code-tree/code_tree.py --repo-root . --incremental -q + +# Query symbols, dependencies, impact +python3 tools/code-tree/query_graph.py --symbol +python3 tools/code-tree/query_graph.py --impact --depth 5 +python3 tools/code-tree/query_graph.py --test-impact +``` + +See [docs/agents/code-tree.md](docs/agents/code-tree.md) for full usage. + +### Option 2: code-review-graph MCP server (optional) + +```bash +pip install code-review-graph +code-review-graph install # registers MCP server + git hooks +code-review-graph build # initial index (~15s) +# restart Claude Code to pick up the new MCP server +``` + +| Command | Description | +|---------|-------------| +| `/code-review-graph:review-pr` | Review the current PR with graph-aware context | +| `/code-review-graph:review-delta` | Review only staged/unstaged changes | +| `code-review-graph update` | Manually refresh the index (auto-updates via hooks) | +| `code-review-graph status` | Show indexed node counts by language | + +The `.code-review-graphignore` file controls which files are excluded from indexing. The local `.code-review-graph/` directory is git-ignored. + +--- + +## Additional resources + +- **Development setup:** See [DEVELOPMENT.md](DEVELOPMENT.md) +- **Contributing:** See [CONTRIBUTING.md](CONTRIBUTING.md) +- **Architecture (detailed):** See [docs/architecture/](docs/architecture/) +- **Examples:** Check `examples/` and `python/samples/` diff --git a/CLAUDE.md b/CLAUDE.md index e85c54d21..93302cdda 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,214 +1 @@ -# CLAUDE.md - Kagent Development Guide - -This document provides essential guidance for AI agents working in the kagent repository. - ---- - -## Development Workflow Skill - -**For detailed development workflows, use the `kagent-dev` skill.** The skill provides comprehensive guidance on: - -- Adding CRD fields (step-by-step with examples) -- Running and debugging E2E tests -- PR review workflows -- Local development setup -- CI failure troubleshooting -- Common development patterns - -The skill includes detailed reference materials on CRD workflows, translator patterns, E2E debugging, and CI failures. - ---- - -## Project Overview - -**Kagent** is a Kubernetes-native framework for building, deploying, and managing AI agents. - -**Architecture:** -``` -┌─────────────┐ ┌──────────────┐ ┌─────────────┐ -│ Controller │ │ HTTP Server │ │ UI │ -│ (Go) │──▶│ (Go) │──▶│ (Next.js) │ -└─────────────┘ └──────────────┘ └─────────────┘ - │ │ - ▼ ▼ -┌─────────────┐ ┌──────────────┐ -│ Database │ │ Agent Runtime│ -│ (SQLite/PG) │ │ (Python) │ -└─────────────┘ └──────────────┘ -``` - -**Current Version:** v0.x.x (Alpha stage) - ---- - -## Repository Structure - -``` -kagent/ -├── go/ # Go workspace (go.work) -│ ├── api/ # Shared types: CRDs, ADK types, DB models, HTTP client -│ ├── core/ # Infrastructure: controllers, HTTP server, CLI -│ └── adk/ # Go Agent Development Kit -├── python/ # Agent runtime and ADK -│ ├── packages/ # UV workspace packages (kagent-adk, etc.) -│ └── samples/ # Example agents -├── ui/ # Next.js web interface -├── helm/ # Kubernetes deployment charts -│ ├── kagent-crds/ # CRD chart (install first) -│ └── kagent/ # Main application chart -└── .claude/skills/kagent-dev/ # Development skill -``` - ---- - -## Language Guidelines - -### When to Use Each Language - -| Language | Use For | Don't Use For | -|----------|---------|---------------| -| **Go** | K8s controllers, CLI tools, core APIs, HTTP server, database layer | Agent runtime, LLM integrations, UI | -| **Python** | Agent runtime, ADK, LLM integrations, AI/ML logic | Kubernetes controllers, CLI, infrastructure | -| **TypeScript** | Web UI components and API clients only | Backend logic, controllers, agents | - -**Rule of thumb:** Infrastructure in Go, AI/Agent logic in Python, User interface in TypeScript. - ---- - -## Core Conventions - -### Error Handling - -**Go:** -```go -// Always wrap errors with context using %w -if err != nil { - return fmt.Errorf("failed to create agent %s: %w", name, err) -} -``` - -**Controllers:** -```go -// Return error to requeue with backoff -if err != nil { - return ctrl.Result{}, fmt.Errorf("reconciliation failed: %w", err) -} -``` - -### Testing - -**Required for all PRs:** -- ✅ Unit tests for new functions/methods -- ✅ E2E tests for new CRD fields or API endpoints -- ✅ Mock external services (LLMs, K8s API) in unit tests -- ✅ All tests passing in CI pipeline - -**Go testing pattern (table-driven):** -```go -func TestSomething(t *testing.T) { - tests := []struct { - name string - input string - want string - wantErr bool - }{ - {name: "valid input", input: "foo", want: "bar", wantErr: false}, - {name: "invalid input", input: "", want: "", wantErr: true}, - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - got, err := Something(tt.input) - if (err != nil) != tt.wantErr { - t.Errorf("Something() error = %v, wantErr %v", err, tt.wantErr) - } - if got != tt.want { - t.Errorf("Something() = %v, want %v", got, tt.want) - } - }) - } -} -``` - -### Commit Messages - -Use **Conventional Commits** format: - -``` -: - -[optional body] -``` - -**Types:** `feat`, `fix`, `docs`, `refactor`, `test`, `chore`, `perf`, `ci` - -**Examples:** -``` -feat: add support for custom service account in agent CRD -fix: enable usage metadata in streaming OpenAI responses -docs: update CLAUDE.md with testing requirements -``` - ---- - -## API Versioning - -- **v1alpha2** (current) - All new features go here -- **v1alpha1** (legacy/deprecated) - Minimal maintenance only - -Breaking changes are acceptable in alpha versions. - ---- - -## Best Practices - -### Do's ✅ - -- Read existing code before making changes -- Follow the language guidelines (Go for infra, Python for agents, TS for UI) -- Write table-driven tests in Go -- Wrap errors with context using `%w` -- Use conventional commit messages -- Mock external services in unit tests -- Update documentation for user-facing changes -- Run `make lint` before submitting - -### Don'ts ❌ - -- Don't add features beyond what's requested (avoid over-engineering) -- Don't modify v1alpha1 unless fixing critical bugs (focus on v1alpha2) -- Don't vendor dependencies (use go.mod) -- Don't commit without testing locally first -- Don't use `any` type in TypeScript -- Don't skip E2E tests for API/CRD changes -- Don't create new MCP servers in the main kagent repo - ---- - -## Quick Reference - -| Task | Command | -|------|---------| -| Create Kind cluster | `make create-kind-cluster` | -| Deploy kagent | `make helm-install` | -| Build all | `make build` | -| Run all tests | `make test` | -| Run E2E tests | `make -C go e2e` | -| Lint code | `make -C go lint` | -| Generate CRD code | `make -C go generate` | -| Access UI | `kubectl port-forward -n kagent svc/kagent-ui 3000:8080` | - ---- - -## Additional Resources - -- **Development setup:** See [DEVELOPMENT.md](DEVELOPMENT.md) -- **Contributing:** See [CONTRIBUTING.md](CONTRIBUTING.md) -- **Architecture:** See [docs/architecture/](docs/architecture/) -- **Examples:** Check `examples/` and `python/samples/` - ---- - -**Project Version:** v0.x.x (Alpha) - -For questions or suggestions about this guide, please open an issue or PR. +./AGENTS.md diff --git a/docs/agents/architecture.md b/docs/agents/architecture.md new file mode 100644 index 000000000..a4cbe5c16 --- /dev/null +++ b/docs/agents/architecture.md @@ -0,0 +1,153 @@ +# Kagent Architecture + +How kagent components interact end-to-end, from agent creation through message processing and tool execution. + +**See also:** [go-guide.md](go-guide.md) (Go development), [python-guide.md](python-guide.md) (Python ADK), [ui-guide.md](ui-guide.md) (frontend), [testing-ci.md](testing-ci.md) (CI/CD) + +--- + +## High-level architecture + +``` +┌─────────────┐ ┌──────────────┐ ┌─────────────┐ +│ Controller │ │ HTTP Server │ │ UI │ +│ (Go) │──▶│ (Go) │──▶│ (Next.js) │ +└─────────────┘ └──────────────┘ └─────────────┘ + │ │ + ▼ ▼ +┌─────────────┐ ┌──────────────┐ +│ Database │ │ Agent Runtime│ +│ (SQLite/PG) │ │ (Python) │ +└─────────────┘ └──────────────┘ +``` + +## End-to-end flow (UI -> Controller -> Agent -> LLM -> Tools -> UI) + +### UI (Next.js) + +- Entry: `ui/src/app/` Next.js app router +- Sends A2A JSON-RPC messages to the controller HTTP server +- Renders streaming responses via SSE +- Manages agents, models, tool servers, and conversations + +### Controller (Go) + +- Entry: `go/core/cmd/controller/main.go` +- Kubernetes controllers reconcile Agent, ModelConfig, RemoteMCPServer, MCPServer CRDs +- Creates Deployments, ConfigMaps, Secrets, ServiceAccounts for agent pods +- Upserts metadata to database (SQLite or PostgreSQL) +- HTTP API server on port 8083 proxies A2A requests to agent pods + +### Agent Runtime (Python ADK) + +- Entry: `python/packages/kagent-adk/src/kagent/adk/cli.py` +- Each agent pod runs the Python ADK +- Reads `config.json` from mounted Secret to configure agent behavior +- Connects to MCP tool servers for tool execution +- Manages LLM conversation loop: system prompt + history + tool calls + +### Agent Runtime (Go ADK) + +- Entry: `go/adk/cmd/main.go` +- Alternative runtime for agents written in Go +- Supports BYO (Bring Your Own) agent pattern +- Uses Google ADK for agent creation and session management + +## Key subsystem boundaries + +| Subsystem | Language | Root path | +|-----------|----------|-----------| +| CRD Types & API | Go | `go/api/` | +| Controllers | Go | `go/core/internal/controller/` | +| HTTP Server | Go | `go/core/internal/httpserver/` | +| Database Layer | Go | `go/core/internal/database/` | +| A2A Protocol | Go | `go/core/internal/a2a/` | +| MCP Integration | Go | `go/core/internal/mcp/` | +| CLI | Go | `go/core/cli/` | +| Go ADK | Go | `go/adk/` | +| Python ADK | Python | `python/packages/kagent-adk/` | +| Python Core | Python | `python/packages/kagent-core/` | +| Python Skills | Python | `python/packages/kagent-skills/` | +| Web UI | TypeScript | `ui/src/` | +| Helm Charts | YAML | `helm/` | +| Pre-built Agents | YAML | `helm/agents/` | + +## Critical dependency directions + +Flag violations of these dependency rules: + +``` +# Allowed direction (arrow = "may depend on"). Reverse is forbidden. +go/core/ -> go/api/ (core may use api types, NOT the reverse) +go/adk/ -> go/api/ (adk may use api types, NOT the reverse) +go/core/internal/controller/ -> go/core/internal/database/ (controller may use db, NOT reverse) +go/core/internal/httpserver/ -> go/core/internal/database/ (http may use db, NOT reverse) + +# Forbidden imports +go/api/ must NOT import go/core/ or go/adk/ +ui/ must NOT import go/ or python/ directly +``` + +## Controller patterns + +- **Shared reconciler**: All controllers share a single `kagentReconciler` instance (`go/core/internal/controller/reconciler/`) +- **Translator pattern**: CRD specs are translated to Kubernetes resources via translators (`go/core/internal/controller/translator/`) +- **Database-level concurrency**: Atomic upserts (`INSERT ... ON CONFLICT DO UPDATE`), no application-level locks +- **Network I/O outside transactions**: Prevents long-running operations from holding database locks +- **Event filtering**: Custom predicates filter Create/Delete/Update events to reduce unnecessary reconciliation + +## CRD types (v1alpha2) + +| CRD | Purpose | Definition | +|-----|---------|------------| +| Agent | AI agent configuration (declarative or BYO) | `go/api/v1alpha2/agent_types.go` | +| ModelConfig | LLM provider configuration | `go/api/v1alpha2/modelconfig_types.go` | +| RemoteMCPServer | Remote MCP tool server | `go/api/v1alpha2/remotemcpserver_types.go` | +| ModelProviderConfig | Provider-level configuration | `go/api/v1alpha2/modelproviderconfig_types.go` | + +## Protocols + +- **A2A (Agent-to-Agent)**: JSON-RPC 2.0 over HTTP with SSE streaming for inter-agent communication +- **MCP (Model Context Protocol)**: Standard protocol for tool server integration (SSE or Streamable HTTP) + +## Database support + +- **SQLite** (default): Local development, single-node deployments +- **PostgreSQL**: Production deployments, supports pgvector for memory + +## Go module structure + +Single workspace module at `go/go.mod` (module `github.com/kagent-dev/kagent/go`): + +``` +go/ +├── api/ # Shared types (CRDs, DB models, HTTP API, client SDK) +├── core/ # Infrastructure (controllers, HTTP server, CLI, database) +└── adk/ # Go Agent Development Kit +``` + +## Python package structure + +UV workspace at `python/`: + +``` +python/packages/ +├── kagent-adk/ # Main Python ADK (agent executor, A2A server, MCP toolset) +├── kagent-core/ # Core utilities +├── kagent-skills/ # Skills framework +├── kagent-openai/ # OpenAI native integration +├── kagent-langgraph/ # LangGraph framework support +├── kagent-crewai/ # CrewAI framework support +└── agentsts-*/ # AgentSTS variants +``` + +## Helm deployment + +Two charts (install CRDs first): + +```bash +helm install kagent-crds helm/kagent-crds/ # CRD definitions +helm install kagent helm/kagent/ # Main application +``` + +Pre-built agents available in `helm/agents/` (k8s, istio, helm, prometheus, grafana, etc.). diff --git a/docs/agents/code-tree.md b/docs/agents/code-tree.md new file mode 100644 index 000000000..f4b82607d --- /dev/null +++ b/docs/agents/code-tree.md @@ -0,0 +1,179 @@ +# Kagent Code-Tree Knowledge Graph + +Pre-built knowledge graph artifacts, query commands, codebase statistics, entry points, hub files, module dependencies, and workflows for development and PR reviews. + +**See also:** [architecture.md](architecture.md) (system architecture), [go-guide.md](go-guide.md) (Go files), [python-guide.md](python-guide.md) (Python files), [ui-guide.md](ui-guide.md) (UI files) + +--- + +## Development workflow + +Follow this workflow for any code change. Regenerate the graph first, use queries to scope your change, then verify impact after. + +### Before writing code + +1. **Regenerate graph**: `python3 tools/code-tree/code_tree.py --repo-root . --incremental -q` +2. **Locate**: `python3 tools/code-tree/query_graph.py --symbol ` or `--search ` +3. **Scope**: `--rdeps ` to understand blast radius before changing anything +4. **Read**: Load the relevant guide from [AGENTS.md](../../AGENTS.md), then read the target files + +### While writing code + +5. **Guard hub files**: Extra care when touching high-fanout files (see hub files below). Run `--impact --depth 5` first. +6. **Match patterns**: Follow existing code style. Go: error wrapping with `%w`, descriptive names. Python: type hints, ruff formatting. TypeScript: strict mode, TailwindCSS. +7. **Cross-boundary changes**: Verify dependency direction rules (see [architecture.md](architecture.md#critical-dependency-directions)). + +### After writing code + +8. **Run relevant tests**: See essential commands in [AGENTS.md](../../AGENTS.md) +9. **Run impact check**: `python3 tools/code-tree/query_graph.py --test-impact ` +10. **Lint**: Run the appropriate formatter/linter for the language you changed + +--- + +## Available artifacts + +| File | Contents | Use case | +|------|----------|----------| +| `docs/code-tree/summary.md` | Architecture overview, hub files, key types, inheritance, module deps | **Start here** for any exploration | +| `docs/code-tree/graph.json` | Full knowledge graph (nodes + edges) | Programmatic queries with `jq` | +| `docs/code-tree/tags.json` | Flat symbol index with file:line locations | Quick symbol lookup | +| `docs/code-tree/modules.json` | Directory-level dependency map | Module impact analysis | + +## Querying the graph + +Use the query tool at `tools/code-tree/query_graph.py`: + +```bash +# Find where a symbol is defined +python3 tools/code-tree/query_graph.py --symbol AgentExecutor + +# Trace what a file depends on +python3 tools/code-tree/query_graph.py --deps go/core/internal/controller/agent_controller.go + +# Find what imports a file (reverse deps / blast radius) +python3 tools/code-tree/query_graph.py --rdeps go/api/v1alpha2/agent_types.go + +# Show inheritance hierarchy for a class +python3 tools/code-tree/query_graph.py --hierarchy AgentExecutor + +# Search symbols by keyword +python3 tools/code-tree/query_graph.py --search "reconcil" + +# Get module overview (files, deps, symbols) +python3 tools/code-tree/query_graph.py --module go/core/internal/controller + +# Find entry points (main functions, unimported files) +python3 tools/code-tree/query_graph.py --entry-points + +# Extract code chunks from a file for context +python3 tools/code-tree/query_graph.py --chunks go/api/v1alpha2/agent_types.go + +# Who calls this function/method? +python3 tools/code-tree/query_graph.py --callers Reconcile + +# What does this function call? +python3 tools/code-tree/query_graph.py --callees CreateAgent + +# Find call paths between two symbols +python3 tools/code-tree/query_graph.py --call-chain Reconcile CreateAgent + +# Change impact analysis (transitive) +python3 tools/code-tree/query_graph.py --impact go/api/v1alpha2/agent_types.go +python3 tools/code-tree/query_graph.py --impact AgentSpec --depth 8 + +# Which tests are affected by a change? +python3 tools/code-tree/query_graph.py --test-impact go/core/internal/controller/agent_controller.go + +# What tests cover a file? +python3 tools/code-tree/query_graph.py --test-coverage go/api/v1alpha2/agent_types.go + +# Graph statistics +python3 tools/code-tree/query_graph.py --stats +``` + +Add `--json` to any query for machine-readable output. +Use `--depth N` to control impact/call-chain traversal depth (default: 5). + +## Entry points + +| Entry point | Purpose | +|-------------|---------| +| `go/core/cmd/controller/main.go` | Controller + HTTP server | +| `go/adk/cmd/main.go` | Go ADK server | +| `go/core/cli/cmd/kagent/main.go` | kagent CLI | +| `python/packages/kagent-adk/src/kagent/adk/cli.py` | Python ADK CLI | +| `python/packages/kagent-adk/src/kagent/adk/_a2a.py` | Python A2A server | +| `ui/src/app/layout.tsx` | UI entry (Next.js root layout) | + +## Hub files (most connected) + +These files are imported by the most other files. Changes to them have the widest blast radius: + +| File | Role | +|------|------| +| `go/api/v1alpha2/agent_types.go` | Agent CRD type definitions (20KB) | +| `go/api/v1alpha2/modelconfig_types.go` | ModelConfig CRD types (14KB) | +| `go/api/v1alpha2/common_types.go` | Shared types (ValueRef, etc.) | +| `go/api/database/models.go` | Database GORM models | +| `go/api/adk/types.go` | Go ADK configuration types | +| `python/packages/kagent-adk/src/kagent/adk/types.py` | Python ADK types (23KB, mirrors Go) | +| `python/packages/kagent-adk/src/kagent/adk/_agent_executor.py` | Core executor (28KB) | +| `ui/src/lib/messageHandlers.ts` | A2A message parsing (26KB) | +| `ui/src/components/chat/ChatInterface.tsx` | Main chat UI (28KB) | + +## Key module dependencies + +``` +go/core/internal/controller/ -> go/api/v1alpha2/ (CRD types) +go/core/internal/controller/ -> go/core/internal/database/ (DB operations) +go/core/internal/controller/ -> go/core/internal/mcp/ (tool discovery) +go/core/internal/httpserver/ -> go/core/internal/database/ (DB queries) +go/core/internal/httpserver/ -> go/core/internal/a2a/ (A2A proxy) +go/core/internal/a2a/ -> go/api/client/ (REST client) +go/adk/pkg/ -> go/api/adk/ (ADK types) +go/adk/pkg/models/ -> go/api/v1alpha2/ (ModelConfig types) +kagent-adk/ -> kagent-core/ (core utilities) +kagent-adk/ -> kagent-skills/ (skills framework) +ui/src/lib/ -> ui/src/types/ (type definitions) +ui/src/components/ -> ui/src/lib/ (utilities) +``` + +## Key type hierarchies + +``` +Agent CRD (go/api/v1alpha2/agent_types.go) + |- AgentSpec + | |- DeclarativeAgentConfig (type: "Declarative") + | |- BYOAgentConfig (type: "BYO") + |- AgentStatus + +ModelConfig CRD (go/api/v1alpha2/modelconfig_types.go) + |- ModelConfigSpec + | |- OpenAIConfig + | |- AnthropicConfig + | |- AzureOpenAIConfig + | |- OllamaConfig + | |- GeminiConfig + | |- CustomProviderConfig + +LLM Providers (python/packages/kagent-adk/src/kagent/adk/models/) + |- OpenAI native + |- LiteLLM (multi-provider) +``` + +## Regenerating the graph + +**Requires Python 3.9+** (uses `list[str]` type syntax). If your default `python3` is older, use `python3.12` or similar explicitly. + +Always regenerate before first use in a session: + +```bash +# Incremental (fast — only reparses changed files) +python3 tools/code-tree/code_tree.py --repo-root . --incremental -q + +# Full rebuild (after major changes) +python3 tools/code-tree/code_tree.py --repo-root . +``` + +For a structured PR review workflow using these queries, see [copilot-instructions.md](../../.github/copilot-instructions.md#code-tree-impact-analysis). diff --git a/docs/agents/go-guide.md b/docs/agents/go-guide.md new file mode 100644 index 000000000..567dc1db1 --- /dev/null +++ b/docs/agents/go-guide.md @@ -0,0 +1,184 @@ +# Kagent Go Development Guide + +Go backend development: controllers, HTTP server, database, CLI, and Go ADK. + +**See also:** [architecture.md](architecture.md) (system overview), [testing-ci.md](testing-ci.md) (test commands), [code-tree.md](code-tree.md) (code navigation) + +--- + +## Local development + +### Prerequisites + +- Go 1.26.1+ +- Docker with Buildx v0.23.0+ +- Kind v0.27.0+ +- kubectl v1.33.4+ +- Helm 3 + +### Essential commands + +| Task | Command | +|------|---------| +| Build all Go | `make -C go build` | +| Run unit tests | `make -C go test` | +| Run E2E tests | `make -C go e2e` | +| Lint | `make -C go lint` | +| Fix lint | `make -C go lint-fix` | +| Format | `make -C go fmt` | +| Vet | `make -C go vet` | +| Generate CRDs | `make -C go manifests` | +| Generate DeepCopy | `make -C go generate` | +| Vulnerability check | `make -C go govulncheck` | +| Build CLI (local) | `make build-cli-local` | + +### Module structure + +Single Go module: `github.com/kagent-dev/kagent/go` (Go 1.26.1) + +``` +go/ +├── api/ # Shared types (no internal business logic) +│ ├── v1alpha2/ # Current CRD definitions +│ ├── v1alpha1/ # Legacy CRDs (deprecated) +│ ├── adk/ # ADK configuration types +│ ├── client/ # REST HTTP client SDK +│ ├── database/ # GORM database models +│ ├── httpapi/ # HTTP request/response types +│ ├── utils/ # Shared utility functions +│ └── config/ # Generated CRD & RBAC manifests +├── core/ # Infrastructure +│ ├── cmd/ # Binary entry points (controller, CLI) +│ ├── cli/ # kagent CLI application +│ ├── internal/ # Core services +│ │ ├── controller/ # K8s controllers & reconciliation +│ │ ├── database/ # Database implementation (SQLite/Postgres) +│ │ ├── httpserver/ # HTTP API server +│ │ ├── a2a/ # Agent-to-Agent protocol +│ │ ├── mcp/ # MCP tool server integration +│ │ ├── metrics/ # Prometheus metrics +│ │ └── telemetry/ # OpenTelemetry tracing +│ └── pkg/ # Public packages (auth, translators) +└── adk/ # Go Agent Development Kit + ├── cmd/ # ADK server entry point + ├── pkg/ # Agent runtime, sessions, skills, MCP client + └── examples/ # Example tools (oneshot, BYO agents) +``` + +## Go coding standards + +### Error handling + +```go +// Always wrap errors with context using %w +if err != nil { + return fmt.Errorf("failed to create agent %s: %w", name, err) +} + +// Controllers: return error to requeue with backoff +if err != nil { + return ctrl.Result{}, fmt.Errorf("reconciliation failed: %w", err) +} +``` + +### Naming conventions + +- Exported identifiers: `PascalCase` (e.g., `AgentSpec`, `CreateAgent`) +- Unexported identifiers: `camelCase` (e.g., `agentName`, `parseConfig`) +- Descriptive variable names: `fingerPrint` not `fp`, `cacheKey` not `ck` +- Context as first parameter: `func DoSomething(ctx context.Context, ...) error` + +### Testing pattern (table-driven) + +```go +func TestSomething(t *testing.T) { + tests := []struct { + name string + input string + want string + wantErr bool + }{ + {name: "valid input", input: "foo", want: "bar", wantErr: false}, + {name: "invalid input", input: "", want: "", wantErr: true}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := Something(tt.input) + if (err != nil) != tt.wantErr { + t.Errorf("Something() error = %v, wantErr %v", err, tt.wantErr) + } + if got != tt.want { + t.Errorf("Something() = %v, want %v", got, tt.want) + } + }) + } +} +``` + +### Concurrency rules + +- No nested goroutines (`go func() { go func() { ... } }()`) +- Reuse K8s clients from struct fields, not `kubernetes.NewForConfig()` per call +- Fire-and-forget goroutines: require `context.WithTimeout` + error logging +- Database-level concurrency via atomic upserts, no application-level locks + +## Controller development + +### Adding a new controller + +1. Define CRD types in `go/api/v1alpha2/` +2. Run `make -C go manifests generate` to generate CRD YAML and DeepCopy methods +3. Create controller in `go/core/internal/controller/` +4. Register with the controller manager in `go/core/cmd/controller/main.go` +5. Create translator in `go/core/internal/controller/translator/` +6. Add E2E tests in `go/core/test/e2e/` + +### Reconciliation pattern + +```go +func (r *MyController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + // 1. Fetch the resource + // 2. Handle deletion (finalizers) + // 3. Translate CRD spec to K8s resources + // 4. Apply resources (create/update) + // 5. Update status + // 6. Upsert to database + return ctrl.Result{}, nil +} +``` + +## Adding CRD fields + +1. Add field to type struct in `go/api/v1alpha2/*_types.go` +2. Run `make -C go manifests generate` +3. Update translator in `go/core/internal/controller/translator/` +4. Update ADK types in `go/api/adk/types.go` +5. Mirror in Python types: `python/packages/kagent-adk/src/kagent/adk/types.py` +6. Update Helm CRD templates if needed +7. Add unit tests for new field handling +8. Add E2E test if user-facing + +## HTTP API development + +- Server: `go/core/internal/httpserver/` +- Routes registered with gorilla/mux +- Request/response types in `go/api/httpapi/` +- Client SDK in `go/api/client/` + +## CLI development + +- CLI: `go/core/cli/` using Cobra + Viper +- Build: `make build-cli-local` +- Install: `make kagent-cli-install` + +## Linting + +golangci-lint v2.11.3+ with configuration in `go/.golangci.yaml`: + +```bash +make -C go lint # Check +make -C go lint-fix # Auto-fix +make -C go lint-config # Validate config +``` + +Always run lint before pushing. CI will fail on lint errors including `ineffassign`, `staticcheck`, and `gofmt` violations. diff --git a/docs/agents/python-guide.md b/docs/agents/python-guide.md new file mode 100644 index 000000000..414fdc08d --- /dev/null +++ b/docs/agents/python-guide.md @@ -0,0 +1,152 @@ +# Kagent Python Development Guide + +Python ADK, agent runtime, LLM integrations, and framework support. + +**See also:** [architecture.md](architecture.md) (system overview), [testing-ci.md](testing-ci.md) (test commands), [go-guide.md](go-guide.md) (Go types that mirror Python) + +--- + +## Local development + +### Prerequisites + +- Python 3.13 (3.10+ supported in CI) +- UV package manager (v0.10.4+) + +### Essential commands + +| Task | Command | +|------|---------| +| Sync dependencies | `make -C python update` | +| Format code | `make -C python format` | +| Lint code | `make -C python lint` | +| Run all tests | `make -C python test` | +| Build packages | `make -C python build` | +| Security audit | `make -C python audit` | +| Generate test certs | `make -C python generate-test-certs` | + +### Package structure (UV workspace) + +``` +python/packages/ +├── kagent-adk/ # Main ADK - agent executor, A2A server, MCP toolset +│ └── src/kagent/adk/ +│ ├── _a2a.py # FastAPI A2A server +│ ├── _agent_executor.py # Core request handler (hub file) +│ ├── _approval.py # Human-in-the-loop approval +│ ├── types.py # Config types (mirrors Go ADK types) +│ ├── _mcp_toolset.py # MCP tool integration +│ ├── _mcp_capability_tools.py # MCP capability tools +│ ├── _memory_service.py # Vector memory service +│ ├── _session_service.py # Session persistence +│ ├── _token.py # K8s token refresh +│ ├── cli.py # CLI interface +│ ├── converters/ # Event/part converters +│ ├── models/ # LLM provider implementations +│ └── tools/ # Built-in tools (AskUser, Skills, memory) +│ +├── kagent-core/ # Core utilities +├── kagent-skills/ # Skills runtime/execution +├── kagent-openai/ # OpenAI native integration +├── kagent-langgraph/ # LangGraph framework support +├── kagent-crewai/ # CrewAI framework support +└── agentsts-*/ # AgentSTS variants +``` + +## Python coding standards + +### Type hints + +- Type hints on all function signatures +- Use `Optional[T]` for nullable parameters +- Use `list[T]`, `dict[K, V]` (lowercase) for Python 3.10+ +- No bare `Any` types without justification + +### Formatting and linting + +- **Ruff** for formatting and linting +- Run `make -C python format` before committing +- Run `make -C python lint` to check formatting + +### Error handling + +```python +# Wrap errors with context +try: + result = await mcp_client.call_tool(tool_name, args) +except Exception as e: + raise RuntimeError(f"Failed to call tool {tool_name}: {e}") from e +``` + +### Testing + +```python +# Use pytest with async support +import pytest + +@pytest.mark.asyncio +async def test_agent_executor(): + executor = AgentExecutor(config) + result = await executor.handle_request(request) + assert result.status == "completed" +``` + +Test certs required for TLS tests: `make -C python generate-test-certs` + +## ADK development + +### Agent executor flow + +1. A2A request received via FastAPI server (`_a2a.py`) +2. Request parsed and routed to `AgentExecutor._handle_request()` (`_agent_executor.py`) +3. ADK Runner manages LLM loop: system prompt + history + tool calls +4. Tool execution via MCP toolset (`_mcp_toolset.py`) +5. Events converted from ADK format to A2A format (`converters/`) +6. Response streamed back via SSE + +### Adding LLM provider support + +1. Create provider implementation in `kagent-adk/src/kagent/adk/models/` +2. Add provider config types to `types.py` +3. Mirror config types in Go: `go/api/v1alpha2/modelconfig_types.go` +4. Update translator: `go/core/internal/controller/translator/` +5. Add tests + +### Type alignment (Python ↔ Go) + +Python types in `kagent-adk/src/kagent/adk/types.py` must stay aligned with Go types in `go/api/adk/types.go`. Both are serialized as JSON in `config.json`. + +When adding fields: +- Add to Go type first, then mirror in Python +- Use the same JSON field names +- Add cross-reference comments in both languages +- Flag changes to one side without corresponding changes to the other + +## Sample agents + +Located in `python/samples/`: + +``` +python/samples/ +├── adk/ # ADK-based examples +├── langgraph/ # LangGraph examples +├── crewai/ # CrewAI examples +└── openai/ # OpenAI examples +``` + +## Common development patterns + +### Adding a new built-in tool + +1. Create tool class in `kagent-adk/src/kagent/adk/tools/` +2. Register in the agent executor's tool loading +3. Add tests +4. Update documentation + +### Adding a new framework integration + +1. Create new package in `python/packages/kagent-/` +2. Add to UV workspace in `python/pyproject.toml` +3. Implement the agent executor interface +4. Add sample in `python/samples//` +5. Add CI test job in `.github/workflows/ci.yaml` diff --git a/docs/agents/testing-ci.md b/docs/agents/testing-ci.md new file mode 100644 index 000000000..ccde1ad90 --- /dev/null +++ b/docs/agents/testing-ci.md @@ -0,0 +1,178 @@ +# Kagent Testing & CI Guide + +Test suites, CI workflows, test data, and local testing commands. + +**See also:** [go-guide.md](go-guide.md) (Go testing patterns), [python-guide.md](python-guide.md) (Python testing), [ui-guide.md](ui-guide.md) (UI testing) + +--- + +## Test suites + +### Go tests + +| Suite | Command | Scope | +|-------|---------|-------| +| Unit tests | `make -C go test` | All Go packages (skips E2E) | +| E2E tests | `make -C go e2e` | Full stack with Kind cluster | +| Lint | `make -C go lint` | golangci-lint v2.11.3+ | +| Vulnerability | `make -C go govulncheck` | Known CVE check | + +**Go unit tests** use table-driven patterns with race detection (`-race` flag in CI). + +**Go E2E tests** require a running Kind cluster: + +```bash +make create-kind-cluster # Create cluster +make helm-install # Deploy kagent +make -C go e2e # Run E2E (failfast) +``` + +E2E tests are in `go/core/test/e2e/`. + +### Python tests + +| Suite | Command | Scope | +|-------|---------|-------| +| All packages | `make -C python test` | pytest across all packages | +| Format check | `make -C python lint` | Ruff formatter diff | +| Security audit | `make -C python audit` | pip-audit CVE check | + +**Python tests** require test certificates: `make -C python generate-test-certs` + +CI tests against multiple Python versions: 3.10, 3.11, 3.12, 3.13. + +### Helm tests + +Helm chart unit tests run in CI: + +```bash +# Tested in CI via helm-unit-tests job +``` + +### UI tests + +| Suite | Command | Scope | +|-------|---------|-------| +| Lint | `cd ui && npx next lint` | ESLint | +| Unit tests | `cd ui && npx jest` | Jest component tests | +| E2E tests | Cypress | Browser automation | + +## CI workflow matrix + +Main CI pipeline: `.github/workflows/ci.yaml` + +| Job | Trigger | What it does | +|-----|---------|--------------| +| `setup` | All | Cache key generation | +| `test-e2e` | Push/PR | Kind cluster E2E tests (SQLite + Postgres matrix) | +| `go-unit-tests` | Push/PR | Go unit tests with race detection | +| `helm-unit-tests` | Push/PR | Helm chart testing | +| `ui-tests` | Push/PR | Node.js lint + Jest | +| `build` | Push/PR | Multi-platform Docker builds (amd64/arm64) | +| `go-lint` | Push/PR | golangci-lint v2.11.3 | +| `python-test` | Push/PR | Pytest across Python 3.10-3.13 | +| `python-lint` | Push/PR | Ruff linter + formatter | +| `manifests-check` | Push/PR | Verify CRD manifests are up-to-date | + +### E2E test database matrix + +E2E tests run against both database backends: + +| Database | CI Job | +|----------|--------| +| SQLite | `test-e2e` (sqlite matrix) | +| PostgreSQL (pgvector) | `test-e2e` (postgres matrix) | + +### Docker image builds + +CI builds images for: +- `controller` (Go controller + HTTP server) +- `ui` (Next.js frontend) +- `app` (Python ADK runtime) +- `cli` (Go CLI binary) +- `golang-adk` (Go ADK runtime) +- `skills-init` (Skills initialization container) + +Platforms: linux/amd64, linux/arm64 + +## Other CI workflows + +| Workflow | Purpose | +|----------|---------| +| `tag.yaml` | Release tagging & versioning | +| `image-scan.yaml` | Container image vulnerability scanning | +| `run-agent-framework-test.yaml` | Framework integration tests | +| `stalebot.yaml` | Issue/PR stale detection | + +## Testing requirements for PRs + +### Must have + +- Unit tests for new functions/methods +- E2E tests for new CRD fields or API endpoints +- Mock external services (LLMs, K8s API) in unit tests +- All existing tests passing + +### Coverage rules + +| Change size | Expected tests | +|-------------|----------------| +| Bug fix (< 50 lines) | Unit test reproducing bug + fix | +| Small feature (50-200) | Unit tests + integration if API-facing | +| Medium feature (200-500) | Unit + integration + E2E | +| Large feature (500+) | Unit + integration + E2E + negative tests | + +### CRD change test coverage + +When adding CRD fields: + +1. Unit test for translator handling the new field +2. Unit test for database model handling +3. E2E test verifying end-to-end flow (CRD → agent config → runtime behavior) + +### Type alignment test coverage + +When changing types in `go/api/adk/types.go`: + +1. Verify Python mirror in `kagent-adk/src/kagent/adk/types.py` is updated +2. Add test that serializes Go type to JSON and deserializes in Python +3. Verify `config.json` schema is backward compatible + +## Local CI reproduction + +Reproduce CI failures locally: + +```bash +# Go tests +make -C go test # Unit tests +make -C go lint # Linting +make -C go manifests generate # CRD generation check + +# Python tests +make -C python generate-test-certs # Required for TLS tests +make -C python test # All pytest +make -C python lint # Ruff formatting + +# UI tests +cd ui && npm ci && npx next lint && npx jest + +# E2E (requires Kind cluster) +make create-kind-cluster +make helm-install +make -C go e2e + +# Manifests check +make -C go manifests generate +git diff --exit-code go/api/config/ +``` + +## Common CI failure patterns + +| Failure | Fix | +|---------|-----| +| `golangci-lint` error | Run `make -C go lint-fix` locally | +| `manifests-check` failed | Run `make -C go manifests generate` and commit | +| Python format error | Run `make -C python format` and commit | +| E2E timeout | Check pod logs: `kubectl logs -n kagent deploy/kagent-controller` | +| TLS test failure | Run `make -C python generate-test-certs` | +| Race condition in Go tests | Look for shared state in test setup/teardown | diff --git a/docs/agents/ui-guide.md b/docs/agents/ui-guide.md new file mode 100644 index 000000000..91e0125f9 --- /dev/null +++ b/docs/agents/ui-guide.md @@ -0,0 +1,137 @@ +# Kagent UI Development Guide + +Next.js web interface: components, routing, state management, and testing. + +**See also:** [architecture.md](architecture.md) (system overview), [testing-ci.md](testing-ci.md) (test commands) + +--- + +## Local development + +### Prerequisites + +- Node.js 24.13.0+ (see `ui/.nvmrc`) +- npm + +### Essential commands + +| Task | Command | +|------|---------| +| Install dependencies | `npm ci` (in `ui/`) | +| Build | `make -C ui build` | +| Clean | `make -C ui clean` | +| Security audit | `make -C ui audit` | +| Update deps | `make -C ui update` | +| Access UI (dev) | `kubectl port-forward -n kagent svc/kagent-ui 3000:8080` | + +### CI commands + +CI runs linting and Jest tests: + +```bash +cd ui && npm ci && npx next lint && npx jest +``` + +## Project structure + +``` +ui/ +├── src/ +│ ├── app/ # Next.js app router +│ │ ├── agents/ # Agent management pages +│ │ │ ├── new/ # Create agent wizard +│ │ │ └── [namespace]/[name]/ # Agent detail + chat +│ │ ├── models/ # Model configuration pages +│ │ ├── servers/ # MCP server management +│ │ ├── tools/ # Tool management +│ │ ├── a2a/ # Agent-to-Agent pages +│ │ └── actions/ # Server actions +│ ├── components/ # Reusable React components +│ │ ├── chat/ # Chat interface (hub) +│ │ ├── create/ # Creation wizards +│ │ ├── models/ # Model components +│ │ ├── tools/ # Tool components +│ │ ├── sidebars/ # Navigation +│ │ ├── ui/ # Shadcn/ui primitives (30+) +│ │ └── icons/ # Icon components +│ ├── hooks/ # Custom React hooks +│ ├── types/ # TypeScript type definitions +│ └── lib/ # Utility functions +│ ├── a2aClient.ts # A2A JSON-RPC client +│ ├── messageHandlers.ts # A2A message parsing (hub file) +│ ├── toolUtils.ts # Tool utilities +│ ├── k8sUtils.ts # Kubernetes utilities +│ └── providers.ts # LLM provider info +├── cypress/ # E2E test framework +├── package.json +├── tsconfig.json +└── next.config.js +``` + +## Technologies + +| Technology | Purpose | +|------------|---------| +| Next.js 16 | App router, server actions | +| React 19 | UI framework | +| Radix UI | Accessible component primitives | +| Shadcn/ui | Styled component library | +| TailwindCSS | Utility-first styling | +| Zustand | State management | +| React Hook Form + Zod | Form handling + validation | +| Lucide React | Icons | +| react-markdown | Markdown rendering | +| @a2a-js/sdk | A2A protocol client | + +## TypeScript standards + +- **Strict mode** enabled in `tsconfig.json` +- No `any` type — use proper typing or `unknown` with type guards +- No inline styles — use TailwindCSS classes +- No direct DOM manipulation — use React patterns +- Components should be functional with hooks +- Use descriptive prop interfaces + +## Key hub components + +These components have the most connections. Changes require extra care: + +| Component | Size | Purpose | +|-----------|------|---------| +| `ChatInterface.tsx` | 27KB | Main chat UI — renders messages, handles input, manages streaming | +| `messageHandlers.ts` | 26KB | Parses A2A events, manages message state | +| `ToolCallDisplay.tsx` | 12KB | Renders tool call cards with approval/rejection | +| `ToolDisplay.tsx` | 9.5KB | Tool rendering and management | +| `AgentsProvider.tsx` | 9.4KB | Agent state provider (context) | + +## Development patterns + +### Adding a new page + +1. Create route in `ui/src/app//page.tsx` +2. Add navigation link in sidebar component +3. Create any needed server actions in `ui/src/app/actions/` +4. Add types in `ui/src/types/` +5. Use existing Shadcn/ui components from `ui/src/components/ui/` + +### Adding a new component + +1. Create component in appropriate `ui/src/components//` +2. Use Radix UI primitives for accessibility +3. Style with TailwindCSS +4. Add TypeScript prop interfaces +5. Add Jest tests for logic-heavy components + +### A2A client usage + +```typescript +import { A2AClient } from '@/lib/a2aClient'; + +// Send message and handle streaming response +const client = new A2AClient(baseUrl); +await client.sendMessage(agentId, message, { + onEvent: (event) => { + // Handle SSE events + }, +}); +``` diff --git a/tools/code-tree/code_tree.py b/tools/code-tree/code_tree.py new file mode 100644 index 000000000..1bca1882e --- /dev/null +++ b/tools/code-tree/code_tree.py @@ -0,0 +1,2331 @@ +#!/usr/bin/env python3 +"""code_tree.py - Build a knowledge graph of a codebase with call graph and test mapping. + +Hybrid Parser + Knowledge Graph + Chunked Context approach: +1. Parse: Extract symbols, calls, and references from every file +2. Index: Link symbols into a knowledge graph with dependency edges +3. Map: Connect tests to source code, detect module boundaries +4. Output: Generate structured artifacts for AI agent consumption + +Edge types: contains, imports, inherits, calls, tests, type_of + +Usage: + python code_tree.py [--repo-root .] [--output-dir docs/code-tree] + python code_tree.py --help + +Zero external dependencies required. Optional: tree-sitter for enhanced parsing. +""" + +import argparse +import ast as python_ast +import fnmatch +import hashlib +import json +import logging +import os +import re +import sys +from collections import Counter, defaultdict +from dataclasses import asdict, dataclass, field +from datetime import datetime, timezone +from pathlib import Path +from typing import Optional + + +# ── Constants ───────────────────────────────────────────────────── + +LANG_EXTENSIONS = { + ".py": "python", ".pyw": "python", + ".go": "go", + ".js": "javascript", ".jsx": "javascript", ".mjs": "javascript", ".cjs": "javascript", + ".ts": "typescript", ".tsx": "typescript", ".mts": "typescript", + ".java": "java", + ".rs": "rust", + ".rb": "ruby", + ".cpp": "cpp", ".cc": "cpp", ".cxx": "cpp", ".hpp": "cpp", + ".c": "c", ".h": "c", + ".cs": "csharp", + ".php": "php", + ".swift": "swift", + ".kt": "kotlin", ".kts": "kotlin", + ".scala": "scala", + ".proto": "protobuf", + ".sql": "sql", + ".sh": "bash", ".bash": "bash", + ".lua": "lua", + ".dart": "dart", + ".ex": "elixir", ".exs": "elixir", + ".hs": "haskell", + ".zig": "zig", + ".r": "r", ".R": "r", + ".vue": "vue", ".svelte": "svelte", +} + +DEFAULT_EXCLUDES = [ + ".git", "node_modules", "__pycache__", ".venv", "venv", "vendor", + "dist", "build", ".tox", ".mypy_cache", ".pytest_cache", "eggs", + ".egg-info", ".idea", ".vscode", ".code-tree", "generated", + ".next", ".nuxt", "coverage", ".cache", ".terraform", + ".claude", +] + +BINARY_EXTENSIONS = { + ".pyc", ".pyo", ".so", ".o", ".a", ".dylib", ".class", ".jar", + ".exe", ".dll", ".bin", ".dat", ".db", ".sqlite", ".sqlite3", + ".png", ".jpg", ".jpeg", ".gif", ".bmp", ".ico", ".svg", + ".mp3", ".mp4", ".wav", ".avi", ".mov", ".mkv", + ".zip", ".tar", ".gz", ".bz2", ".xz", ".rar", ".7z", + ".woff", ".woff2", ".ttf", ".eot", ".otf", + ".pdf", ".doc", ".docx", ".xls", ".xlsx", ".ppt", ".pptx", +} + +TEST_MARKERS = {"test_", "_test.", ".test.", ".spec.", "tests/", "__tests__/", "test/"} + +BUILD_FILES = { + "go.mod", "go.sum", + "package.json", + "pyproject.toml", "setup.py", "setup.cfg", + "pom.xml", "build.gradle", "build.gradle.kts", + "Cargo.toml", + "Gemfile", + "composer.json", + "pubspec.yaml", + "mix.exs", +} + +MAX_FILE_SIZE = 1_048_576 # 1 MB + +# Python builtins to exclude from call graph (module-level frozenset for O(1) lookup) +_PYTHON_BUILTINS = frozenset({ + "print", "len", "range", "enumerate", "zip", "map", "filter", + "sorted", "reversed", "list", "dict", "set", "tuple", "str", + "int", "float", "bool", "bytes", "type", "isinstance", "issubclass", + "hasattr", "getattr", "setattr", "delattr", "property", + "staticmethod", "classmethod", "super", "open", "repr", + "iter", "next", "any", "all", "min", "max", "sum", "abs", + "round", "id", "hash", "input", "format", "vars", "dir", + "callable", "chr", "ord", "hex", "oct", "bin", + "ValueError", "TypeError", "KeyError", "IndexError", + "AttributeError", "RuntimeError", "NotImplementedError", + "StopIteration", "Exception", "OSError", "IOError", + "FileNotFoundError", "PermissionError", "ImportError", +}) + +# Keywords to skip in regex call extraction (module-level frozenset) +_REGEX_CALL_KEYWORDS = frozenset({ + "if", "for", "while", "switch", "return", "new", "delete", "throw", "catch", +}) + + +# ── Data Classes ────────────────────────────────────────────────── + +@dataclass +class Symbol: + id: str + type: str # file, class, function, method, interface, struct, enum, constant, variable + name: str + file: str + language: str + line_start: int + line_end: int + scope: Optional[str] = None + params: list[str] = field(default_factory=list) + bases: list[str] = field(default_factory=list) + decorators: list[str] = field(default_factory=list) + docstring: Optional[str] = None + exported: bool = True + signature: Optional[str] = None + is_test: bool = False + + +@dataclass +class Edge: + source: str + target: str + type: str # contains, imports, inherits, calls, tests, type_of + metadata: dict = field(default_factory=dict) + + +@dataclass +class CallRef: + """Raw call reference extracted from a function body, before resolution.""" + caller_id: str # Symbol ID of the calling function/method + raw_name: str # Unresolved callee name (e.g. "foo", "self.bar", "pkg.func") + line: int # Line number of the call + file: str # File containing the call + + +@dataclass +class ModuleBoundary: + """A detected build module / package boundary.""" + root_dir: str # Directory containing the build file + build_file: str # Name of the build file (e.g. "package.json") + name: Optional[str] = None # Module name if detectable + language: Optional[str] = None + + +# ── File Discovery ──────────────────────────────────────────────── + +def should_exclude(path: str, excludes: list[str]) -> bool: + parts = Path(path).parts + for exc in excludes: + if any(fnmatch.fnmatch(p, exc) for p in parts): + return True + if fnmatch.fnmatch(path, exc): + return True + return False + + +def is_binary(file_path: str) -> bool: + ext = Path(file_path).suffix.lower() + if ext in BINARY_EXTENSIONS: + return True + try: + with open(file_path, "rb") as f: + chunk = f.read(8192) + return b"\x00" in chunk + except (OSError, IOError): + return True + + +def is_test_file(rel_path: str) -> bool: + """Determine if a file is a test file based on naming conventions.""" + lower_rel = rel_path.lower() + return any(m in lower_rel for m in TEST_MARKERS) + + +def file_hash(file_path: str) -> str: + """Compute SHA-256 hash of a file for change detection.""" + h = hashlib.sha256() + try: + with open(file_path, "rb") as f: + for chunk in iter(lambda: f.read(65536), b""): + h.update(chunk) + except OSError: + return "" + return h.hexdigest() + + +def discover_all( + root: str, + excludes: list[str], + include_tests: bool = True, + max_file_size: int = MAX_FILE_SIZE, +) -> tuple[list[tuple[str, str, str, bool]], list[ModuleBoundary]]: + """Discover source files and build files in a single os.walk() pass. + + Returns (files, module_boundaries) where files is a list of + (abs_path, rel_path, language, is_test) tuples. + """ + files = [] + boundaries = [] + root = os.path.abspath(root) + exclude_set = set(excludes) + + for dirpath, dirnames, filenames in os.walk(root): + dirnames[:] = [ + d for d in dirnames + if d not in exclude_set + ] + + for fname in filenames: + abs_path = os.path.join(dirpath, fname) + + # Check for build files + if fname in BUILD_FILES: + rel_dir = os.path.relpath(dirpath, root) + if rel_dir == ".": + rel_dir = "" + boundary = ModuleBoundary(root_dir=rel_dir, build_file=fname) + try: + _enrich_module_boundary(abs_path, boundary) + except (OSError, json.JSONDecodeError, UnicodeDecodeError) as e: + logging.debug("Failed to enrich module boundary from %s: %s", abs_path, e) + boundaries.append(boundary) + + # Check for source files + ext = Path(fname).suffix.lower() + lang = LANG_EXTENSIONS.get(ext) + if not lang: + continue + + rel_path = os.path.relpath(abs_path, root) + if should_exclude(rel_path, excludes): + continue + + test = is_test_file(rel_path) + if not include_tests and test: + continue + + try: + size = os.path.getsize(abs_path) + if size > max_file_size or size == 0: + continue + except OSError: + continue + + if is_binary(abs_path): + continue + + files.append((abs_path, rel_path, lang, test)) + + return sorted(files, key=lambda x: x[1]), boundaries + + +def _enrich_module_boundary(abs_path: str, boundary: ModuleBoundary): + """Extract module name and language from a build file.""" + fname = boundary.build_file + + if fname == "go.mod": + boundary.language = "go" + with open(abs_path, "r", encoding="utf-8") as f: + for line in f: + if line.startswith("module "): + boundary.name = line.split()[1].strip() + break + + elif fname == "package.json": + boundary.language = "javascript" + with open(abs_path, "r", encoding="utf-8") as f: + try: + data = json.load(f) + boundary.name = data.get("name", "") + except json.JSONDecodeError: + pass + + elif fname in ("pyproject.toml", "setup.cfg"): + boundary.language = "python" + with open(abs_path, "r", encoding="utf-8") as f: + m = _TOML_NAME_RE.search(f.read()) + if m: + boundary.name = m.group(1) + + elif fname == "Cargo.toml": + boundary.language = "rust" + with open(abs_path, "r", encoding="utf-8") as f: + m = _TOML_NAME_RE.search(f.read()) + if m: + boundary.name = m.group(1) + + elif fname in ("pom.xml",): + boundary.language = "java" + with open(abs_path, "r", encoding="utf-8") as f: + content = f.read() + m = re.search(r'([^<]+)', content) + if m: + boundary.name = m.group(1) + + elif fname in ("build.gradle", "build.gradle.kts"): + boundary.language = "java" + + +def read_file(path: str) -> Optional[str]: + for encoding in ("utf-8", "latin-1"): + try: + with open(path, "r", encoding=encoding) as f: + return f.read() + except (UnicodeDecodeError, OSError): + continue + return None + + +# ── Python Parser (ast module) ──────────────────────────────────── + +def parse_python( + content: str, rel_path: str, is_test: bool, +) -> tuple[list[Symbol], list[Edge], list[CallRef]]: + """Parse Python using the ast module with call graph extraction.""" + symbols = [] + edges = [] + call_refs = [] + file_id = f"file:{rel_path}" + + try: + tree = python_ast.parse(content, filename=rel_path) + except SyntaxError: + return symbols, edges, call_refs + + # Extract imports + for node in python_ast.walk(tree): + if isinstance(node, python_ast.Import): + for alias in node.names: + edges.append(Edge(file_id, f"module:{alias.name}", "imports")) + elif isinstance(node, python_ast.ImportFrom): + if node.module: + level = node.level or 0 + module_ref = node.module + if level > 0: + # Relative import: compute target module + parts = Path(rel_path).parts + if len(parts) > level: + base = ".".join(parts[:len(parts) - level]) + module_ref = f"{base}.{node.module}" if node.module else base + edges.append(Edge(file_id, f"module:{module_ref}", "imports")) + for alias in (node.names or []): + if alias.name != "*": + edges.append(Edge( + file_id, + f"symbol:{module_ref}.{alias.name}", + "imports", + {"from_import": True}, + )) + + # Walk top-level nodes for symbols + _walk_python_body( + tree, rel_path, file_id, None, symbols, edges, call_refs, is_test, + ) + + return symbols, edges, call_refs + + +def _walk_python_body( + node, rel_path: str, parent_id: str, scope: Optional[str], + symbols: list, edges: list, call_refs: list, is_test: bool, +): + """Recursively walk Python AST nodes to extract symbols and calls.""" + for child in python_ast.iter_child_nodes(node): + if isinstance(child, python_ast.ClassDef): + _extract_python_class( + child, rel_path, parent_id, scope, symbols, edges, call_refs, is_test, + ) + elif isinstance(child, (python_ast.FunctionDef, python_ast.AsyncFunctionDef)): + _extract_python_function( + child, rel_path, parent_id, scope, symbols, edges, call_refs, is_test, + ) + elif isinstance(child, python_ast.Assign): + # Module-level constants (UPPER_CASE) + if scope is None: # Only at module level + for target in child.targets: + if isinstance(target, python_ast.Name) and target.id.isupper(): + sym_id = f"constant:{rel_path}:{target.id}" + symbols.append(Symbol( + id=sym_id, type="constant", name=target.id, + file=rel_path, language="python", + line_start=child.lineno, + line_end=child.end_lineno or child.lineno, + exported=not target.id.startswith("_"), + is_test=is_test, + )) + edges.append(Edge(parent_id, sym_id, "contains")) + + +def _extract_python_class( + node: python_ast.ClassDef, rel_path: str, parent_id: str, + outer_scope: Optional[str], + symbols: list, edges: list, call_refs: list, is_test: bool, +): + """Extract a Python class, including nested classes and methods.""" + qualified = f"{outer_scope}.{node.name}" if outer_scope else node.name + sym_id = f"class:{rel_path}:{qualified}" + + bases = [] + for base in node.bases: + try: + bases.append(python_ast.unparse(base)) + except (ValueError, TypeError): + if isinstance(base, python_ast.Name): + bases.append(base.id) + + decorators = [] + for dec in node.decorator_list: + try: + decorators.append(python_ast.unparse(dec)) + except (ValueError, TypeError): + pass + + docstring = python_ast.get_docstring(node) + symbols.append(Symbol( + id=sym_id, type="class", name=node.name, + file=rel_path, language="python", + line_start=node.lineno, + line_end=node.end_lineno or node.lineno, + scope=outer_scope, + bases=bases, decorators=decorators, + docstring=docstring[:200] if docstring else None, + exported=not node.name.startswith("_"), + is_test=is_test, + )) + edges.append(Edge(parent_id, sym_id, "contains")) + + for base_name in bases: + edges.append(Edge(sym_id, f"class:?:{base_name}", "inherits", {"unresolved": True})) + + # Recurse into class body — handles nested classes, methods, inner functions + _walk_python_body( + node, rel_path, sym_id, qualified, symbols, edges, call_refs, is_test, + ) + + +def _extract_python_function( + node, rel_path: str, parent_id: str, scope: Optional[str], + symbols: list, edges: list, call_refs: list, is_test: bool, +): + """Extract a Python function/method with call graph analysis.""" + qualified = f"{scope}.{node.name}" if scope else node.name + + # Determine if this is a method (parent is a class) + is_method = scope is not None and parent_id.startswith("class:") + sym_type = "method" if is_method else "function" + + sym_id = f"{sym_type}:{rel_path}:{qualified}" + + params = [a.arg for a in node.args.args if a.arg != "self"] + docstring = python_ast.get_docstring(node) + decorators = [] + for dec in node.decorator_list: + try: + decorators.append(python_ast.unparse(dec)) + except (ValueError, TypeError): + pass + + # Build signature + sig_parts = [] + for a in node.args.args: + part = a.arg + if a.annotation: + try: + part += f": {python_ast.unparse(a.annotation)}" + except (ValueError, TypeError): + pass + sig_parts.append(part) + signature = f"({', '.join(sig_parts)})" + if hasattr(node, 'returns') and node.returns: + try: + signature += f" -> {python_ast.unparse(node.returns)}" + except (ValueError, TypeError): + pass + + symbols.append(Symbol( + id=sym_id, type=sym_type, name=node.name, + file=rel_path, language="python", + line_start=node.lineno, + line_end=node.end_lineno or node.lineno, + scope=scope, params=params, decorators=decorators, + docstring=docstring[:200] if docstring else None, + exported=not node.name.startswith("_"), + signature=signature, + is_test=is_test or node.name.startswith("test_"), + )) + edges.append(Edge(parent_id, sym_id, "contains")) + + # Extract calls from function body + _extract_python_calls(node, sym_id, rel_path, call_refs) + + # Recurse into nested functions/classes inside this function + for child in python_ast.iter_child_nodes(node): + if isinstance(child, python_ast.ClassDef): + _extract_python_class( + child, rel_path, sym_id, qualified, symbols, edges, call_refs, is_test, + ) + elif isinstance(child, (python_ast.FunctionDef, python_ast.AsyncFunctionDef)): + # Nested function (closure) + _extract_python_function( + child, rel_path, sym_id, qualified, symbols, edges, call_refs, is_test, + ) + + +def _extract_python_calls( + func_node, caller_id: str, rel_path: str, + call_refs: list, +): + """Walk a Python function body to extract call references. + + Uses a manual stack instead of ast.walk() to skip nested + FunctionDef/AsyncFunctionDef/ClassDef nodes (those are handled + separately and would cause double-counting). + """ + stack = list(python_ast.iter_child_nodes(func_node)) + while stack: + node = stack.pop() + + # Skip nested function/class definitions — they have their own caller_id + if isinstance(node, (python_ast.FunctionDef, python_ast.AsyncFunctionDef, python_ast.ClassDef)): + continue + + if isinstance(node, python_ast.Call): + raw_name = None + line = getattr(node, 'lineno', 0) + + if isinstance(node.func, python_ast.Name): + raw_name = node.func.id + elif isinstance(node.func, python_ast.Attribute): + try: + raw_name = python_ast.unparse(node.func) + except (ValueError, TypeError): + raw_name = node.func.attr + + if raw_name and raw_name not in _PYTHON_BUILTINS: + call_refs.append(CallRef( + caller_id=caller_id, + raw_name=raw_name, + line=line, + file=rel_path, + )) + + # Continue walking children + stack.extend(python_ast.iter_child_nodes(node)) + + +# ── Regex Parsers ───────────────────────────────────────────────── + +def _compile_patterns(raw_patterns): + """Pre-compile a list of (regex_str, sym_type, groups) into (compiled_re, sym_type, groups).""" + return [(re.compile(pat), sym_type, groups) for pat, sym_type, groups in raw_patterns] + +GO_PATTERNS = _compile_patterns([ + (r"^package\s+(\w+)", "package", {"name": 1}), + (r'^func\s+(\w+)\s*\(([^)]*)\)', "function", {"name": 1, "params": 2}), + (r'^func\s+\(\w+\s+\*?(\w+)\)\s+(\w+)\s*\(([^)]*)\)', "method", {"scope": 1, "name": 2, "params": 3}), + (r'^type\s+(\w+)\s+struct\s*\{', "struct", {"name": 1}), + (r'^type\s+(\w+)\s+interface\s*\{', "interface", {"name": 1}), + (r'^\s*"([^"]+)"', "_import", {"name": 1}), +]) + +GO_IMPORT_BLOCK = re.compile(r'^import\s*\(\s*\n(.*?)\n\s*\)', re.MULTILINE | re.DOTALL) +GO_IMPORT_SINGLE = re.compile(r'^import\s+"([^"]+)"', re.MULTILINE) + +JS_PATTERNS = _compile_patterns([ + (r"^(?:export\s+)?class\s+(\w+)(?:\s+extends\s+(\w+))?", "class", {"name": 1, "bases": 2}), + (r"^(?:export\s+)?(?:async\s+)?function\s+(\w+)\s*\(([^)]*)\)", "function", {"name": 1, "params": 2}), + (r"^(?:export\s+)?(?:const|let|var)\s+(\w+)\s*=\s*(?:async\s+)?(?:\([^)]*\)|[^=])\s*=>", "function", {"name": 1}), + (r'^import\s+.*?\s+from\s+[\'"]([^\'"]+)[\'"]', "_import", {"name": 1}), + (r'^import\s+[\'"]([^\'"]+)[\'"]', "_import", {"name": 1}), + (r"require\(\s*['\"]([^'\"]+)['\"]\s*\)", "_import", {"name": 1}), + (r"^(?:export\s+)?interface\s+(\w+)", "interface", {"name": 1}), + (r"^(?:export\s+)?type\s+(\w+)\s*=", "interface", {"name": 1}), + (r"^(?:export\s+)?enum\s+(\w+)", "enum", {"name": 1}), +]) + +JAVA_PATTERNS = _compile_patterns([ + (r"^package\s+([\w.]+);", "package", {"name": 1}), + (r"^import\s+([\w.]+);", "_import", {"name": 1}), + (r"(?:public|private|protected)?\s*(?:abstract\s+)?(?:static\s+)?class\s+(\w+)(?:\s+extends\s+(\w+))?", "class", {"name": 1, "bases": 2}), + (r"(?:public|private|protected)?\s*interface\s+(\w+)", "interface", {"name": 1}), + (r"(?:public|private|protected)?\s*enum\s+(\w+)", "enum", {"name": 1}), + (r"(?:public|private|protected)\s+(?:static\s+)?(?:abstract\s+)?(?:synchronized\s+)?(?:final\s+)?\w+(?:<[^>]+>)?\s+(\w+)\s*\(([^)]*)\)", "method", {"name": 1, "params": 2}), +]) + +RUST_PATTERNS = _compile_patterns([ + (r"^pub\s+fn\s+(\w+)\s*(?:<[^>]*>)?\s*\(([^)]*)\)", "function", {"name": 1, "params": 2}), + (r"^fn\s+(\w+)\s*(?:<[^>]*>)?\s*\(([^)]*)\)", "function", {"name": 1, "params": 2}), + (r"^pub\s+struct\s+(\w+)", "struct", {"name": 1}), + (r"^struct\s+(\w+)", "struct", {"name": 1}), + (r"^pub\s+enum\s+(\w+)", "enum", {"name": 1}), + (r"^enum\s+(\w+)", "enum", {"name": 1}), + (r"^pub\s+trait\s+(\w+)", "interface", {"name": 1}), + (r"^trait\s+(\w+)", "interface", {"name": 1}), + (r"^impl(?:<[^>]*>)?\s+(\w+)", "struct", {"name": 1}), + (r'^use\s+([\w:]+)', "_import", {"name": 1}), + (r'^mod\s+(\w+);', "_import", {"name": 1}), +]) + +PROTO_PATTERNS = _compile_patterns([ + (r'^message\s+(\w+)\s*\{', "class", {"name": 1}), + (r'^service\s+(\w+)\s*\{', "interface", {"name": 1}), + (r'^\s*rpc\s+(\w+)\s*\((\w+)\)\s*returns\s*\((\w+)\)', "method", {"name": 1, "params": 2}), + (r'^enum\s+(\w+)\s*\{', "enum", {"name": 1}), + (r'^import\s+"([^"]+)"', "_import", {"name": 1}), +]) + +CSHARP_PATTERNS = _compile_patterns([ + (r"(?:public|private|protected|internal)?\s*(?:abstract\s+)?(?:static\s+)?class\s+(\w+)(?:\s*:\s*(\w+))?", "class", {"name": 1, "bases": 2}), + (r"(?:public|private|protected|internal)?\s*interface\s+(\w+)", "interface", {"name": 1}), + (r"(?:public|private|protected|internal)?\s*enum\s+(\w+)", "enum", {"name": 1}), + (r"(?:public|private|protected|internal)\s+(?:static\s+)?(?:async\s+)?(?:virtual\s+)?(?:override\s+)?\w+(?:<[^>]+>)?\s+(\w+)\s*\(([^)]*)\)", "method", {"name": 1, "params": 2}), + (r'^using\s+([\w.]+);', "_import", {"name": 1}), +]) + +LANG_PATTERNS = { + "go": GO_PATTERNS, + "javascript": JS_PATTERNS, + "typescript": JS_PATTERNS, + "java": JAVA_PATTERNS, + "rust": RUST_PATTERNS, + "protobuf": PROTO_PATTERNS, + "csharp": CSHARP_PATTERNS, +} + +# Regex pattern for extracting function calls from source code +CALL_PATTERN = re.compile(r'\b([a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*)\s*\(') + +# Shared TOML name extraction pattern (used for pyproject.toml and Cargo.toml) +_TOML_NAME_RE = re.compile(r'name\s*=\s*"([^"]+)"') + + +def _extract_go_import_edges(content: str, file_id: str) -> list[Edge]: + """Extract Go import edges from content (shared by regex and tree-sitter paths).""" + edges = [] + for m in GO_IMPORT_BLOCK.finditer(content): + block = m.group(1) + for line in block.strip().split("\n"): + line = line.strip().strip('"') + if line and not line.startswith("//"): + parts = line.split() + imp = parts[-1].strip('"') if parts else line + edges.append(Edge(file_id, f"module:{imp}", "imports")) + for m in GO_IMPORT_SINGLE.finditer(content): + edges.append(Edge(file_id, f"module:{m.group(1)}", "imports")) + return edges + + +def parse_with_regex( + content: str, rel_path: str, language: str, is_test: bool, +) -> tuple[list[Symbol], list[Edge], list[CallRef]]: + """Parse a source file using regex patterns with call extraction.""" + symbols = [] + edges = [] + call_refs = [] + file_id = f"file:{rel_path}" + patterns = LANG_PATTERNS.get(language, []) + if not patterns: + return symbols, edges, call_refs + + lines = content.split("\n") + + # Handle Go import blocks + if language == "go": + edges.extend(_extract_go_import_edges(content, file_id)) + + current_class = None + brace_depth = 0 + symbol_ranges = [] # Track (line_start, sym_id, sym_type) for call extraction + + for lineno, line in enumerate(lines, 1): + stripped = line.rstrip() + + brace_depth += stripped.count("{") - stripped.count("}") + if brace_depth <= 0: + current_class = None + + for compiled_re, sym_type, groups in patterns: + m = compiled_re.search(stripped) + if not m: + continue + + name = m.group(groups["name"]) if "name" in groups else None + if not name: + continue + + if sym_type == "_import": + if language != "go": + edges.append(Edge(file_id, f"module:{name}", "imports")) + continue + + if sym_type == "package": + continue + + params_str = m.group(groups["params"]) if "params" in groups and groups["params"] <= len(m.groups()) else None + params = [p.strip().split()[-1].split(":")[0] for p in params_str.split(",") if p.strip()] if params_str else [] + + bases_match = m.group(groups["bases"]) if "bases" in groups and groups["bases"] <= len(m.groups()) and m.group(groups["bases"]) else None + bases = [bases_match] if bases_match else [] + + scope_match = m.group(groups["scope"]) if "scope" in groups and groups["scope"] <= len(m.groups()) and m.group(groups["scope"]) else None + scope = scope_match or (current_class if sym_type == "method" else None) + + # Determine end line (use len(lines) + 1 so the last line is reachable) + end_line = lineno + for j in range(lineno, min(lineno + 500, len(lines) + 1)): + if j > lineno and j <= len(lines) and lines[j - 1].strip() and not lines[j - 1].startswith((" ", "\t")) and brace_depth <= 1: + end_line = j - 1 + break + else: + end_line = min(lineno + 50, len(lines)) + + if scope: + sym_id = f"{sym_type}:{rel_path}:{scope}.{name}" + else: + sym_id = f"{sym_type}:{rel_path}:{name}" + + exported = not name.startswith("_") + if language == "go": + exported = name[0].isupper() if name else True + + symbols.append(Symbol( + id=sym_id, type=sym_type, name=name, + file=rel_path, language=language, + line_start=lineno, line_end=end_line, + scope=scope, params=params, bases=bases, + exported=exported, + is_test=is_test, + )) + + if sym_type in ("class", "struct", "interface", "enum"): + edges.append(Edge(file_id, sym_id, "contains")) + current_class = name + for b in bases: + edges.append(Edge(sym_id, f"class:?:{b}", "inherits", {"unresolved": True})) + elif sym_type == "method" and scope: + parent_id = f"class:{rel_path}:{scope}" + edges.append(Edge(parent_id, sym_id, "contains")) + elif sym_type == "function": + edges.append(Edge(file_id, sym_id, "contains")) + + symbol_ranges.append((lineno, end_line, sym_id)) + break + + # Extract calls from function/method bodies using regex + _extract_regex_calls(lines, symbol_ranges, rel_path, call_refs) + + return symbols, edges, call_refs + + +def _extract_regex_calls( + lines: list[str], symbol_ranges: list, rel_path: str, call_refs: list, +): + """Extract function calls from source code using regex (fallback).""" + for start, end, sym_id in symbol_ranges: + if not (sym_id.startswith("function:") or sym_id.startswith("method:")): + continue + + for lineno in range(start, min(end + 1, len(lines) + 1)): + line = lines[lineno - 1] if lineno <= len(lines) else "" + # Skip comments and strings (rough heuristic) + stripped = line.strip() + if stripped.startswith("//") or stripped.startswith("#") or stripped.startswith("*"): + continue + + for m in CALL_PATTERN.finditer(line): + raw_name = m.group(1) + if raw_name in _REGEX_CALL_KEYWORDS: + continue + call_refs.append(CallRef( + caller_id=sym_id, + raw_name=raw_name, + line=lineno, + file=rel_path, + )) + + +def _extract_regex_imports( + content: str, rel_path: str, language: str, +) -> list[Edge]: + """Extract only import edges using regex — lightweight fallback for tree-sitter.""" + edges = [] + file_id = f"file:{rel_path}" + patterns = LANG_PATTERNS.get(language, []) + + # Handle Go import blocks + if language == "go": + return _extract_go_import_edges(content, file_id) + + # For other languages, scan only import patterns + for line in content.split("\n"): + stripped = line.rstrip() + for compiled_re, sym_type, groups in patterns: + if sym_type != "_import": + continue + m = compiled_re.search(stripped) + if m: + name = m.group(groups["name"]) if "name" in groups else None + if name: + edges.append(Edge(file_id, f"module:{name}", "imports")) + + return edges + + +# ── Tree-sitter Integration ────────────────────────────────────── + +def try_load_tree_sitter() -> dict: + """Try to load tree-sitter with available language grammars.""" + available = {} + + try: + from tree_sitter import Language, Parser + except ImportError: + return available + + grammar_packages = { + "python": "tree_sitter_python", + "go": "tree_sitter_go", + "javascript": "tree_sitter_javascript", + "typescript": "tree_sitter_typescript", + "java": "tree_sitter_java", + "rust": "tree_sitter_rust", + "ruby": "tree_sitter_ruby", + "c": "tree_sitter_c", + "cpp": "tree_sitter_cpp", + } + + for lang, pkg in grammar_packages.items(): + try: + mod = __import__(pkg) + language_func = getattr(mod, "language", None) + if language_func: + lang_obj = Language(language_func()) + parser = Parser(lang_obj) + available[lang] = (parser, lang_obj) + except (ImportError, AttributeError, OSError) as e: + logging.debug("tree-sitter grammar %s unavailable: %s", pkg, e) + + if not available: + try: + from tree_sitter_languages import get_parser, get_language + for lang in ["python", "go", "javascript", "typescript", "java", "rust", "ruby", "c", "cpp"]: + try: + available[lang] = (get_parser(lang), get_language(lang)) + except (ImportError, AttributeError, OSError) as e: + logging.debug("tree-sitter-languages %s unavailable: %s", lang, e) + except ImportError: + pass + + return available + + +# Node types that represent symbol declarations per language +TS_SYMBOL_TYPES = { + "python": { + "class_definition": "class", + "function_definition": "function", + }, + "go": { + "function_declaration": "function", + "method_declaration": "method", + "type_declaration": None, # need sub-check for struct vs interface + }, + "javascript": { + "class_declaration": "class", + "function_declaration": "function", + "arrow_function": "function", + "method_definition": "method", + "interface_declaration": "interface", + }, + "typescript": { + "class_declaration": "class", + "function_declaration": "function", + "arrow_function": "function", + "method_definition": "method", + "interface_declaration": "interface", + "type_alias_declaration": "interface", + "enum_declaration": "enum", + }, + "java": { + "class_declaration": "class", + "interface_declaration": "interface", + "enum_declaration": "enum", + "method_declaration": "method", + "constructor_declaration": "method", + }, + "rust": { + "function_item": "function", + "struct_item": "struct", + "enum_item": "enum", + "trait_item": "interface", + "impl_item": None, # scope provider, not a symbol itself + }, +} + +# Node types that represent function calls per language +TS_CALL_TYPES = { + "python": ["call"], + "go": ["call_expression"], + "javascript": ["call_expression", "new_expression"], + "typescript": ["call_expression", "new_expression"], + "java": ["method_invocation", "object_creation_expression"], + "rust": ["call_expression", "macro_invocation"], +} + +def parse_with_tree_sitter( + content: str, rel_path: str, language: str, parser, lang_obj, is_test: bool, +) -> tuple[list[Symbol], list[Edge], list[CallRef]]: + """Parse a file using tree-sitter with full call graph extraction.""" + symbols = [] + edges = [] + call_refs = [] + file_id = f"file:{rel_path}" + + content_bytes = content.encode("utf-8") + try: + tree = parser.parse(content_bytes) + except (ValueError, TypeError, RuntimeError) as e: + logging.debug("tree-sitter parse failed for %s: %s", rel_path, e) + return symbols, edges, call_refs + symbol_types = TS_SYMBOL_TYPES.get(language, {}) + call_types = set(TS_CALL_TYPES.get(language, [])) + + def get_node_text(node): + return content_bytes[node.start_byte:node.end_byte].decode("utf-8", errors="replace") + + def find_name(node): + """Extract the name identifier from a declaration node.""" + # Try field name first + name_node = node.child_by_field_name("name") + if name_node: + return get_node_text(name_node) + + # Try first identifier child + for child in node.children: + if child.type in ("identifier", "name", "type_identifier", "property_identifier"): + return get_node_text(child) + return None + + def find_params(node): + """Extract parameter names from a function/method declaration.""" + params = [] + param_node = node.child_by_field_name("parameters") or node.child_by_field_name("formal_parameters") + if not param_node: + for child in node.children: + if child.type in ("parameters", "formal_parameters", "parameter_list"): + param_node = child + break + + if param_node: + for child in param_node.children: + if child.type in ("identifier", "name"): + text = get_node_text(child) + if text != "self" and text != "cls": + params.append(text) + elif child.type in ("parameter", "formal_parameter", "typed_parameter", + "default_parameter", "typed_default_parameter"): + name_n = child.child_by_field_name("name") + if not name_n: + for sub in child.children: + if sub.type == "identifier": + name_n = sub + break + if name_n: + text = get_node_text(name_n) + if text != "self" and text != "cls": + params.append(text) + return params + + def find_bases(node, language): + """Extract base classes/interfaces from a class declaration.""" + bases = [] + # Python: argument_list in class_definition + superclass_node = node.child_by_field_name("superclasses") + if superclass_node: + for child in superclass_node.children: + if child.type in ("identifier", "attribute"): + bases.append(get_node_text(child)) + return bases + + # Java/JS/TS: superclass field + for field_name in ("superclass", "super_class"): + sc = node.child_by_field_name(field_name) + if sc: + bases.append(get_node_text(sc)) + + # Look for extends/implements clauses + for child in node.children: + if child.type in ("superclass", "class_heritage"): + for sub in child.children: + if sub.type in ("identifier", "type_identifier"): + bases.append(get_node_text(sub)) + return bases + + def extract_calls_from_body(body_node, caller_id): + """Walk a function body to extract call references.""" + if body_node is None: + return + + stack = [body_node] + while stack: + node = stack.pop() + + if node.type in call_types: + func_node = node.child_by_field_name("function") + if not func_node: + # Try first child for some languages + if node.children: + func_node = node.children[0] + + if func_node: + raw_name = get_node_text(func_node) + # Clean up the name + raw_name = raw_name.strip() + if raw_name and len(raw_name) < 200: # sanity limit + call_refs.append(CallRef( + caller_id=caller_id, + raw_name=raw_name, + line=node.start_point[0] + 1, + file=rel_path, + )) + + # Don't recurse into nested function/class declarations for calls + if node.type not in symbol_types and node != body_node: + for child in node.children: + stack.append(child) + elif node == body_node: + for child in node.children: + stack.append(child) + + def visit(node, scope=None, parent_id=None): + """Recursively visit tree-sitter AST nodes.""" + if parent_id is None: + parent_id = file_id + + node_type = node.type + + # Handle scope-only nodes (e.g., impl_item in Rust) that provide + # scope context for their children but are not symbols themselves + if node_type in symbol_types and symbol_types[node_type] is None: + impl_name = find_name(node) + if impl_name: + for child in node.children: + visit(child, scope=impl_name, parent_id=parent_id) + else: + for child in node.children: + visit(child, scope=scope, parent_id=parent_id) + return + + sym_type = symbol_types.get(node_type) + + if sym_type is not None: + name = find_name(node) + if name: + qualified = f"{scope}.{name}" if scope else name + + # Determine actual sym_type for special cases + actual_type = sym_type + if language == "go" and node_type == "type_declaration": + # Check if struct or interface + for child in node.children: + if child.type == "type_spec": + for sub in child.children: + if sub.type == "struct_type": + actual_type = "struct" + elif sub.type == "interface_type": + actual_type = "interface" + name_node = child.child_by_field_name("name") + if name_node: + name = get_node_text(name_node) + qualified = f"{scope}.{name}" if scope else name + + # Determine if method + if actual_type == "function" and scope and parent_id.startswith("class:"): + actual_type = "method" + + sym_id = f"{actual_type}:{rel_path}:{qualified}" + + exported = not name.startswith("_") + if language == "go": + exported = name[0].isupper() if name else True + + params = find_params(node) + bases = find_bases(node, language) if actual_type in ("class", "struct") else [] + + symbols.append(Symbol( + id=sym_id, type=actual_type, name=name, + file=rel_path, language=language, + line_start=node.start_point[0] + 1, + line_end=node.end_point[0] + 1, + scope=scope, params=params, bases=bases, + exported=exported, + is_test=is_test, + )) + + # Create containment edges + if actual_type in ("class", "struct", "interface", "enum"): + edges.append(Edge(parent_id, sym_id, "contains")) + for b in bases: + edges.append(Edge(sym_id, f"class:?:{b}", "inherits", {"unresolved": True})) + # Recurse into class body with scope + for child in node.children: + visit(child, scope=name, parent_id=sym_id) + return + elif actual_type == "method" and scope: + scope_id = f"class:{rel_path}:{scope}" + edges.append(Edge(scope_id, sym_id, "contains")) + else: + edges.append(Edge(parent_id, sym_id, "contains")) + + # Extract calls from function/method body + body_node = node.child_by_field_name("body") or node.child_by_field_name("block") + if body_node is None: + # Some languages use the last block child + for child in reversed(node.children): + if child.type in ("block", "statement_block", "compound_statement"): + body_node = child + break + extract_calls_from_body(body_node, sym_id) + return # Don't recurse into function body for more symbols (nested handled above) + + # Not a symbol node; recurse + for child in node.children: + visit(child, scope, parent_id) + + visit(tree.root_node) + + # Extract imports using regex (more reliable for multi-language consistency) + edges.extend(_extract_regex_imports(content, rel_path, language)) + + return symbols, edges, call_refs + + +# ── Parser Dispatcher ───────────────────────────────────────────── + +def parse_file( + abs_path: str, rel_path: str, language: str, + ts_parsers: dict, is_test: bool, +) -> tuple[list[Symbol], list[Edge], list[CallRef], int]: + """Parse a single file, choosing the best available parser. + + Returns (symbols, edges, call_refs, line_count). + """ + content = read_file(abs_path) + if content is None: + return [], [], [], 0 + + line_count = content.count("\n") + (1 if content and not content.endswith("\n") else 0) + + # Python: always use ast module (most accurate) + if language == "python": + symbols, edges, calls = parse_python(content, rel_path, is_test) + return symbols, edges, calls, line_count + + # Tree-sitter if available + if language in ts_parsers: + parser, lang_obj = ts_parsers[language] + symbols, edges, calls = parse_with_tree_sitter( + content, rel_path, language, parser, lang_obj, is_test, + ) + if symbols or edges: + return symbols, edges, calls, line_count + + # Fallback: regex + symbols, edges, calls = parse_with_regex(content, rel_path, language, is_test) + return symbols, edges, calls, line_count + + +# ── Call Resolution ─────────────────────────────────────────────── + +def resolve_calls( + call_refs: list[CallRef], + symbols: list[Symbol], + edges: list[Edge], + file_map: dict, +) -> list[Edge]: + """Resolve raw call references to known symbols, creating calls edges.""" + call_edges = [] + + # Build lookup indexes + # name -> list of symbol IDs + name_to_symbols = defaultdict(list) + for sym in symbols: + name_to_symbols[sym.name].append(sym) + if sym.scope: + name_to_symbols[f"{sym.scope}.{sym.name}"].append(sym) + + # file -> {imported_name -> module_path} + file_imports = defaultdict(dict) + for edge in edges: + if edge.type == "imports" and edge.source.startswith("file:"): + src_file = edge.source[5:] + if edge.target.startswith("symbol:"): + # from X import Y -> local name Y maps to symbol + symbol_path = edge.target[7:] + parts = symbol_path.rsplit(".", 1) + if len(parts) == 2: + local_name = parts[1] + file_imports[src_file][local_name] = symbol_path + elif edge.target.startswith("module:"): + mod_name = edge.target[7:] + # import X -> local name X + local_name = mod_name.rsplit(".", 1)[-1] + file_imports[src_file][local_name] = mod_name + + # file -> list of symbols defined in that file + file_symbols = defaultdict(list) + for sym in symbols: + file_symbols[sym.file].append(sym) + + # Build a set of all symbol IDs for O(1) existence checks + symbol_ids = {s.id for s in symbols} + + # Deduplicate calls: (caller_id, resolved_target_id) pairs + seen_calls = set() + + for ref in call_refs: + raw = ref.raw_name + caller_id = ref.caller_id + + resolved = None + + # 1. self.method() -> look up method in the same class + if raw.startswith("self.") or raw.startswith("this."): + method_name = raw.split(".", 1)[1] + # Find the class scope from caller_id + # caller_id format: method:file:Class.method_name + parts = caller_id.split(":") + if len(parts) >= 3: + scope_parts = parts[2].rsplit(".", 1) + if len(scope_parts) == 2: + class_name = scope_parts[0] + target_id = f"method:{ref.file}:{class_name}.{method_name}" + if target_id in symbol_ids: + resolved = target_id + + # 2. Direct name -> look up in same file, then imports + if not resolved and "." not in raw: + # Same file first + for sym in file_symbols.get(ref.file, []): + if sym.name == raw and sym.id != caller_id: + resolved = sym.id + break + + # Check imports + if not resolved: + imported_path = file_imports.get(ref.file, {}).get(raw) + if imported_path: + # Find the actual symbol + parts = imported_path.rsplit(".", 1) + if len(parts) == 2: + sym_name = parts[1] + for sym in name_to_symbols.get(sym_name, []): + resolved = sym.id + break + + # 3. Qualified name: module.func or Class.method + if not resolved and "." in raw: + parts = raw.split(".") + # Try as imported_module.symbol + if len(parts) == 2: + obj, attr = parts + # Check if obj is an imported module + imported_path = file_imports.get(ref.file, {}).get(obj) + if imported_path: + for sym in name_to_symbols.get(attr, []): + resolved = sym.id + break + + # Check if obj is a class in the same file -> static method or class ref + if not resolved: + for sym in file_symbols.get(ref.file, []): + if sym.name == obj and sym.type in ("class", "struct"): + target_id = f"method:{ref.file}:{obj}.{attr}" + if target_id in symbol_ids: + resolved = target_id + break + + # 4. Last resort: match by name across all symbols + if not resolved: + simple_name = raw.split(".")[-1] + candidates = name_to_symbols.get(simple_name, []) + if len(candidates) == 1 and candidates[0].id != caller_id: + resolved = candidates[0].id + + if resolved: + key = (caller_id, resolved) + if key not in seen_calls and caller_id != resolved: + seen_calls.add(key) + call_edges.append(Edge( + caller_id, resolved, "calls", + {"line": ref.line}, + )) + + return call_edges + + +# ── Test-to-Code Mapping ───────────────────────────────────────── + +def detect_test_relationships( + symbols: list[Symbol], + edges: list[Edge], + files: list[tuple[str, str, str, bool]], + file_map: dict, +) -> list[Edge]: + """Map test files/functions to the source code they test.""" + test_edges = [] + seen = set() + + test_files = [(abs_p, rel_p, lang) for abs_p, rel_p, lang, is_test in files if is_test] + source_files = {rel_p for _, rel_p, _, is_test in files if not is_test} + + # Build indexes — index by both unqualified name and scope.name so + # test_Class_method patterns can match qualified "Class.method" keys. + symbol_by_name = defaultdict(list) + for sym in symbols: + if not sym.is_test: + symbol_by_name[sym.name].append(sym) + if sym.scope: + symbol_by_name[f"{sym.scope}.{sym.name}"].append(sym) + + # Strategy 1: Path mirroring + for _, test_rel, lang in test_files: + source_match = _match_test_to_source_path(test_rel, source_files) + if source_match: + key = (f"file:{test_rel}", f"file:{source_match}") + if key not in seen: + seen.add(key) + test_edges.append(Edge( + f"file:{test_rel}", f"file:{source_match}", "tests", + {"strategy": "path_mirror"}, + )) + + # Strategy 2: Import-based (test imports source module) + file_import_targets = defaultdict(set) + for edge in edges: + if edge.type == "imports" and edge.source.startswith("file:") and edge.target.startswith("file:"): + src = edge.source[5:] + tgt = edge.target[5:] + if is_test_file(src) and tgt in source_files: + file_import_targets[src].add(tgt) + + for test_file, source_targets in file_import_targets.items(): + for source_target in source_targets: + key = (f"file:{test_file}", f"file:{source_target}") + if key not in seen: + seen.add(key) + test_edges.append(Edge( + f"file:{test_file}", f"file:{source_target}", "tests", + {"strategy": "import"}, + )) + + # Strategy 3: Name matching (test_foo -> foo, TestUserModel -> UserModel) + for sym in symbols: + if not sym.is_test: + continue + if sym.type not in ("function", "method", "class"): + continue + + targets = _match_test_symbol_to_source(sym, symbol_by_name) + for target_sym in targets: + key = (sym.id, target_sym.id) + if key not in seen: + seen.add(key) + test_edges.append(Edge( + sym.id, target_sym.id, "tests", + {"strategy": "name_match"}, + )) + + return test_edges + + +def _match_test_to_source_path(test_path: str, source_files: set) -> Optional[str]: + """Match a test file to its source file by path conventions.""" + test_p = Path(test_path) + stem = test_p.stem + parent = str(test_p.parent) + suffix = test_p.suffix + + # Strip test prefixes/suffixes from filename + source_stem = stem + for prefix in ("test_", "test"): + if source_stem.startswith(prefix): + source_stem = source_stem[len(prefix):] + break + for suffix_str in ("_test", "_spec", ".test", ".spec"): + if source_stem.endswith(suffix_str): + source_stem = source_stem[:-len(suffix_str)] + break + + if source_stem == stem: + return None # No transformation happened + + # Try same directory + candidate = str(test_p.with_name(source_stem + suffix)) + if candidate in source_files: + return candidate + + # Try stripping test directory prefixes + parent_parts = Path(parent).parts + for test_dir in ("tests", "test", "__tests__", "spec"): + if test_dir in parent_parts: + idx = parent_parts.index(test_dir) + # Replace test dir with src/lib or just remove it + for src_dir in ("src", "lib", "pkg", "internal", ""): + if src_dir: + new_parts = parent_parts[:idx] + (src_dir,) + parent_parts[idx + 1:] + else: + new_parts = parent_parts[:idx] + parent_parts[idx + 1:] + candidate = str(Path(*new_parts) / (source_stem + suffix)) if new_parts else source_stem + suffix + if candidate in source_files: + return candidate + + # Try searching source_files for matching stem + for sf in source_files: + if Path(sf).stem == source_stem: + return sf + + return None + + +def _match_test_symbol_to_source( + test_sym: Symbol, symbol_by_name: dict, +) -> list[Symbol]: + """Match a test function/class to the source symbol it tests.""" + name = test_sym.name + matches = [] + + # test_foo -> foo + if name.startswith("test_"): + target_name = name[5:] + if target_name in symbol_by_name: + matches.extend(symbol_by_name[target_name]) + + # TestFoo -> Foo (test classes) + if name.startswith("Test") and len(name) > 4 and name[4].isupper(): + target_name = name[4:] + if target_name in symbol_by_name: + matches.extend(symbol_by_name[target_name]) + + # test_ClassName_method -> ClassName.method + if name.startswith("test_") and "_" in name[5:]: + parts = name[5:].split("_", 1) + if len(parts) == 2: + class_name, method_hint = parts + qualified = f"{class_name}.{method_hint}" + if qualified in symbol_by_name: + matches.extend(symbol_by_name[qualified]) + + return matches + + +# ── Graph Builder ───────────────────────────────────────────────── + +def _resolve_relative_import( + src_file: str, module_name: str, known_files: set[str], + js_project_roots: list[str], +) -> Optional[str]: + """Resolve a relative JS/TS import to an actual file path. + + Handles ./foo, ../bar (filesystem-relative) and @/lib/utils + (Next.js src/ alias). + """ + if module_name.startswith("@/"): + # Next.js / TS path alias: @/ maps to src/ within the project root. + # Find the deepest JS project root that is a parent of src_file. + project_root = "" + for root in js_project_roots: + prefix = root + "/" if root else "" + if src_file.startswith(prefix) or not root: + if len(root) > len(project_root): + project_root = root + if project_root: + rel_import = module_name[2:] # strip @/ + base = str(Path(project_root) / "src" / rel_import) + else: + return None + else: + # Standard relative: resolve against the source file's directory + src_dir = str(Path(src_file).parent) + base = os.path.normpath(os.path.join(src_dir, module_name)) + + # Try common extensions + for ext in ("", ".ts", ".tsx", ".js", ".jsx", ".mjs", "/index.ts", "/index.tsx", "/index.js"): + candidate = base + ext + if candidate in known_files: + return candidate + + return None + + +def resolve_imports( + symbols: list[Symbol], edges: list[Edge], file_map: dict[str, str], + module_boundaries: list[ModuleBoundary], +) -> list[Edge]: + """Resolve import edges to actual files where possible.""" + resolved = [] + + # Build lookup maps + module_to_files = defaultdict(list) + for rel_path, lang in file_map.items(): + parts = Path(rel_path).with_suffix("").parts + module_name = ".".join(parts) + module_to_files[module_name].append(rel_path) + + for i in range(len(parts)): + partial = ".".join(parts[i:]) + module_to_files[partial].append(rel_path) + + # Also index by __init__.py aware paths + for rel_path, lang in file_map.items(): + if Path(rel_path).stem == "__init__": + # Package directory maps to its __init__.py + pkg_parts = Path(rel_path).parent.parts + if pkg_parts: + pkg_name = ".".join(pkg_parts) + module_to_files[pkg_name].append(rel_path) + + # Map Go import paths + go_pkg_map = defaultdict(list) + for rel_path, lang in file_map.items(): + if lang == "go": + pkg_dir = str(Path(rel_path).parent) + go_pkg_map[pkg_dir].append(rel_path) + + # Use module boundaries for better Go module resolution + go_module_prefix = "" + go_module_root = "" + for boundary in module_boundaries: + if boundary.language == "go" and boundary.name: + go_module_prefix = boundary.name + go_module_root = boundary.root_dir # e.g. "go" + break + + # Build (file, name) -> symbol index for O(1) symbol lookup in import resolution + file_name_to_symbol = {} + for sym in symbols: + file_name_to_symbol[(sym.file, sym.name)] = sym + + # Build a set of known file paths for O(1) lookup during relative resolution + known_files = set(file_map.keys()) + + # Collect JS/TS project roots from module boundaries (package.json locations) + js_project_roots = [ + mb.root_dir for mb in module_boundaries + if mb.build_file == "package.json" + ] + + for edge in edges: + if edge.type != "imports": + resolved.append(edge) + continue + + target = edge.target + if target.startswith("module:"): + module_name = target[7:] + + # Resolve relative JS/TS imports (./foo, ../bar, @/lib/utils) + if module_name.startswith("./") or module_name.startswith("../") or module_name.startswith("@/"): + src_file = edge.source[5:] if edge.source.startswith("file:") else "" + if src_file: + rel_resolved = _resolve_relative_import(src_file, module_name, known_files, js_project_roots) + if rel_resolved: + resolved.append(Edge(edge.source, f"file:{rel_resolved}", "imports", edge.metadata)) + continue + + # Try direct file mapping + candidates = module_to_files.get(module_name, []) + if candidates: + for c in candidates: + resolved.append(Edge(edge.source, f"file:{c}", "imports", edge.metadata)) + continue + + # Try Go package mapping with module prefix stripping + # e.g. "github.com/kagent-dev/kagent/go/adk/pkg/a2a" + # -> strip prefix "github.com/kagent-dev/kagent/go" + # -> remainder "/adk/pkg/a2a" -> "adk/pkg/a2a" + # -> prepend module root "go" -> "go/adk/pkg/a2a" + if go_module_prefix and module_name.startswith(go_module_prefix): + remainder = module_name[len(go_module_prefix):].lstrip("/") + if go_module_root: + local_path = f"{go_module_root}/{remainder}" if remainder else go_module_root + else: + local_path = remainder + if local_path in go_pkg_map: + for f in go_pkg_map[local_path]: + resolved.append(Edge(edge.source, f"file:{f}", "imports", edge.metadata)) + continue + + # Try Go package mapping by suffix match on path segments + for pkg_path, pkg_files in go_pkg_map.items(): + # Compare by path suffix (slash-separated), not dotted + if module_name.endswith("/" + pkg_path) or module_name.endswith(pkg_path): + for f in pkg_files: + resolved.append(Edge(edge.source, f"file:{f}", "imports", edge.metadata)) + break + else: + resolved.append(edge) # Keep unresolved external + + elif target.startswith("symbol:"): + symbol_path = target[7:] + parts = symbol_path.rsplit(".", 1) + if len(parts) == 2: + module_part, sym_name = parts + candidates = module_to_files.get(module_part, []) + if candidates: + for c in candidates: + sym = file_name_to_symbol.get((c, sym_name)) + if sym: + resolved.append(Edge(edge.source, sym.id, "imports", edge.metadata)) + else: + resolved.append(Edge(edge.source, f"file:{c}", "imports", edge.metadata)) + continue + resolved.append(edge) + else: + resolved.append(edge) + + # Build class/struct/interface name -> symbols for O(1) inheritance resolution + # Use defaultdict(list) to handle multiple classes with the same name across files + class_by_name: dict[str, list[Symbol]] = defaultdict(list) + for sym in symbols: + if sym.type in ("class", "struct", "interface"): + class_by_name[sym.name].append(sym) + + # Resolve inheritance edges + final = [] + for edge in resolved: + if edge.type == "inherits" and edge.metadata.get("unresolved"): + base_name = edge.target.split(":")[-1] + candidates = class_by_name.get(base_name, []) + if len(candidates) == 1: + final.append(Edge(edge.source, candidates[0].id, "inherits")) + elif candidates: + # Disambiguate: prefer class in the same file as the source symbol + source_file = edge.source.split(":")[1] if ":" in edge.source else "" + match = next((c for c in candidates if c.file == source_file), candidates[0]) + final.append(Edge(edge.source, match.id, "inherits")) + else: + final.append(edge) + else: + final.append(edge) + + return final + + +def compute_directory_stats( + files: list[tuple[str, str, str, bool]], symbols: list[Symbol], +) -> dict: + """Compute statistics per directory.""" + dir_stats = defaultdict(lambda: { + "files": 0, "test_files": 0, "languages": set(), + "symbols": 0, "classes": 0, "functions": 0, + }) + + for _, rel_path, lang, is_test in files: + d = str(Path(rel_path).parent) or "." + dir_stats[d]["files"] += 1 + if is_test: + dir_stats[d]["test_files"] += 1 + dir_stats[d]["languages"].add(lang) + + for sym in symbols: + d = str(Path(sym.file).parent) or "." + dir_stats[d]["symbols"] += 1 + if sym.type in ("class", "struct", "interface"): + dir_stats[d]["classes"] += 1 + elif sym.type in ("function", "method"): + dir_stats[d]["functions"] += 1 + + for d in dir_stats: + dir_stats[d]["languages"] = sorted(dir_stats[d]["languages"]) + + return dict(dir_stats) + + +# ── Output Generators ───────────────────────────────────────────── + +def _serialize_boundaries(boundaries: list[ModuleBoundary]) -> list[dict]: + """Serialize module boundaries to dicts (shared by graph and modules output).""" + result = [] + for mb in boundaries: + entry = {"root_dir": mb.root_dir, "build_file": mb.build_file} + if mb.name: + entry["name"] = mb.name + if mb.language: + entry["language"] = mb.language + result.append(entry) + return result + +def compute_graph_stats( + symbols: list[Symbol], edges: list[Edge], + files: list[tuple[str, str, str, bool]], +) -> dict: + """Compute stats shared by graph.json and summary.md (single pass).""" + lang_counts = Counter(lang for _, _, lang, _ in files) + type_counts = Counter(sym.type for sym in symbols) + edge_type_counts = Counter(edge.type for edge in edges) + test_count = sum(1 for _, _, _, t in files if t) + return { + "lang_counts": lang_counts, + "type_counts": type_counts, + "edge_type_counts": edge_type_counts, + "test_count": test_count, + } + + +def generate_graph_json( + symbols: list[Symbol], edges: list[Edge], + files: list[tuple[str, str, str, bool]], repo_root: str, + file_hashes: dict, module_boundaries: list[ModuleBoundary], + line_counts: Optional[dict] = None, + stats: Optional[dict] = None, +) -> dict: + """Generate the full knowledge graph as a JSON-serializable dict.""" + if stats is None: + stats = compute_graph_stats(symbols, edges, files) + + lang_counts = stats["lang_counts"] + type_counts = stats["type_counts"] + edge_type_counts = stats["edge_type_counts"] + + if line_counts is None: + line_counts = {} + + # Build file nodes + nodes = [] + for abs_path, rel_path, lang, is_test in files: + line_count = line_counts.get(rel_path, 0) + + node = { + "id": f"file:{rel_path}", + "type": "file", + "name": Path(rel_path).name, + "path": rel_path, + "language": lang, + "line_count": line_count, + } + if is_test: + node["is_test"] = True + nodes.append(node) + + # Add symbol nodes + for sym in symbols: + node = asdict(sym) + node = {k: v for k, v in node.items() if v is not None and v != [] and v is not False} + if "exported" not in node: + node["exported"] = False + if sym.is_test: + node["is_test"] = True + nodes.append(node) + + # Serialize edges + edge_list = [asdict(e) for e in edges] + for e in edge_list: + if not e.get("metadata"): + del e["metadata"] + + return { + "metadata": { + "repo_root": repo_root, + "generated_at": datetime.now(timezone.utc).isoformat(), + "languages": dict(sorted(lang_counts.items(), key=lambda x: -x[1])), + "total_files": len(files), + "total_symbols": len(symbols), + "symbol_types": dict(sorted(type_counts.items(), key=lambda x: -x[1])), + "edge_types": dict(sorted(edge_type_counts.items(), key=lambda x: -x[1])), + "file_hashes": file_hashes, + "module_boundaries": _serialize_boundaries(module_boundaries), + }, + "nodes": nodes, + "edges": edge_list, + } + + +def generate_tags_json(symbols: list[Symbol]) -> list[dict]: + """Generate a flat symbol index for quick lookup.""" + tags = [] + for sym in symbols: + tag = { + "name": sym.name, + "type": sym.type, + "file": sym.file, + "line": sym.line_start, + "end_line": sym.line_end, + "language": sym.language, + } + if sym.scope: + tag["scope"] = sym.scope + if sym.params: + tag["params"] = sym.params + if sym.bases: + tag["bases"] = sym.bases + if sym.docstring: + tag["doc"] = sym.docstring + if sym.signature: + tag["signature"] = sym.signature + if not sym.exported: + tag["private"] = True + if sym.is_test: + tag["is_test"] = True + tags.append(tag) + return sorted(tags, key=lambda t: (t["file"], t["line"])) + + +def generate_modules_json( + edges: list[Edge], files: list[tuple[str, str, str, bool]], + module_boundaries: list[ModuleBoundary], +) -> dict: + """Generate module-level dependency map with build module awareness.""" + # Group files by directory + dir_files = defaultdict(list) + for _, rel_path, lang, _ in files: + d = str(Path(rel_path).parent) or "." + dir_files[d].append(rel_path) + + file_to_dir = {} + for _, rel_path, _, _ in files: + file_to_dir[rel_path] = str(Path(rel_path).parent) or "." + + # Module dependency edges + module_deps = defaultdict(set) + for edge in edges: + if edge.type != "imports": + continue + src_file = edge.source.replace("file:", "") if edge.source.startswith("file:") else None + tgt_file = edge.target.replace("file:", "") if edge.target.startswith("file:") else None + + if src_file and tgt_file and src_file in file_to_dir and tgt_file in file_to_dir: + src_dir = file_to_dir[src_file] + tgt_dir = file_to_dir[tgt_file] + if src_dir != tgt_dir: + module_deps[src_dir].add(tgt_dir) + + # Test relationships per module + module_tests = defaultdict(set) + for edge in edges: + if edge.type == "tests": + src = edge.source.replace("file:", "") if edge.source.startswith("file:") else None + tgt = edge.target.replace("file:", "") if edge.target.startswith("file:") else None + if src and tgt and src in file_to_dir and tgt in file_to_dir: + test_dir = file_to_dir[src] + source_dir = file_to_dir[tgt] + module_tests[source_dir].add(test_dir) + + # Build module map + module_map = {} + for d, f in sorted(dir_files.items()): + entry = { + "files": sorted(f), + "depends_on": sorted(module_deps.get(d, set())), + } + tests = module_tests.get(d, set()) + if tests: + entry["tested_by"] = sorted(tests) + module_map[d] = entry + + return { + "modules": module_map, + "build_modules": _serialize_boundaries(module_boundaries), + } + + +def generate_summary_md( + symbols: list[Symbol], edges: list[Edge], + files: list[tuple[str, str, str, bool]], dir_stats: dict, + repo_root: str, ts_available: bool, + stats: Optional[dict] = None, + module_deps: Optional[dict] = None, +) -> str: + """Generate a human/AI-readable codebase summary.""" + if stats is None: + stats = compute_graph_stats(symbols, edges, files) + + lang_counts = stats["lang_counts"] + type_counts = stats["type_counts"] + edge_type_counts = stats["edge_type_counts"] + test_count = stats["test_count"] + + # Single pass over edges for hub files, hub symbols, inheritance, and test coverage + import_counts = Counter() + call_counts = Counter() + inheritance = defaultdict(list) + tested_files = set() + test_map = defaultdict(set) + + for edge in edges: + if edge.type == "imports" and edge.target.startswith("file:"): + import_counts[edge.target[5:]] += 1 + elif edge.type == "calls": + call_counts[edge.target] += 1 + elif edge.type == "inherits" and not edge.metadata.get("unresolved"): + child_name = edge.source.split(":")[-1] + parent_name = edge.target.split(":")[-1] + inheritance[parent_name].append(child_name) + elif edge.type == "tests" and edge.target.startswith("file:"): + tested_files.add(edge.target[5:]) + test_name = edge.source[5:] if edge.source.startswith("file:") else edge.source + test_map[edge.target[5:]].add(test_name) + + hub_files = import_counts.most_common(15) + hub_symbols = call_counts.most_common(15) + + lines = [] + lines.append("# Codebase Knowledge Graph\n") + lines.append(f"Generated: {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M UTC')}") + lines.append(f"Parser: {'Tree-sitter + AST' if ts_available else 'AST + Regex'}\n") + + # Statistics + lines.append("## Statistics\n") + lines.append(f"- **{len(files)}** files ({test_count} test files) across **{len(lang_counts)}** languages") + lines.append(f"- **{len(symbols)}** symbols extracted") + for st, count in sorted(type_counts.items(), key=lambda x: -x[1]): + plural = st if st.endswith("s") else f"{st}es" if st.endswith("ss") else f"{st}s" + lines.append(f" - {count} {plural if count != 1 else st}") + lines.append(f"- **{len(edges)}** edges:") + for et, count in sorted(edge_type_counts.items(), key=lambda x: -x[1]): + lines.append(f" - {count} {et}") + if tested_files: + source_count = sum(1 for _, _, _, t in files if not t) + lines.append(f"- **{len(tested_files)}/{source_count}** source files have test coverage") + lines.append("") + + # Languages + lines.append("## Languages\n") + lines.append("| Language | Files |") + lines.append("|----------|-------|") + for lang, count in sorted(lang_counts.items(), key=lambda x: -x[1]): + lines.append(f"| {lang} | {count} |") + lines.append("") + + # Directory structure + lines.append("## Directory Structure\n") + lines.append("```") + for d in sorted(dir_stats.keys()): + dir_info = dir_stats[d] + indent = " " * d.count("/") + dir_name = Path(d).name or "." + lang_str = ", ".join(dir_info["languages"]) + test_str = f", {dir_info['test_files']} tests" if dir_info.get("test_files") else "" + lines.append(f"{indent}{dir_name}/ ({dir_info['files']} files{test_str}, {lang_str})") + lines.append("```\n") + + # Hub files + if hub_files: + lines.append("## Hub Files (most imported)\n") + lines.append("| File | Imported by |") + lines.append("|------|-------------|") + for f, count in hub_files: + lines.append(f"| `{f}` | {count} files |") + lines.append("") + + # Most-called symbols + if hub_symbols: + lines.append("## Most-Called Symbols\n") + lines.append("| Symbol | Called by |") + lines.append("|--------|----------|") + for sym_id, count in hub_symbols: + parts = sym_id.split(":", 2) + display = parts[-1] if len(parts) >= 3 else sym_id + lines.append(f"| `{display}` | {count} callers |") + lines.append("") + + # Key classes + classes = [s for s in symbols if s.type in ("class", "struct", "interface") and s.exported and not s.is_test] + if classes: + lines.append("## Key Types\n") + lines.append("| Type | Kind | File | Line | Bases |") + lines.append("|------|------|------|------|-------|") + for cls in sorted(classes, key=lambda c: c.name)[:50]: + bases_str = ", ".join(cls.bases) if cls.bases else "-" + lines.append(f"| `{cls.name}` | {cls.type} | `{cls.file}` | {cls.line_start} | {bases_str} |") + if len(classes) > 50: + lines.append(f"\n*... and {len(classes) - 50} more types (see tags.json)*\n") + lines.append("") + + # Inheritance trees + roots = [name for name in inheritance if not any(name in children for children in inheritance.values())] + if roots: + lines.append("## Inheritance Hierarchy\n") + lines.append("```") + + def print_tree(name, indent=0, visited=None): + if visited is None: + visited = set() + if name in visited: + return + visited.add(name) + prefix = " " * indent + ("|- " if indent > 0 else "") + lines.append(f"{prefix}{name}") + for child in sorted(inheritance.get(name, [])): + print_tree(child, indent + 1, visited) + + for root in sorted(roots)[:10]: + print_tree(root) + lines.append("```\n") + + # Key functions + exported_funcs = [s for s in symbols if s.type == "function" and s.exported and not s.is_test] + if exported_funcs: + lines.append("## Key Functions\n") + lines.append("| Function | File | Line | Params |") + lines.append("|----------|------|------|--------|") + for func in sorted(exported_funcs, key=lambda f: f.name)[:50]: + params_str = ", ".join(func.params[:5]) if func.params else "-" + if len(func.params) > 5: + params_str += ", ..." + lines.append(f"| `{func.name}` | `{func.file}` | {func.line_start} | {params_str} |") + if len(exported_funcs) > 50: + lines.append(f"\n*... and {len(exported_funcs) - 50} more functions (see tags.json)*\n") + lines.append("") + + # Module dependencies (use pre-computed or compute from modules_json) + if module_deps is not None: + dir_deps = module_deps + else: + file_set = {rel for _, rel, _, _ in files} + dir_deps = defaultdict(set) + for edge in edges: + if edge.type == "imports": + src = edge.source.replace("file:", "") + tgt = edge.target.replace("file:", "") + if src in file_set and tgt in file_set: + sd = str(Path(src).parent) or "." + td = str(Path(tgt).parent) or "." + if sd != td: + dir_deps[sd].add(td) + + if dir_deps: + lines.append("## Module Dependencies\n") + lines.append("```") + for src_dir in sorted(dir_deps.keys()): + deps = sorted(dir_deps[src_dir]) + lines.append(f"{src_dir} -> {', '.join(deps)}") + lines.append("```\n") + + # Test coverage (test_map already built in single-pass above) + if tested_files: + lines.append("## Test Coverage Map\n") + lines.append("| Source File | Tested By |") + lines.append("|------------|-----------|") + for src_file in sorted(test_map.keys())[:30]: + testers = sorted(test_map[src_file]) + testers_str = ", ".join(f"`{t}`" for t in testers[:3]) + if len(testers) > 3: + testers_str += f" +{len(testers) - 3} more" + lines.append(f"| `{src_file}` | {testers_str} |") + if len(test_map) > 30: + lines.append(f"\n*... and {len(test_map) - 30} more tested files*\n") + lines.append("") + + return "\n".join(lines) + + +def _edge_file(node_id: str) -> str: + """Extract the file path from a node ID for incremental merge filtering.""" + if node_id.startswith("file:"): + return node_id[5:] + # Symbol IDs have format "type:file:name" — extract file part + parts = node_id.split(":", 2) + if len(parts) >= 3: + return parts[1] + return "" + + +# ── Main ────────────────────────────────────────────────────────── + +def main(): + parser = argparse.ArgumentParser( + description="Build a knowledge graph of a codebase with call graph and test mapping", + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--repo-root", default=".", + help="Repository root directory (default: current directory)", + ) + parser.add_argument( + "--output-dir", default=None, + help="Output directory (default: /docs/code-tree)", + ) + parser.add_argument( + "--exclude", action="append", default=[], + help="Additional glob patterns to exclude (repeatable)", + ) + parser.add_argument( + "--max-file-size", type=int, default=MAX_FILE_SIZE, + help=f"Skip files larger than this in bytes (default: {MAX_FILE_SIZE})", + ) + parser.add_argument( + "--exclude-tests", action="store_true", + help="Exclude test files from analysis (tests are included by default)", + ) + parser.add_argument( + "--no-tree-sitter", action="store_true", + help="Disable tree-sitter even if available", + ) + parser.add_argument( + "--no-call-graph", action="store_true", + help="Skip call graph extraction (faster but less detailed)", + ) + parser.add_argument( + "--incremental", action="store_true", + help="Only reparse files that changed since last run", + ) + parser.add_argument( + "--quiet", "-q", action="store_true", + help="Suppress progress output", + ) + + args = parser.parse_args() + repo_root = os.path.abspath(args.repo_root) + output_dir = args.output_dir or os.path.join(repo_root, "docs", "code-tree") + excludes = DEFAULT_EXCLUDES + args.exclude + include_tests = not args.exclude_tests + + def log(msg): + if not args.quiet: + print(msg, file=sys.stderr) + + log(f"Scanning: {repo_root}") + + # Load previous graph for incremental mode + prev_hashes = {} + prev_graph = None + if args.incremental: + prev_graph_path = os.path.join(output_dir, "graph.json") + if os.path.exists(prev_graph_path): + try: + with open(prev_graph_path, "r", encoding="utf-8") as f: + prev_graph = json.load(f) + prev_hashes = prev_graph.get("metadata", {}).get("file_hashes", {}) + log(f"Loaded previous graph with {len(prev_hashes)} file hashes") + except (json.JSONDecodeError, OSError): + prev_graph = None + + # Load tree-sitter + ts_parsers = {} + if not args.no_tree_sitter: + ts_parsers = try_load_tree_sitter() + if ts_parsers: + log(f"Tree-sitter available for: {', '.join(sorted(ts_parsers.keys()))}") + else: + log("Tree-sitter not available, using AST/regex parsers") + + # Discover files and build boundaries in a single pass + files, module_boundaries = discover_all(repo_root, excludes, include_tests, args.max_file_size) + test_file_count = sum(1 for _, _, _, t in files if t) + source_file_count = len(files) - test_file_count + log(f"Found {len(files)} files ({source_file_count} source, {test_file_count} test)") + + if not files: + log("No files found. Check --repo-root and --exclude options.") + sys.exit(1) + + if module_boundaries: + log(f"Found {len(module_boundaries)} build module(s): {', '.join(mb.name or mb.build_file for mb in module_boundaries)}") + + # Compute file hashes and determine what needs parsing + current_hashes = {} + files_to_parse = [] + for abs_path, rel_path, lang, is_test in files: + h = file_hash(abs_path) + current_hashes[rel_path] = h + if args.incremental and prev_hashes.get(rel_path) == h: + continue # File unchanged + files_to_parse.append((abs_path, rel_path, lang, is_test)) + + if args.incremental: + log(f"Incremental: {len(files_to_parse)}/{len(files)} files need parsing") + + # Build file_map from ALL files (needed by resolve_imports even in incremental mode) + file_map = {rel_path: lang for _, rel_path, lang, _ in files} + + # Parse files + all_symbols = [] + all_edges = [] + all_call_refs = [] + line_counts = {} + + # In incremental mode, load symbols/edges for unchanged files from previous graph + changed_files = {rel for _, rel, _, _ in files_to_parse} if files_to_parse else set() + if args.incremental and prev_graph is not None and changed_files: + prev_nodes = prev_graph.get("nodes", []) + prev_edges = prev_graph.get("edges", []) + + for node in prev_nodes: + node_file = node.get("file") or node.get("path", "") + if node_file and node_file not in changed_files: + if node.get("type") == "file": + line_counts[node.get("path", "")] = node.get("line_count", 0) + else: + all_symbols.append(Symbol( + id=node["id"], type=node["type"], name=node["name"], + file=node.get("file", ""), language=node.get("language", ""), + line_start=node.get("line_start", 0), + line_end=node.get("line_end", 0), + scope=node.get("scope"), + params=node.get("params", []), + bases=node.get("bases", []), + decorators=node.get("decorators", []), + docstring=node.get("docstring"), + exported=node.get("exported", True), + signature=node.get("signature"), + is_test=node.get("is_test", False), + )) + + for edge_data in prev_edges: + src_file = _edge_file(edge_data.get("source", "")) + # Drop edges whose SOURCE is a changed file (reparsing will + # regenerate them). Keep edges from unchanged files even if + # they point TO a changed file — those importers/callers are + # not reparsed, so their outgoing edges must be preserved. + if src_file not in changed_files: + all_edges.append(Edge( + source=edge_data["source"], target=edge_data["target"], + type=edge_data["type"], + metadata=edge_data.get("metadata", {}), + )) + + log(f"Loaded {len(all_symbols)} symbols, {len(all_edges)} edges from previous graph (unchanged files)") + + parse_target = files_to_parse if files_to_parse else files + parse_count = len(parse_target) + + for i, (abs_path, rel_path, lang, is_test) in enumerate(parse_target): + if not args.quiet and (i + 1) % 100 == 0: + log(f" Parsed {i + 1}/{parse_count} files...") + + symbols, file_edges, call_refs, lc = parse_file(abs_path, rel_path, lang, ts_parsers, is_test) + line_counts[rel_path] = lc + all_symbols.extend(symbols) + all_edges.extend(file_edges) + if not args.no_call_graph: + all_call_refs.extend(call_refs) + + log(f"Extracted {len(all_symbols)} symbols, {len(all_edges)} edges, {len(all_call_refs)} call references") + + # Resolve calls BEFORE import resolution so the alias map (built from + # module:/symbol: import edges) is still intact. + if not args.no_call_graph and all_call_refs: + call_edges = resolve_calls(all_call_refs, all_symbols, all_edges, file_map) + all_edges.extend(call_edges) + log(f"Resolved {len(call_edges)} call edges from {len(all_call_refs)} references") + + # Resolve imports (rewrites module:/symbol: edges to file: edges) + all_edges = resolve_imports(all_symbols, all_edges, file_map, module_boundaries) + log(f"Resolved imports: {len(all_edges)} edges after resolution") + + # Detect test-to-code relationships + if include_tests: + test_edges = detect_test_relationships(all_symbols, all_edges, files, file_map) + all_edges.extend(test_edges) + log(f"Mapped {len(test_edges)} test relationships") + + # Compute directory stats + dir_stats = compute_directory_stats(files, all_symbols) + + # Generate outputs + os.makedirs(output_dir, exist_ok=True) + + # Compute stats once and share across generators + graph_stats = compute_graph_stats(all_symbols, all_edges, files) + + # graph.json + graph = generate_graph_json( + all_symbols, all_edges, files, repo_root, current_hashes, module_boundaries, + line_counts=line_counts, stats=graph_stats, + ) + graph_path = os.path.join(output_dir, "graph.json") + with open(graph_path, "w", encoding="utf-8") as f: + json.dump(graph, f, indent=2, default=str) + log(f"Wrote {graph_path}") + + # tags.json + tags = generate_tags_json(all_symbols) + tags_path = os.path.join(output_dir, "tags.json") + with open(tags_path, "w", encoding="utf-8") as f: + json.dump(tags, f, indent=2) + log(f"Wrote {tags_path}") + + # modules.json + modules = generate_modules_json(all_edges, files, module_boundaries) + modules_path = os.path.join(output_dir, "modules.json") + with open(modules_path, "w", encoding="utf-8") as f: + json.dump(modules, f, indent=2) + log(f"Wrote {modules_path}") + + # summary.md (reuse stats) + summary = generate_summary_md( + all_symbols, all_edges, files, dir_stats, + repo_root, bool(ts_parsers), stats=graph_stats, + ) + summary_path = os.path.join(output_dir, "summary.md") + with open(summary_path, "w", encoding="utf-8") as f: + f.write(summary) + log(f"Wrote {summary_path}") + + # Print summary (reuse stats instead of re-counting) + log(f"\nDone. Output in: {output_dir}") + log(f" graph.json - Full knowledge graph ({len(graph['nodes'])} nodes, {len(graph['edges'])} edges)") + log(f" tags.json - Symbol index ({len(tags)} tags)") + log(f" modules.json - Module dependencies ({len(modules['modules'])} modules)") + log(f" summary.md - Codebase overview") + log(f"\n Edge breakdown:") + for et, count in sorted(graph_stats["edge_type_counts"].items(), key=lambda x: -x[1]): + log(f" {et}: {count}") + + +if __name__ == "__main__": + main() diff --git a/tools/code-tree/query_graph.py b/tools/code-tree/query_graph.py new file mode 100644 index 000000000..7ca9b6e31 --- /dev/null +++ b/tools/code-tree/query_graph.py @@ -0,0 +1,1043 @@ +#!/usr/bin/env python3 +"""query_graph.py - Query the code-tree knowledge graph with impact analysis. + +Traverses the knowledge graph to find relevant code, trace call chains, +analyze change impact, and map test coverage. + +Usage: + python query_graph.py --symbol ClassName # Find symbol definition + python query_graph.py --deps path/to/file.py # File dependencies + python query_graph.py --rdeps path/to/file.py # Reverse dependencies + python query_graph.py --hierarchy ClassName # Inheritance tree + python query_graph.py --search "keyword" # Search symbols + python query_graph.py --module src/api # Module overview + python query_graph.py --entry-points # Find entry points + python query_graph.py --chunks path/to/file.py # Extract code chunks + python query_graph.py --callers func_name # Who calls this? + python query_graph.py --callees func_name # What does this call? + python query_graph.py --impact path/to/file.py # Change impact analysis + python query_graph.py --test-impact path/to/file.py # Which tests are affected? + +All queries output file:line references suitable for AI agent context. +""" + +import argparse +import json +import os +import sys +from collections import defaultdict, deque +from pathlib import Path +from typing import Optional + + +def load_graph(graph_dir: str) -> dict: + graph_path = os.path.join(graph_dir, "graph.json") + if not os.path.exists(graph_path): + print(f"Error: graph.json not found in {graph_dir}", file=sys.stderr) + print("Run code_tree.py first to generate the knowledge graph.", file=sys.stderr) + sys.exit(1) + with open(graph_path, "r", encoding="utf-8") as f: + return json.load(f) + + +def load_tags(graph_dir: str) -> list[dict]: + tags_path = os.path.join(graph_dir, "tags.json") + if not os.path.exists(tags_path): + return [] + with open(tags_path, "r", encoding="utf-8") as f: + return json.load(f) + + +def load_modules(graph_dir: str) -> dict: + modules_path = os.path.join(graph_dir, "modules.json") + if not os.path.exists(modules_path): + return {"modules": {}} + with open(modules_path, "r", encoding="utf-8") as f: + return json.load(f) + + +class GraphQuery: + """Query interface for the code-tree knowledge graph with impact analysis.""" + + def __init__(self, graph_dir: str, repo_root: str): + self.graph = load_graph(graph_dir) + self.tags = load_tags(graph_dir) + self.modules = load_modules(graph_dir) + self.repo_root = repo_root + + # Build indexes + self.nodes_by_id = {} + self.nodes_by_name = defaultdict(list) + self.nodes_by_file = defaultdict(list) + self.nodes_by_type = defaultdict(list) + + for node in self.graph.get("nodes", []): + nid = node.get("id", "") + self.nodes_by_id[nid] = node + name = node.get("name", "") + self.nodes_by_name[name].append(node) + file_path = node.get("file") or node.get("path", "") + if file_path: + self.nodes_by_file[file_path].append(node) + self.nodes_by_type[node.get("type", "")].append(node) + + # Build adjacency lists + self.outgoing = defaultdict(list) # source -> [(target, edge)] + self.incoming = defaultdict(list) # target -> [(source, edge)] + + # Also index by edge type for fast filtering + self.outgoing_by_type = defaultdict(lambda: defaultdict(list)) + self.incoming_by_type = defaultdict(lambda: defaultdict(list)) + + for edge in self.graph.get("edges", []): + src = edge.get("source", "") + tgt = edge.get("target", "") + etype = edge.get("type", "") + self.outgoing[src].append((tgt, edge)) + self.incoming[tgt].append((src, edge)) + self.outgoing_by_type[etype][src].append((tgt, edge)) + self.incoming_by_type[etype][tgt].append((src, edge)) + + # ── Symbol Lookup ───────────────────────────────────────────── + + def find_symbol(self, name: str) -> list[dict]: + """Find all symbols matching a name (exact or partial).""" + results = [] + + if name in self.nodes_by_name: + results.extend(self.nodes_by_name[name]) + + if not results: + name_lower = name.lower() + for n, nodes in self.nodes_by_name.items(): + if name_lower in n.lower(): + results.extend(nodes) + + return results + + # ── Dependency Queries ──────────────────────────────────────── + + def get_dependencies(self, file_path: str) -> dict: + """Get all files/symbols that a given file depends on.""" + file_id = f"file:{file_path}" + deps = {"imports": [], "inherits": [], "calls": []} + + for target, edge in self.outgoing.get(file_id, []): + if edge["type"] == "imports" and target.startswith("file:"): + deps["imports"].append(target[5:]) + + # Check symbols in this file + for node in self.nodes_by_file.get(file_path, []): + nid = node.get("id", "") + for target, edge in self.outgoing.get(nid, []): + if edge["type"] == "inherits": + target_node = self.nodes_by_id.get(target, {}) + if target_node: + deps["inherits"].append({ + "from": node.get("name"), + "to": target_node.get("name"), + "file": target_node.get("file"), + }) + elif edge["type"] == "calls": + target_node = self.nodes_by_id.get(target, {}) + if target_node and target_node.get("file") != file_path: + deps["calls"].append({ + "caller": node.get("name"), + "callee": target_node.get("name"), + "file": target_node.get("file"), + }) + + return deps + + def get_reverse_dependencies(self, file_path: str) -> list[str]: + """Get all files that import a given file.""" + file_id = f"file:{file_path}" + rdeps = set() + + for source, edge in self.incoming.get(file_id, []): + if edge["type"] == "imports" and source.startswith("file:"): + rdeps.add(source[5:]) + + for node in self.nodes_by_file.get(file_path, []): + nid = node.get("id", "") + for source, edge in self.incoming.get(nid, []): + if edge["type"] == "imports" and source.startswith("file:"): + rdeps.add(source[5:]) + + return sorted(rdeps) + + # ── Call Graph Queries ──────────────────────────────────────── + + def get_callers(self, symbol_name: str) -> list[dict]: + """Find all functions/methods that call a given symbol.""" + # Find the target symbol(s) + targets = [ + n for n in self.nodes_by_name.get(symbol_name, []) + if n.get("type") in ("function", "method", "class") + ] + + results = [] + seen = set() + + for target in targets: + tid = target.get("id", "") + for source_id, edge in self.incoming_by_type["calls"].get(tid, []): + if source_id in seen: + continue + seen.add(source_id) + source_node = self.nodes_by_id.get(source_id, {}) + results.append({ + "caller": source_node.get("name", "?"), + "type": source_node.get("type", "?"), + "file": source_node.get("file", "?"), + "line": source_node.get("line_start", "?"), + "call_line": edge.get("metadata", {}).get("line", "?"), + }) + + return sorted(results, key=lambda x: (str(x.get("file", "")), x.get("line", 0))) + + def get_callees(self, symbol_name: str) -> list[dict]: + """Find all functions/methods called by a given symbol.""" + sources = [ + n for n in self.nodes_by_name.get(symbol_name, []) + if n.get("type") in ("function", "method") + ] + + results = [] + seen = set() + + for source in sources: + sid = source.get("id", "") + for target_id, edge in self.outgoing_by_type["calls"].get(sid, []): + if target_id in seen: + continue + seen.add(target_id) + target_node = self.nodes_by_id.get(target_id, {}) + results.append({ + "callee": target_node.get("name", "?"), + "type": target_node.get("type", "?"), + "file": target_node.get("file", "?"), + "line": target_node.get("line_start", "?"), + "call_line": edge.get("metadata", {}).get("line", "?"), + }) + + return sorted(results, key=lambda x: (str(x.get("file", "")), x.get("line", 0))) + + def get_call_chain(self, from_name: str, to_name: str, max_depth: int = 8) -> list[list[str]]: + """Find call paths from one symbol to another (BFS shortest paths).""" + # Find source and target IDs + from_ids = { + n["id"] for n in self.nodes_by_name.get(from_name, []) + if n.get("type") in ("function", "method") + } + to_ids = { + n["id"] for n in self.nodes_by_name.get(to_name, []) + if n.get("type") in ("function", "method", "class") + } + + if not from_ids or not to_ids: + return [] + + # BFS from each source + paths = [] + for start in from_ids: + queue = deque([(start, [start])]) + visited = {start} + + while queue and len(paths) < 5: + current, path = queue.popleft() + if len(path) > max_depth: + continue + + for target_id, edge in self.outgoing_by_type["calls"].get(current, []): + if target_id in to_ids: + paths.append(path + [target_id]) + continue + if target_id not in visited: + visited.add(target_id) + queue.append((target_id, path + [target_id])) + + return paths + + # ── Hierarchy ───────────────────────────────────────────────── + + def get_hierarchy(self, class_name: str) -> dict: + """Get the inheritance hierarchy for a class.""" + candidates = [ + n for n in self.nodes_by_name.get(class_name, []) + if n.get("type") in ("class", "struct", "interface") + ] + + if not candidates: + return {"error": f"Class '{class_name}' not found"} + + result = {"ancestors": [], "descendants": []} + + for cls in candidates: + cls_id = cls.get("id", "") + visited = set() + + def find_ancestors(node_id, depth=0): + if node_id in visited or depth > 10: + return + visited.add(node_id) + for target, edge in self.outgoing.get(node_id, []): + if edge["type"] == "inherits": + target_node = self.nodes_by_id.get(target, {"name": target.split(":")[-1]}) + result["ancestors"].append({ + "name": target_node.get("name", "?"), + "file": target_node.get("file", "?"), + "depth": depth + 1, + }) + find_ancestors(target, depth + 1) + + find_ancestors(cls_id) + visited.clear() + + def find_descendants(node_id, depth=0): + if node_id in visited or depth > 10: + return + visited.add(node_id) + for source, edge in self.incoming.get(node_id, []): + if edge["type"] == "inherits": + source_node = self.nodes_by_id.get(source, {"name": source.split(":")[-1]}) + result["descendants"].append({ + "name": source_node.get("name", "?"), + "file": source_node.get("file", "?"), + "depth": depth + 1, + }) + find_descendants(source, depth + 1) + + find_descendants(cls_id) + + return result + + # ── Search ──────────────────────────────────────────────────── + + def search(self, keyword: str) -> list[dict]: + """Search symbols by keyword (name, docstring, file path).""" + keyword_lower = keyword.lower() + results = [] + + for tag in self.tags: + score = 0 + name = tag.get("name", "") + doc = tag.get("doc", "") + file_path = tag.get("file", "") + + if keyword_lower == name.lower(): + score = 100 + elif keyword_lower in name.lower(): + score = 50 + elif keyword_lower in file_path.lower(): + score = 20 + elif doc and keyword_lower in doc.lower(): + score = 10 + + if score > 0: + results.append({**tag, "_score": score}) + + results.sort(key=lambda x: (-x["_score"], x.get("file", ""), x.get("line", 0))) + return results[:30] + + # ── Module Queries ──────────────────────────────────────────── + + def get_module(self, module_path: str) -> dict: + """Get overview of a module (directory).""" + module_data = self.modules.get("modules", {}).get(module_path, {}) + + module_symbols = [] + for tag in self.tags: + if tag.get("file", "").startswith(module_path): + module_symbols.append(tag) + + return { + "path": module_path, + "files": module_data.get("files", []), + "depends_on": module_data.get("depends_on", []), + "tested_by": module_data.get("tested_by", []), + "symbols": module_symbols, + } + + def find_entry_points(self) -> list[dict]: + """Find likely entry points (files with no internal importers).""" + all_files = { + n.get("path") for n in self.nodes_by_type.get("file", []) if n.get("path") + } + + # Use the imports index instead of scanning all edges + imported_files = set() + for file_id, edges_list in self.incoming_by_type["imports"].items(): + if file_id.startswith("file:"): + imported_files.add(file_id[5:]) + + entry_candidates = all_files - imported_files + + main_patterns = [ + "main.py", "main.go", "index.js", "index.ts", "app.py", + "server.go", "__main__.py", "cli.py", "cmd/", + ] + + # Build a file-to-tags index for efficient lookup + tags_by_file = defaultdict(list) + for t in self.tags: + f = t.get("file") + if f: + tags_by_file[f].append(t) + + results = [] + + for f in sorted(entry_candidates): + priority = 0 + for pat in main_patterns: + if pat in f: + priority = 10 + break + + file_symbols = tags_by_file.get(f, []) + has_main = any(t.get("name") == "main" for t in file_symbols) + if has_main: + priority = 20 + + results.append({ + "file": f, + "priority": priority, + "symbols": len(file_symbols), + }) + + results.sort(key=lambda x: (-x["priority"], x["file"])) + return results[:20] + + # ── Chunk Extraction ────────────────────────────────────────── + + def extract_chunks(self, file_path: str) -> list[dict]: + """Extract clean code chunks from a file using symbol boundaries.""" + abs_path = os.path.join(self.repo_root, file_path) + if not os.path.exists(abs_path): + return [{"error": f"File not found: {file_path}"}] + + try: + with open(abs_path, "r", encoding="utf-8", errors="ignore") as f: + lines = f.readlines() + except OSError: + return [{"error": f"Cannot read: {file_path}"}] + + file_symbols = sorted( + [t for t in self.tags if t.get("file") == file_path], + key=lambda t: t.get("line", 0), + ) + + if not file_symbols: + return [{ + "file": file_path, + "line_start": 1, + "line_end": len(lines), + "type": "file", + "name": Path(file_path).name, + "content": "".join(lines[:200]), + }] + + chunks = [] + for sym in file_symbols: + line_start = sym.get("line", 1) + line_end = sym.get("end_line", line_start + 50) + + # Use actual end_line from tags if available + line_end = min(line_end, len(lines)) + + chunk_lines = lines[line_start - 1 : line_end] + chunks.append({ + "file": file_path, + "line_start": line_start, + "line_end": line_end, + "type": sym.get("type", "unknown"), + "name": sym.get("name", "?"), + "scope": sym.get("scope"), + "content": "".join(chunk_lines), + }) + + return chunks + + # ── Impact Analysis ─────────────────────────────────────────── + + def _compute_impact_bfs(self, target: str, max_depth: int = 5) -> tuple[dict, set]: + """Core BFS for impact analysis. Returns (affected_map, start_ids). + + affected_map: node_id -> {"depth": N, "via": edge_type} + start_ids: set of origin node IDs + """ + start_ids = set() + + # Try as file path + file_id = f"file:{target}" + if file_id in self.nodes_by_id: + start_ids.add(file_id) + for node in self.nodes_by_file.get(target, []): + start_ids.add(node.get("id", "")) + else: + # Try as symbol name + for node in self.nodes_by_name.get(target, []): + start_ids.add(node.get("id", "")) + if node.get("file"): + start_ids.add(f"file:{node['file']}") + + if not start_ids: + return {}, start_ids + + # BFS through reverse dependency edges (imports, calls, inherits) + affected = {} + queue = deque() + + for sid in start_ids: + affected[sid] = {"depth": 0, "via": "origin"} + queue.append((sid, 0)) + + while queue: + current_id, depth = queue.popleft() + if depth >= max_depth: + continue + + for source_id, edge in self.incoming.get(current_id, []): + etype = edge.get("type", "") + if etype not in ("imports", "calls", "inherits"): + continue + + if source_id not in affected or affected[source_id]["depth"] > depth + 1: + affected[source_id] = {"depth": depth + 1, "via": etype} + queue.append((source_id, depth + 1)) + + return affected, start_ids + + def _build_impact_report(self, target: str, affected: dict, max_depth: int) -> dict: + """Build a structured impact report from pre-computed BFS results. + + Organizes the affected map into files and symbols ranked by distance. + """ + affected_files = set() + affected_symbols = [] + + for node_id, info in affected.items(): + if info["depth"] == 0: + continue # Skip the origin nodes + + node = self.nodes_by_id.get(node_id, {}) + if node.get("type") == "file": + affected_files.add(node.get("path", "")) + elif node.get("type") in ("function", "method", "class", "struct", "interface"): + affected_symbols.append({ + "name": node.get("name", "?"), + "type": node.get("type", "?"), + "file": node.get("file", "?"), + "line": node.get("line_start", "?"), + "depth": info["depth"], + "via": info["via"], + }) + + # Track the file too + if node.get("file"): + affected_files.add(node["file"]) + + # Sort by depth then name + affected_symbols.sort(key=lambda x: (x["depth"], str(x.get("file", "")), x.get("name", ""))) + + return { + "target": target, + "affected_files": sorted(affected_files), + "affected_symbols": affected_symbols, + "total_affected_files": len(affected_files), + "total_affected_symbols": len(affected_symbols), + "max_depth_reached": max_depth, + } + + def get_impact(self, target: str, max_depth: int = 5) -> dict: + """Compute transitive impact of changing a file or symbol. + + Given a file path or symbol name, finds everything that depends on it + transitively through imports, calls, and inheritance edges. + + Returns a structured impact report with affected items ranked by distance. + """ + affected, start_ids = self._compute_impact_bfs(target, max_depth) + + if not start_ids: + return {"error": f"No file or symbol found matching '{target}'"} + + return self._build_impact_report(target, affected, max_depth) + + def get_test_impact(self, target: str, max_depth: int = 5) -> dict: + """Find all tests affected by a change to a file or symbol. + + Combines impact analysis with test mapping to determine which tests + need to be re-run or reviewed when code changes. + """ + affected, start_ids = self._compute_impact_bfs(target, max_depth) + + if not start_ids: + return {"error": f"No file or symbol found matching '{target}'"} + + # Reuse BFS result to build impact summary (avoids duplicate BFS) + impact = self._build_impact_report(target, affected, max_depth) + + # All affected node IDs are already in the affected map + affected_ids = set(affected.keys()) + + # Use the tests index for O(1) lookup per affected node + affected_tests = [] + seen_tests = set() + + for node_id in affected_ids: + for test_source, edge in self.incoming_by_type["tests"].get(node_id, []): + if test_source in seen_tests: + continue + seen_tests.add(test_source) + test_node = self.nodes_by_id.get(test_source, {}) + + test_file = test_node.get("file") or test_node.get("path", "") + if test_source.startswith("file:"): + test_file = test_source[5:] + + affected_tests.append({ + "test": test_file or test_source, + "test_name": test_node.get("name", "?"), + "test_type": test_node.get("type", "file"), + "tests_target": node_id.split(":")[-1] if ":" in node_id else node_id, + "strategy": edge.get("metadata", {}).get("strategy", "?"), + }) + + # Deduplicate by test file + test_files = sorted(set(t["test"] for t in affected_tests)) + + return { + "target": target, + "affected_test_files": test_files, + "affected_tests": affected_tests, + "total_test_files": len(test_files), + "total_tests": len(affected_tests), + "impact_summary": { + "affected_files": len(impact.get("affected_files", [])), + "affected_symbols": len(impact.get("affected_symbols", [])), + }, + } + + def get_test_coverage(self, file_path: str) -> dict: + """Show which tests cover a given file.""" + file_id = f"file:{file_path}" + tests = [] + seen = set() + + # Direct test edges to this file + for source_id, edge in self.incoming_by_type["tests"].get(file_id, []): + if source_id not in seen: + seen.add(source_id) + node = self.nodes_by_id.get(source_id, {}) + tests.append({ + "test": node.get("path") or node.get("file", source_id), + "name": node.get("name", "?"), + "type": node.get("type", "?"), + "strategy": edge.get("metadata", {}).get("strategy", "?"), + }) + + # Tests targeting symbols in this file + for node in self.nodes_by_file.get(file_path, []): + nid = node.get("id", "") + for source_id, edge in self.incoming_by_type["tests"].get(nid, []): + if source_id not in seen: + seen.add(source_id) + source_node = self.nodes_by_id.get(source_id, {}) + tests.append({ + "test": source_node.get("file", source_id), + "name": source_node.get("name", "?"), + "type": source_node.get("type", "?"), + "targets": node.get("name", "?"), + "strategy": edge.get("metadata", {}).get("strategy", "?"), + }) + + return { + "file": file_path, + "tests": tests, + "total": len(tests), + } + + +# ── Output Formatters ───────────────────────────────────────────── + +def format_symbol_result(node: dict) -> str: + parts = [] + sym_type = node.get("type", "?") + name = node.get("name", "?") + file_path = node.get("file") or node.get("path", "?") + line = node.get("line_start") or node.get("line", "?") + + header = f"{sym_type}: {name}" + if node.get("scope"): + header = f"{sym_type}: {node['scope']}.{name}" + parts.append(header) + parts.append(f" Location: {file_path}:{line}") + + if node.get("signature"): + parts.append(f" Signature: {name}{node['signature']}") + elif node.get("params"): + parts.append(f" Params: {', '.join(node['params'])}") + if node.get("bases"): + parts.append(f" Bases: {', '.join(node['bases'])}") + if node.get("decorators"): + parts.append(f" Decorators: {', '.join(node['decorators'])}") + if node.get("docstring") or node.get("doc"): + doc = node.get("docstring") or node.get("doc", "") + parts.append(f" Doc: {doc[:120]}") + if node.get("is_test"): + parts.append(f" [TEST]") + + return "\n".join(parts) + + +def format_deps(deps: dict) -> str: + lines = [] + if deps.get("imports"): + lines.append("Imports:") + for imp in sorted(deps["imports"]): + lines.append(f" -> {imp}") + if deps.get("inherits"): + lines.append("Inherits:") + for inh in deps["inherits"]: + lines.append(f" {inh['from']} extends {inh['to']} ({inh.get('file', '?')})") + if deps.get("calls"): + lines.append("Calls (cross-file):") + for call in deps["calls"][:20]: + lines.append(f" {call['caller']} -> {call['callee']} ({call.get('file', '?')})") + if len(deps.get("calls", [])) > 20: + lines.append(f" ... and {len(deps['calls']) - 20} more") + if not lines: + lines.append("No internal dependencies found.") + return "\n".join(lines) + + +def format_hierarchy(hierarchy: dict) -> str: + if "error" in hierarchy: + return hierarchy["error"] + + lines = [] + if hierarchy.get("ancestors"): + lines.append("Ancestors (inherits from):") + for a in hierarchy["ancestors"]: + indent = " " * a["depth"] + lines.append(f"{indent}<- {a['name']} ({a['file']})") + if hierarchy.get("descendants"): + lines.append("Descendants (inherited by):") + for d in hierarchy["descendants"]: + indent = " " * d["depth"] + lines.append(f"{indent}-> {d['name']} ({d['file']})") + if not lines: + lines.append("No inheritance relationships found.") + return "\n".join(lines) + + +def format_chunks(chunks: list[dict]) -> str: + lines = [] + for chunk in chunks: + if "error" in chunk: + lines.append(chunk["error"]) + continue + header = f"--- {chunk['type']}: {chunk['name']} ({chunk['file']}:{chunk['line_start']}-{chunk['line_end']}) ---" + lines.append(header) + lines.append(chunk["content"]) + lines.append("") + return "\n".join(lines) + + +def format_impact(impact: dict) -> str: + if "error" in impact: + return impact["error"] + + lines = [] + lines.append(f"Impact analysis for: {impact['target']}") + lines.append(f" Affected files: {impact['total_affected_files']}") + lines.append(f" Affected symbols: {impact['total_affected_symbols']}") + lines.append("") + + if impact.get("affected_files"): + lines.append("Affected files:") + for f in impact["affected_files"]: + lines.append(f" * {f}") + lines.append("") + + if impact.get("affected_symbols"): + # Group by depth + by_depth = defaultdict(list) + for sym in impact["affected_symbols"]: + by_depth[sym["depth"]].append(sym) + + for depth in sorted(by_depth.keys()): + label = "Direct" if depth == 1 else f"Depth {depth}" + lines.append(f"{label} dependencies:") + for sym in by_depth[depth]: + lines.append(f" {sym['type']:10s} {sym['name']:30s} {sym['file']}:{sym['line']} (via {sym['via']})") + lines.append("") + + return "\n".join(lines) + + +def format_test_impact(test_impact: dict) -> str: + if "error" in test_impact: + return test_impact["error"] + + lines = [] + lines.append(f"Test impact for: {test_impact['target']}") + lines.append(f" Test files to re-run: {test_impact['total_test_files']}") + lines.append(f" Total test mappings: {test_impact['total_tests']}") + lines.append(f" Impact scope: {test_impact['impact_summary']['directly_affected_files']} files, " + f"{test_impact['impact_summary']['directly_affected_symbols']} symbols") + lines.append("") + + if test_impact.get("affected_test_files"): + lines.append("Test files to re-run:") + for f in test_impact["affected_test_files"]: + lines.append(f" * {f}") + lines.append("") + + if test_impact.get("affected_tests"): + lines.append("Test details:") + for t in test_impact["affected_tests"]: + lines.append(f" {t['test_name']:30s} tests {t['tests_target']:20s} ({t['strategy']})") + lines.append("") + + return "\n".join(lines) + + +# ── Main ────────────────────────────────────────────────────────── + +def main(): + parser = argparse.ArgumentParser( + description="Query the code-tree knowledge graph with impact analysis", + formatter_class=argparse.RawDescriptionHelpFormatter, + ) + parser.add_argument( + "--graph-dir", default=None, + help="Directory containing graph.json (default: /docs/code-tree)", + ) + parser.add_argument( + "--repo-root", default=".", + help="Repository root directory (default: current directory)", + ) + parser.add_argument("--json", action="store_true", help="Output as JSON") + + group = parser.add_mutually_exclusive_group(required=True) + group.add_argument("--symbol", "-s", help="Find a symbol by name") + group.add_argument("--deps", "-d", help="Show dependencies of a file") + group.add_argument("--rdeps", "-r", help="Show reverse dependencies of a file") + group.add_argument("--hierarchy", "-H", help="Show inheritance hierarchy for a class") + group.add_argument("--search", "-S", help="Search symbols by keyword") + group.add_argument("--module", "-m", help="Show module (directory) overview") + group.add_argument("--entry-points", "-e", action="store_true", help="Find entry points") + group.add_argument("--chunks", "-c", help="Extract code chunks from a file") + group.add_argument("--callers", help="Find all callers of a symbol") + group.add_argument("--callees", help="Find all functions called by a symbol") + group.add_argument("--call-chain", nargs=2, metavar=("FROM", "TO"), + help="Find call paths between two symbols") + group.add_argument("--impact", "-I", help="Compute change impact for a file or symbol") + group.add_argument("--test-impact", "-T", help="Find tests affected by changes to a file or symbol") + group.add_argument("--test-coverage", help="Show test coverage for a file") + group.add_argument("--stats", action="store_true", help="Show graph statistics") + + parser.add_argument("--depth", type=int, default=5, + help="Max depth for impact/call-chain analysis (default: 5)") + + args = parser.parse_args() + repo_root = os.path.abspath(args.repo_root) + graph_dir = args.graph_dir or os.path.join(repo_root, "docs", "code-tree") + + gq = GraphQuery(graph_dir, repo_root) + + if args.symbol: + results = gq.find_symbol(args.symbol) + if args.json: + print(json.dumps(results, indent=2, default=str)) + elif results: + for r in results: + print(format_symbol_result(r)) + print() + else: + print(f"No symbols matching '{args.symbol}' found.") + + elif args.deps: + deps = gq.get_dependencies(args.deps) + if args.json: + print(json.dumps(deps, indent=2)) + else: + print(f"Dependencies of {args.deps}:\n") + print(format_deps(deps)) + + elif args.rdeps: + rdeps = gq.get_reverse_dependencies(args.rdeps) + if args.json: + print(json.dumps(rdeps, indent=2)) + else: + print(f"Files that import {args.rdeps}:\n") + for r in rdeps: + print(f" <- {r}") + if not rdeps: + print(" No internal importers found.") + + elif args.hierarchy: + hierarchy = gq.get_hierarchy(args.hierarchy) + if args.json: + print(json.dumps(hierarchy, indent=2)) + else: + print(f"Inheritance hierarchy for {args.hierarchy}:\n") + print(format_hierarchy(hierarchy)) + + elif args.search: + results = gq.search(args.search) + if args.json: + print(json.dumps(results, indent=2)) + else: + print(f"Search results for '{args.search}':\n") + for r in results: + test_marker = " [T]" if r.get("is_test") else "" + print(f" {r['type']:10s} {r['name']:30s} {r['file']}:{r['line']}{test_marker}") + if not results: + print(" No matches found.") + + elif args.module: + module_data = gq.get_module(args.module) + if args.json: + print(json.dumps(module_data, indent=2)) + else: + print(f"Module: {args.module}\n") + print(f"Files ({len(module_data['files'])}):") + for f in module_data["files"]: + print(f" {f}") + print(f"\nDepends on ({len(module_data['depends_on'])}):") + for d in module_data["depends_on"]: + print(f" -> {d}") + if module_data.get("tested_by"): + print(f"\nTested by ({len(module_data['tested_by'])}):") + for t in module_data["tested_by"]: + print(f" <- {t}") + print(f"\nSymbols ({len(module_data['symbols'])}):") + for s in module_data["symbols"][:30]: + test_marker = " [T]" if s.get("is_test") else "" + print(f" {s['type']:10s} {s['name']:30s} :{s['line']}{test_marker}") + if len(module_data["symbols"]) > 30: + print(f" ... and {len(module_data['symbols']) - 30} more") + + elif args.entry_points: + entries = gq.find_entry_points() + if args.json: + print(json.dumps(entries, indent=2)) + else: + print("Entry points (files not imported by others):\n") + for e in entries: + marker = " *" if e["priority"] >= 10 else "" + print(f" {e['file']}{marker} ({e['symbols']} symbols)") + + elif args.chunks: + chunks = gq.extract_chunks(args.chunks) + if args.json: + print(json.dumps(chunks, indent=2)) + else: + print(format_chunks(chunks)) + + elif args.callers: + callers = gq.get_callers(args.callers) + if args.json: + print(json.dumps(callers, indent=2)) + else: + print(f"Callers of '{args.callers}':\n") + if callers: + for c in callers: + print(f" {c['type']:10s} {c['caller']:30s} {c['file']}:{c['line']} (calls at :{c['call_line']})") + else: + print(" No callers found.") + + elif args.callees: + callees = gq.get_callees(args.callees) + if args.json: + print(json.dumps(callees, indent=2)) + else: + print(f"Functions called by '{args.callees}':\n") + if callees: + for c in callees: + print(f" -> {c['type']:10s} {c['callee']:30s} {c['file']}:{c['line']}") + else: + print(" No callees found.") + + elif args.call_chain: + from_name, to_name = args.call_chain + chains = gq.get_call_chain(from_name, to_name, args.depth) + if args.json: + # Convert node IDs to readable names + readable_chains = [] + for chain in chains: + readable = [] + for node_id in chain: + node = gq.nodes_by_id.get(node_id, {}) + readable.append({ + "name": node.get("name", node_id), + "type": node.get("type", "?"), + "file": node.get("file", "?"), + }) + readable_chains.append(readable) + print(json.dumps(readable_chains, indent=2)) + else: + print(f"Call chains from '{from_name}' to '{to_name}':\n") + if chains: + for i, chain in enumerate(chains): + names = [] + for node_id in chain: + node = gq.nodes_by_id.get(node_id, {}) + names.append(node.get("name", node_id.split(":")[-1])) + print(f" Chain {i + 1}: {' -> '.join(names)}") + else: + print(" No call chains found.") + + elif args.impact: + impact = gq.get_impact(args.impact, args.depth) + if args.json: + print(json.dumps(impact, indent=2)) + else: + print(format_impact(impact)) + + elif args.test_impact: + test_impact = gq.get_test_impact(args.test_impact, args.depth) + if args.json: + print(json.dumps(test_impact, indent=2)) + else: + print(format_test_impact(test_impact)) + + elif args.test_coverage: + coverage = gq.get_test_coverage(args.test_coverage) + if args.json: + print(json.dumps(coverage, indent=2)) + else: + print(f"Test coverage for {args.test_coverage}:\n") + if coverage["tests"]: + for t in coverage["tests"]: + target_str = f" -> {t['targets']}" if t.get("targets") else "" + print(f" {t['name']:30s} ({t['type']}) in {t['test']}{target_str}") + else: + print(" No tests found for this file.") + + elif args.stats: + meta = gq.graph.get("metadata", {}) + if args.json: + # Don't include file_hashes in stats output (too verbose) + stats_meta = {k: v for k, v in meta.items() if k != "file_hashes"} + print(json.dumps(stats_meta, indent=2)) + else: + print("Knowledge Graph Statistics:\n") + print(f" Files: {meta.get('total_files', '?')}") + print(f" Symbols: {meta.get('total_symbols', '?')}") + print(f" Languages: {meta.get('languages', {})}") + print(f" Symbol types: {meta.get('symbol_types', {})}") + print(f" Edge types: {meta.get('edge_types', {})}") + print(f" Generated: {meta.get('generated_at', '?')}") + if meta.get("module_boundaries"): + print(f"\n Build modules:") + for mb in meta["module_boundaries"]: + name = mb.get("name", mb.get("build_file", "?")) + print(f" {name} ({mb.get('build_file', '?')}) in {mb.get('root_dir', '.')}") + + +if __name__ == "__main__": + main() diff --git a/tools/code-tree/test_query_graph.py b/tools/code-tree/test_query_graph.py new file mode 100644 index 000000000..ffb367edf --- /dev/null +++ b/tools/code-tree/test_query_graph.py @@ -0,0 +1,321 @@ +#!/usr/bin/env python3 +"""Unit tests for query_graph.py — covers start-ID resolution, BFS depth +limiting, edge-type filtering, and output determinism.""" + +import json +import os +import tempfile +import unittest + +from query_graph import GraphQuery + + +def _build_test_graph(): + """Build a minimal but representative graph for testing.""" + return { + "metadata": { + "repo_root": "/fake", + "total_files": 5, + "total_symbols": 8, + "languages": {"python": 3, "go": 2}, + "symbol_types": {"function": 4, "class": 2, "method": 2}, + "edge_types": {"imports": 4, "calls": 3, "tests": 2, "inherits": 1, "contains": 5}, + "generated_at": "2026-01-01T00:00:00", + }, + "nodes": [ + {"id": "file:src/a.py", "type": "file", "name": "a.py", "path": "src/a.py", "language": "python"}, + {"id": "file:src/b.py", "type": "file", "name": "b.py", "path": "src/b.py", "language": "python"}, + {"id": "file:src/c.py", "type": "file", "name": "c.py", "path": "src/c.py", "language": "python"}, + {"id": "file:tests/test_a.py", "type": "file", "name": "test_a.py", "path": "tests/test_a.py", "language": "python", "is_test": True}, + {"id": "file:pkg/d.go", "type": "file", "name": "d.go", "path": "pkg/d.go", "language": "go"}, + {"id": "function:src/a.py:foo", "type": "function", "name": "foo", "file": "src/a.py", "language": "python", "line_start": 10, "line_end": 20}, + {"id": "function:src/b.py:bar", "type": "function", "name": "bar", "file": "src/b.py", "language": "python", "line_start": 5, "line_end": 15}, + {"id": "function:src/c.py:baz", "type": "function", "name": "baz", "file": "src/c.py", "language": "python", "line_start": 1, "line_end": 8}, + {"id": "class:src/a.py:Base", "type": "class", "name": "Base", "file": "src/a.py", "language": "python", "line_start": 25, "line_end": 50}, + {"id": "class:src/b.py:Child", "type": "class", "name": "Child", "file": "src/b.py", "language": "python", "line_start": 20, "line_end": 40, "bases": ["Base"]}, + {"id": "method:src/b.py:Child.process", "type": "method", "name": "process", "file": "src/b.py", "language": "python", "line_start": 25, "line_end": 35, "scope": "Child"}, + {"id": "function:pkg/d.go:Handle", "type": "function", "name": "Handle", "file": "pkg/d.go", "language": "go", "line_start": 10, "line_end": 30}, + ], + "edges": [ + # imports: b imports a, c imports b, d imports a + {"source": "file:src/b.py", "target": "file:src/a.py", "type": "imports"}, + {"source": "file:src/c.py", "target": "file:src/b.py", "type": "imports"}, + {"source": "file:pkg/d.go", "target": "file:src/a.py", "type": "imports"}, + # calls: bar -> foo, baz -> bar, Handle -> foo + {"source": "function:src/b.py:bar", "target": "function:src/a.py:foo", "type": "calls", "metadata": {"line": 8}}, + {"source": "function:src/c.py:baz", "target": "function:src/b.py:bar", "type": "calls", "metadata": {"line": 3}}, + {"source": "function:pkg/d.go:Handle", "target": "function:src/a.py:foo", "type": "calls", "metadata": {"line": 15}}, + # inherits: Child -> Base + {"source": "class:src/b.py:Child", "target": "class:src/a.py:Base", "type": "inherits"}, + # tests: test_a tests a.py and foo + {"source": "file:tests/test_a.py", "target": "file:src/a.py", "type": "tests", "metadata": {"strategy": "path_mirror"}}, + {"source": "file:tests/test_a.py", "target": "function:src/a.py:foo", "type": "tests", "metadata": {"strategy": "name_match"}}, + # contains + {"source": "file:src/a.py", "target": "function:src/a.py:foo", "type": "contains"}, + {"source": "file:src/a.py", "target": "class:src/a.py:Base", "type": "contains"}, + {"source": "file:src/b.py", "target": "function:src/b.py:bar", "type": "contains"}, + {"source": "file:src/b.py", "target": "class:src/b.py:Child", "type": "contains"}, + {"source": "class:src/b.py:Child", "target": "method:src/b.py:Child.process", "type": "contains"}, + ], + } + + +def _build_test_tags(): + """Build tags matching the test graph.""" + return [ + {"name": "foo", "type": "function", "file": "src/a.py", "line": 10, "end_line": 20, "language": "python"}, + {"name": "bar", "type": "function", "file": "src/b.py", "line": 5, "end_line": 15, "language": "python"}, + {"name": "baz", "type": "function", "file": "src/c.py", "line": 1, "end_line": 8, "language": "python"}, + {"name": "Base", "type": "class", "file": "src/a.py", "line": 25, "end_line": 50, "language": "python"}, + {"name": "Child", "type": "class", "file": "src/b.py", "line": 20, "end_line": 40, "language": "python", "bases": ["Base"]}, + {"name": "process", "type": "method", "file": "src/b.py", "line": 25, "end_line": 35, "language": "python", "scope": "Child"}, + {"name": "Handle", "type": "function", "file": "pkg/d.go", "line": 10, "end_line": 30, "language": "go"}, + ] + + +class TestGraphQuery(unittest.TestCase): + """Tests for GraphQuery using an in-memory test graph.""" + + @classmethod + def setUpClass(cls): + """Write test graph artifacts to a temp directory.""" + cls._tmpdir_obj = tempfile.TemporaryDirectory() + cls.tmpdir = cls._tmpdir_obj.name + with open(os.path.join(cls.tmpdir, "graph.json"), "w", encoding="utf-8") as f: + json.dump(_build_test_graph(), f) + with open(os.path.join(cls.tmpdir, "tags.json"), "w", encoding="utf-8") as f: + json.dump(_build_test_tags(), f) + with open(os.path.join(cls.tmpdir, "modules.json"), "w", encoding="utf-8") as f: + json.dump({"modules": { + "src": {"files": ["src/a.py", "src/b.py", "src/c.py"], "depends_on": []}, + "pkg": {"files": ["pkg/d.go"], "depends_on": ["src"]}, + }}, f) + cls.gq = GraphQuery(cls.tmpdir, "/fake") + + @classmethod + def tearDownClass(cls): + """Clean up the temporary directory.""" + cls._tmpdir_obj.cleanup() + + # ── Start-ID resolution ────────────────────────────────────── + + def test_impact_resolves_file_path(self): + """Impact analysis should resolve a file path to its file node and contained symbols.""" + impact = self.gq.get_impact("src/a.py", max_depth=1) + self.assertNotIn("error", impact) + # b.py imports a.py, d.go imports a.py + self.assertIn("src/b.py", impact["affected_files"]) + self.assertIn("pkg/d.go", impact["affected_files"]) + + def test_impact_resolves_symbol_name(self): + """Impact analysis should resolve a symbol name and find its dependents.""" + impact = self.gq.get_impact("foo", max_depth=1) + self.assertNotIn("error", impact) + # bar calls foo, Handle calls foo + affected_names = [s["name"] for s in impact["affected_symbols"]] + self.assertIn("bar", affected_names) + self.assertIn("Handle", affected_names) + + def test_impact_unknown_target_returns_error(self): + """Impact analysis on a nonexistent target should return an error.""" + impact = self.gq.get_impact("nonexistent_symbol_xyz") + self.assertIn("error", impact) + + # ── BFS depth limiting ─────────────────────────────────────── + + def test_impact_depth_1_excludes_transitive(self): + """Depth 1 should find direct dependents only, not transitive ones.""" + impact = self.gq.get_impact("src/a.py", max_depth=1) + # c.py depends on b.py which depends on a.py — transitive, should NOT appear at depth 1 + self.assertNotIn("src/c.py", impact["affected_files"]) + + def test_impact_depth_2_includes_transitive(self): + """Depth 2 should find transitive dependents through one hop.""" + impact = self.gq.get_impact("src/a.py", max_depth=2) + # c.py imports b.py which imports a.py — should appear at depth 2 + self.assertIn("src/c.py", impact["affected_files"]) + + def test_impact_depth_0_returns_no_affected(self): + """Depth 0 should return only origin nodes, no affected items.""" + impact = self.gq.get_impact("src/a.py", max_depth=0) + self.assertEqual(impact["total_affected_files"], 0) + self.assertEqual(impact["total_affected_symbols"], 0) + + # ── Edge-type filtering ────────────────────────────────────── + + def test_impact_follows_imports(self): + """Impact BFS should follow import edges.""" + impact = self.gq.get_impact("src/a.py", max_depth=1) + self.assertIn("src/b.py", impact["affected_files"]) + + def test_impact_follows_calls(self): + """Impact BFS should follow call edges.""" + impact = self.gq.get_impact("foo", max_depth=1) + affected_names = [s["name"] for s in impact["affected_symbols"]] + self.assertIn("bar", affected_names) + + def test_impact_follows_inherits(self): + """Impact BFS should follow inheritance edges.""" + impact = self.gq.get_impact("Base", max_depth=1) + affected_names = [s["name"] for s in impact["affected_symbols"]] + self.assertIn("Child", affected_names) + + def test_impact_does_not_follow_contains(self): + """Impact BFS should not follow 'contains' edges (not a dependency).""" + impact = self.gq.get_impact("foo", max_depth=5) + # 'contains' edge from file:src/a.py -> function:foo should NOT make + # a.py appear as affected by foo + affected_names = [s["name"] for s in impact["affected_symbols"]] + self.assertNotIn("Base", affected_names) # Base is contained by a.py, not dependent on foo + + def test_impact_does_not_follow_tests(self): + """Impact BFS should not follow 'tests' edges as dependency edges.""" + # tests edges are separate from dependency edges + impact = self.gq.get_impact("src/a.py", max_depth=1) + # test_a.py tests a.py, but tests is not a dependency direction for impact + affected_names = [s["name"] for s in impact["affected_symbols"]] + # The test file should NOT appear in affected_symbols (it's not importing/calling a.py) + # But it appears in affected_files via import — let's check edge types used + # Actually tests/test_a.py has a "tests" edge, not "imports", so it should NOT appear + self.assertNotIn("tests/test_a.py", impact["affected_files"]) + + # ── Test impact ────────────────────────────────────────────── + + def test_test_impact_finds_test_files(self): + """Test impact should find tests for affected code.""" + ti = self.gq.get_test_impact("src/a.py") + self.assertNotIn("error", ti) + self.assertIn("tests/test_a.py", ti["affected_test_files"]) + + def test_test_impact_by_symbol(self): + """Test impact should work with symbol names.""" + ti = self.gq.get_test_impact("foo") + self.assertNotIn("error", ti) + self.assertGreaterEqual(ti["total_test_files"], 1) + + # ── Symbol lookup ──────────────────────────────────────────── + + def test_find_symbol_exact(self): + """Exact symbol lookup should return matching nodes.""" + results = self.gq.find_symbol("foo") + self.assertTrue(len(results) >= 1) + self.assertEqual(results[0]["name"], "foo") + + def test_find_symbol_partial(self): + """Partial symbol lookup should work as fallback.""" + results = self.gq.find_symbol("Handl") + self.assertTrue(len(results) >= 1) + names = [r["name"] for r in results] + self.assertIn("Handle", names) + + def test_find_symbol_not_found(self): + """Missing symbol should return empty list.""" + results = self.gq.find_symbol("totally_nonexistent_xyz_123") + self.assertEqual(len(results), 0) + + # ── Dependency queries ─────────────────────────────────────── + + def test_get_dependencies(self): + """get_dependencies should return files imported by a given file.""" + deps = self.gq.get_dependencies("src/b.py") + self.assertIn("src/a.py", deps["imports"]) + + def test_get_reverse_dependencies(self): + """get_reverse_dependencies should return files that import a given file.""" + rdeps = self.gq.get_reverse_dependencies("src/a.py") + self.assertIn("src/b.py", rdeps) + self.assertIn("pkg/d.go", rdeps) + + # ── Call graph ─────────────────────────────────────────────── + + def test_get_callers(self): + """get_callers should find functions that call a given symbol.""" + callers = self.gq.get_callers("foo") + caller_names = [c["caller"] for c in callers] + self.assertIn("bar", caller_names) + self.assertIn("Handle", caller_names) + + def test_get_callees(self): + """get_callees should find functions called by a given symbol.""" + callees = self.gq.get_callees("bar") + callee_names = [c["callee"] for c in callees] + self.assertIn("foo", callee_names) + + def test_call_chain(self): + """get_call_chain should find paths between symbols.""" + chains = self.gq.get_call_chain("baz", "foo") + self.assertTrue(len(chains) >= 1) + # baz -> bar -> foo + self.assertEqual(len(chains[0]), 3) + + # ── Hierarchy ──────────────────────────────────────────────── + + def test_hierarchy_descendants(self): + """Hierarchy should show classes that inherit from a base.""" + h = self.gq.get_hierarchy("Base") + descendant_names = [d["name"] for d in h["descendants"]] + self.assertIn("Child", descendant_names) + + def test_hierarchy_ancestors(self): + """Hierarchy should show parent classes.""" + h = self.gq.get_hierarchy("Child") + ancestor_names = [a["name"] for a in h["ancestors"]] + self.assertIn("Base", ancestor_names) + + # ── Output determinism ─────────────────────────────────────── + + def test_reverse_deps_sorted(self): + """Reverse dependencies should be returned in sorted order.""" + rdeps = self.gq.get_reverse_dependencies("src/a.py") + self.assertEqual(rdeps, sorted(rdeps)) + + def test_impact_files_sorted(self): + """Impact affected_files should be returned in sorted order.""" + impact = self.gq.get_impact("src/a.py", max_depth=3) + self.assertEqual(impact["affected_files"], sorted(impact["affected_files"])) + + def test_impact_symbols_sorted_by_depth(self): + """Impact affected_symbols should be sorted by depth.""" + impact = self.gq.get_impact("src/a.py", max_depth=3) + depths = [s["depth"] for s in impact["affected_symbols"]] + self.assertEqual(depths, sorted(depths)) + + def test_callers_sorted_by_file_and_line(self): + """Callers should be sorted by file then line for deterministic output.""" + callers = self.gq.get_callers("foo") + keys = [(str(c.get("file", "")), c.get("line", 0)) for c in callers] + self.assertEqual(keys, sorted(keys)) + + # ── Test coverage ──────────────────────────────────────────── + + def test_get_test_coverage(self): + """get_test_coverage should find tests for a file.""" + cov = self.gq.get_test_coverage("src/a.py") + self.assertGreaterEqual(cov["total"], 1) + test_files = [t["test"] for t in cov["tests"]] + self.assertTrue(any("test_a" in t for t in test_files)) + + # ── Search ─────────────────────────────────────────────────── + + def test_search_by_name(self): + """Search should find symbols by name.""" + results = self.gq.search("bar") + names = [r["name"] for r in results] + self.assertIn("bar", names) + + def test_search_by_file_path(self): + """Search should match on file paths.""" + results = self.gq.search("pkg/d") + self.assertTrue(len(results) >= 1) + + # ── Stats ──────────────────────────────────────────────────── + + def test_graph_metadata_present(self): + """Graph metadata should be accessible.""" + meta = self.gq.graph.get("metadata", {}) + self.assertEqual(meta["total_files"], 5) + self.assertEqual(meta["total_symbols"], 8) + + +if __name__ == "__main__": + unittest.main()