diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..88efb03 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,141 @@ +# Owlex - Claude Code Configuration + +## Overview + +Owlex is an MCP server that provides access to multiple AI agents (Codex, Gemini, OpenCode) from within Claude Code. It enables: + +1. **Council Deliberation** - Get multiple AI perspectives with optional revision rounds +2. **Liza Peer Review** - Claude implements, external agents review with binding verdicts +3. **Individual Sessions** - Direct agent access with session persistence + +## Quick Reference + +### Council (Consultation) + +``` +/council "Your question here" +``` + +Or with tools: +```python +council_ask( + prompt="Your question", + team="security_audit", # or roles=["security", "perf", "skeptic"] + deliberate=True, + critique=True, +) +``` + +### Liza (Peer-Reviewed Implementation) + +``` +/liza "Implement feature X" +``` + +**Flow:** +1. `liza_start` - Create task +2. Claude implements (Write/Edit/Bash) +3. `liza_submit` - Send for review +4. If REJECT → fix and resubmit +5. If ALL APPROVE → done + +**Tools:** +| Tool | Purpose | +|------|---------| +| `liza_start` | Create implementation task | +| `liza_submit` | Submit for review | +| `liza_status` | Check task status | +| `liza_feedback` | Get reviewer feedback | + +**Blackboard:** `.owlex/liza-state.yaml` + +### Individual Agents + +```python +# Codex +start_codex_session(prompt="...") +resume_codex_session(prompt="...", session_id="...") + +# Gemini +start_gemini_session(prompt="...") +resume_gemini_session(prompt="...", session_ref="latest") + +# OpenCode +start_opencode_session(prompt="...") +resume_opencode_session(prompt="...") +``` + +## When to Use What + +| Scenario | Tool | +|----------|------| +| Need multiple opinions | `/council` | +| Production code with review | `/liza` | +| Deep code analysis | `start_codex_session` | +| Large codebase (1M tokens) | `start_gemini_session` | +| Alternative perspective | `start_opencode_session` | +| Architecture decisions | `council_ask team="architecture_review"` | +| Security review | `council_ask team="security_audit"` | + +## Liza Architecture + +``` +Claude Code = Orchestrator + Coder (trusted) +Codex/Gemini/OpenCode = Reviewers (external validation) +``` + +**Key Principles:** +- Claude implements, cannot self-approve +- Reviewers provide binding APPROVE/REJECT verdicts +- Multiple reviewers catch different issues +- Iterate until all approve or max iterations + +**Contract Rules (from Liza):** +- No fabrication - don't claim to have done something you haven't +- No test corruption - never modify tests to make failing code pass +- No scope creep - only implement what was requested +- Be honest about assumptions and limitations + +## Configuration + +Environment variables: +```bash +COUNCIL_EXCLUDE_AGENTS="" # Skip agents: "opencode,gemini" +COUNCIL_DEFAULT_TEAM="" # Default team preset +OWLEX_DEFAULT_TIMEOUT=300 # Timeout in seconds +CODEX_BYPASS_APPROVALS=false # Bypass sandbox +GEMINI_YOLO_MODE=false # Auto-approve actions +``` + +## Development + +```bash +# Run tests +pytest tests/test_liza.py -v + +# Install locally +pip install -e . + +# Check module +python -c "from owlex.liza import LizaOrchestrator; print('OK')" +``` + +## File Structure + +``` +owlex/ +├── server.py # MCP server with all tools +├── council.py # Council deliberation logic +├── engine.py # Agent execution engine +├── liza/ # Liza peer-review system +│ ├── blackboard.py # State management +│ ├── contracts.py # Behavioral rules +│ ├── orchestrator.py # Review loop coordination +│ └── protocol.py # Verdict parsing +├── skills/ # Slash command skills +│ ├── liza/ # /liza skill +│ └── council/ # /council skill +└── commands/ # Command documentation + ├── liza.md + └── council.md +``` diff --git a/README.md b/README.md index d0696e1..82f2297 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,15 @@ # Owlex -[![Version](https://img.shields.io/github/v/release/agentic-mcp-tools/owlex)](https://github.com/agentic-mcp-tools/owlex/releases) +[![Version](https://img.shields.io/github/v/release/jonastbrg/owlex)](https://github.com/jonastbrg/owlex/releases) [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![Python](https://img.shields.io/badge/python-3.10+-blue)](https://python.org) [![MCP](https://img.shields.io/badge/MCP-compatible-purple)](https://modelcontextprotocol.io) -**Get a second opinion without leaving Claude Code.** +**Multi-agent AI orchestration for Claude Code.** -Different AI models have different strengths and blind spots. Owlex lets you query Codex, Gemini, and OpenCode directly from Claude Code - and optionally run a structured deliberation where they review each other's answers before Claude synthesizes a final response. +Owlex provides access to 5 AI agents (Codex, Gemini, OpenCode, Grok, ClaudeOR) directly from Claude Code. Use them for council deliberation, peer-reviewed coding, or individual sessions with session persistence. -![Council demo](media/owlex_demo.gif) +![Async council demo](media/owlex_async_demo.gif) ## How the Council Works @@ -19,14 +19,21 @@ Different AI models have different strengths and blind spots. Owlex lets you que Use it for architecture decisions, debugging tricky issues, or when you want more confidence than a single model provides. Not for every question - for the ones that matter. -### Data Flow +## Capabilities at a Glance -![Council data flow](media/council_flow.gif) +| Feature | Description | +|---------|-------------| +| **Council** | Multi-agent deliberation with 2-round revision | +| **Liza** | Peer-supervised coding with binding review verdicts | +| **5 Agents** | Codex, Gemini, OpenCode, Grok, ClaudeOR | +| **Roles/Teams** | Assign specialist perspectives (security, perf, skeptic) | +| **Session Persistence** | Resume conversations across sessions | +| **Async Tasks** | Background execution with status polling | ## Installation ```bash -uv tool install git+https://github.com/agentic-mcp-tools/owlex.git +uv tool install git+https://github.com/jonastbrg/owlex.git ``` Add to `.mcp.json`: @@ -53,50 +60,29 @@ Options: - `claude_opinion` - Share your initial thinking with agents - `deliberate` - Enable Round 2 revision (default: true) - `critique` - Agents critique each other instead of revise -- `roles` - Assign specialist roles (dict or list) -- `team` - Use a predefined team preset +- `roles` - Assign specialist roles: `{"codex": "security", "gemini": "perf"}` +- `team` - Use preset: `security_audit`, `code_review`, `architecture_review`, `devil_advocate`, `balanced` - `timeout` - Timeout per agent in seconds (default: 300) -### Specialist Roles - -Agents can operate with specialist perspectives that shape their analysis: - -| Role | Description | -|------|-------------| -| `security` | Security analyst - vulnerabilities, auth, data protection | -| `perf` | Performance optimizer - efficiency, caching, scalability | -| `skeptic` | Devil's advocate - challenge assumptions, find edge cases | -| `architect` | System architect - design patterns, modularity, APIs | -| `maintainer` | Code maintainer - readability, testing, tech debt | -| `dx` | Developer experience - ergonomics, documentation, errors | -| `testing` | Testing specialist - coverage, strategies, edge cases | -| `neutral` | No role injection (default) | - -**Assign roles explicitly:** -``` -council_ask prompt="Review this auth flow" roles={"codex": "security", "gemini": "perf"} -``` - -**Auto-assign from list (in agent order: codex, gemini, opencode):** -``` -council_ask prompt="Review this code" roles=["security", "skeptic", "maintainer"] -``` - -### Team Presets - -Predefined role combinations for common scenarios: - -| Team | Codex | Gemini | OpenCode | ClaudeOR | -|------|-------|--------|----------|----------| -| `security_audit` | security | skeptic | architect | dx | -| `code_review` | maintainer | perf | testing | dx | -| `architecture_review` | architect | perf | maintainer | dx | -| `devil_advocate` | skeptic | skeptic | skeptic | skeptic | -| `balanced` | security | perf | maintainer | dx | -| `optimal` | maintainer | architect | dx | skeptic | - -``` -council_ask prompt="Is this design secure?" team="security_audit" +### Roles & Teams + +Assign specialist perspectives for focused deliberation: + +**Built-in Roles:** +| Role | Focus | +|------|-------| +| `security` | Vulnerabilities, attack vectors, hardening | +| `perf` | Performance, scalability, optimization | +| `skeptic` | Find flaws, challenge assumptions | +| `architect` | Design patterns, maintainability | +| `maintainer` | Technical debt, long-term costs | +| `dx` | Developer experience, usability | +| `testing` | Test coverage, edge cases | + +**Team Presets:** +```python +council_ask(prompt="Review this auth flow", team="security_audit") +council_ask(prompt="Architecture options?", team="architecture_review") ``` ### Individual Agent Sessions @@ -108,20 +94,11 @@ council_ask prompt="Is this design secure?" team="security_audit" | `start_gemini_session` | New Gemini session | | `resume_gemini_session` | Resume with index or `latest` | | `start_opencode_session` | New OpenCode session | -| `resume_opencode_session` | Resume with session ID or `--continue` | -| `start_claudeor_session` | New Claude via OpenRouter session | -| `resume_claudeor_session` | Resume with session ID or `--continue` | - -### Claude Code Skills - -Non-blocking slash commands for quick agent invocation: - -| Skill | Description | -|-------|-------------| -| `/codex` | Ask Codex a question | -| `/gemini` | Ask Gemini a question | -| `/council` | Run council deliberation | -| `/critique` | Run council in critique mode | +| `resume_opencode_session` | Resume previous session | +| `start_grok_session` | New Grok session (reasoning or coding model) | +| `resume_grok_session` | Resume previous Grok session | +| `start_claudeor_session` | New ClaudeOR session (OpenRouter backend) | +| `resume_claudeor_session` | Resume previous ClaudeOR session | ### Async Task Management @@ -138,15 +115,16 @@ Council runs in the background. Start a query, keep working, check results later | Variable | Default | Description | |----------|---------|-------------| -| `COUNCIL_EXCLUDE_AGENTS` | `` | Skip agents (e.g., `opencode,gemini,claudeor`) | -| `COUNCIL_DEFAULT_TEAM` | `` | Default team when none specified (empty = neutral) | -| `COUNCIL_CLAUDE_OPINION` | `false` | Claude shares its opinion with agents by default | +| `COUNCIL_EXCLUDE_AGENTS` | `` | Skip agents (e.g., `opencode,gemini,grok`) | +| `COUNCIL_DEFAULT_TEAM` | `` | Default team preset for council | | `OWLEX_DEFAULT_TIMEOUT` | `300` | Timeout in seconds | | `CODEX_BYPASS_APPROVALS` | `false` | Bypass sandbox (use with caution) | | `GEMINI_YOLO_MODE` | `false` | Auto-approve Gemini actions | | `OPENCODE_AGENT` | `plan` | `plan` (read-only) or `build` | -| `OPENROUTER_API_KEY` | `` | OpenRouter API key (enables ClaudeOR agent) | -| `CLAUDEOR_MODEL` | `` | OpenRouter model for ClaudeOR (e.g., `deepseek/deepseek-v3.2`) | +| `GROK_MODEL` | `xai/grok-4-1-fast-reasoning` | Grok reasoning model | +| `GROK_CODE_MODEL` | `xai/grok-code-fast-1` | Grok coding model | +| `XAI_API_KEY` | `` | xAI API key for Grok | +| `CLAUDEOR_MODEL` | `` | Model for ClaudeOR (OpenRouter) | ## Cost Notes @@ -155,12 +133,89 @@ Council runs in the background. Start a query, keep working, check results later - Exclude agents with `COUNCIL_EXCLUDE_AGENTS` to control costs - Use council for important decisions, not every question +## Liza: Peer-Supervised Coding + +**External validation for production-quality code.** + +Liza implements a peer-review loop where Claude (the coder) implements tasks and external agents (Codex, Gemini, Grok) review with binding verdicts. Based on [liza-mas/liza](https://github.com/liza-mas/liza). + +### Architecture + +``` +┌─────────────────────────────────────────────────┐ +│ Claude Code (Coder + Orchestrator) │ +└─────────────────────────────────────────────────┘ + │ + [implementation] + │ + ┌─────────────┼─────────────┐ + ▼ ▼ ▼ + ┌─────────┐ ┌─────────┐ ┌─────────┐ + │ Codex │ │ Gemini │ │ Grok │ + │(Reviewer│ │(Reviewer│ │(Reviewer│ + └─────────┘ └─────────┘ └─────────┘ +``` + +### Workflow + +``` +1. liza_start("Add rate limiting") → task_id +2. Claude implements (Write/Edit/Bash) +3. liza_submit(task_id, summary) → reviewers examine + - Codex: REJECT - "Missing IP-based limiting" + - Gemini: APPROVE +4. Claude fixes based on feedback +5. liza_submit again → all APPROVE +6. Done! ✅ +``` + +### MCP Tools + +| Tool | Description | +|------|-------------| +| `liza_start` | Create a task for Claude to implement | +| `liza_submit` | Submit implementation for review | +| `liza_status` | Check task status | +| `liza_feedback` | Get feedback to address | + +### Key Principles + +- **External validation**: Claude cannot self-approve +- **Critique mode**: Reviewers actively find bugs, security issues, edge cases +- **Multi-reviewer**: Different agents catch different issues +- **Iteration**: Loop until all reviewers approve or max iterations + +--- + ## When to Use Each Agent | Agent | Strengths | |-------|-----------| -| **Codex (gpt5.2-codex)** | Deep reasoning, code review, bug finding | +| **Codex** | Deep reasoning, code review, bug finding, PRD writing | | **Gemini** | 1M context window, multimodal, large codebases | -| **OpenCode** | Alternative perspective, configurable models | -| **ClaudeOR** | Claude Code + OpenRouter (DeepSeek, GPT-4o, etc.) | -| **Claude** | Complex multi-step implementation, synthesis | +| **OpenCode** | Alternative perspective, configurable backend models | +| **Grok** | Contrarian perspective, reasoning model + coding model | +| **ClaudeOR** | Claude via OpenRouter, alternative model access | +| **Claude** | Complex multi-step implementation, synthesis, orchestration | + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Claude Code (MCP Client) │ +└─────────────────────────────────────────────────────────────┘ + │ + │ MCP Protocol + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Owlex MCP Server │ +├─────────────────────────────────────────────────────────────┤ +│ Council │ Liza │ Individual │ +│ Deliberation │ Peer Review │ Sessions │ +├───────────────────┴──────────────────────────────────────────┤ +│ Agent Engine │ +├──────┬──────┬──────────┬──────────┬────────────────────────┤ +│Codex │Gemini│ OpenCode │ Grok │ ClaudeOR │ +│ CLI │ CLI │ CLI │ (xAI) │ (OpenRouter) │ +└──────┴──────┴──────────┴──────────┴────────────────────────┘ +``` diff --git a/commands/liza.md b/commands/liza.md new file mode 100644 index 0000000..7025f42 --- /dev/null +++ b/commands/liza.md @@ -0,0 +1,50 @@ +# /liza - Peer-Reviewed Coding with External Validation + +Start a Liza session where Claude implements and Codex/Gemini review. + +## Usage + +``` +/liza +``` + +## Examples + +``` +/liza Add a login endpoint with rate limiting +/liza Fix the bug in the authentication flow +/liza Implement user profile page with avatar upload +``` + +## How It Works + +**Architecture:** +- **Claude** = Coder (trusted, actually writes code) +- **Codex/Gemini** = Reviewers (examine and provide binding verdicts) + +**Flow:** +1. `/liza` creates a task → Claude implements +2. Claude submits for review → Codex and Gemini examine +3. If REJECT: Claude fixes based on feedback, resubmits +4. If ALL APPROVE: Done! + +## Key Principles (from Liza) + +- **External validation**: Claude cannot self-approve; reviewers provide binding verdicts +- **Critique mode**: Reviewers actively look for bugs, security issues, edge cases +- **Iteration until approved**: Loop continues until all reviewers approve or max iterations +- **Merged feedback**: Different reviewers catch different issues + +## MCP Tools + +| Tool | Purpose | +|------|---------| +| `liza_start` | Create task (called by this command) | +| `liza_submit` | Submit implementation for review | +| `liza_status` | Check task status | +| `liza_feedback` | Get feedback to address | + +## Blackboard + +State is persisted in `.owlex/liza-state.yaml` for resumability. +View with resource: `owlex://liza/blackboard` diff --git a/owlex/agents/__init__.py b/owlex/agents/__init__.py index 339fdf4..e32c657 100644 --- a/owlex/agents/__init__.py +++ b/owlex/agents/__init__.py @@ -8,5 +8,6 @@ from .gemini import GeminiRunner from .opencode import OpenCodeRunner from .claudeor import ClaudeORRunner +from .grok import GrokRunner -__all__ = ["AgentRunner", "CodexRunner", "GeminiRunner", "OpenCodeRunner", "ClaudeORRunner"] +__all__ = ["AgentRunner", "CodexRunner", "GeminiRunner", "OpenCodeRunner", "ClaudeORRunner", "GrokRunner"] diff --git a/owlex/agents/grok.py b/owlex/agents/grok.py new file mode 100644 index 0000000..c99aeef --- /dev/null +++ b/owlex/agents/grok.py @@ -0,0 +1,219 @@ +""" +Grok agent runner via OpenCode CLI with xAI backend. +Uses OpenCode with xAI/Grok models for council deliberation and coding tasks. +""" + +import asyncio +import json +import os +import re +from pathlib import Path +from typing import Callable + +from ..config import config +from .base import AgentRunner, AgentCommand + + +def _get_grok_project_id(working_directory: str) -> str | None: + """ + Look up the projectID that OpenCode uses for session storage. + Reuses OpenCode's project registry since Grok runs through OpenCode. + """ + project_dir = Path.home() / ".local" / "share" / "opencode" / "storage" / "project" + if not project_dir.exists(): + return None + + abs_path = os.path.abspath(working_directory) + + try: + for project_file in project_dir.glob("*.json"): + if project_file.name == "global.json": + continue + try: + with open(project_file) as f: + project_data = json.load(f) + if project_data.get("worktree") == abs_path: + return project_data.get("id") + except (json.JSONDecodeError, OSError): + continue + except OSError: + return None + + return None + + +async def get_latest_grok_session( + working_directory: str | None = None, + since_mtime: float | None = None, + max_retries: int = 3, + retry_delay: float = 0.3, +) -> str | None: + """ + Find the most recent Grok/OpenCode session ID from filesystem. + Grok uses OpenCode's session storage since it runs through the OpenCode CLI. + """ + opencode_dir = Path.home() / ".local" / "share" / "opencode" / "storage" / "session" + if not opencode_dir.exists(): + return None + + if not working_directory: + return None + + project_id = _get_grok_project_id(working_directory) + if not project_id: + return None + + for attempt in range(max_retries): + latest_file: Path | None = None + latest_mtime: float = 0 + + project_dirs = [opencode_dir / project_id] + + for project_dir in project_dirs: + if not project_dir.exists(): + continue + try: + for session_file in project_dir.glob("ses_*.json"): + try: + mtime = session_file.stat().st_mtime + if since_mtime is not None and mtime < since_mtime: + continue + if mtime > latest_mtime: + latest_mtime = mtime + latest_file = session_file + except OSError: + continue + except OSError: + continue + + if latest_file is not None: + return latest_file.stem + + if attempt < max_retries - 1: + await asyncio.sleep(retry_delay) + + return None + + +def clean_grok_output(raw_output: str, original_prompt: str = "") -> str: + """Clean Grok/OpenCode CLI output by removing noise.""" + if not config.grok.clean_output: + return raw_output + cleaned = raw_output + # Remove ANSI escape codes + cleaned = re.sub(r'\x1b\[[0-9;]*m', '', cleaned) + # Collapse multiple newlines + cleaned = re.sub(r'\n{3,}', '\n\n', cleaned) + return cleaned.strip() + + +class GrokRunner(AgentRunner): + """ + Runner for Grok models via OpenCode CLI. + + Uses OpenCode with xAI/Grok models. Requires XAI_API_KEY environment variable. + + Two model configurations: + - GROK_MODEL: Model for council deliberation (default: xai/grok-4-1-fast-reasoning) + - GROK_CODE_MODEL: Model for coding tasks (default: xai/grok-code-fast-1) + """ + + @property + def name(self) -> str: + return "grok" + + def _get_model(self, for_coding: bool = False) -> str: + """Get the appropriate Grok model based on task type.""" + if for_coding: + return config.grok.code_model + return config.grok.model + + def build_exec_command( + self, + prompt: str, + working_directory: str | None = None, + enable_search: bool = False, + for_coding: bool = False, + **kwargs, + ) -> AgentCommand: + """Build command for running Grok via OpenCode with a prompt.""" + full_command = ["opencode", "run"] + + # Model configuration - use Grok model + model = self._get_model(for_coding=for_coding) + full_command.extend(["--model", model]) + + # Agent selection (reuse OpenCode's agent config) + if config.grok.agent: + full_command.extend(["--agent", config.grok.agent]) + + # Use -- to signal end of options + full_command.append("--") + full_command.append(prompt) + + return AgentCommand( + command=full_command, + prompt="", # Prompt is in command as positional arg + cwd=working_directory, + output_prefix="Grok Output", + not_found_hint="Please ensure OpenCode is installed (curl -fsSL https://opencode.ai/install | bash) and XAI_API_KEY is set.", + stream=True, + ) + + def build_resume_command( + self, + session_ref: str, + prompt: str, + working_directory: str | None = None, + enable_search: bool = False, + for_coding: bool = False, + **kwargs, + ) -> AgentCommand: + """Build command for resuming an existing Grok/OpenCode session.""" + full_command = ["opencode", "run"] + + # Model configuration + model = self._get_model(for_coding=for_coding) + full_command.extend(["--model", model]) + + # Agent selection + if config.grok.agent: + full_command.extend(["--agent", config.grok.agent]) + + # Session resume + if session_ref == "--continue" or session_ref == "latest": + full_command.append("--continue") + else: + if session_ref.startswith("-"): + raise ValueError(f"Invalid session_ref: '{session_ref}' - cannot start with '-'") + full_command.extend(["--session", session_ref]) + + full_command.append("--") + full_command.append(prompt) + + return AgentCommand( + command=full_command, + prompt="", + cwd=working_directory, + output_prefix="Grok Resume Output", + not_found_hint="Please ensure OpenCode is installed (curl -fsSL https://opencode.ai/install | bash) and XAI_API_KEY is set.", + stream=False, + ) + + def get_output_cleaner(self) -> Callable[[str, str], str]: + return clean_grok_output + + async def parse_session_id( + self, + output: str, + since_mtime: float | None = None, + working_directory: str | None = None, + ) -> str | None: + """ + Get session ID for Grok. + Uses OpenCode's session storage since Grok runs through OpenCode. + """ + return await get_latest_grok_session( + working_directory=working_directory, + since_mtime=since_mtime, + ) diff --git a/owlex/config.py b/owlex/config.py index 2352573..1220b38 100644 --- a/owlex/config.py +++ b/owlex/config.py @@ -40,6 +40,15 @@ class ClaudeORConfig: clean_output: bool = True +@dataclass(frozen=True) +class GrokConfig: + """Configuration for Grok via xAI/OpenCode integration.""" + model: str = "xai/grok-4-1-fast-reasoning" # Default reasoning model + code_model: str = "xai/grok-code-fast-1" # Default coding model + agent: str = "plan" # Agent to use - "plan" (read-only) or "build" (full access) + clean_output: bool = True + + @dataclass(frozen=True) class CouncilConfig: """Configuration for council orchestration.""" @@ -55,6 +64,7 @@ class OwlexConfig: gemini: GeminiConfig opencode: OpenCodeConfig claudeor: ClaudeORConfig + grok: GrokConfig council: CouncilConfig default_timeout: int = 300 @@ -97,6 +107,13 @@ def load_config() -> OwlexConfig: clean_output=os.environ.get("CLAUDEOR_CLEAN_OUTPUT", "true").lower() == "true", ) + grok = GrokConfig( + model=os.environ.get("GROK_MODEL", "xai/grok-4-1-fast-reasoning"), + code_model=os.environ.get("GROK_CODE_MODEL", "xai/grok-code-fast-1"), + agent=os.environ.get("GROK_AGENT", "plan"), + clean_output=os.environ.get("GROK_CLEAN_OUTPUT", "true").lower() == "true", + ) + # Parse council exclude agents (comma-separated list) exclude_raw = os.environ.get("COUNCIL_EXCLUDE_AGENTS", "") exclude_agents = frozenset( @@ -128,6 +145,7 @@ def load_config() -> OwlexConfig: gemini=gemini, opencode=opencode, claudeor=claudeor, + grok=grok, council=council, default_timeout=timeout, ) diff --git a/owlex/council.py b/owlex/council.py index 64ed0ad..8ce6a8f 100644 --- a/owlex/council.py +++ b/owlex/council.py @@ -10,7 +10,7 @@ from typing import Any from .config import config -from .engine import engine, build_agent_response, codex_runner, gemini_runner, opencode_runner, claudeor_runner +from .engine import engine, build_agent_response, codex_runner, gemini_runner, opencode_runner, claudeor_runner, grok_runner from .prompts import inject_role_prefix, build_deliberation_prompt_with_role from .roles import RoleSpec, RoleDefinition, RoleResolver, RoleId, get_resolver from .models import ( @@ -141,9 +141,12 @@ async def deliberate( # Determine which agents to run excluded = config.council.exclude_agents # Include claudeor only if API key is configured + # Include grok (uses XAI_API_KEY via OpenCode, checked at runtime) all_agents = ["codex", "gemini", "opencode"] if config.claudeor.api_key: all_agents.append("claudeor") + # Grok is always available if not excluded (XAI_API_KEY checked by OpenCode at runtime) + all_agents.append("grok") active_agents = [a for a in all_agents if a not in excluded] # Resolve roles (team parameter is just a string that resolves to a team preset) @@ -356,6 +359,35 @@ async def run_claudeor(): claudeor_task.async_task = asyncio.create_task(run_claudeor()) async_tasks.append(claudeor_task.async_task) + if "grok" not in excluded: + grok_role = roles.get("grok") + grok_prompt = inject_role_prefix(prompt, grok_role) + + grok_task = self._engine.create_task( + command=f"council_{Agent.GROK.value}", + args={"prompt": grok_prompt, "working_directory": working_directory}, + context=self.context, + ) + tasks["grok"] = grok_task + + # Capture prompt in closure + _grok_prompt = grok_prompt + + async def run_grok(): + # Clean start for R1 - session ID captured after completion + await self._engine.run_agent( + grok_task, grok_runner, mode="exec", + prompt=_grok_prompt, working_directory=working_directory + ) + elapsed = (datetime.now() - round1_start).total_seconds() + status = "completed" if grok_task.status == "completed" else "failed" + model_name = config.grok.model or "Grok" + self.log(f"{model_name} {status} ({elapsed:.1f}s)") + await self.notify(f"{model_name} {status} ({elapsed:.1f}s)") + + grok_task.async_task = asyncio.create_task(run_grok()) + async_tasks.append(grok_task.async_task) + # Wait for all tasks with timeout if async_tasks: done, pending = await asyncio.wait( @@ -435,11 +467,25 @@ async def parse_claudeor_session(): self.log("ClaudeOR session ID not found, R2 will use exec mode") return session - codex_session, gemini_session, opencode_session, claudeor_session = await asyncio.gather( + async def parse_grok_session(): + if "grok" not in tasks or tasks["grok"].status != "completed": + return None + session = await grok_runner.parse_session_id( + "", since_mtime=r1_start_mtime, working_directory=working_directory + ) + if session and not grok_runner.validate_session_id(session): + self.log(f"Grok session ID validation failed: {session}") + return None + if not session: + self.log("Grok session ID not found, R2 will use exec mode") + return session + + codex_session, gemini_session, opencode_session, claudeor_session, grok_session = await asyncio.gather( parse_codex_session(), parse_gemini_session(), parse_opencode_session(), parse_claudeor_session(), + parse_grok_session(), ) return CouncilRound( @@ -447,6 +493,7 @@ async def parse_claudeor_session(): gemini=build_agent_response(tasks["gemini"], Agent.GEMINI, session_id=gemini_session) if "gemini" in tasks else None, opencode=build_agent_response(tasks["opencode"], Agent.OPENCODE, session_id=opencode_session) if "opencode" in tasks else None, claudeor=build_agent_response(tasks["claudeor"], Agent.CLAUDEOR, session_id=claudeor_session) if "claudeor" in tasks else None, + grok=build_agent_response(tasks["grok"], Agent.GROK, session_id=grok_session) if "grok" in tasks else None, ) async def _run_round_2( @@ -475,6 +522,7 @@ async def _run_round_2( ("gemini", round_1.gemini), ("opencode", round_1.opencode), ("claudeor", round_1.claudeor), + ("grok", round_1.grok), ]: if r1_result and r1_result.status == "failed": r1_failed.add(agent_name) @@ -486,11 +534,13 @@ async def _run_round_2( gemini_session = round_1.gemini.session_id if round_1.gemini else None opencode_session = round_1.opencode.session_id if round_1.opencode else None claudeor_session = round_1.claudeor.session_id if round_1.claudeor else None + grok_session = round_1.grok.session_id if round_1.grok else None codex_content = (round_1.codex.content or round_1.codex.error or "(no response)") if round_1.codex else None gemini_content = (round_1.gemini.content or round_1.gemini.error or "(no response)") if round_1.gemini else None opencode_content = (round_1.opencode.content or round_1.opencode.error or "(no response)") if round_1.opencode else None claudeor_content = (round_1.claudeor.content or round_1.claudeor.error or "(no response)") if round_1.claudeor else None + grok_content = (round_1.grok.content or round_1.grok.error or "(no response)") if round_1.grok else None claude_content = claude_opinion.strip() if claude_opinion else None round2_start = datetime.now() @@ -515,6 +565,7 @@ async def _run_round_2( gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=False, @@ -527,6 +578,7 @@ async def _run_round_2( gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=True, @@ -576,6 +628,7 @@ async def run_codex_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=False, @@ -588,6 +641,7 @@ async def run_codex_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=True, @@ -635,6 +689,7 @@ async def run_gemini_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=False, @@ -647,6 +702,7 @@ async def run_gemini_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=True, @@ -694,6 +750,7 @@ async def run_opencode_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=False, @@ -706,6 +763,7 @@ async def run_opencode_delib(): gemini_answer=gemini_content, opencode_answer=opencode_content, claudeor_answer=claudeor_content, + grok_answer=grok_content, claude_answer=claude_content, critique=critique, include_original=True, @@ -744,6 +802,68 @@ async def run_claudeor_delib(): claudeor_delib_task.async_task = asyncio.create_task(run_claudeor_delib()) async_tasks.append(claudeor_delib_task.async_task) + if "grok" not in excluded and "grok" not in r1_failed: + grok_role = roles.get("grok") + # Build prompt for resume mode (no original needed - agent has R1 context) + grok_delib_prompt_resume = build_deliberation_prompt_with_role( + original_prompt=prompt, + role=grok_role, + codex_answer=codex_content, + gemini_answer=gemini_content, + opencode_answer=opencode_content, + claudeor_answer=claudeor_content, + grok_answer=grok_content, + claude_answer=claude_content, + critique=critique, + include_original=False, + ) + # Build prompt for exec fallback (include original - agent starts fresh) + grok_delib_prompt_exec = build_deliberation_prompt_with_role( + original_prompt=prompt, + role=grok_role, + codex_answer=codex_content, + gemini_answer=gemini_content, + opencode_answer=opencode_content, + claudeor_answer=claudeor_content, + grok_answer=grok_content, + claude_answer=claude_content, + critique=critique, + include_original=True, + ) + + grok_delib_task = self._engine.create_task( + command=f"council_{Agent.GROK.value}_delib", + args={"prompt": grok_delib_prompt_resume, "working_directory": working_directory}, + context=self.context, + ) + tasks["grok"] = grok_delib_task + + # Capture session and prompts in closure + _grok_session = grok_session + _grok_delib_prompt_resume = grok_delib_prompt_resume + _grok_delib_prompt_exec = grok_delib_prompt_exec + + async def run_grok_delib(): + # Resume with explicit session ID if available, otherwise exec with full context + if _grok_session: + await self._engine.run_agent( + grok_delib_task, grok_runner, mode="resume", + session_ref=_grok_session, + prompt=_grok_delib_prompt_resume, working_directory=working_directory + ) + else: + await self._engine.run_agent( + grok_delib_task, grok_runner, mode="exec", + prompt=_grok_delib_prompt_exec, working_directory=working_directory + ) + elapsed = (datetime.now() - round2_start).total_seconds() + model_name = config.grok.model or "Grok" + self.log(f"{model_name} revised ({elapsed:.1f}s)") + await self.notify(f"{model_name} revised ({elapsed:.1f}s)") + + grok_delib_task.async_task = asyncio.create_task(run_grok_delib()) + async_tasks.append(grok_delib_task.async_task) + # Wait for all tasks with timeout if async_tasks: done, pending = await asyncio.wait( @@ -770,4 +890,5 @@ async def run_claudeor_delib(): gemini=build_agent_response(tasks["gemini"], Agent.GEMINI) if "gemini" in tasks else None, opencode=build_agent_response(tasks["opencode"], Agent.OPENCODE) if "opencode" in tasks else None, claudeor=build_agent_response(tasks["claudeor"], Agent.CLAUDEOR) if "claudeor" in tasks else None, + grok=build_agent_response(tasks["grok"], Agent.GROK) if "grok" in tasks else None, ) diff --git a/owlex/engine.py b/owlex/engine.py index 9288b0d..c7c8a1e 100644 --- a/owlex/engine.py +++ b/owlex/engine.py @@ -12,7 +12,7 @@ from .config import config from .models import Task, TaskStatus, AgentResponse, Agent -from .agents import CodexRunner, GeminiRunner, OpenCodeRunner, ClaudeORRunner +from .agents import CodexRunner, GeminiRunner, OpenCodeRunner, ClaudeORRunner, GrokRunner from .agents.base import AgentRunner, AgentCommand @@ -39,6 +39,7 @@ def build_agent_response( Agent.GEMINI.value: "Gemini Output:\n\n", Agent.OPENCODE.value: "OpenCode Output:\n\n", Agent.CLAUDEOR.value: "Claude (OpenRouter) Output:\n\n", + Agent.GROK.value: "Grok Output:\n\n", } prefix = prefix_map.get(agent_name, "") @@ -64,6 +65,7 @@ def build_agent_response( gemini_runner = GeminiRunner() opencode_runner = OpenCodeRunner() claudeor_runner = ClaudeORRunner() +grok_runner = GrokRunner() # Map Agent enum to runner instances AGENT_RUNNERS: dict[Agent, AgentRunner] = { @@ -71,6 +73,7 @@ def build_agent_response( Agent.GEMINI: gemini_runner, Agent.OPENCODE: opencode_runner, Agent.CLAUDEOR: claudeor_runner, + Agent.GROK: grok_runner, } diff --git a/owlex/liza/__init__.py b/owlex/liza/__init__.py new file mode 100644 index 0000000..ac3866e --- /dev/null +++ b/owlex/liza/__init__.py @@ -0,0 +1,33 @@ +""" +Liza - Peer-supervised multi-agent coding system for owlex. + +Implements the Liza protocol: disciplined peer-supervised coding where +agents take coder/reviewer roles with binding verdicts. + +Based on: https://github.com/liza-mas/liza +""" + +from .blackboard import Blackboard, Task, TaskStatus, AgentRole +from .protocol import ReviewVerdict, VerdictStatus, parse_verdict +from .contracts import CoderContract, ReviewerContract, get_contract_prompt +from .orchestrator import LizaOrchestrator, LizaConfig, LizaResult + +__all__ = [ + # Blackboard + "Blackboard", + "Task", + "TaskStatus", + "AgentRole", + # Protocol + "ReviewVerdict", + "VerdictStatus", + "parse_verdict", + # Contracts + "CoderContract", + "ReviewerContract", + "get_contract_prompt", + # Orchestrator + "LizaOrchestrator", + "LizaConfig", + "LizaResult", +] diff --git a/owlex/liza/blackboard.py b/owlex/liza/blackboard.py new file mode 100644 index 0000000..7c05854 --- /dev/null +++ b/owlex/liza/blackboard.py @@ -0,0 +1,606 @@ +""" +Blackboard - Shared state coordination for Liza multi-agent system. + +The blackboard is the single source of truth for task state, assignments, +and history. All agents read from and write to this shared state. + +State file location: .owlex/liza-state.yaml (in working directory) +""" + +import os +import fcntl +import uuid +from dataclasses import dataclass, field +from datetime import datetime, timezone +from enum import Enum +from pathlib import Path +from typing import Any + +import yaml + + +class TaskStatus(str, Enum): + """Task lifecycle states.""" + DRAFT = "DRAFT" # Planner creating task + UNCLAIMED = "UNCLAIMED" # Ready to be claimed by coder + CLAIMED = "CLAIMED" # Coder has claimed, not yet started + WORKING = "WORKING" # Coder implementing + READY_FOR_REVIEW = "READY_FOR_REVIEW" # Submitted for review + IN_REVIEW = "IN_REVIEW" # Reviewer examining + APPROVED = "APPROVED" # Reviewer approved + REJECTED = "REJECTED" # Reviewer rejected (needs iteration) + BLOCKED = "BLOCKED" # Cannot proceed without intervention + SUPERSEDED = "SUPERSEDED" # Replaced by another task + MERGED = "MERGED" # Approved and merged + + +class AgentRole(str, Enum): + """Agent roles in Liza system.""" + CODER = "coder" + REVIEWER = "reviewer" + PLANNER = "planner" + + +@dataclass +class HistoryEntry: + """A single entry in task history.""" + time: str + event: str + agent: str | None = None + details: dict[str, Any] | None = None + + def to_dict(self) -> dict: + d = {"time": self.time, "event": self.event} + if self.agent: + d["agent"] = self.agent + if self.details: + d["details"] = self.details + return d + + @classmethod + def from_dict(cls, data: dict) -> "HistoryEntry": + return cls( + time=data["time"], + event=data["event"], + agent=data.get("agent"), + details=data.get("details"), + ) + + +@dataclass +class ReviewRecord: + """Record of a single review.""" + reviewer: str + verdict: str # APPROVE or REJECT + feedback: str | None + timestamp: str + iteration: int + + def to_dict(self) -> dict: + return { + "reviewer": self.reviewer, + "verdict": self.verdict, + "feedback": self.feedback, + "timestamp": self.timestamp, + "iteration": self.iteration, + } + + @classmethod + def from_dict(cls, data: dict) -> "ReviewRecord": + return cls( + reviewer=data["reviewer"], + verdict=data["verdict"], + feedback=data.get("feedback"), + timestamp=data["timestamp"], + iteration=data["iteration"], + ) + + +@dataclass +class Task: + """A task in the Liza system.""" + id: str + description: str + status: TaskStatus = TaskStatus.UNCLAIMED + coder: str | None = None + reviewers: list[str] = field(default_factory=list) + iteration: int = 0 + max_iterations: int = 5 + done_when: str | None = None + created_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat()) + updated_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat()) + history: list[HistoryEntry] = field(default_factory=list) + reviews: list[ReviewRecord] = field(default_factory=list) + last_implementation: str | None = None # Summary of last implementation + merged_feedback: str | None = None # Combined feedback from all reviewers + + def to_dict(self) -> dict: + return { + "id": self.id, + "description": self.description, + "status": self.status.value, + "coder": self.coder, + "reviewers": self.reviewers, + "iteration": self.iteration, + "max_iterations": self.max_iterations, + "done_when": self.done_when, + "created_at": self.created_at, + "updated_at": self.updated_at, + "history": [h.to_dict() for h in self.history], + "reviews": [r.to_dict() for r in self.reviews], + "last_implementation": self.last_implementation, + "merged_feedback": self.merged_feedback, + } + + @classmethod + def from_dict(cls, data: dict) -> "Task": + return cls( + id=data["id"], + description=data["description"], + status=TaskStatus(data["status"]), + coder=data.get("coder"), + reviewers=data.get("reviewers", []), + iteration=data.get("iteration", 0), + max_iterations=data.get("max_iterations", 5), + done_when=data.get("done_when"), + created_at=data.get("created_at", datetime.now(timezone.utc).isoformat()), + updated_at=data.get("updated_at", datetime.now(timezone.utc).isoformat()), + history=[HistoryEntry.from_dict(h) for h in data.get("history", [])], + reviews=[ReviewRecord.from_dict(r) for r in data.get("reviews", [])], + last_implementation=data.get("last_implementation"), + merged_feedback=data.get("merged_feedback"), + ) + + def add_history(self, event: str, agent: str | None = None, details: dict | None = None): + """Add a history entry.""" + self.history.append(HistoryEntry( + time=datetime.now(timezone.utc).isoformat(), + event=event, + agent=agent, + details=details, + )) + self.updated_at = datetime.now(timezone.utc).isoformat() + + def add_review(self, reviewer: str, verdict: str, feedback: str | None): + """Add a review record.""" + self.reviews.append(ReviewRecord( + reviewer=reviewer, + verdict=verdict, + feedback=feedback, + timestamp=datetime.now(timezone.utc).isoformat(), + iteration=self.iteration, + )) + self.updated_at = datetime.now(timezone.utc).isoformat() + + +@dataclass +class BlackboardState: + """Full blackboard state.""" + version: int = 1 + goal: str = "" + spec_ref: str | None = None + created_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat()) + updated_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat()) + tasks: list[Task] = field(default_factory=list) + config: dict[str, Any] = field(default_factory=dict) + log: list[str] = field(default_factory=list) + + def to_dict(self) -> dict: + return { + "version": self.version, + "goal": self.goal, + "spec_ref": self.spec_ref, + "created_at": self.created_at, + "updated_at": self.updated_at, + "tasks": [t.to_dict() for t in self.tasks], + "config": self.config, + "log": self.log, + } + + @classmethod + def from_dict(cls, data: dict) -> "BlackboardState": + return cls( + version=data.get("version", 1), + goal=data.get("goal", ""), + spec_ref=data.get("spec_ref"), + created_at=data.get("created_at", datetime.now(timezone.utc).isoformat()), + updated_at=data.get("updated_at", datetime.now(timezone.utc).isoformat()), + tasks=[Task.from_dict(t) for t in data.get("tasks", [])], + config=data.get("config", {}), + log=data.get("log", []), + ) + + +class Blackboard: + """ + Blackboard state manager with file-based persistence. + + Provides atomic read/write operations with file locking. + """ + + DEFAULT_DIR = ".owlex" + DEFAULT_FILE = "liza-state.yaml" + + def __init__(self, working_directory: str | None = None): + """ + Initialize blackboard. + + Args: + working_directory: Directory containing .owlex/. Defaults to CWD. + """ + if working_directory is None: + working_directory = os.getcwd() + self.working_directory = Path(working_directory) + self.state_dir = self.working_directory / self.DEFAULT_DIR + self.state_file = self.state_dir / self.DEFAULT_FILE + self.lock_file = self.state_dir / f"{self.DEFAULT_FILE}.lock" + + def exists(self) -> bool: + """Check if blackboard state file exists.""" + return self.state_file.exists() + + def initialize( + self, + goal: str, + spec_ref: str | None = None, + config: dict[str, Any] | None = None, + ) -> BlackboardState: + """ + Initialize a new blackboard. + + Args: + goal: The goal/objective for this Liza session + spec_ref: Optional reference to specification file + config: Optional configuration overrides + + Returns: + The initialized BlackboardState + """ + # Create directory if needed + self.state_dir.mkdir(parents=True, exist_ok=True) + + state = BlackboardState( + goal=goal, + spec_ref=spec_ref, + config=config or {}, + ) + state.log.append(f"Blackboard initialized: {goal}") + + self._write_state(state) + return state + + def read(self) -> BlackboardState: + """ + Read current blackboard state. + + Returns: + Current BlackboardState + + Raises: + FileNotFoundError: If blackboard not initialized + """ + if not self.exists(): + raise FileNotFoundError(f"Blackboard not found at {self.state_file}. Run liza_start to create a task.") + + with open(self.state_file, "r") as f: + # Acquire shared lock for reading + fcntl.flock(f.fileno(), fcntl.LOCK_SH) + try: + data = yaml.safe_load(f) + return BlackboardState.from_dict(data) + finally: + fcntl.flock(f.fileno(), fcntl.LOCK_UN) + + def write(self, state: BlackboardState): + """ + Write blackboard state atomically. + + Args: + state: The state to write + """ + self._write_state(state) + + def _write_state(self, state: BlackboardState): + """Write state with exclusive lock.""" + self.state_dir.mkdir(parents=True, exist_ok=True) + state.updated_at = datetime.now(timezone.utc).isoformat() + + # Write to temp file then rename for atomicity + temp_file = self.state_file.with_suffix(".yaml.tmp") + + with open(temp_file, "w") as f: + fcntl.flock(f.fileno(), fcntl.LOCK_EX) + try: + yaml.dump(state.to_dict(), f, default_flow_style=False, sort_keys=False) + finally: + fcntl.flock(f.fileno(), fcntl.LOCK_UN) + + # Atomic rename + temp_file.rename(self.state_file) + + def add_task( + self, + description: str, + coder: str | None = None, + reviewers: list[str] | None = None, + max_iterations: int = 5, + done_when: str | None = None, + status: TaskStatus = TaskStatus.UNCLAIMED, + ) -> Task: + """ + Add a new task to the blackboard. + + Args: + description: Task description + coder: Agent to implement (e.g., "codex") + reviewers: Agents to review (e.g., ["gemini", "opencode"]) + max_iterations: Maximum coder-reviewer cycles + done_when: Completion criteria + status: Initial status (default UNCLAIMED) + + Returns: + The created Task + """ + state = self.read() + + task_id = f"task-{len(state.tasks) + 1}" + task = Task( + id=task_id, + description=description, + status=status, + coder=coder, + reviewers=reviewers or [], + max_iterations=max_iterations, + done_when=done_when, + ) + task.add_history("task_created", details={"description": description[:100]}) + + state.tasks.append(task) + state.log.append(f"Task {task_id} created: {description[:50]}...") + + self.write(state) + return task + + def get_task(self, task_id: str) -> Task | None: + """Get a task by ID.""" + state = self.read() + for task in state.tasks: + if task.id == task_id: + return task + return None + + def update_task(self, task: Task): + """ + Update a task in the blackboard. + + Args: + task: The task with updated fields + """ + state = self.read() + for i, t in enumerate(state.tasks): + if t.id == task.id: + state.tasks[i] = task + break + self.write(state) + + def transition_task( + self, + task_id: str, + new_status: TaskStatus, + agent: str | None = None, + details: dict | None = None, + ) -> Task: + """ + Transition a task to a new status with history. + + Args: + task_id: Task to transition + new_status: Target status + agent: Agent performing the transition + details: Additional details for history + + Returns: + The updated Task + + Raises: + ValueError: If task not found + """ + state = self.read() + + task = None + for t in state.tasks: + if t.id == task_id: + task = t + break + + if task is None: + raise ValueError(f"Task {task_id} not found") + + old_status = task.status + task.status = new_status + task.add_history( + f"status_changed", + agent=agent, + details={"from": old_status.value, "to": new_status.value, **(details or {})}, + ) + + state.log.append(f"{task_id}: {old_status.value} -> {new_status.value}") + self.write(state) + return task + + def claim_task(self, task_id: str, coder: str) -> Task: + """ + Claim a task for implementation. + + Args: + task_id: Task to claim + coder: Agent claiming the task + + Returns: + The claimed Task + """ + state = self.read() + + task = None + for t in state.tasks: + if t.id == task_id: + task = t + break + + if task is None: + raise ValueError(f"Task {task_id} not found") + + if task.status not in (TaskStatus.UNCLAIMED, TaskStatus.REJECTED): + raise ValueError(f"Task {task_id} cannot be claimed (status: {task.status.value})") + + task.coder = coder + task.status = TaskStatus.CLAIMED + task.add_history("task_claimed", agent=coder) + + if task.status == TaskStatus.REJECTED: + task.iteration += 1 + + state.log.append(f"{task_id} claimed by {coder}") + self.write(state) + return task + + def submit_for_review( + self, + task_id: str, + implementation_summary: str | None = None, + ) -> Task: + """ + Submit a task for review. + + Args: + task_id: Task to submit + implementation_summary: Optional summary of implementation + + Returns: + The submitted Task + """ + state = self.read() + + task = None + for t in state.tasks: + if t.id == task_id: + task = t + break + + if task is None: + raise ValueError(f"Task {task_id} not found") + + task.status = TaskStatus.READY_FOR_REVIEW + task.last_implementation = implementation_summary + task.add_history("submitted_for_review", agent=task.coder, details={ + "iteration": task.iteration, + "summary": implementation_summary[:200] if implementation_summary else None, + }) + + state.log.append(f"{task_id} submitted for review (iteration {task.iteration})") + self.write(state) + return task + + def record_review( + self, + task_id: str, + reviewer: str, + verdict: str, + feedback: str | None = None, + ) -> Task: + """ + Record a review verdict. + + Args: + task_id: Task being reviewed + reviewer: Reviewing agent + verdict: APPROVE or REJECT + feedback: Review feedback + + Returns: + The reviewed Task + """ + state = self.read() + + task = None + for t in state.tasks: + if t.id == task_id: + task = t + break + + if task is None: + raise ValueError(f"Task {task_id} not found") + + task.add_review(reviewer, verdict, feedback) + task.add_history("review_received", agent=reviewer, details={ + "verdict": verdict, + "has_feedback": feedback is not None, + }) + + state.log.append(f"{task_id} reviewed by {reviewer}: {verdict}") + self.write(state) + return task + + def finalize_reviews(self, task_id: str) -> tuple[Task, bool]: + """ + Finalize all reviews for a task iteration. + + Merges feedback from all reviewers and determines final verdict. + All reviewers must APPROVE for the task to be approved. + + Args: + task_id: Task to finalize + + Returns: + Tuple of (updated Task, approved: bool) + """ + state = self.read() + + task = None + for t in state.tasks: + if t.id == task_id: + task = t + break + + if task is None: + raise ValueError(f"Task {task_id} not found") + + # Get reviews from current iteration + current_reviews = [r for r in task.reviews if r.iteration == task.iteration] + + # Check if all reviewers approved + all_approved = all(r.verdict == "APPROVE" for r in current_reviews) + any_feedback = [r.feedback for r in current_reviews if r.feedback] + + if all_approved: + task.status = TaskStatus.APPROVED + task.add_history("all_reviews_approved", details={ + "reviewers": [r.reviewer for r in current_reviews], + }) + state.log.append(f"{task_id} APPROVED by all reviewers") + else: + task.status = TaskStatus.REJECTED + # Merge all feedback + task.merged_feedback = "\n\n---\n\n".join( + f"**{r.reviewer}** ({r.verdict}):\n{r.feedback or '(no feedback)'}" + for r in current_reviews + ) + task.add_history("reviews_require_changes", details={ + "rejecting_reviewers": [r.reviewer for r in current_reviews if r.verdict != "APPROVE"], + }) + state.log.append(f"{task_id} REJECTED - needs iteration") + + self.write(state) + return task, all_approved + + def get_active_tasks(self) -> list[Task]: + """Get all non-terminal tasks.""" + state = self.read() + terminal = {TaskStatus.APPROVED, TaskStatus.MERGED, TaskStatus.SUPERSEDED} + return [t for t in state.tasks if t.status not in terminal] + + def log_message(self, message: str): + """Add a message to the blackboard log.""" + state = self.read() + state.log.append(f"[{datetime.now(timezone.utc).isoformat()}] {message}") + self.write(state) diff --git a/owlex/liza/contracts.py b/owlex/liza/contracts.py new file mode 100644 index 0000000..55ffdf3 --- /dev/null +++ b/owlex/liza/contracts.py @@ -0,0 +1,302 @@ +""" +Behavioral Contracts - Rules that discipline agents in Liza system. + +Based on the Liza contract philosophy: +"Negative space design: The contract defines what's forbidden; the shape that +remains is where judgment lives. Strict on failure modes, silent on excellence." + +Contracts prevent: +- Sycophancy +- Phantom fixes (claiming to fix without actually fixing) +- Test corruption +- Scope creep +- Hallucinated completions +- Self-certification +""" + +from dataclasses import dataclass +from enum import Enum + + +class ContractTier(str, Enum): + """Contract rule tiers - Tier 0 is never violated.""" + TIER_0 = "tier_0" # Invariants - never violated + TIER_1 = "tier_1" # Important but can be overridden by human + TIER_2 = "tier_2" # Best practices + + +@dataclass +class CoderContract: + """Behavioral contract for coder agents.""" + + @staticmethod + def get_prompt() -> str: + return """ +# Coder Contract + +You are operating as a CODER in the Liza peer-supervised coding system. +Your work will be reviewed by independent reviewers who will APPROVE or REJECT. + +## Tier 0 Invariants (NEVER violate) + +1. **No fabrication**: Never claim to have done something you haven't +2. **No test corruption**: Never modify tests to make failing code pass +3. **No phantom fixes**: Don't claim "fixed" without actual implementation +4. **No self-certification**: You cannot approve your own work +5. **No scope creep**: Only implement what was requested + +## Behavioral Rules + +### Implementation +- Read the task description carefully before starting +- If requirements are unclear, note your assumptions explicitly +- Implement the minimal solution that satisfies requirements +- Do not add features that weren't requested + +### Code Quality +- Follow existing patterns in the codebase +- Handle errors appropriately +- Write tests if the codebase has tests +- Do not leave TODOs without explicit approval + +### Communication +- Be honest about what you implemented +- Clearly state any assumptions made +- If you couldn't complete something, say so +- Do not use vague language to hide incompleteness + +### When Blocked +- If you cannot proceed, state why clearly +- Do not guess at requirements +- Request clarification rather than assuming + +## After Implementation + +Provide a clear summary of what you implemented: +- What was done +- What assumptions were made +- What was NOT done (if anything) +- Any concerns or risks + +Remember: A reviewer will examine your work. Be honest - they will find issues +if you try to hide them. Quality is the fastest path to approval. +""" + + +@dataclass +class ReviewerContract: + """Behavioral contract for reviewer agents.""" + + @staticmethod + def get_prompt(critique_mode: bool = True) -> str: + base = """ +# Reviewer Contract + +You are operating as a REVIEWER in the Liza peer-supervised coding system. +Your verdict is BINDING - the coder must address your feedback before approval. + +## Tier 0 Invariants (NEVER violate) + +1. **No rubber-stamping**: Never approve without thorough examination +2. **No fabricated issues**: Only raise issues you actually found +3. **Honest verdicts**: APPROVE means you'd stake your reputation on this code +4. **Specific feedback**: Vague criticism is not actionable +5. **No implementation**: You review, you do not implement + +## Review Process + +### Before Reviewing +- Read the task description to understand requirements +- Understand what was supposed to be implemented +- If this is a re-review, check previous feedback was addressed + +### During Review +- Check that implementation matches requirements +- Verify error handling and edge cases +- Look for security issues +- Check for obvious bugs or logic errors +- Verify tests exist and pass (if applicable) + +### Verdict Decision +- **APPROVE**: Implementation is correct, meets requirements, no blocking issues +- **REJECT**: Issues found that must be fixed + +### Feedback Format +When rejecting, provide: +- Specific issues with file/line references if possible +- Clear explanation of what's wrong +- Suggestion for how to fix (optional but helpful) +""" + + if critique_mode: + base += """ +## Critique Mode (ACTIVE) + +You are in CRITIQUE MODE. This means: +- Your job is to FIND problems, not to be polite +- Assume the code has bugs until proven otherwise +- Check for: + * Security vulnerabilities + * Race conditions + * Resource leaks + * Unhandled edge cases + * Missing validation + * Incorrect error handling +- Do NOT approve just because it "looks okay" +- If you're uncertain, REJECT and explain why + +The coder expects rigorous review. Being thorough helps them. +""" + else: + base += """ +## Standard Review Mode + +Focus on: +- Correctness (does it do what it should?) +- Meeting requirements (does it solve the task?) +- Obvious bugs or security issues +- Reasonable code quality + +Minor style issues should not block approval. +""" + + base += """ +## Your Verdict + +After your review, provide your verdict in this format: + +APPROVE or REJECT + + +Your detailed feedback here. + +""" + return base + + +def get_contract_prompt(role: str, critique_mode: bool = True) -> str: + """ + Get the appropriate contract prompt for a role. + + Args: + role: "coder" or "reviewer" + critique_mode: For reviewers, whether to use critique mode + + Returns: + Contract prompt text + """ + if role == "coder": + return CoderContract.get_prompt() + elif role == "reviewer": + return ReviewerContract.get_prompt(critique_mode) + else: + raise ValueError(f"Unknown role: {role}") + + +def build_coder_prompt( + task_description: str, + previous_feedback: str | None = None, + iteration: int = 0, + done_when: str | None = None, +) -> str: + """ + Build the complete prompt for a coder agent. + + Args: + task_description: What to implement + previous_feedback: Feedback from reviewers (if iteration > 0) + iteration: Current iteration number + done_when: Optional completion criteria + + Returns: + Complete prompt for the coder + """ + prompt = CoderContract.get_prompt() + + prompt += f""" + +# Your Task + +## Description +{task_description} +""" + + if done_when: + prompt += f""" +## Completion Criteria +{done_when} +""" + + if iteration > 0 and previous_feedback: + prompt += f""" +## Reviewer Feedback (Iteration {iteration}) + +Your previous implementation was REJECTED. Address these issues: + +{previous_feedback} + +Focus on fixing the specific issues raised. Do not introduce unrelated changes. +""" + else: + prompt += """ +## Instructions + +Implement the task as described. After implementation, provide a summary of: +1. What you implemented +2. Any assumptions made +3. Anything you couldn't complete or are uncertain about +""" + + return prompt + + +def build_reviewer_prompt( + task_description: str, + implementation_summary: str | None = None, + previous_feedback: str | None = None, + iteration: int = 0, + critique_mode: bool = True, +) -> str: + """ + Build the complete prompt for a reviewer agent. + + Args: + task_description: What was supposed to be implemented + implementation_summary: Coder's summary of their implementation + previous_feedback: Feedback from previous iteration + iteration: Current iteration number + critique_mode: Whether to use critique mode + + Returns: + Complete prompt for the reviewer + """ + prompt = ReviewerContract.get_prompt(critique_mode) + + prompt += f""" + +# Review Request + +## Original Task +{task_description} + +## Implementation Summary from Coder +{implementation_summary or "(No summary provided - examine the code directly)"} +""" + + if iteration > 0 and previous_feedback: + prompt += f""" +## Previous Review Feedback (Iteration {iteration - 1}) +The coder was asked to address: + +{previous_feedback} + +Verify whether these issues have been properly resolved. +""" + + prompt += """ +## Your Review + +Examine the implementation and provide your verdict. +""" + + return prompt diff --git a/owlex/liza/orchestrator.py b/owlex/liza/orchestrator.py new file mode 100644 index 0000000..a9a2c7f --- /dev/null +++ b/owlex/liza/orchestrator.py @@ -0,0 +1,447 @@ +""" +Liza Orchestrator - Coordination for Claude (coder) + reviewer agents loop. + +Architecture: +- Claude Code = Orchestrator + Coder (trusted, actually writes code) +- Codex/Gemini/OpenCode = Reviewers (via owlex MCP) + +Flow: +1. Claude implements the task (using Write/Edit/Bash tools) +2. Claude sends implementation summary to reviewers via owlex +3. Reviewers examine and return APPROVE/REJECT + feedback +4. If REJECT: Claude fixes based on merged feedback, loop back +5. If APPROVE: done +""" + +import asyncio +import sys +from dataclasses import dataclass, field +from datetime import datetime +from typing import Any, Callable, Awaitable + +from .blackboard import Blackboard, Task, TaskStatus +from .protocol import parse_verdict, VerdictStatus, ReviewVerdict +from .contracts import build_reviewer_prompt + + +def _log(msg: str): + """Log to stderr.""" + print(f"[Liza] {msg}", file=sys.stderr, flush=True) + + +@dataclass +class LizaConfig: + """Configuration for Liza orchestrator.""" + max_iterations: int = 5 + critique_mode: bool = True + parallel_reviews: bool = True # Run reviewers in parallel + require_all_approve: bool = True # All reviewers must approve + timeout_per_agent: int = 300 # Seconds per reviewer call + working_directory: str | None = None + reviewers: list[str] = field(default_factory=lambda: ["codex", "gemini"]) + + +@dataclass +class ReviewRoundResult: + """Result of a single review round.""" + iteration: int + reviews: dict[str, ReviewVerdict] # reviewer -> verdict + all_approved: bool + merged_feedback: str | None + issues_found: list[str] + + +@dataclass +class LizaResult: + """Final result of a Liza session.""" + task_id: str + task_description: str + final_status: TaskStatus + review_rounds: list[ReviewRoundResult] + total_iterations: int + approved: bool + all_feedback: list[str] + duration_seconds: float + error: str | None = None + + def to_dict(self) -> dict: + """Convert to dictionary for JSON serialization.""" + return { + "task_id": self.task_id, + "task_description": self.task_description, + "final_status": self.final_status.value, + "total_iterations": self.total_iterations, + "approved": self.approved, + "all_feedback": self.all_feedback, + "duration_seconds": self.duration_seconds, + "error": self.error, + "review_rounds": [ + { + "iteration": rr.iteration, + "all_approved": rr.all_approved, + "merged_feedback": rr.merged_feedback, + "issues_found": rr.issues_found, + "reviews": { + reviewer: { + "status": v.status.value, + "feedback": v.feedback, + "issues": v.issues, + "confidence": v.confidence, + } + for reviewer, v in rr.reviews.items() + }, + } + for rr in self.review_rounds + ], + } + + +# Type for reviewer runner function (calls owlex agents) +ReviewerRunner = Callable[[str, str, str | None, int], Awaitable[str | None]] + + +class LizaOrchestrator: + """ + Orchestrates the Liza review loop. + + Claude Code is the coder (trusted implementer). + Codex/Gemini/OpenCode are reviewers (via owlex MCP). + + This orchestrator: + 1. Manages the blackboard state + 2. Sends implementation to reviewers + 3. Parses verdicts and merges feedback + 4. Returns feedback to Claude for iteration + + Claude does the actual implementation outside this orchestrator. + """ + + def __init__( + self, + config: LizaConfig | None = None, + blackboard: Blackboard | None = None, + ): + """ + Initialize orchestrator. + + Args: + config: Liza configuration + blackboard: Blackboard instance (creates one if not provided) + """ + self.config = config or LizaConfig() + self.blackboard = blackboard or Blackboard(self.config.working_directory) + self._reviewer_runner: ReviewerRunner | None = None + self.log_entries: list[str] = [] + + def log(self, msg: str): + """Add to log and print.""" + timestamp = datetime.now().strftime("%H:%M:%S") + entry = f"[{timestamp}] {msg}" + self.log_entries.append(entry) + _log(msg) + + def set_reviewer_runner(self, runner: ReviewerRunner): + """ + Set the function that runs reviewer agents. + + The runner should have signature: + async def run(agent: str, prompt: str, working_dir: str | None, timeout: int) -> str | None + + Where agent is "codex", "gemini", or "opencode". + This will be connected to owlex's agent execution. + """ + self._reviewer_runner = runner + + async def run_reviewer( + self, + agent: str, + prompt: str, + ) -> str | None: + """Run a reviewer agent with the given prompt.""" + if self._reviewer_runner is None: + raise RuntimeError("Reviewer runner not set. Call set_reviewer_runner first.") + + return await self._reviewer_runner( + agent, + prompt, + self.config.working_directory, + self.config.timeout_per_agent, + ) + + async def submit_for_review( + self, + task_id: str, + implementation_summary: str, + ) -> ReviewRoundResult: + """ + Submit Claude's implementation for review by owlex agents. + + Args: + task_id: Task being reviewed + implementation_summary: Summary/description of what Claude implemented + + Returns: + ReviewRoundResult with all verdicts and merged feedback + """ + task = self.blackboard.get_task(task_id) + if task is None: + raise ValueError(f"Task {task_id} not found") + + # Update blackboard + self.blackboard.submit_for_review(task_id, implementation_summary) + self.blackboard.transition_task(task_id, TaskStatus.IN_REVIEW) + + reviewers = task.reviewers or self.config.reviewers + self.log(f"Submitting to reviewers: {', '.join(reviewers)}") + + reviews: dict[str, ReviewVerdict] = {} + + if self.config.parallel_reviews: + # Run reviewers in parallel + review_tasks = [] + for reviewer in reviewers: + review_tasks.append( + self._run_single_reviewer(task, reviewer, implementation_summary) + ) + + review_results = await asyncio.gather(*review_tasks, return_exceptions=True) + + for reviewer, result in zip(reviewers, review_results): + if isinstance(result, Exception): + self.log(f"Reviewer {reviewer} failed: {result}") + reviews[reviewer] = ReviewVerdict( + status=VerdictStatus.BLOCKED, + feedback=f"Reviewer error: {result}", + issues=[], + raw_response="", + confidence=0, + ) + else: + reviews[reviewer] = result + else: + # Run reviewers sequentially + for reviewer in reviewers: + try: + verdict = await self._run_single_reviewer(task, reviewer, implementation_summary) + reviews[reviewer] = verdict + except Exception as e: + self.log(f"Reviewer {reviewer} failed: {e}") + reviews[reviewer] = ReviewVerdict( + status=VerdictStatus.BLOCKED, + feedback=f"Reviewer error: {e}", + issues=[], + raw_response="", + confidence=0, + ) + + # Record reviews in blackboard + for reviewer, verdict in reviews.items(): + self.blackboard.record_review( + task_id, + reviewer=reviewer, + verdict=verdict.status.value, + feedback=verdict.feedback, + ) + + # Determine overall verdict + if self.config.require_all_approve: + all_approved = all(v.approved for v in reviews.values()) + else: + # Majority vote + approve_count = sum(1 for v in reviews.values() if v.approved) + all_approved = approve_count > len(reviews) // 2 + + # Collect all issues + all_issues = [] + for verdict in reviews.values(): + all_issues.extend(verdict.issues) + + # Merge feedback from all reviewers (not just rejecting ones) + feedback_parts = [] + for reviewer, verdict in reviews.items(): + status_emoji = "✅" if verdict.approved else "❌" + feedback_parts.append( + f"### {reviewer.title()} {status_emoji} ({verdict.status.value})\n" + f"**Confidence:** {verdict.confidence:.0%}\n\n" + f"{verdict.feedback or '(no detailed feedback)'}" + ) + merged_feedback = "\n\n---\n\n".join(feedback_parts) + + # Log verdicts + for reviewer, verdict in reviews.items(): + self.log(f"{reviewer}: {verdict.status.value} (confidence: {verdict.confidence:.0%})") + + # Update blackboard with final verdict + task, _ = self.blackboard.finalize_reviews(task_id) + + result = ReviewRoundResult( + iteration=task.iteration, + reviews=reviews, + all_approved=all_approved, + merged_feedback=merged_feedback, + issues_found=all_issues, + ) + + if all_approved: + self.log(f"✅ All reviewers APPROVED") + else: + self.log(f"❌ Review REJECTED - {len(all_issues)} issues found") + + return result + + async def _run_single_reviewer( + self, + task: Task, + reviewer: str, + implementation_summary: str, + ) -> ReviewVerdict: + """Run a single reviewer and parse verdict.""" + self.log(f"Running reviewer: {reviewer}") + + reviewer_prompt = build_reviewer_prompt( + task_description=task.description, + implementation_summary=implementation_summary, + previous_feedback=task.merged_feedback, + iteration=task.iteration, + critique_mode=self.config.critique_mode, + ) + + response = await self.run_reviewer(reviewer, reviewer_prompt) + + if response is None: + return ReviewVerdict( + status=VerdictStatus.BLOCKED, + feedback="Reviewer failed to respond", + issues=[], + raw_response="", + confidence=0, + ) + + verdict = parse_verdict(response) + return verdict + + def prepare_for_iteration(self, task_id: str, merged_feedback: str) -> Task: + """ + Prepare task for next iteration after rejection. + + Updates blackboard state for Claude to iterate. + + Args: + task_id: Task to update + merged_feedback: Combined feedback from reviewers + + Returns: + Updated task + """ + task = self.blackboard.get_task(task_id) + if task is None: + raise ValueError(f"Task {task_id} not found") + + task.iteration += 1 + task.merged_feedback = merged_feedback + task.status = TaskStatus.WORKING + task.add_history("iteration_started", agent="claude", details={ + "iteration": task.iteration, + }) + + self.blackboard.update_task(task) + self.log(f"Prepared for iteration {task.iteration}") + + return task + + def mark_approved(self, task_id: str) -> Task: + """Mark task as approved and complete.""" + return self.blackboard.transition_task( + task_id, TaskStatus.APPROVED, + agent="liza", + details={"approved_by": "reviewers"}, + ) + + def mark_blocked(self, task_id: str, reason: str) -> Task: + """Mark task as blocked.""" + return self.blackboard.transition_task( + task_id, TaskStatus.BLOCKED, + details={"reason": reason}, + ) + + # === Convenience methods for MCP tool === + + def create_task( + self, + description: str, + reviewers: list[str] | None = None, + max_iterations: int | None = None, + done_when: str | None = None, + ) -> Task: + """ + Create a new task for Claude to implement. + + Args: + description: What to implement + reviewers: Which agents review (default from config) + max_iterations: Max review cycles (default from config) + done_when: Completion criteria + + Returns: + Created task + """ + if not self.blackboard.exists(): + self.blackboard.initialize(goal=description[:100]) + + task = self.blackboard.add_task( + description=description, + coder="claude", # Claude is always the coder + reviewers=reviewers or self.config.reviewers, + max_iterations=max_iterations or self.config.max_iterations, + done_when=done_when, + status=TaskStatus.WORKING, # Claude starts working immediately + ) + + # Claude claims the task + task.add_history("claimed_by_claude") + self.blackboard.update_task(task) + + self.log(f"Created task {task.id}: {description[:50]}...") + return task + + def get_task_status(self, task_id: str) -> dict | None: + """Get current task status for MCP response.""" + task = self.blackboard.get_task(task_id) + if task is None: + return None + + return { + "task_id": task.id, + "description": task.description, + "status": task.status.value, + "iteration": task.iteration, + "max_iterations": task.max_iterations, + "reviewers": task.reviewers, + "merged_feedback": task.merged_feedback, + "review_count": len(task.reviews), + "created_at": task.created_at, + "updated_at": task.updated_at, + } + + def get_feedback_for_claude(self, task_id: str) -> str | None: + """ + Get formatted feedback for Claude to address. + + Returns None if no feedback (approved or not yet reviewed). + """ + task = self.blackboard.get_task(task_id) + if task is None or task.merged_feedback is None: + return None + + return f"""# Review Feedback (Iteration {task.iteration}) + +Your implementation was reviewed. Please address the following feedback: + +{task.merged_feedback} + +## Instructions +1. Read the feedback carefully +2. Fix the identified issues +3. Do NOT introduce unrelated changes +4. When done, call `liza_submit` with your implementation summary +""" diff --git a/owlex/liza/protocol.py b/owlex/liza/protocol.py new file mode 100644 index 0000000..017db53 --- /dev/null +++ b/owlex/liza/protocol.py @@ -0,0 +1,257 @@ +""" +Review Protocol - Structured verdict format for Liza reviews. + +Defines the protocol for reviewers to communicate verdicts and feedback. +Provides parsing utilities to extract structured data from reviewer responses. +""" + +import re +from dataclasses import dataclass +from enum import Enum + + +class VerdictStatus(str, Enum): + """Possible review verdicts.""" + APPROVE = "APPROVE" + REJECT = "REJECT" + BLOCKED = "BLOCKED" # Reviewer cannot proceed (needs clarification) + + +@dataclass +class ReviewVerdict: + """Parsed review verdict.""" + status: VerdictStatus + feedback: str | None + issues: list[str] # Individual issues extracted + raw_response: str + confidence: float # 0-1, how confident we are in the parse + + @property + def approved(self) -> bool: + return self.status == VerdictStatus.APPROVE + + +# Regex patterns for verdict extraction +VERDICT_PATTERNS = [ + # XML-style tags (preferred) + r"\s*(APPROVE|REJECT|BLOCKED)\s*", + # Markdown headers + r"##?\s*Verdict:?\s*(APPROVE|REJECT|BLOCKED)", + # Bold text + r"\*\*Verdict:?\*\*\s*(APPROVE|REJECT|BLOCKED)", + # Plain text patterns + r"(?:^|\n)Verdict:?\s*(APPROVE|REJECT|BLOCKED)", + r"(?:^|\n)(?:My |Final )?(?:verdict|decision):?\s*(APPROVE|REJECT|BLOCKED)", + # Implicit patterns (lower confidence) + r"I\s+(?:would\s+)?(?:recommend\s+)?(approve|reject)", + r"This\s+(?:implementation\s+)?(?:is\s+)?(approved|rejected)", +] + +FEEDBACK_PATTERNS = [ + # XML-style tags (preferred) + r"(.*?)", + r"(.*?)", + # Markdown sections + r"##?\s*Feedback:?\s*\n(.*?)(?=\n##|\n\*\*|$)", + r"##?\s*Issues:?\s*\n(.*?)(?=\n##|\n\*\*|$)", + # Bold headers + r"\*\*Feedback:?\*\*\s*\n?(.*?)(?=\n\*\*|$)", + r"\*\*Issues:?\*\*\s*\n?(.*?)(?=\n\*\*|$)", +] + +ISSUE_PATTERNS = [ + # Bulleted lists + r"[-*]\s+(.+?)(?=\n[-*]|\n\n|$)", + # Numbered lists + r"\d+[.)]\s+(.+?)(?=\n\d+[.)]|\n\n|$)", +] + + +def parse_verdict(response: str) -> ReviewVerdict: + """ + Parse a reviewer's response into a structured verdict. + + Attempts to extract: + - Verdict status (APPROVE/REJECT/BLOCKED) + - Feedback text + - Individual issues + + Args: + response: Raw response text from reviewer + + Returns: + ReviewVerdict with parsed data + """ + response_lower = response.lower() + status = None + confidence = 0.0 + + # Try verdict patterns in order of specificity + for i, pattern in enumerate(VERDICT_PATTERNS): + match = re.search(pattern, response, re.IGNORECASE | re.MULTILINE) + if match: + verdict_text = match.group(1).upper() + if verdict_text in ("APPROVE", "APPROVED"): + status = VerdictStatus.APPROVE + elif verdict_text in ("REJECT", "REJECTED"): + status = VerdictStatus.REJECT + elif verdict_text == "BLOCKED": + status = VerdictStatus.BLOCKED + + # Earlier patterns are more explicit = higher confidence + confidence = 1.0 - (i * 0.1) + confidence = max(confidence, 0.5) + break + + # Fallback: infer from content + if status is None: + # Check for approval signals + approve_signals = [ + "looks good", "lgtm", "ship it", "approved", + "no issues", "well done", "excellent", "ready to merge", + ] + reject_signals = [ + "needs work", "please fix", "issues found", "rejected", + "problems", "bugs", "vulnerabilities", "missing", + ] + + approve_count = sum(1 for s in approve_signals if s in response_lower) + reject_count = sum(1 for s in reject_signals if s in response_lower) + + if approve_count > reject_count and approve_count > 0: + status = VerdictStatus.APPROVE + confidence = 0.3 + (0.1 * approve_count) + elif reject_count > 0: + status = VerdictStatus.REJECT + confidence = 0.3 + (0.1 * reject_count) + else: + # Default to REJECT if unclear (safe default) + status = VerdictStatus.REJECT + confidence = 0.1 + + # Extract feedback + feedback = None + for pattern in FEEDBACK_PATTERNS: + match = re.search(pattern, response, re.IGNORECASE | re.DOTALL) + if match: + feedback = match.group(1).strip() + break + + # If no explicit feedback section, use the whole response minus verdict + if feedback is None and status == VerdictStatus.REJECT: + # Remove the verdict line and use remaining as feedback + feedback = response + for pattern in VERDICT_PATTERNS[:4]: # Only explicit patterns + feedback = re.sub(pattern, "", feedback, flags=re.IGNORECASE | re.MULTILINE) + feedback = feedback.strip() + if len(feedback) < 10: + feedback = None + + # Extract individual issues + issues = [] + if feedback: + for pattern in ISSUE_PATTERNS: + matches = re.findall(pattern, feedback, re.MULTILINE) + issues.extend(m.strip() for m in matches if m.strip()) + + # Deduplicate issues + issues = list(dict.fromkeys(issues)) + + return ReviewVerdict( + status=status, + feedback=feedback, + issues=issues, + raw_response=response, + confidence=confidence, + ) + + +def format_verdict_prompt(critique_mode: bool = True) -> str: + """ + Generate prompt instructions for reviewers on verdict format. + + Args: + critique_mode: If True, emphasize finding issues over being polite + + Returns: + Prompt text to include in reviewer instructions + """ + base = """ +## Review Verdict Format + +After examining the implementation, provide your verdict using this format: + +APPROVE or REJECT + + +Your detailed feedback here. If rejecting, list specific issues: +- Issue 1: Description +- Issue 2: Description + + +**Verdict Guidelines:** +- APPROVE: Implementation meets requirements, no blocking issues +- REJECT: Implementation has issues that must be fixed before approval +""" + + if critique_mode: + base += """ +**Critique Mode Active:** +- Your job is to find bugs, security issues, and architectural flaws +- Do NOT be polite at the expense of thoroughness +- Missing edge cases, error handling, or tests are grounds for rejection +- If something could break in production, REJECT +- Only APPROVE if you would stake your reputation on this code +""" + else: + base += """ +**Review Mode:** +- Focus on correctness and meeting requirements +- Minor style issues should not block approval +- Suggest improvements but only REJECT for functional issues +""" + + return base + + +def build_review_prompt( + task_description: str, + implementation_summary: str | None, + previous_feedback: str | None = None, + iteration: int = 0, + critique_mode: bool = True, +) -> str: + """ + Build the prompt for a reviewer agent. + + Args: + task_description: What was supposed to be implemented + implementation_summary: Summary of the implementation + previous_feedback: Feedback from previous iteration (if any) + iteration: Current iteration number + critique_mode: Whether to use critique mode + + Returns: + Complete prompt for the reviewer + """ + prompt = f"""# Code Review Request + +## Task +{task_description} + +## Implementation Summary +{implementation_summary or "(No summary provided - examine the code directly)"} +""" + + if iteration > 0 and previous_feedback: + prompt += f""" +## Previous Feedback (Iteration {iteration - 1}) +The coder was asked to address these issues: +{previous_feedback} + +Please verify whether these issues have been properly addressed. +""" + + prompt += format_verdict_prompt(critique_mode) + + return prompt diff --git a/owlex/models.py b/owlex/models.py index eab3f3c..6838983 100644 --- a/owlex/models.py +++ b/owlex/models.py @@ -36,6 +36,7 @@ class Agent(str, Enum): GEMINI = "gemini" OPENCODE = "opencode" CLAUDEOR = "claudeor" # Claude Code via OpenRouter + GROK = "grok" # Grok via xAI @dataclass @@ -103,6 +104,7 @@ class CouncilRound(BaseModel): gemini: AgentResponse | None = None opencode: AgentResponse | None = None claudeor: AgentResponse | None = None # Claude Code via OpenRouter + grok: AgentResponse | None = None # Grok via xAI class CouncilMetadata(BaseModel): diff --git a/owlex/prompts.py b/owlex/prompts.py index a6c0f80..173b0aa 100644 --- a/owlex/prompts.py +++ b/owlex/prompts.py @@ -42,6 +42,7 @@ def build_deliberation_prompt( gemini_answer: str | None = None, opencode_answer: str | None = None, claudeor_answer: str | None = None, + grok_answer: str | None = None, claude_answer: str | None = None, critique: bool = False, include_original: bool = False, @@ -61,6 +62,7 @@ def build_deliberation_prompt( gemini_answer: Gemini's round 1 answer (optional if excluded) opencode_answer: OpenCode's round 1 answer (optional if excluded) claudeor_answer: ClaudeOR's round 1 answer (optional if excluded) + grok_answer: Grok's round 1 answer (optional if excluded) claude_answer: Optional Claude opinion to include critique: If True, use critique mode prompts include_original: If True, include original_prompt in the output (for exec fallback) @@ -96,6 +98,9 @@ def build_deliberation_prompt( if claudeor_answer: parts.extend(["", "CLAUDE (OPENROUTER)'S ANSWER:", claudeor_answer]) + if grok_answer: + parts.extend(["", "GROK'S ANSWER:", grok_answer]) + parts.extend(["", instruction]) return "\n".join(parts) @@ -126,6 +131,7 @@ def build_deliberation_prompt_with_role( gemini_answer: str | None = None, opencode_answer: str | None = None, claudeor_answer: str | None = None, + grok_answer: str | None = None, claude_answer: str | None = None, critique: bool = False, include_original: bool = False, @@ -143,6 +149,7 @@ def build_deliberation_prompt_with_role( gemini_answer: Gemini's round 1 answer opencode_answer: OpenCode's round 1 answer claudeor_answer: ClaudeOR's round 1 answer + grok_answer: Grok's round 1 answer claude_answer: Optional Claude opinion critique: If True, use critique mode prompts include_original: If True, include original_prompt (for exec fallback) @@ -157,6 +164,7 @@ def build_deliberation_prompt_with_role( gemini_answer=gemini_answer, opencode_answer=opencode_answer, claudeor_answer=claudeor_answer, + grok_answer=grok_answer, claude_answer=claude_answer, critique=critique, include_original=include_original, diff --git a/owlex/roles.py b/owlex/roles.py index 4b43ef7..d9b814b 100644 --- a/owlex/roles.py +++ b/owlex/roles.py @@ -271,6 +271,7 @@ def from_dict(cls, data: dict) -> TeamPreset: "gemini": RoleId.SKEPTIC.value, "opencode": RoleId.ARCHITECT.value, "claudeor": RoleId.DX.value, + "grok": RoleId.TESTING.value, }, ), @@ -283,6 +284,7 @@ def from_dict(cls, data: dict) -> TeamPreset: "gemini": RoleId.PERFORMANCE.value, "opencode": RoleId.TESTING.value, "claudeor": RoleId.DX.value, + "grok": RoleId.SKEPTIC.value, }, ), @@ -295,6 +297,7 @@ def from_dict(cls, data: dict) -> TeamPreset: "gemini": RoleId.PERFORMANCE.value, "opencode": RoleId.MAINTAINER.value, "claudeor": RoleId.DX.value, + "grok": RoleId.SKEPTIC.value, }, ), @@ -307,6 +310,7 @@ def from_dict(cls, data: dict) -> TeamPreset: "gemini": RoleId.SKEPTIC.value, "opencode": RoleId.SKEPTIC.value, "claudeor": RoleId.SKEPTIC.value, + "grok": RoleId.SKEPTIC.value, }, ), @@ -319,6 +323,7 @@ def from_dict(cls, data: dict) -> TeamPreset: "gemini": RoleId.PERFORMANCE.value, "opencode": RoleId.MAINTAINER.value, "claudeor": RoleId.DX.value, + "grok": RoleId.SKEPTIC.value, }, ), @@ -330,7 +335,8 @@ def from_dict(cls, data: dict) -> TeamPreset: "codex": RoleId.MAINTAINER.value, # Deep reasoning for surgical changes "gemini": RoleId.ARCHITECT.value, # Large context for system-wide view "opencode": RoleId.DX.value, # Best tone/steerability for docs - "claudeor": RoleId.SKEPTIC.value, # Fast, unconstrained critic + "claudeor": RoleId.PERFORMANCE.value, # Fast general-purpose + "grok": RoleId.SKEPTIC.value, # Deliberate contrarian, less aligned }, ), } @@ -525,7 +531,7 @@ def _resolve_explicit_mapping( ) -> dict[str, RoleDefinition]: """Resolve explicit agent->role mapping.""" # Validate for unknown agent keys (typos like "codexx") - known_agents = {"codex", "gemini", "opencode", "claudeor"} + known_agents = {"codex", "gemini", "opencode", "claudeor", "grok"} unknown_keys = set(mapping.keys()) - known_agents if unknown_keys: raise ValueError(f"Unknown agent(s) in role mapping: {', '.join(sorted(unknown_keys))}") diff --git a/owlex/server.py b/owlex/server.py index 78b238b..0f593ef 100644 --- a/owlex/server.py +++ b/owlex/server.py @@ -15,7 +15,7 @@ from mcp.server.session import ServerSession from .models import TaskResponse, ErrorCode, Agent -from .engine import engine, DEFAULT_TIMEOUT, codex_runner, gemini_runner, opencode_runner, claudeor_runner +from .engine import engine, DEFAULT_TIMEOUT, codex_runner, gemini_runner, opencode_runner, claudeor_runner, grok_runner from .council import Council from .config import config from .roles import get_resolver @@ -68,6 +68,11 @@ def _get_opencode_model() -> str: return model if model else "openrouter/anthropic/claude-sonnet-4" +def _get_grok_model() -> str: + """Get Grok model from config.""" + return config.grok.model + + @mcp.resource("owlex://agents") async def get_agents() -> str: """List available agents and their configuration.""" @@ -109,6 +114,16 @@ async def get_agents() -> str: "agent_mode": config.opencode.agent, } }, + "grok": { + "available": "grok" not in excluded, + "cli_version": opencode_ver, # Uses OpenCode CLI + "model": _get_grok_model(), + "code_model": config.grok.code_model, + "description": "Deliberate contrarian via xAI/Grok, less aligned perspective", + "config": { + "agent_mode": config.grok.agent, + } + }, } return json.dumps({ @@ -508,6 +523,98 @@ async def resume_claudeor_session( ).model_dump() +# === Grok Session Tools === + +@mcp.tool() +async def start_grok_session( + ctx: Context[ServerSession, None], + prompt: str = Field(description="The question or request to send"), + working_directory: str | None = Field(default=None, description="Working directory for Grok context"), + for_coding: bool = Field(default=False, description="Use coding model (grok-code-fast-1) instead of reasoning model"), +) -> dict: + """ + Start a new Grok session via OpenCode with xAI backend. + + Uses OpenCode CLI with xAI/Grok models. Requires XAI_API_KEY environment variable. + + Two model options: + - for_coding=False (default): Uses GROK_MODEL (default: xai/grok-4-1-fast-reasoning) for reasoning/deliberation + - for_coding=True: Uses GROK_CODE_MODEL (default: xai/grok-code-fast-1) for coding tasks + """ + if not prompt or not prompt.strip(): + return TaskResponse(success=False, error="'prompt' parameter is required.", error_code=ErrorCode.INVALID_ARGS).model_dump() + + working_directory, error = _validate_working_directory(working_directory) + if error: + return TaskResponse(success=False, error=error, error_code=ErrorCode.INVALID_ARGS).model_dump() + + task = engine.create_task( + command=f"{Agent.GROK.value}_exec", + args={"prompt": prompt.strip(), "working_directory": working_directory, "for_coding": for_coding}, + context=ctx, + ) + + task.async_task = asyncio.create_task(engine.run_agent( + task, grok_runner, mode="exec", + prompt=prompt.strip(), working_directory=working_directory, for_coding=for_coding + )) + + model = config.grok.code_model if for_coding else config.grok.model + return TaskResponse( + success=True, + task_id=task.task_id, + status=task.status, + message=f"Grok session started ({model}). Use wait_for_task to get result.", + ).model_dump() + + +@mcp.tool() +async def resume_grok_session( + ctx: Context[ServerSession, None], + prompt: str = Field(description="The question or request to send to the resumed session"), + session_id: str | None = Field(default=None, description="Session ID to resume (uses --continue if not provided)"), + working_directory: str | None = Field(default=None, description="Working directory for Grok context"), + for_coding: bool = Field(default=False, description="Use coding model (grok-code-fast-1) instead of reasoning model"), +) -> dict: + """Resume an existing Grok session with full conversation history.""" + if not prompt or not prompt.strip(): + return TaskResponse(success=False, error="'prompt' parameter is required.", error_code=ErrorCode.INVALID_ARGS).model_dump() + + working_directory, error = _validate_working_directory(working_directory) + if error: + return TaskResponse(success=False, error=error, error_code=ErrorCode.INVALID_ARGS).model_dump() + + use_continue = not session_id or not session_id.strip() + session_ref = "--continue" if use_continue else session_id.strip() + + # Validate session ID if provided + if not use_continue and not grok_runner.validate_session_id(session_ref): + return TaskResponse( + success=False, + error=f"Invalid session_id: '{session_id}' - contains disallowed characters", + error_code=ErrorCode.INVALID_ARGS + ).model_dump() + + task = engine.create_task( + command=f"{Agent.GROK.value}_resume", + args={"session_id": session_ref, "prompt": prompt.strip(), "working_directory": working_directory, "for_coding": for_coding}, + context=ctx, + ) + + task.async_task = asyncio.create_task(engine.run_agent( + task, grok_runner, mode="resume", + prompt=prompt.strip(), session_ref=session_ref, working_directory=working_directory, for_coding=for_coding + )) + + model = config.grok.code_model if for_coding else config.grok.model + return TaskResponse( + success=True, + task_id=task.task_id, + status=task.status, + message=f"Grok resume started ({model}){' (continuing last session)' if use_continue else f' for session {session_id}'}. Use wait_for_task to get result.", + ).model_dump() + + # === Task Management Tools === @mcp.tool() @@ -935,6 +1042,340 @@ async def council_ask( return response +# === Liza Tools (Peer-Supervised Coding) === + +# Module-level Liza orchestrators (per working directory) +_liza_orchestrators: dict[str, "LizaOrchestrator"] = {} + + +def _get_liza_orchestrator(working_directory: str | None = None) -> "LizaOrchestrator": + """Get or create a Liza orchestrator for the working directory.""" + from .liza import LizaOrchestrator, LizaConfig + + wd = working_directory or os.getcwd() + if wd not in _liza_orchestrators: + config_obj = LizaConfig( + working_directory=wd, + reviewers=["codex", "gemini"], # Default reviewers + ) + orchestrator = LizaOrchestrator(config=config_obj) + + # Set up the reviewer runner to use owlex agents + async def run_reviewer(agent: str, prompt: str, working_dir: str | None, timeout: int) -> str | None: + """Run a reviewer agent via owlex.""" + runner_map = { + "codex": codex_runner, + "gemini": gemini_runner, + "opencode": opencode_runner, + "grok": grok_runner, + } + runner = runner_map.get(agent) + if runner is None: + return None + + task = engine.create_task( + command=f"liza_review_{agent}", + args={"prompt": prompt, "working_directory": working_dir}, + context=None, + ) + + await engine.run_agent( + task, runner, mode="exec", + prompt=prompt, working_directory=working_dir, + **({"enable_search": False} if agent == "codex" else {}) + ) + + if task.status == "completed": + return task.result + return None + + orchestrator.set_reviewer_runner(run_reviewer) + _liza_orchestrators[wd] = orchestrator + + return _liza_orchestrators[wd] + + +@mcp.tool() +async def liza_start( + ctx: Context[ServerSession, None], + task_description: str = Field(description="Description of what Claude should implement"), + reviewers: list[str] | None = Field(default=None, description="Reviewer agents (default: ['codex', 'gemini'])"), + max_iterations: int = Field(default=5, description="Maximum coder-reviewer cycles"), + done_when: str | None = Field(default=None, description="Optional completion criteria"), + working_directory: str | None = Field(default=None, description="Working directory for the task"), +) -> dict: + """ + Start a Liza peer-reviewed coding task. + + Creates a task for Claude (the coder) to implement. After Claude implements, + use liza_submit to send the implementation for review by Codex/Gemini. + + Architecture: + - Claude Code = Coder (trusted, actually writes code) + - Codex/Gemini/OpenCode/Grok = Reviewers (examine and provide feedback) + + Flow: + 1. liza_start → Creates task, returns task_id + 2. Claude implements the task (using Write/Edit/Bash tools) + 3. liza_submit → Sends to reviewers, returns verdicts + 4. If REJECT: Claude fixes based on feedback, goto step 3 + 5. If APPROVE: Done! + """ + if not task_description or not task_description.strip(): + return {"success": False, "error": "'task_description' is required."} + + working_directory, error = _validate_working_directory(working_directory) + if error: + return {"success": False, "error": error} + + try: + orchestrator = _get_liza_orchestrator(working_directory) + + # Update reviewers if specified + if reviewers: + orchestrator.config.reviewers = reviewers + + task = orchestrator.create_task( + description=task_description.strip(), + reviewers=reviewers, + max_iterations=max_iterations, + done_when=done_when, + ) + + return { + "success": True, + "task_id": task.id, + "message": f"Task created. Claude should now implement: {task_description[:100]}...", + "instructions": ( + "1. Implement the task using Write/Edit/Bash tools\n" + "2. When done, call liza_submit with your implementation summary\n" + "3. Reviewers will examine and provide feedback\n" + "4. If rejected, fix issues and submit again" + ), + "task": { + "id": task.id, + "description": task.description, + "reviewers": task.reviewers, + "max_iterations": task.max_iterations, + "done_when": task.done_when, + }, + } + except Exception as e: + return {"success": False, "error": str(e)} + + +@mcp.tool() +async def liza_submit( + ctx: Context[ServerSession, None], + task_id: str = Field(description="Task ID from liza_start"), + implementation_summary: str = Field(description="Summary of what was implemented (for reviewers)"), + working_directory: str | None = Field(default=None, description="Working directory"), +) -> dict: + """ + Submit Claude's implementation for review by Codex/Gemini. + + After Claude implements the task, call this to send it for review. + Reviewers will examine the code and return APPROVE or REJECT with feedback. + + If REJECT: Claude should fix the issues and call liza_submit again. + If APPROVE: The task is complete! + """ + if not task_id or not task_id.strip(): + return {"success": False, "error": "'task_id' is required."} + if not implementation_summary or not implementation_summary.strip(): + return {"success": False, "error": "'implementation_summary' is required."} + + working_directory, error = _validate_working_directory(working_directory) + if error: + return {"success": False, "error": error} + + try: + orchestrator = _get_liza_orchestrator(working_directory) + + # Check task exists and is in right state + task = orchestrator.blackboard.get_task(task_id.strip()) + if task is None: + return {"success": False, "error": f"Task '{task_id}' not found."} + + # Check iteration limit + if task.iteration >= task.max_iterations: + return { + "success": False, + "error": f"Max iterations ({task.max_iterations}) reached. Task blocked.", + "task_status": orchestrator.get_task_status(task_id), + } + + # Submit for review + result = await orchestrator.submit_for_review( + task_id=task_id.strip(), + implementation_summary=implementation_summary.strip(), + ) + + if result.all_approved: + orchestrator.mark_approved(task_id) + return { + "success": True, + "approved": True, + "message": "All reviewers APPROVED! Task complete.", + "review_summary": { + "iteration": result.iteration, + "reviews": { + reviewer: { + "status": v.status.value, + "confidence": v.confidence, + } + for reviewer, v in result.reviews.items() + }, + }, + } + else: + # Prepare for next iteration + orchestrator.prepare_for_iteration(task_id, result.merged_feedback) + task = orchestrator.blackboard.get_task(task_id) + + return { + "success": True, + "approved": False, + "message": "Review REJECTED. Please address the feedback and submit again.", + "iteration": task.iteration, + "remaining_iterations": task.max_iterations - task.iteration, + "feedback": result.merged_feedback, + "issues_found": result.issues_found, + "review_summary": { + "reviews": { + reviewer: { + "status": v.status.value, + "confidence": v.confidence, + "feedback_preview": (v.feedback[:200] + "...") if v.feedback and len(v.feedback) > 200 else v.feedback, + } + for reviewer, v in result.reviews.items() + }, + }, + "instructions": ( + "1. Read the feedback above carefully\n" + "2. Fix the identified issues\n" + "3. Do NOT introduce unrelated changes\n" + "4. Call liza_submit again with updated summary" + ), + } + except Exception as e: + return {"success": False, "error": str(e)} + + +@mcp.tool() +async def liza_status( + task_id: str = Field(description="Task ID to check"), + working_directory: str | None = Field(default=None, description="Working directory"), +) -> dict: + """ + Get the current status of a Liza task. + + Returns task details including status, iteration count, reviewers, and any feedback. + """ + if not task_id or not task_id.strip(): + return {"success": False, "error": "'task_id' is required."} + + working_directory, error = _validate_working_directory(working_directory) + if error: + return {"success": False, "error": error} + + try: + orchestrator = _get_liza_orchestrator(working_directory) + status = orchestrator.get_task_status(task_id.strip()) + + if status is None: + return {"success": False, "error": f"Task '{task_id}' not found."} + + return { + "success": True, + "task": status, + } + except Exception as e: + return {"success": False, "error": str(e)} + + +@mcp.tool() +async def liza_feedback( + task_id: str = Field(description="Task ID to get feedback for"), + working_directory: str | None = Field(default=None, description="Working directory"), +) -> dict: + """ + Get the latest reviewer feedback for a Liza task. + + Returns formatted feedback that Claude should address before resubmitting. + """ + if not task_id or not task_id.strip(): + return {"success": False, "error": "'task_id' is required."} + + working_directory, error = _validate_working_directory(working_directory) + if error: + return {"success": False, "error": error} + + try: + orchestrator = _get_liza_orchestrator(working_directory) + feedback = orchestrator.get_feedback_for_claude(task_id.strip()) + + if feedback is None: + return { + "success": True, + "has_feedback": False, + "message": "No feedback available (task may be approved or not yet reviewed).", + } + + return { + "success": True, + "has_feedback": True, + "feedback": feedback, + } + except Exception as e: + return {"success": False, "error": str(e)} + + +@mcp.resource("owlex://liza/blackboard") +def liza_blackboard_resource() -> str: + """View the current Liza blackboard state.""" + try: + orchestrator = _get_liza_orchestrator() + if not orchestrator.blackboard.exists(): + return "No Liza blackboard found. Use liza_start to create a task." + + state = orchestrator.blackboard.read() + lines = [ + "# Liza Blackboard", + f"**Goal:** {state.goal}", + f"**Created:** {state.created_at}", + f"**Updated:** {state.updated_at}", + "", + "## Tasks", + ] + + for task in state.tasks: + status_emoji = { + "APPROVED": "✅", + "REJECTED": "❌", + "WORKING": "🔨", + "IN_REVIEW": "👀", + "BLOCKED": "🚫", + }.get(task.status.value, "📋") + + lines.append(f"### {task.id} {status_emoji} {task.status.value}") + lines.append(f"**Description:** {task.description[:100]}...") + lines.append(f"**Coder:** {task.coder} | **Reviewers:** {', '.join(task.reviewers)}") + lines.append(f"**Iteration:** {task.iteration}/{task.max_iterations}") + if task.merged_feedback: + lines.append(f"**Has Feedback:** Yes") + lines.append("") + + if state.log: + lines.append("## Recent Log") + for entry in state.log[-10:]: + lines.append(f"- {entry}") + + return "\n".join(lines) + except Exception as e: + return f"Error reading blackboard: {e}" + + def main(): """Entry point for owlex-server command.""" import argparse diff --git a/pyproject.toml b/pyproject.toml index 635465b..9040139 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "owlex" -version = "0.1.9" +version = "0.2.0" description = "MCP server for Codex CLI and Gemini CLI integration with Claude Code" license = "MIT" readme = "README.md" @@ -21,6 +21,7 @@ classifiers = [ dependencies = [ "mcp>=1.0.0", "pydantic>=2.0.0", + "pyyaml>=6.0.0", ] [project.scripts] diff --git a/skills/liza/liza.md b/skills/liza/liza.md new file mode 100644 index 0000000..0d01a31 --- /dev/null +++ b/skills/liza/liza.md @@ -0,0 +1,163 @@ +# Liza Skill - Peer-Supervised Coding + + +Liza is a peer-supervised coding system where Claude (coder) implements tasks and external agents (Codex, Gemini) review with binding verdicts. Based on https://github.com/liza-mas/liza. + +Use this skill when: +- User wants rigorous code review before completion +- User says "liza", "peer review", "external review", "adversarial review" +- User wants multiple AI perspectives on implementation quality +- User wants to ensure code is production-ready + + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Claude Code │ +│ Orchestrator + CODER (trusted implementer) │ +└─────────────────────────────────────────────────────────────┘ + │ + [implementation] + │ + ┌───────────────────┼───────────────────┐ + ▼ ▼ ▼ + ┌───────────┐ ┌──────────┐ ┌──────────┐ + │ Codex │ │ Gemini │ │ OpenCode │ + │ (Reviewer)│ │(Reviewer)│ │(Reviewer)│ + └───────────┘ └──────────┘ └──────────┘ +``` + +## Workflow + +### Step 1: Start Task + +``` +Use mcp__owlex__liza_start to create the task: + +Parameters: +- task_description: What to implement +- reviewers: ["codex", "gemini"] (default) +- max_iterations: 5 (default) +- done_when: Optional completion criteria +``` + +### Step 2: Implement + +Claude implements the task using standard tools (Write, Edit, Bash). +Follow the Coder Contract: +- No fabrication - don't claim to have done something you haven't +- No test corruption - never modify tests to make failing code pass +- No scope creep - only implement what was requested +- Be honest about assumptions and limitations + +### Step 3: Submit for Review + +``` +Use mcp__owlex__liza_submit: + +Parameters: +- task_id: From liza_start +- implementation_summary: Description of what you implemented +``` + +Reviewers will examine and return: +- **APPROVE**: Code meets requirements, no blocking issues +- **REJECT**: Issues found that must be fixed + specific feedback + +### Step 4: Handle Verdict + +**If ALL APPROVE:** +- Task complete! Celebrate. + +**If ANY REJECT:** +1. Read the merged feedback carefully +2. Fix the identified issues +3. Do NOT introduce unrelated changes +4. Call liza_submit again with updated summary +5. Repeat until approved or max iterations + +## Key Principles + +### External Validation +Claude cannot self-approve. Reviewers provide binding verdicts. +This prevents: +- Sycophancy +- Phantom fixes +- Hallucinated completions +- Self-certification bias + +### Critique Mode +Reviewers actively look for: +- Security vulnerabilities +- Race conditions +- Resource leaks +- Unhandled edge cases +- Missing validation +- Incorrect error handling + +### Multi-Reviewer Benefit +Different reviewers catch different issues: +- Codex: Deep code analysis, bug finding +- Gemini: Architecture, patterns, large-scale issues +- OpenCode: Alternative perspective, edge cases + +## Example Session + +``` +User: /liza Add rate limiting to the login endpoint + +Claude: I'll create a Liza task for this. + +[Calls liza_start with task_description] +→ Returns task_id: "task-1" + +Claude: Now implementing rate limiting... + +[Uses Write/Edit to add rate limiting code] + +Claude: Implementation complete. Submitting for review. + +[Calls liza_submit with implementation_summary] +→ Codex: REJECT - "Missing IP-based limiting, only user-based" +→ Gemini: REJECT - "No Redis backend for distributed systems" + +Claude: I see the issues. Let me fix them. + +[Fixes the implementation] + +[Calls liza_submit again] +→ Codex: APPROVE +→ Gemini: APPROVE + +Claude: All reviewers approved! Rate limiting is now implemented with: +- User-based rate limiting +- IP-based rate limiting +- Redis backend for distributed systems +``` + +## MCP Tools Reference + +| Tool | Description | +|------|-------------| +| `liza_start` | Create a task for Claude to implement | +| `liza_submit` | Submit implementation for review | +| `liza_status` | Get current task status | +| `liza_feedback` | Get formatted feedback to address | + +## Blackboard + +State persisted in `.owlex/liza-state.yaml`: +- Task lifecycle tracking +- Review history +- Iteration count +- Merged feedback + +View with: `owlex://liza/blackboard` resource + +## Configuration + +Default reviewers: `["codex", "gemini"]` +Default max iterations: 5 +Critique mode: enabled (reviewers actively find issues) +Require all approve: true (all reviewers must approve) diff --git a/tests/test_liza.py b/tests/test_liza.py new file mode 100644 index 0000000..eb5dd17 --- /dev/null +++ b/tests/test_liza.py @@ -0,0 +1,374 @@ +""" +Tests for the Liza peer-supervised coding module. +""" + +import asyncio +import os +import tempfile +from pathlib import Path + +import pytest + +from owlex.liza import ( + Blackboard, + Task, + TaskStatus, + LizaOrchestrator, + LizaConfig, + parse_verdict, + VerdictStatus, + ReviewVerdict, +) + + +class TestVerdictParsing: + """Tests for verdict parsing from reviewer responses.""" + + def test_parse_approve_xml(self): + """Parse APPROVE verdict with XML tags.""" + response = """ + I've reviewed the implementation. + + APPROVE + + + Code looks good. Well structured and follows best practices. + + """ + verdict = parse_verdict(response) + assert verdict.status == VerdictStatus.APPROVE + assert verdict.approved + assert verdict.confidence >= 0.9 + + def test_parse_reject_xml(self): + """Parse REJECT verdict with XML tags.""" + response = """ + REJECT + + + Found several issues: + - Missing input validation + - No error handling for edge cases + - Tests don't cover failure scenarios + + """ + verdict = parse_verdict(response) + assert verdict.status == VerdictStatus.REJECT + assert not verdict.approved + assert "Missing input validation" in verdict.feedback + assert len(verdict.issues) >= 2 + + def test_parse_approve_markdown(self): + """Parse APPROVE verdict with markdown format.""" + response = """ + ## Verdict: APPROVE + + The implementation meets requirements. + """ + verdict = parse_verdict(response) + assert verdict.status == VerdictStatus.APPROVE + + def test_parse_reject_implicit(self): + """Parse REJECT verdict from implicit signals.""" + response = """ + There are several problems with this implementation: + - Missing validation + - Bugs in error handling + Please fix these issues before approval. + """ + verdict = parse_verdict(response) + assert verdict.status == VerdictStatus.REJECT + assert verdict.confidence < 1.0 # Lower confidence than explicit XML + + def test_parse_approve_implicit(self): + """Parse APPROVE verdict from implicit signals.""" + response = """ + Looks good to me! LGTM. + No issues found, ready to merge. + """ + verdict = parse_verdict(response) + assert verdict.status == VerdictStatus.APPROVE + + def test_extract_issues(self): + """Extract individual issues from feedback.""" + response = """ + REJECT + + Issues found: + - SQL injection vulnerability in login handler + - Missing rate limiting + - No input sanitization + + """ + verdict = parse_verdict(response) + assert len(verdict.issues) == 3 + assert any("SQL injection" in issue for issue in verdict.issues) + + +class TestBlackboard: + """Tests for blackboard state management.""" + + @pytest.fixture + def temp_dir(self): + """Create a temporary directory for blackboard.""" + with tempfile.TemporaryDirectory() as tmpdir: + yield tmpdir + + def test_initialize_blackboard(self, temp_dir): + """Initialize a new blackboard.""" + bb = Blackboard(working_directory=temp_dir) + state = bb.initialize(goal="Test goal") + + assert state.goal == "Test goal" + assert bb.exists() + assert (Path(temp_dir) / ".owlex" / "liza-state.yaml").exists() + + def test_add_task(self, temp_dir): + """Add a task to the blackboard.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + + task = bb.add_task( + description="Implement login", + coder="claude", + reviewers=["codex", "gemini"], + ) + + assert task.id == "task-1" + assert task.description == "Implement login" + assert task.coder == "claude" + assert task.reviewers == ["codex", "gemini"] + assert task.status == TaskStatus.UNCLAIMED + + def test_claim_task(self, temp_dir): + """Claim a task for implementation.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + bb.add_task(description="Test task", status=TaskStatus.UNCLAIMED) + + task = bb.claim_task("task-1", "claude") + + assert task.status == TaskStatus.CLAIMED + assert task.coder == "claude" + assert len(task.history) >= 2 # created + claimed + + def test_submit_for_review(self, temp_dir): + """Submit task for review.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + bb.add_task(description="Test task", coder="claude", status=TaskStatus.WORKING) + + task = bb.submit_for_review("task-1", "Implementation summary") + + assert task.status == TaskStatus.READY_FOR_REVIEW + assert task.last_implementation == "Implementation summary" + + def test_record_review(self, temp_dir): + """Record a review verdict.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + bb.add_task(description="Test task", reviewers=["codex"], status=TaskStatus.IN_REVIEW) + + task = bb.record_review("task-1", "codex", "APPROVE", "Looks good") + + assert len(task.reviews) == 1 + assert task.reviews[0].reviewer == "codex" + assert task.reviews[0].verdict == "APPROVE" + + def test_finalize_reviews_approved(self, temp_dir): + """Finalize reviews when all approve.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + bb.add_task(description="Test task", reviewers=["codex", "gemini"], status=TaskStatus.IN_REVIEW) + + bb.record_review("task-1", "codex", "APPROVE", "Good") + bb.record_review("task-1", "gemini", "APPROVE", "LGTM") + + task, approved = bb.finalize_reviews("task-1") + + assert approved + assert task.status == TaskStatus.APPROVED + + def test_finalize_reviews_rejected(self, temp_dir): + """Finalize reviews when any reject.""" + bb = Blackboard(working_directory=temp_dir) + bb.initialize(goal="Test goal") + bb.add_task(description="Test task", reviewers=["codex", "gemini"], status=TaskStatus.IN_REVIEW) + + bb.record_review("task-1", "codex", "APPROVE", "Good") + bb.record_review("task-1", "gemini", "REJECT", "Missing tests") + + task, approved = bb.finalize_reviews("task-1") + + assert not approved + assert task.status == TaskStatus.REJECTED + assert "gemini" in task.merged_feedback + assert "Missing tests" in task.merged_feedback + + +class TestLizaOrchestrator: + """Tests for the Liza orchestrator.""" + + @pytest.fixture + def temp_dir(self): + """Create a temporary directory.""" + with tempfile.TemporaryDirectory() as tmpdir: + yield tmpdir + + def test_create_task(self, temp_dir): + """Create a task via orchestrator.""" + config = LizaConfig(working_directory=temp_dir) + orchestrator = LizaOrchestrator(config=config) + + task = orchestrator.create_task( + description="Implement feature X", + reviewers=["codex", "gemini"], + ) + + assert task.id == "task-1" + assert task.coder == "claude" + assert task.status == TaskStatus.WORKING + + def test_get_task_status(self, temp_dir): + """Get task status.""" + config = LizaConfig(working_directory=temp_dir) + orchestrator = LizaOrchestrator(config=config) + orchestrator.create_task(description="Test task") + + status = orchestrator.get_task_status("task-1") + + assert status is not None + assert status["task_id"] == "task-1" + assert status["status"] == "WORKING" + + @pytest.mark.asyncio + async def test_submit_for_review_mock(self, temp_dir): + """Test submitting for review with mock reviewers.""" + config = LizaConfig( + working_directory=temp_dir, + reviewers=["mock_reviewer"], + ) + orchestrator = LizaOrchestrator(config=config) + + # Mock reviewer that always approves + async def mock_runner(agent: str, prompt: str, wd: str | None, timeout: int) -> str: + return "APPROVELooks good!" + + orchestrator.set_reviewer_runner(mock_runner) + + task = orchestrator.create_task(description="Test task", reviewers=["mock_reviewer"]) + result = await orchestrator.submit_for_review(task.id, "Implementation done") + + assert result.all_approved + assert "mock_reviewer" in result.reviews + assert result.reviews["mock_reviewer"].approved + + @pytest.mark.asyncio + async def test_submit_for_review_rejected(self, temp_dir): + """Test submitting for review with rejection.""" + config = LizaConfig( + working_directory=temp_dir, + reviewers=["reviewer1", "reviewer2"], + ) + orchestrator = LizaOrchestrator(config=config) + + # Mock: one approves, one rejects + async def mock_runner(agent: str, prompt: str, wd: str | None, timeout: int) -> str: + if agent == "reviewer1": + return "APPROVE" + else: + return "REJECTMissing validation" + + orchestrator.set_reviewer_runner(mock_runner) + + task = orchestrator.create_task(description="Test task", reviewers=["reviewer1", "reviewer2"]) + result = await orchestrator.submit_for_review(task.id, "Implementation done") + + assert not result.all_approved + assert result.reviews["reviewer1"].approved + assert not result.reviews["reviewer2"].approved + assert "Missing validation" in result.merged_feedback + + +class TestIntegration: + """Integration tests for the full Liza workflow.""" + + @pytest.fixture + def temp_dir(self): + """Create a temporary directory.""" + with tempfile.TemporaryDirectory() as tmpdir: + yield tmpdir + + @pytest.mark.asyncio + async def test_full_workflow_approved_first_try(self, temp_dir): + """Test full workflow: create → implement → review → approved.""" + config = LizaConfig( + working_directory=temp_dir, + reviewers=["codex", "gemini"], + ) + orchestrator = LizaOrchestrator(config=config) + + # All reviewers approve + async def mock_runner(agent: str, prompt: str, wd: str | None, timeout: int) -> str: + return f"APPROVE{agent} approves this implementation." + + orchestrator.set_reviewer_runner(mock_runner) + + # Step 1: Create task + task = orchestrator.create_task( + description="Add login endpoint", + reviewers=["codex", "gemini"], + ) + assert task.status == TaskStatus.WORKING + + # Step 2: Submit for review + result = await orchestrator.submit_for_review( + task.id, + "Implemented login endpoint with validation" + ) + + # Step 3: Check result + assert result.all_approved + + # Step 4: Verify final state + final_task = orchestrator.blackboard.get_task(task.id) + assert final_task.status == TaskStatus.APPROVED + + @pytest.mark.asyncio + async def test_full_workflow_with_iteration(self, temp_dir): + """Test full workflow with one rejection then approval.""" + config = LizaConfig( + working_directory=temp_dir, + reviewers=["codex"], + ) + orchestrator = LizaOrchestrator(config=config) + + call_count = 0 + + # First call rejects, second call approves + async def mock_runner(agent: str, prompt: str, wd: str | None, timeout: int) -> str: + nonlocal call_count + call_count += 1 + if call_count == 1: + return "REJECTMissing error handling" + else: + return "APPROVEError handling looks good now." + + orchestrator.set_reviewer_runner(mock_runner) + + # Create and submit + task = orchestrator.create_task(description="Add feature", reviewers=["codex"]) + + # First submission - rejected + result1 = await orchestrator.submit_for_review(task.id, "First implementation") + assert not result1.all_approved + assert "Missing error handling" in result1.merged_feedback + + # Prepare for iteration + orchestrator.prepare_for_iteration(task.id, result1.merged_feedback) + task = orchestrator.blackboard.get_task(task.id) + assert task.iteration == 1 + + # Second submission - approved + result2 = await orchestrator.submit_for_review(task.id, "Fixed error handling") + assert result2.all_approved