OpenManus · realtmxi · Sep 11, 2025 · Sep 1, 2025 · Sep 7, 2025 · Sep 7, 2025
diff --git a/.gitignore b/.gitignore
@@ -35,4 +35,7 @@ verl.egg-info/
 
 test_memory.md
 
-trajectories/traj_*.json
+trajectories/
+
+AGENTS.md
+CLAUDE.md
diff --git a/data/gaia/val.json b/data/gaia/val.json
diff --git a/data/gaia/val.parquet b/data/gaia/val.parquet
diff --git a/docs/DOCKER_SETUP.md b/docs/DOCKER_SETUP.md
@@ -0,0 +1,141 @@
+# OpenManus-RL Docker Setup for AMD GPUs
+
+This setup allows you to run OpenManus-RL alfworld rollouts in a Docker container without affecting your existing verl-agent environment.
+
+## Prerequisites
+
+- Docker installed and running
+- AMD GPU with ROCm support
+- The `verl-agent:rocm-snap1` Docker image (from your previous verl-agent setup)
+- Models stored in `/root/models/`
+
+## Setup Instructions
+
+### 1. Initial Setup
+
+First, run the setup script to create and configure the Docker container:
+
+```bash
+cd /root/OpenManus-RL
+./scripts/docker_setup.sh
+```
+
+This will:
+- Create a new Docker container named `openmanus-rl`
+- Install all required dependencies
+- Set up a virtual environment at `/opt/openmanus-venv`
+- Port 8001 on host will map to 8000 in container (to avoid conflict with verl-agent)
+
+### 2. Start/Access the Container
+
+If you need to enter the container manually:
+
+```bash
+docker exec -it openmanus-rl bash
+source /opt/openmanus-venv/bin/activate
+cd /workspace
+```
+
+Then you can run commands directly.
+
+### 3. Run Rollouts (Unified Script)
+
+See ROLLOUT_GUIDE.md for detailed examples. A few quick starters:
+
+- GAIA dry‑run:
+  - `python scripts/rollout/unified_rollout.py --env gaia --batch_size 2 --total_envs 4 --dry_run`
+
+- AlfWorld small run (OpenAI):
+  - `python scripts/rollout/unified_rollout.py --env alfworld --model gpt-4o-mini --batch_size 1 --total_envs 2 --max_steps 20 --dump_path logs/alfworld/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+- GAIA small run (local vLLM):
+  - `./scripts/serve_model.sh` (in another shell)
+  - `python scripts/rollout/unified_rollout.py --env gaia --model qwen2.5-7b-alfworld --base_url http://127.0.0.1:8000/v1 --gaia_tools python_code_generator --batch_size 1 --total_envs 2 --max_steps 30 --dump_path logs/gaia/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+### 4. Running GAIA (Tool-Use) Rollouts
+
+GAIA uses the tool-use environment and the dataset in `data/gaia/val.json`. Some tools need extra API keys.
+
+Required packages for common tools are already listed in `requirements_docker.txt` (requests, python-dotenv, wikipedia). For Google search, set:
+
+```bash
+export GOOGLE_API_KEY=your-google-api-key
+export GOOGLE_CX=your-custom-search-engine-id
+```
+
+There are two ways to run GAIA:
+
+Use the unified script. Examples:
+
+1) OpenAI API
+```bash
+export OPENAI_API_KEY="your-openai-api-key"
+python scripts/rollout/unified_rollout.py \
+  --env gaia --model gpt-4o-mini \
+  --gaia_tools python_code_generator \
+  --total_envs 50 --batch_size 10 --max_steps 30 --concurrency 8 \
+  --dump_path logs/gaia/trajectory_$(date +%Y%m%d_%H%M%S).jsonl \
+  --chat_root /workspace
+```
+
+2) Local model via vLLM (OpenAI-compatible)
+
+First start the vLLM server (see above), then:
+```bash
+python scripts/rollout/unified_rollout.py \
+  --env gaia --model qwen2.5-7b-alfworld --base_url http://127.0.0.1:8000/v1 \
+  --gaia_tools python_code_generator \
+  --total_envs 50 --batch_size 10 --max_steps 30 --concurrency 8 \
+  --dump_path logs/gaia/trajectory_$(date +%Y%m%d_%H%M%S).jsonl \
+  --chat_root /workspace
+```
+
+Notes:
+- Default GAIA tools used in examples: `python_code_generator`（避免外部 API 依赖）。
+- If a tool needs external access (web APIs), ensure the container has outbound network connectivity and env vars are set.
+- Chat histories and logs are saved under `logs/gaia` and `trajectories/<timestamp>/gaia/<model>/` when `--chat_root` is provided.
+
+## Container Management
+
+### Stop the container
+```bash
+docker stop openmanus-rl
+```
+
+### Start the container again
+```bash
+docker start openmanus-rl
+```
+
+### Remove the container
+```bash
+docker stop openmanus-rl
+docker rm openmanus-rl
+```
+
+### Check container logs
+```bash
+docker logs openmanus-rl
+```
+
+## Troubleshooting
+
+### If vLLM fails to start
+1. Check GPU memory usage: `rocm-smi`
+2. Adjust `--gpu-memory-utilization` in `serve_model.sh`
+3. Make sure no other process is using port 8000 in the container
+
+### If rollout fails
+1. Check that all dependencies are installed: `pip list`
+2. Verify AlfWorld data is downloaded: `ls ~/.cache/alfworld` or re‑run `alfworld-download -f`
+3. Check logs under `/workspace/logs/<env>/`
+
+### Port conflicts
+- Default: container 8000 → host 8001 (configured by `docker_setup.sh`)
+- Adjust mapping via `-p` flag if needed.
+
+## Output Files
+
+- Trajectory files: `/root/OpenManus-RL/logs/alfworld/trajectory_*.jsonl`
+- Chat histories: `/root/OpenManus-RL/trajectories/<timestamp>/`
+- Log files: `/root/OpenManus-RL/logs/alfworld/run_log_*.log`
diff --git a/docs/ROLLOUT_GUIDE.md b/docs/ROLLOUT_GUIDE.md
@@ -0,0 +1,90 @@
+# Rollout Guide (AlfWorld, GAIA, WebShop)
+
+This guide shows how to run rollouts for the three environments using a single unified script. The script supports both OpenAI API and local OpenAI‑compatible endpoints (e.g., vLLM).
+
+## Prerequisites
+
+- Python venv prepared via Docker setup (see DOCKER_SETUP.md)
+- .env at repo root (auto‑loaded) for API keys:
+  - `OPENAI_API_KEY` for OpenAI
+  - Optional tool keys (e.g., GAIA Google tools): `GOOGLE_API_KEY`, `GOOGLE_CX`
+- For local inference (vLLM), start the server first (see DOCKER_SETUP.md or `serve_model.sh`).
+
+## Unified Script
+
+- Entry: `python scripts/rollout/unified_rollout.py`
+- Core flags:
+  - `--env {alfworld,gaia,webshop}` choose environment
+  - `--model <name>` model name (OpenAI or local)
+  - `--base_url <url>` set when using local server (e.g., `http://127.0.0.1:8000/v1`)
+  - `--batch_size`, `--total_envs`, `--max_steps`, `--concurrency`
+  - `--dump_path <jsonl>` save trajectories
+  - `--chat_root <dir>` save chat histories under `trajectories/<ts>/<env>/<model>/`
+  - `--dry_run` plan batches without creating envs/calling models
+  - `--unique_envs` ensure unique task/game sampling where supported
+
+## GAIA
+
+Data path default: `data/gaia/val.json`
+
+- Dry‑run (no model calls):
+  - `python scripts/rollout/openmanus_rollout.py --env gaia --batch_size 2 --total_envs 4 --dry_run`
+
+- OpenAI small run (minimal tools):
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env gaia --model gpt-4o \
+    --gaia_tools python_code_generator \
+    --batch_size 1 --total_envs 2 --max_steps 30 --concurrency 2 \
+    --dump_path logs/gaia/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+- Local vLLM small run:
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env gaia --model qwen2.5-7b-alfworld --base_url http://127.0.0.1:8000/v1 \
+    --gaia_tools python_code_generator \
+    --batch_size 1 --total_envs 2 --max_steps 30 --concurrency 2 \
+    --dump_path logs/gaia/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+## AlfWorld
+
+Make sure AlfWorld is installed and game data downloaded (`alfworld-download -f`).
+
+- Dry‑run (unique game files sampling):
+  - `python scripts/rollout/unified_rollout.py --env alfworld --unique_envs --batch_size 2 --total_envs 4 --dry_run`
+
+- OpenAI small run:
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env alfworld --model gpt-4o \
+    --batch_size 1 --total_envs 2 --max_steps 30 --concurrency 2 \
+    --dump_path logs/alfworld/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+- Local vLLM small run:
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env alfworld --model qwen2.5-7b-alfworld --base_url http://127.0.0.1:8000/v1 \
+    --batch_size 1 --total_envs 2 --max_steps 20 --concurrency 2 \
+    --dump_path logs/alfworld/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+## WebShop (optional)
+
+To run WebShop, follow data/index setup in DOCKER_SETUP.md, then use:
+
+- Dry‑run:
+  - `python scripts/rollout/openmanus_rollout.py --env webshop --batch_size 2 --total_envs 4 --dry_run`
+
+- OpenAI:
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env webshop --model gpt-4o \
+    --batch_size 2 --total_envs 4 --max_steps 30 --concurrency 2 \
+    --dump_path logs/webshop/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+- Local vLLM:
+  - `python scripts/rollout/openmanus_rollout.py \
+    --env webshop --model qwen2.5-7b-alfworld --base_url http://127.0.0.1:8000/v1 \
+    --batch_size 2 --total_envs 4 --max_steps 30 --concurrency 2 \
+    --dump_path logs/webshop/trajectory_$(date +%Y%m%d_%H%M%S).jsonl --chat_root .`
+
+## Outputs
+
+- Logs: `logs/<env>/unified_run_*.log`
+- Trajectory: `--dump_path` JSONL
+- Chats: `trajectories/<timestamp>/<env>/<model>/` when `--chat_root` is set
+
diff --git a/openmanus_rl/engines/__init__.py b/openmanus_rl/engines/__init__.py
@@ -0,0 +1,6 @@
+"""LLM engine interfaces and factories.
+
+This package provides lightweight wrappers around OpenAI-compatible
+chat completion APIs and a simple factory used by tool modules.
+"""
+
diff --git a/openmanus_rl/engines/factory.py b/openmanus_rl/engines/factory.py
@@ -0,0 +1,21 @@
+"""Engine factory helpers.
+
+Exposes `create_llm_engine` returning a callable that maps prompt -> text using
+the minimal `ChatOpenAI` wrapper. Keep the surface small and stable so tools
+can depend on it without heavy coupling.
+"""
+
+from typing import Callable, Optional
+from .openai import ChatOpenAI
+
+
+def create_llm_engine(model_string: str = "gpt-4o-mini", is_multimodal: bool = False, base_url: Optional[str] = None) -> Callable[[str], str]:
+    chat = ChatOpenAI(model=model_string, base_url=base_url)
+
+    def _engine(prompt: str) -> str:
+        # Tools currently call engine(prompt) for text-only flows.
+        # If multimodal is needed later, extend by adding optional image args.
+        return chat(prompt)
+
+    return _engine
+
diff --git a/openmanus_rl/engines/openai.py b/openmanus_rl/engines/openai.py
@@ -0,0 +1,124 @@
+"""Minimal OpenAI chat wrapper.
+
+Provides a small surface compatible with internal code paths that expect
+`ChatOpenAI` with a callable interface. Supports OpenAI-compatible backends
+such as vLLM by honoring `OPENAI_BASE_URL`.
+"""
+
+from typing import Optional, List, Dict, Any, Type
+import json
+import re
+try:
+    from pydantic import BaseModel  # type: ignore
+except Exception:  # pragma: no cover
+    BaseModel = object  # type: ignore
+import os
+
+try:
+    from openai import OpenAI  # type: ignore
+except Exception as exc:  # pragma: no cover
+    OpenAI = None  # type: ignore
+
+
+class ChatOpenAI:
+    """Thin wrapper around OpenAI's Chat Completions API.
+
+    The instance is callable and returns plain text. Images are not sent as
+    binary by design to remain compatible with OpenAI-compatible servers that
+    do not support multimodal content; image paths are appended as text hints.
+    """
+
+    def __init__(
+        self,
+        model: str = "gpt-4o-mini",
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
+        temperature: float = 0.0,
+    ) -> None:
+        if OpenAI is None:
+            raise RuntimeError("openai package is not installed")
+
+        self.model = model
+        self.temperature = temperature
+        self.base_url = base_url or os.getenv("OPENAI_BASE_URL")
+        self.api_key = api_key or os.getenv("OPENAI_API_KEY", "EMPTY")
+        self.client = OpenAI(api_key=self.api_key, base_url=self.base_url)
+
+    def __call__(
+        self,
+        prompt: str,
+        images: Optional[List[str]] = None,
+        system: Optional[str] = None,
+        response_format: Optional[Type] = None,
+        **_: Any,
+    ) -> Any:
+        messages: List[Dict[str, Any]] = []
+        if system:
+            messages.append({"role": "system", "content": system})
+
+        if not images:
+            messages.append({"role": "user", "content": prompt})
+        else:
+            # Safe multimodal fallback: append image paths as text hints.
+            content = prompt
+            for p in images:
+                content += f"\n[Image: {p}]"
+            messages.append({"role": "user", "content": content})
+
+        resp = self.client.chat.completions.create(
+            model=self.model,
+            messages=messages,
+            temperature=self.temperature,
+            n=1,
+        )
+        text = (resp.choices[0].message.content or "").strip()
+
+        # Best-effort structured parsing when a pydantic model is requested
+        try:
+            if response_format and isinstance(response_format, type) and issubclass(response_format, BaseModel):
+                # Try JSON first
+                try:
+                    data = json.loads(text)
+                    if isinstance(data, dict):
+                        return response_format(**data)
+                    if isinstance(data, list):
+                        # Common pattern: patch list
+                        payload: Dict[str, Any] = {}
+                        if hasattr(response_format, "model_fields") and "patch" in response_format.model_fields:  # pydantic v2
+                            payload["patch"] = data
+                        elif hasattr(response_format, "__fields__") and "patch" in getattr(response_format, "__fields__"):
+                            payload["patch"] = data
+                        if payload:
+                            return response_format(**payload)
+                except Exception:
+                    pass
+
+                # Special-case: AnswerVerification(analysis: str, true_false: bool)
+                if getattr(response_format, "__name__", "") == "AnswerVerification":
+                    analysis = ""
+                    tf = False
+                    m = re.search(r"<analysis>\s*(.*?)\s*</analysis>", text, re.DOTALL)
+                    if m:
+                        analysis = m.group(1).strip()
+                    m2 = re.search(r"<true_false>\s*(.*?)\s*</true_false>", text, re.DOTALL)
+                    if m2:
+                        val = m2.group(1).strip().lower()
+                        tf = val in ("true", "1", "yes")
+                    if not analysis:
+                        analysis = text
+                    return response_format(analysis=analysis, true_false=tf)
+
+                # Fallback: try to populate known common fields
+                payload: Dict[str, Any] = {}
+                for field in ("analysis", "text"):
+                    if (hasattr(response_format, "model_fields") and field in response_format.model_fields) or (
+                        hasattr(response_format, "__fields__") and field in getattr(response_format, "__fields__")
+                    ):
+                        payload[field] = text
+                if payload:
+                    return response_format(**payload)
+        except Exception:
+            # Swallow parsing errors and return raw text
+            pass
+
+        return text