BrowserOperator
diff --git a/‎CLAUDE.md‎
Lines changed: 49 additions & 33 deletions b/‎CLAUDE.md‎
Lines changed: 49 additions & 33 deletions
diff --git a/‎Dockerfile.devtools‎
Lines changed: 1 addition & 1 deletion b/‎Dockerfile.devtools‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎deployments/cloudrun/Dockerfile‎
Lines changed: 5 additions & 5 deletions b/‎deployments/cloudrun/Dockerfile‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎deployments/local/Dockerfile‎
Lines changed: 2 additions & 2 deletions b/‎deployments/local/Dockerfile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎deployments/local/Makefile‎
Lines changed: 9 additions & 5 deletions b/‎deployments/local/Makefile‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎evals/.env.example‎
Lines changed: 18 additions & 1 deletion b/‎evals/.env.example‎
Lines changed: 18 additions & 1 deletion
diff --git a/‎evals/CLAUDE.md‎
Lines changed: 8 additions & 4 deletions b/‎evals/CLAUDE.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎evals/config.example.anthropic.yml‎
Lines changed: 59 additions & 0 deletions b/‎evals/config.example.anthropic.yml‎
Lines changed: 59 additions & 0 deletions
@@ -91,20 +91,25 @@ web-agent/
 │           │   └── nginx-devtools.conf
 │           └── services-cloudrun/  # Service configs (cloud run)
 │               └── browser-agent-server.conf
-├── browser-agent-server/
-│   └── nodejs/                 # Browser agent server source
-│       ├── src/
-│       │   ├── api-server.js   # HTTP REST API
-│       │   ├── evaluation-server.js  # WebSocket + CDP
-│       │   └── lib/            # BrowserAgentServer, judges
-│       ├── start.js            # Server entrypoint
-│       └── package.json
+├── submodules/                 # Git submodules
+│   ├── browser-operator-core/  # Browser Operator DevTools + Agent Server
+│   │   ├── agent-server/       # Agent server (HTTP/WebSocket API)
+│   │   │   └── nodejs/         # Node.js implementation
+│   │   │       ├── src/
+│   │   │       │   ├── api-server.js   # HTTP REST API
+│   │   │       │   ├── client-manager.js  # Client management
+│   │   │       │   └── lib/    # Core libraries
+│   │   │       ├── start.js    # Server entrypoint
+│   │   │       └── package.json
+│   │   └── front_end/          # DevTools frontend source
+│   ├── kernel-images/          # Base browser environment
+│   └── webarena/               # WebArena benchmark (for webarena evals)
 ├── evals/                      # Evaluation framework
 │   ├── .env                    # API keys (gitignored, copy from .env.example)
 │   ├── config.yml              # Global eval configuration
 │   ├── lib/                    # Shared evaluation library
 │   │   ├── eval_loader.py      # YAML evaluation loader
-│   │   ├── api_client.py       # HTTP client for browser-agent-server
+│   │   ├── api_client.py       # HTTP client for agent server
 │   │   ├── judge.py            # LLMJudge, VisionJudge, SimpleJudge
 │   │   ├── webarena_adapter.py # WebArena task adapter
 │   │   └── webarena_evaluators.py # WebArena evaluators
@@ -130,7 +135,7 @@ web-agent/
 ### deployments/local/Dockerfile
 Multi-stage build that:
 1. Copies pre-built DevTools from `browser-operator-devtools:latest`
-2. Builds browser-agent-server with `npm install`
+2. Builds agent server from `submodules/browser-operator-core/agent-server/nodejs` with `npm install`
 3. Builds kernel-images Go API
 4. Builds WebRTC client
 5. Compiles custom Xorg drivers
@@ -145,9 +150,10 @@ Multi-stage build that:
 ### deployments/local/docker-compose.yml
 Configures container with:
 - Port mappings for all services (8000-8082, 9222, 444)
-- Volume mounts: recordings, chromium-data, browser-agent-server code
+- Volume mounts: recordings, chromium-data
 - tmpfs: `/dev/shm` and `/tmp` (prevents lock file persistence)
 - Environment: `CHROMIUM_FLAGS` with custom DevTools frontend
+- Agent server code is baked into the image (not volume-mounted)
 
 **Recent fixes:**
 - Added missing ports 8000, 8001, 8081, 8082
@@ -163,11 +169,12 @@ This prevents "profile in use" and "display already active" errors.
 
 Available in all deployment types: `local/`, `local-webarena/`, `cloudrun/`
 
-### browser-agent-server/nodejs/src/api-server.js
+### submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
 HTTP REST API with endpoints:
 - `POST /v1/responses` - Execute browser automation tasks
 - `POST /page/content` - Get page HTML/text content
 - `POST /page/screenshot` - Capture screenshots
+- `POST /page/execute` - Execute JavaScript in page context
 - `GET /status` - Health check
 
 ### deployments/commons/supervisor/services/browser-agent-server.conf
@@ -252,12 +259,12 @@ Supervisor configuration files:
 - Check logs: `docker logs kernel-browser-extended | grep CDP`
 
 ### 4. Module Not Found Errors
-**Symptom:** "Cannot find module 'js-yaml'" or "Cannot find module 'BrowserAgentServer.js'"
+**Symptom:** "Cannot find module 'js-yaml'" or missing dependencies
 
 **Solution:**
-- Ensure `browser-agent-server/nodejs/` has all dependencies
-- Run `cd browser-agent-server/nodejs && npm install`
-- Browser-agent-server code is in `browser-agent-server/nodejs/`
+- Agent server code comes from `submodules/browser-operator-core/agent-server/nodejs/`
+- Dependencies are installed during Docker build via `npm install`
+- Rebuild the image if dependencies are missing: `make rebuild`
 
 ### 5. Docker Volume Caching on macOS
 **Symptom:** File changes not visible in running container with docker-compose
@@ -293,10 +300,11 @@ make compose-up  # OR make run
 **Advantages:**
 - Background operation
 - Easy restart without rebuilding
-- Volume-mounted eval-server code (live reload)
 - Managed by docker-compose
 - Better for long-running development
 
+**Note:** Agent server code is baked into the image, so rebuilds are needed for code changes
+
 **Usage:**
 ```bash
 # First time setup
@@ -312,9 +320,11 @@ make test                    # Run simple eval test
 # View logs
 make logs                    # Follow all logs
 
-# Iterate on eval-server code (NO REBUILD NEEDED)
-vim eval-server/nodejs/src/api-server.js
-docker-compose restart       # Picks up changes immediately
+# Iterate on agent server code (REQUIRES REBUILD)
+vim submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
+make rebuild
+docker-compose down
+docker-compose up -d
 
 # Stop
 make stop                    # OR docker-compose down
@@ -362,7 +372,7 @@ make run                     # Restart after rebuild
 |--------|-----------|-------------------|
 | **Logs** | Live in terminal | Background, use `make logs` |
 | **Stopping** | Ctrl+C or docker stop | `make stop` |
-| **Eval server code** | Baked into image, rebuild needed | Volume-mounted, restart only |
+| **Agent server code** | Baked into image, rebuild needed | Baked into image, rebuild needed |
 | **DevTools code** | Baked into image, rebuild needed | Baked into image, rebuild needed |
 | **Best for** | Debugging, seeing startup issues | Development iteration |
 | **Script** | `run-local.sh` | `docker-compose.yml` |
@@ -377,9 +387,11 @@ make run                     # Restart after rebuild
 ```bash
 cd deployments/local
 
-# Browser-agent-server changes (NO REBUILD)
-vim ../../browser-agent-server/nodejs/src/api-server.js
-docker-compose restart       # Volume-mounted, picks up changes
+# Agent server changes (REQUIRES REBUILD)
+vim ../../submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
+make rebuild
+docker-compose down
+docker-compose up -d
 
 # DevTools changes
 vim ../../browser-operator-core/front_end/panels/ai_chat/...
@@ -430,26 +442,29 @@ CHROMIUM_DATA_HOST=/tmp/browser URLS="https://example.com" make run
 
 ## Important Notes
 
-### Browser Agent Server Location
-The browser agent server code is in: `browser-agent-server/nodejs/`
+### Agent Server Location
+The agent server code is in: `submodules/browser-operator-core/agent-server/nodejs/`
 
 This is the main server that handles browser automation requests via HTTP/WebSocket APIs.
 
+**Note:** The submodule must be on the `feat/js-eval-endpoint` branch to have the `/page/execute` endpoint.
+
 ### CDP Port is 9223, Not 9222
 The default Chrome DevTools port is 9222, but this project uses 9223.
 
 Check these files:
 - `deployments/commons/supervisor/services/browser-agent-server.conf` - Must have `CDP_PORT="9223"`
 - Chromium startup config uses port 9223
 
-### Dependencies in browser-agent-server/nodejs/
+### Dependencies in submodules/browser-operator-core/agent-server/nodejs/
 Required packages:
-- js-yaml (for parsing YAML eval files)
-- express (HTTP server)
 - ws (WebSocket server)
-- chrome-remote-interface (CDP client)
+- uuid (ID generation)
+- winston (logging)
+- js-yaml (YAML parsing)
+- dotenv (environment variables)
 
-All managed by `package.json` and `npm install`.
+All managed by `package.json` and `npm install` during Docker build.
 
 ### Lock File Cleanup is Automatic
 After implementing `deployments/*/scripts/init-container.sh`, you should never need to manually clean lock files again. The script runs on every container start.
@@ -706,7 +721,7 @@ curl -X POST http://localhost:8080/page/screenshot \
    - `webarena/` - WebArena benchmark runner
    - `lib/` - Shared evaluation library (judges, adapters, loaders)
 
-3. **Renamed eval-server** - Now called `browser-agent-server/` to better reflect its purpose
+3. **Consolidated agent server** - Now using `submodules/browser-operator-core/agent-server/` directly (removed duplicate `browser-agent-server/` directory)
 
 4. **Moved WebArena config files** - Task configurations moved to in-repo location:
    - New location: `evals/webarena/config_files/` (preferred)
@@ -718,6 +733,7 @@ curl -X POST http://localhost:8080/page/screenshot \
 2. **Fixed tmpfs mounts** - Added `/tmp` to prevent X11 lock persistence
 3. **Added automatic lock cleanup** - `deployments/*/scripts/init-container.sh` runs on every start
 4. **Updated Chromium flags** - Added `--custom-devtools-frontend=http://localhost:8001/`
-5. **Fixed CDP port** - Set `CDP_PORT="9223"` in browser-agent-server supervisor config
+5. **Fixed CDP port** - Set `CDP_PORT="9223"` in agent server supervisor config
+6. **Added /page/execute endpoint** - JavaScript execution endpoint available in `feat/js-eval-endpoint` branch
 6. **Created make test** - Quick verification of API functionality
 7. **Fixed path resolution** - `eval_loader.py` now supports new `evals/native/data/` structure
@@ -69,7 +69,7 @@ FROM devtools-base AS devtools-local
 # Copy local changes from browser-operator-core submodule FIRST
 # This happens before checking out upstream, so we copy over the upstream code
 COPY submodules/browser-operator-core/front_end /workspace/devtools/devtools-frontend/front_end
-COPY browser-agent-server /workspace/devtools/devtools-frontend/browser-agent-server
+COPY submodules/browser-operator-core/agent-server /workspace/devtools/devtools-frontend/browser-agent-server
 
 WORKDIR /workspace/devtools/devtools-frontend
 
 
@@ -55,12 +55,12 @@ RUN sed -i 's/AUTOMATED_MODE: false/AUTOMATED_MODE: true/' front_end/panels/ai_c
 # Build Browser Operator version with AUTOMATED_MODE enabled
 RUN npm run build
 
-# Eval-Server build stage
+# Agent Server build stage
 FROM node:22-bullseye-slim AS browser-agent-server-builder
 WORKDIR /browser-agent-server
-COPY browser-agent-server/nodejs/package*.json ./
+COPY submodules/browser-operator-core/agent-server/nodejs/package*.json ./
 RUN npm install --production
-COPY browser-agent-server/nodejs/ ./
+COPY submodules/browser-operator-core/agent-server/nodejs/ ./
 
 # Multi-stage build using kernel-images as base
 FROM docker.io/golang:1.25.0 AS server-builder
@@ -285,8 +285,8 @@ RUN chown -R kernel:kernel /usr/share/nginx/devtools
 # Copy browser-agent-server from builder
 COPY --from=browser-agent-server-builder /browser-agent-server /opt/browser-agent-server
 
-# Copy custom browser-agent-server startup script INTO browser-agent-server directory
-COPY browser-agent-server/start.js /opt/browser-agent-server/start-cloudrun.js
+# Copy custom agent server startup script from submodule
+COPY submodules/browser-operator-core/agent-server/start.js /opt/browser-agent-server/start-cloudrun.js
 RUN chmod +x /opt/browser-agent-server/start-cloudrun.js
 
 # Set permissions for browser-agent-server
 
@@ -16,8 +16,8 @@ FROM --platform=linux/arm64 node:18-alpine AS browser-agent-server-builder
 
 WORKDIR /workspace
 
-# Copy eval server from browser-operator-core submodule
-COPY browser-agent-server/nodejs /workspace/browser-agent-server
+# Copy agent server from browser-operator-core submodule
+COPY submodules/browser-operator-core/agent-server/nodejs /workspace/browser-agent-server
 
 WORKDIR /workspace/browser-agent-server
 
 
@@ -22,10 +22,14 @@ init: ## Initialize submodules (run this first)
 	cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core
 	@echo "✅ Submodules initialized"
 
-init-devtools: ## Initialize browser-operator-core submodule only
-	@echo "📦 Initializing browser-operator-core submodule..."
-	cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core
-	@echo "✅ browser-operator-core submodule initialized"
+init-devtools: ## Initialize browser-operator-core submodule only (if not already initialized)
+	@if [ ! -f ../../submodules/browser-operator-core/package.json ]; then \
+		echo "📦 Initializing browser-operator-core submodule..."; \
+		cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core; \
+		echo "✅ browser-operator-core submodule initialized"; \
+	else \
+		echo "✅ browser-operator-core already initialized (skipping to preserve branch)"; \
+	fi
 
 build-devtools-base: init-devtools ## Build DevTools base image (slow, rarely needed)
 	@echo "🔨 Building DevTools base layer (this takes ~30 minutes)..."
@@ -69,7 +73,7 @@ build: init ## Build extended image with DevTools frontend (smart: only builds D
 	cd ../../ && docker build -f deployments/local/Dockerfile -t kernel-browser:extended .
 	@echo "✅ Extended build complete"
 
-rebuild: init ## Force complete rebuild (including DevTools)
+rebuild: ## Force complete rebuild (including DevTools)
 	@echo "🔄 Force rebuilding everything from scratch..."
 	$(MAKE) --no-print-directory build-devtools
 	cd ../../ && docker build -f deployments/local/Dockerfile -t kernel-browser:extended .
 
@@ -10,9 +10,26 @@ GROQ_API_KEY=gsk-your-groq-api-key-here
 # Optional: OpenRouter API key (if using OpenRouter)
 OPENROUTER_API_KEY=your-openrouter-api-key-here
 
+# Optional: Cerebras API key (if using Cerebras models)
+CEREBRAS_API_KEY=your-cerebras-api-key-here
+
+# Optional: Anthropic API key (if using Claude models directly)
+ANTHROPIC_API_KEY=your-anthropic-api-key-here
+
+# Optional: Google API key (if using Gemini models directly)
+GOOGLE_API_KEY=your-google-api-key-here
+
 # Optional: LiteLLM configuration (if using LiteLLM)
+# LiteLLM allows you to use custom model endpoints (Ollama, vLLM, etc.)
+# Set these variables, then update config.yml to use provider: "litellm"
 LITELLM_API_KEY=your-litellm-api-key-here
-LITELLM_ENDPOINT=http://localhost:8000
+LITELLM_ENDPOINT=http://localhost:4000
+
+# Example LiteLLM endpoint configurations:
+# - Ollama local: http://localhost:11434
+# - LiteLLM proxy: http://localhost:4000
+# - vLLM server: http://localhost:8000
+# - Custom endpoint: http://172.16.55.34:4000
 
 # WebArena Infrastructure Configuration (Optional)
 # Only required when running WebArena evaluations against self-hosted sites
 
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 This is the **Evaluation Framework** for testing browser automation agents. It uses **LLM-as-a-judge** to evaluate agent responses against defined criteria, with support for **visual verification** through screenshots.
 
-The framework is completely independent of the main browser-agent server and operates as a standalone Python application that communicates with the browser-agent-server API at http://localhost:8080.
+The framework is completely independent of the main agent server and operates as a standalone Python application that communicates with the agent server API at http://localhost:8080.
 
 ## Framework Structure
 
@@ -58,9 +58,13 @@ cp .env.example .env
 # Navigate to native runner directory
 cd native
 
-# Run a specific evaluation by path (relative to data/)
+# Run a specific evaluation by file path (relative to data/)
 python3 run.py --path test-simple/math-001.yaml
 
+# Run a specific evaluation by directory path (NEW: auto-detects task.yaml)
+python3 run.py --path js-verifier/action/dropdown
+python3 run.py --path js-verifier/action/daterange --verbose
+
 # Run with verbose output (shows input, response, reasoning, screenshots)
 python3 run.py --path action-agent/accordion-001.yaml --verbose
 
@@ -172,7 +176,7 @@ evals/
 │   ├── __init__.py                 # Library exports
 │   ├── config_loader.py            # Configuration management
 │   ├── eval_loader.py              # YAML evaluation loader
-│   ├── api_client.py               # HTTP client for browser-agent-server
+│   ├── api_client.py               # HTTP client for agent server
 │   ├── judge.py                    # LLMJudge, VisionJudge, SimpleJudge
 │   ├── webarena_adapter.py         # WebArena task adapter
 │   └── webarena_evaluators.py      # WebArena evaluators
@@ -639,7 +643,7 @@ task_id,site,intent,eval_types,status,score,response,execution_time_ms
 - **WebArenaTask** (lib/webarena_adapter.py:19-79) - Represents WebArena task
 - **EvalLoader** (lib/eval_loader.py:176-315) - Loads evals from YAML files
 - **WebArenaTaskLoader** (lib/webarena_adapter.py:172-330) - Loads WebArena tasks
-- **APIClient** (lib/api_client.py:10-382) - Communicates with browser-agent-server
+- **APIClient** (lib/api_client.py:10-382) - Communicates with agent server
 - **LLMJudge** (lib/judge.py:44-191) - Text-based evaluation
 - **VisionJudge** (lib/judge.py:193-386) - Visual verification
 - **StringEvaluator** (lib/webarena_evaluators.py:38-210) - String matching evaluation
 
@@ -0,0 +1,59 @@
+# Evaluation Framework Configuration
+# This configuration is shared across all evaluation runner scripts
+# Example configuration for Anthropic Claude models
+
+# API endpoint for the evaluation server
+api_endpoint: "http://localhost:8080"
+
+# Model configurations for running evaluations
+# These models are sent to the agent for processing requests
+
+main_model:
+  provider: "anthropic"
+  model_name: "claude-3-5-sonnet-20241022"
+  api_key: "${ANTHROPIC_API_KEY}"
+
+mini_model:
+  provider: "anthropic"
+  model_name: "claude-3-5-haiku-20241022"
+  api_key: "${ANTHROPIC_API_KEY}"
+
+nano_model:
+  provider: "anthropic"
+  model_name: "claude-3-5-haiku-20241022"
+  api_key: "${ANTHROPIC_API_KEY}"
+
+# Model configuration for judging evaluation responses
+# This model is used locally to assess the quality of agent responses
+
+judge_model:
+  provider: "anthropic"
+  model_name: "claude-3-5-sonnet-20241022"
+  api_key: "${ANTHROPIC_API_KEY}"
+
+# Execution settings
+
+execution:
+  # Default number of evaluations to run per script execution
+  default_limit: 20
+
+  # Timeout for API requests (seconds) - set to max for slow custom API
+  timeout: 3600
+
+  # Number of concurrent evaluation requests
+  concurrent_requests: 1
+
+  # Delay between requests (seconds)
+  request_delay: 1
+
+# Reporting settings
+
+reporting:
+  # Directory for storing evaluation reports
+  reports_dir: "reports"
+
+  # Report format
+  format: "csv"
+
+  # Include detailed judge reasoning in reports
+  include_reasoning: true