Skip to content

Commit 06b613b

Browse files
authored
Merge pull request #11 from BrowserOperator/feat/eval-builder
Feat/eval builder
2 parents 98b0d2c + b57fa1a commit 06b613b

File tree

103 files changed

+179388
-201
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

103 files changed

+179388
-201
lines changed

CLAUDE.md

Lines changed: 49 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -91,20 +91,25 @@ web-agent/
9191
│ │ └── nginx-devtools.conf
9292
│ └── services-cloudrun/ # Service configs (cloud run)
9393
│ └── browser-agent-server.conf
94-
├── browser-agent-server/
95-
│ └── nodejs/ # Browser agent server source
96-
│ ├── src/
97-
│ │ ├── api-server.js # HTTP REST API
98-
│ │ ├── evaluation-server.js # WebSocket + CDP
99-
│ │ └── lib/ # BrowserAgentServer, judges
100-
│ ├── start.js # Server entrypoint
101-
│ └── package.json
94+
├── submodules/ # Git submodules
95+
│ ├── browser-operator-core/ # Browser Operator DevTools + Agent Server
96+
│ │ ├── agent-server/ # Agent server (HTTP/WebSocket API)
97+
│ │ │ └── nodejs/ # Node.js implementation
98+
│ │ │ ├── src/
99+
│ │ │ │ ├── api-server.js # HTTP REST API
100+
│ │ │ │ ├── client-manager.js # Client management
101+
│ │ │ │ └── lib/ # Core libraries
102+
│ │ │ ├── start.js # Server entrypoint
103+
│ │ │ └── package.json
104+
│ │ └── front_end/ # DevTools frontend source
105+
│ ├── kernel-images/ # Base browser environment
106+
│ └── webarena/ # WebArena benchmark (for webarena evals)
102107
├── evals/ # Evaluation framework
103108
│ ├── .env # API keys (gitignored, copy from .env.example)
104109
│ ├── config.yml # Global eval configuration
105110
│ ├── lib/ # Shared evaluation library
106111
│ │ ├── eval_loader.py # YAML evaluation loader
107-
│ │ ├── api_client.py # HTTP client for browser-agent-server
112+
│ │ ├── api_client.py # HTTP client for agent server
108113
│ │ ├── judge.py # LLMJudge, VisionJudge, SimpleJudge
109114
│ │ ├── webarena_adapter.py # WebArena task adapter
110115
│ │ └── webarena_evaluators.py # WebArena evaluators
@@ -130,7 +135,7 @@ web-agent/
130135
### deployments/local/Dockerfile
131136
Multi-stage build that:
132137
1. Copies pre-built DevTools from `browser-operator-devtools:latest`
133-
2. Builds browser-agent-server with `npm install`
138+
2. Builds agent server from `submodules/browser-operator-core/agent-server/nodejs` with `npm install`
134139
3. Builds kernel-images Go API
135140
4. Builds WebRTC client
136141
5. Compiles custom Xorg drivers
@@ -145,9 +150,10 @@ Multi-stage build that:
145150
### deployments/local/docker-compose.yml
146151
Configures container with:
147152
- Port mappings for all services (8000-8082, 9222, 444)
148-
- Volume mounts: recordings, chromium-data, browser-agent-server code
153+
- Volume mounts: recordings, chromium-data
149154
- tmpfs: `/dev/shm` and `/tmp` (prevents lock file persistence)
150155
- Environment: `CHROMIUM_FLAGS` with custom DevTools frontend
156+
- Agent server code is baked into the image (not volume-mounted)
151157

152158
**Recent fixes:**
153159
- Added missing ports 8000, 8001, 8081, 8082
@@ -163,11 +169,12 @@ This prevents "profile in use" and "display already active" errors.
163169

164170
Available in all deployment types: `local/`, `local-webarena/`, `cloudrun/`
165171

166-
### browser-agent-server/nodejs/src/api-server.js
172+
### submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
167173
HTTP REST API with endpoints:
168174
- `POST /v1/responses` - Execute browser automation tasks
169175
- `POST /page/content` - Get page HTML/text content
170176
- `POST /page/screenshot` - Capture screenshots
177+
- `POST /page/execute` - Execute JavaScript in page context
171178
- `GET /status` - Health check
172179

173180
### deployments/commons/supervisor/services/browser-agent-server.conf
@@ -252,12 +259,12 @@ Supervisor configuration files:
252259
- Check logs: `docker logs kernel-browser-extended | grep CDP`
253260

254261
### 4. Module Not Found Errors
255-
**Symptom:** "Cannot find module 'js-yaml'" or "Cannot find module 'BrowserAgentServer.js'"
262+
**Symptom:** "Cannot find module 'js-yaml'" or missing dependencies
256263

257264
**Solution:**
258-
- Ensure `browser-agent-server/nodejs/` has all dependencies
259-
- Run `cd browser-agent-server/nodejs && npm install`
260-
- Browser-agent-server code is in `browser-agent-server/nodejs/`
265+
- Agent server code comes from `submodules/browser-operator-core/agent-server/nodejs/`
266+
- Dependencies are installed during Docker build via `npm install`
267+
- Rebuild the image if dependencies are missing: `make rebuild`
261268

262269
### 5. Docker Volume Caching on macOS
263270
**Symptom:** File changes not visible in running container with docker-compose
@@ -293,10 +300,11 @@ make compose-up # OR make run
293300
**Advantages:**
294301
- Background operation
295302
- Easy restart without rebuilding
296-
- Volume-mounted eval-server code (live reload)
297303
- Managed by docker-compose
298304
- Better for long-running development
299305

306+
**Note:** Agent server code is baked into the image, so rebuilds are needed for code changes
307+
300308
**Usage:**
301309
```bash
302310
# First time setup
@@ -312,9 +320,11 @@ make test # Run simple eval test
312320
# View logs
313321
make logs # Follow all logs
314322

315-
# Iterate on eval-server code (NO REBUILD NEEDED)
316-
vim eval-server/nodejs/src/api-server.js
317-
docker-compose restart # Picks up changes immediately
323+
# Iterate on agent server code (REQUIRES REBUILD)
324+
vim submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
325+
make rebuild
326+
docker-compose down
327+
docker-compose up -d
318328

319329
# Stop
320330
make stop # OR docker-compose down
@@ -362,7 +372,7 @@ make run # Restart after rebuild
362372
|--------|-----------|-------------------|
363373
| **Logs** | Live in terminal | Background, use `make logs` |
364374
| **Stopping** | Ctrl+C or docker stop | `make stop` |
365-
| **Eval server code** | Baked into image, rebuild needed | Volume-mounted, restart only |
375+
| **Agent server code** | Baked into image, rebuild needed | Baked into image, rebuild needed |
366376
| **DevTools code** | Baked into image, rebuild needed | Baked into image, rebuild needed |
367377
| **Best for** | Debugging, seeing startup issues | Development iteration |
368378
| **Script** | `run-local.sh` | `docker-compose.yml` |
@@ -377,9 +387,11 @@ make run # Restart after rebuild
377387
```bash
378388
cd deployments/local
379389

380-
# Browser-agent-server changes (NO REBUILD)
381-
vim ../../browser-agent-server/nodejs/src/api-server.js
382-
docker-compose restart # Volume-mounted, picks up changes
390+
# Agent server changes (REQUIRES REBUILD)
391+
vim ../../submodules/browser-operator-core/agent-server/nodejs/src/api-server.js
392+
make rebuild
393+
docker-compose down
394+
docker-compose up -d
383395

384396
# DevTools changes
385397
vim ../../browser-operator-core/front_end/panels/ai_chat/...
@@ -430,26 +442,29 @@ CHROMIUM_DATA_HOST=/tmp/browser URLS="https://example.com" make run
430442

431443
## Important Notes
432444

433-
### Browser Agent Server Location
434-
The browser agent server code is in: `browser-agent-server/nodejs/`
445+
### Agent Server Location
446+
The agent server code is in: `submodules/browser-operator-core/agent-server/nodejs/`
435447

436448
This is the main server that handles browser automation requests via HTTP/WebSocket APIs.
437449

450+
**Note:** The submodule must be on the `feat/js-eval-endpoint` branch to have the `/page/execute` endpoint.
451+
438452
### CDP Port is 9223, Not 9222
439453
The default Chrome DevTools port is 9222, but this project uses 9223.
440454

441455
Check these files:
442456
- `deployments/commons/supervisor/services/browser-agent-server.conf` - Must have `CDP_PORT="9223"`
443457
- Chromium startup config uses port 9223
444458

445-
### Dependencies in browser-agent-server/nodejs/
459+
### Dependencies in submodules/browser-operator-core/agent-server/nodejs/
446460
Required packages:
447-
- js-yaml (for parsing YAML eval files)
448-
- express (HTTP server)
449461
- ws (WebSocket server)
450-
- chrome-remote-interface (CDP client)
462+
- uuid (ID generation)
463+
- winston (logging)
464+
- js-yaml (YAML parsing)
465+
- dotenv (environment variables)
451466

452-
All managed by `package.json` and `npm install`.
467+
All managed by `package.json` and `npm install` during Docker build.
453468

454469
### Lock File Cleanup is Automatic
455470
After implementing `deployments/*/scripts/init-container.sh`, you should never need to manually clean lock files again. The script runs on every container start.
@@ -706,7 +721,7 @@ curl -X POST http://localhost:8080/page/screenshot \
706721
- `webarena/` - WebArena benchmark runner
707722
- `lib/` - Shared evaluation library (judges, adapters, loaders)
708723

709-
3. **Renamed eval-server** - Now called `browser-agent-server/` to better reflect its purpose
724+
3. **Consolidated agent server** - Now using `submodules/browser-operator-core/agent-server/` directly (removed duplicate `browser-agent-server/` directory)
710725

711726
4. **Moved WebArena config files** - Task configurations moved to in-repo location:
712727
- New location: `evals/webarena/config_files/` (preferred)
@@ -718,6 +733,7 @@ curl -X POST http://localhost:8080/page/screenshot \
718733
2. **Fixed tmpfs mounts** - Added `/tmp` to prevent X11 lock persistence
719734
3. **Added automatic lock cleanup** - `deployments/*/scripts/init-container.sh` runs on every start
720735
4. **Updated Chromium flags** - Added `--custom-devtools-frontend=http://localhost:8001/`
721-
5. **Fixed CDP port** - Set `CDP_PORT="9223"` in browser-agent-server supervisor config
736+
5. **Fixed CDP port** - Set `CDP_PORT="9223"` in agent server supervisor config
737+
6. **Added /page/execute endpoint** - JavaScript execution endpoint available in `feat/js-eval-endpoint` branch
722738
6. **Created make test** - Quick verification of API functionality
723739
7. **Fixed path resolution** - `eval_loader.py` now supports new `evals/native/data/` structure

Dockerfile.devtools

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ FROM devtools-base AS devtools-local
6969
# Copy local changes from browser-operator-core submodule FIRST
7070
# This happens before checking out upstream, so we copy over the upstream code
7171
COPY submodules/browser-operator-core/front_end /workspace/devtools/devtools-frontend/front_end
72-
COPY browser-agent-server /workspace/devtools/devtools-frontend/browser-agent-server
72+
COPY submodules/browser-operator-core/agent-server /workspace/devtools/devtools-frontend/browser-agent-server
7373

7474
WORKDIR /workspace/devtools/devtools-frontend
7575

deployments/cloudrun/Dockerfile

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,12 +55,12 @@ RUN sed -i 's/AUTOMATED_MODE: false/AUTOMATED_MODE: true/' front_end/panels/ai_c
5555
# Build Browser Operator version with AUTOMATED_MODE enabled
5656
RUN npm run build
5757

58-
# Eval-Server build stage
58+
# Agent Server build stage
5959
FROM node:22-bullseye-slim AS browser-agent-server-builder
6060
WORKDIR /browser-agent-server
61-
COPY browser-agent-server/nodejs/package*.json ./
61+
COPY submodules/browser-operator-core/agent-server/nodejs/package*.json ./
6262
RUN npm install --production
63-
COPY browser-agent-server/nodejs/ ./
63+
COPY submodules/browser-operator-core/agent-server/nodejs/ ./
6464

6565
# Multi-stage build using kernel-images as base
6666
FROM docker.io/golang:1.25.0 AS server-builder
@@ -285,8 +285,8 @@ RUN chown -R kernel:kernel /usr/share/nginx/devtools
285285
# Copy browser-agent-server from builder
286286
COPY --from=browser-agent-server-builder /browser-agent-server /opt/browser-agent-server
287287

288-
# Copy custom browser-agent-server startup script INTO browser-agent-server directory
289-
COPY browser-agent-server/start.js /opt/browser-agent-server/start-cloudrun.js
288+
# Copy custom agent server startup script from submodule
289+
COPY submodules/browser-operator-core/agent-server/start.js /opt/browser-agent-server/start-cloudrun.js
290290
RUN chmod +x /opt/browser-agent-server/start-cloudrun.js
291291

292292
# Set permissions for browser-agent-server

deployments/local/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ FROM --platform=linux/arm64 node:18-alpine AS browser-agent-server-builder
1616

1717
WORKDIR /workspace
1818

19-
# Copy eval server from browser-operator-core submodule
20-
COPY browser-agent-server/nodejs /workspace/browser-agent-server
19+
# Copy agent server from browser-operator-core submodule
20+
COPY submodules/browser-operator-core/agent-server/nodejs /workspace/browser-agent-server
2121

2222
WORKDIR /workspace/browser-agent-server
2323

deployments/local/Makefile

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,14 @@ init: ## Initialize submodules (run this first)
2222
cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core
2323
@echo "✅ Submodules initialized"
2424

25-
init-devtools: ## Initialize browser-operator-core submodule only
26-
@echo "📦 Initializing browser-operator-core submodule..."
27-
cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core
28-
@echo "✅ browser-operator-core submodule initialized"
25+
init-devtools: ## Initialize browser-operator-core submodule only (if not already initialized)
26+
@if [ ! -f ../../submodules/browser-operator-core/package.json ]; then \
27+
echo "📦 Initializing browser-operator-core submodule..."; \
28+
cd ../../ && git submodule update --init --depth 1 submodules/browser-operator-core; \
29+
echo "✅ browser-operator-core submodule initialized"; \
30+
else \
31+
echo "✅ browser-operator-core already initialized (skipping to preserve branch)"; \
32+
fi
2933

3034
build-devtools-base: init-devtools ## Build DevTools base image (slow, rarely needed)
3135
@echo "🔨 Building DevTools base layer (this takes ~30 minutes)..."
@@ -69,7 +73,7 @@ build: init ## Build extended image with DevTools frontend (smart: only builds D
6973
cd ../../ && docker build -f deployments/local/Dockerfile -t kernel-browser:extended .
7074
@echo "✅ Extended build complete"
7175

72-
rebuild: init ## Force complete rebuild (including DevTools)
76+
rebuild: ## Force complete rebuild (including DevTools)
7377
@echo "🔄 Force rebuilding everything from scratch..."
7478
$(MAKE) --no-print-directory build-devtools
7579
cd ../../ && docker build -f deployments/local/Dockerfile -t kernel-browser:extended .

evals/.env.example

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,26 @@ GROQ_API_KEY=gsk-your-groq-api-key-here
1010
# Optional: OpenRouter API key (if using OpenRouter)
1111
OPENROUTER_API_KEY=your-openrouter-api-key-here
1212

13+
# Optional: Cerebras API key (if using Cerebras models)
14+
CEREBRAS_API_KEY=your-cerebras-api-key-here
15+
16+
# Optional: Anthropic API key (if using Claude models directly)
17+
ANTHROPIC_API_KEY=your-anthropic-api-key-here
18+
19+
# Optional: Google API key (if using Gemini models directly)
20+
GOOGLE_API_KEY=your-google-api-key-here
21+
1322
# Optional: LiteLLM configuration (if using LiteLLM)
23+
# LiteLLM allows you to use custom model endpoints (Ollama, vLLM, etc.)
24+
# Set these variables, then update config.yml to use provider: "litellm"
1425
LITELLM_API_KEY=your-litellm-api-key-here
15-
LITELLM_ENDPOINT=http://localhost:8000
26+
LITELLM_ENDPOINT=http://localhost:4000
27+
28+
# Example LiteLLM endpoint configurations:
29+
# - Ollama local: http://localhost:11434
30+
# - LiteLLM proxy: http://localhost:4000
31+
# - vLLM server: http://localhost:8000
32+
# - Custom endpoint: http://172.16.55.34:4000
1633

1734
# WebArena Infrastructure Configuration (Optional)
1835
# Only required when running WebArena evaluations against self-hosted sites

evals/CLAUDE.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
This is the **Evaluation Framework** for testing browser automation agents. It uses **LLM-as-a-judge** to evaluate agent responses against defined criteria, with support for **visual verification** through screenshots.
88

9-
The framework is completely independent of the main browser-agent server and operates as a standalone Python application that communicates with the browser-agent-server API at http://localhost:8080.
9+
The framework is completely independent of the main agent server and operates as a standalone Python application that communicates with the agent server API at http://localhost:8080.
1010

1111
## Framework Structure
1212

@@ -58,9 +58,13 @@ cp .env.example .env
5858
# Navigate to native runner directory
5959
cd native
6060

61-
# Run a specific evaluation by path (relative to data/)
61+
# Run a specific evaluation by file path (relative to data/)
6262
python3 run.py --path test-simple/math-001.yaml
6363

64+
# Run a specific evaluation by directory path (NEW: auto-detects task.yaml)
65+
python3 run.py --path js-verifier/action/dropdown
66+
python3 run.py --path js-verifier/action/daterange --verbose
67+
6468
# Run with verbose output (shows input, response, reasoning, screenshots)
6569
python3 run.py --path action-agent/accordion-001.yaml --verbose
6670

@@ -172,7 +176,7 @@ evals/
172176
│ ├── __init__.py # Library exports
173177
│ ├── config_loader.py # Configuration management
174178
│ ├── eval_loader.py # YAML evaluation loader
175-
│ ├── api_client.py # HTTP client for browser-agent-server
179+
│ ├── api_client.py # HTTP client for agent server
176180
│ ├── judge.py # LLMJudge, VisionJudge, SimpleJudge
177181
│ ├── webarena_adapter.py # WebArena task adapter
178182
│ └── webarena_evaluators.py # WebArena evaluators
@@ -639,7 +643,7 @@ task_id,site,intent,eval_types,status,score,response,execution_time_ms
639643
- **WebArenaTask** (lib/webarena_adapter.py:19-79) - Represents WebArena task
640644
- **EvalLoader** (lib/eval_loader.py:176-315) - Loads evals from YAML files
641645
- **WebArenaTaskLoader** (lib/webarena_adapter.py:172-330) - Loads WebArena tasks
642-
- **APIClient** (lib/api_client.py:10-382) - Communicates with browser-agent-server
646+
- **APIClient** (lib/api_client.py:10-382) - Communicates with agent server
643647
- **LLMJudge** (lib/judge.py:44-191) - Text-based evaluation
644648
- **VisionJudge** (lib/judge.py:193-386) - Visual verification
645649
- **StringEvaluator** (lib/webarena_evaluators.py:38-210) - String matching evaluation

evals/config.example.anthropic.yml

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Evaluation Framework Configuration
2+
# This configuration is shared across all evaluation runner scripts
3+
# Example configuration for Anthropic Claude models
4+
5+
# API endpoint for the evaluation server
6+
api_endpoint: "http://localhost:8080"
7+
8+
# Model configurations for running evaluations
9+
# These models are sent to the agent for processing requests
10+
11+
main_model:
12+
provider: "anthropic"
13+
model_name: "claude-3-5-sonnet-20241022"
14+
api_key: "${ANTHROPIC_API_KEY}"
15+
16+
mini_model:
17+
provider: "anthropic"
18+
model_name: "claude-3-5-haiku-20241022"
19+
api_key: "${ANTHROPIC_API_KEY}"
20+
21+
nano_model:
22+
provider: "anthropic"
23+
model_name: "claude-3-5-haiku-20241022"
24+
api_key: "${ANTHROPIC_API_KEY}"
25+
26+
# Model configuration for judging evaluation responses
27+
# This model is used locally to assess the quality of agent responses
28+
29+
judge_model:
30+
provider: "anthropic"
31+
model_name: "claude-3-5-sonnet-20241022"
32+
api_key: "${ANTHROPIC_API_KEY}"
33+
34+
# Execution settings
35+
36+
execution:
37+
# Default number of evaluations to run per script execution
38+
default_limit: 20
39+
40+
# Timeout for API requests (seconds) - set to max for slow custom API
41+
timeout: 3600
42+
43+
# Number of concurrent evaluation requests
44+
concurrent_requests: 1
45+
46+
# Delay between requests (seconds)
47+
request_delay: 1
48+
49+
# Reporting settings
50+
51+
reporting:
52+
# Directory for storing evaluation reports
53+
reports_dir: "reports"
54+
55+
# Report format
56+
format: "csv"
57+
58+
# Include detailed judge reasoning in reports
59+
include_reasoning: true

0 commit comments

Comments
 (0)