Skip to content

Commit 3f0f163

Browse files
committed
data documentation update
1 parent 7c1d5ae commit 3f0f163

1 file changed

Lines changed: 144 additions & 12 deletions

File tree

docs/eval/agents/web_agents_eval.md

Lines changed: 144 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@ tasks/eval/agents/web_agents/
2323
├── graph_config.yaml # Workflow and evaluation config
2424
├── chat_history_seed.json # Sample input data
2525
├── logs/ # Request/response logs
26-
├── metadata/ # Execution metadata
27-
└── README.md # This file
26+
└── metadata/ # Execution metadata
2827
```
2928

3029
---
@@ -412,23 +411,156 @@ python -m sygra.cli.run_graph \
412411

413412
### Input Data Format
414413

414+
The input data is stored in `tasks/eval/agents/web_agents/chat_history_seed.json`. Currently, this file contains **sample data with only one mission** for testing and development purposes.
415+
416+
#### Sample Data Structure
417+
418+
Each record in the input file represents one step (turn) of a mission:
419+
415420
```json
416421
{
417-
"id": "mission_1_step_1",
418-
"mission_id": "mission_1",
419-
"turn": 1,
420-
"mission": "Book a flight from NYC to LAX",
421-
"navigational_directions": "Click on the search button",
422+
"id": "mission_01_2",
423+
"mission_id": "mission_01",
424+
"mission": "search for one way flight from hyd to chennai on nov 1 2025",
425+
"date": "2025-11-11 15:12:56",
426+
"navigational_directions": "",
427+
"turn": 2,
428+
"chat_history": [
429+
{
430+
"role": "system",
431+
"content": [
432+
{
433+
"text": "You are a web automation agent...",
434+
"type": "text"
435+
}
436+
]
437+
},
438+
{
439+
"role": "user",
440+
"content": [
441+
{
442+
"text": "Help me now to complete the assigned mission...",
443+
"type": "text"
444+
}
445+
]
446+
},
447+
{
448+
"content": "I'll help you search for a one-way flight...",
449+
"role": "assistant",
450+
"tool_calls": [
451+
{
452+
"id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
453+
"type": "function",
454+
"function": {
455+
"name": "screenshot_tool",
456+
"arguments": "{\"take_screenshot\": true}"
457+
}
458+
}
459+
]
460+
},
461+
{
462+
"role": "tool",
463+
"tool_call_id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
464+
"name": "screenshot_tool",
465+
"content": "success"
466+
}
467+
],
468+
"current_user_text": "You are now midway through the assigned mission...",
469+
"current_tool_result": {
470+
"role": "tool",
471+
"tool_call_id": "tooluse_O5Dr64r9RC-lW8BNsdHTng",
472+
"name": "screenshot_tool",
473+
"content": [
474+
{
475+
"image": {
476+
"format": "png",
477+
"source": {
478+
"bytes": "iVBORw0KGgoAAAANSUhEUgAAA+gAAAPoCAIAAADCwUOz..."
479+
}
480+
}
481+
}
482+
]
483+
},
422484
"golden_response": {
423485
"tool": "click",
424-
"x": 500,
425-
"y": 300,
426-
"bbox": {"x": 480, "y": 280, "width": 40, "height": 40}
427-
},
428-
"chat_history": [...]
486+
"properties": {
487+
"x": 146.44,
488+
"y": 94.44,
489+
"width": 82.04,
490+
"height": 61.11,
491+
"offset_x": 0.0,
492+
"offset_y": 0.0
493+
}
494+
}
495+
}
496+
```
497+
498+
#### Field Descriptions
499+
500+
| Field | Type | Description |
501+
|-------|------|-------------|
502+
| `id` | string | Unique identifier for this step (format: `mission_id_turn`) |
503+
| `mission_id` | string | Identifier for the mission this step belongs to |
504+
| `mission` | string | Description of the overall mission/task |
505+
| `date` | string | Timestamp of the mission |
506+
| `navigational_directions` | string | Optional hints or directions for this step |
507+
| `turn` | integer | Step number within the mission (1-indexed) |
508+
| `chat_history` | array | Complete conversation history up to this point |
509+
| `current_user_text` | string | The prompt text for the current step |
510+
| `current_tool_result` | object | Result from the previous tool execution (includes screenshot) |
511+
| `golden_response` | object | Expected correct response for evaluation |
512+
513+
#### Golden Response Structure
514+
515+
The `golden_response` contains the ground truth for evaluation:
516+
517+
**For Click Actions:**
518+
```json
519+
{
520+
"tool": "click",
521+
"properties": {
522+
"x": 146.44,
523+
"y": 94.44,
524+
"width": 82.04,
525+
"height": 61.11,
526+
"offset_x": 0.0,
527+
"offset_y": 0.0
528+
}
529+
}
530+
```
531+
532+
**For Typing Actions:**
533+
```json
534+
{
535+
"tool": "typing",
536+
"properties": {
537+
"text": "Hyderabad"
538+
}
429539
}
430540
```
431541

542+
**For Scroll Actions:**
543+
```json
544+
{
545+
"tool": "scroll",
546+
"properties": {
547+
"direction": "down",
548+
"amount": 200
549+
}
550+
}
551+
```
552+
553+
#### Current Sample Data
554+
555+
The `chat_history_seed.json` file currently contains:
556+
- **1 mission** (`mission_01`)
557+
- **Multiple steps/turns** for that mission
558+
- Complete chat history for each step
559+
- Screenshots embedded as base64 in `current_tool_result`
560+
- Golden responses for evaluation
561+
562+
> **Note:** This is sample data for testing purposes. A production dataset would contain multiple missions with various web automation scenarios.
563+
432564
### Output Format
433565

434566
**Flattened Output:**

0 commit comments

Comments
 (0)