Skip to content
Open

Dev #370

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
0f8fa83
some fixes
Chenglong-MS May 29, 2026
18cf360
small fixes
Chenglong-MS May 29, 2026
4cb0a2f
cleanup
Chenglong-MS May 29, 2026
c616338
some updates
Chenglong-MS May 29, 2026
a59ec9e
some cleanup
Chenglong-MS May 29, 2026
3a23dc3
workflow design
Chenglong-MS May 29, 2026
1571113
refactor(loading): Refactor AnvilLoader and add custom parameter support
zhb-y-agent May 30, 2026
b9abafb
test: keep zh locale keys aligned
cat0825 May 31, 2026
748a30c
bug fix and clean up
Chenglong-MS May 31, 2026
1c73e62
minor
Chenglong-MS Jun 2, 2026
9321bbf
Merge pull request #357 from cat0825/improve-zh-localization-copy
Chenglong-MS Jun 2, 2026
e14ce66
minor fix
Chenglong-MS Jun 2, 2026
b1eafdf
fix issues learned from build
Chenglong-MS Jun 5, 2026
7deaed7
some redesign
Chenglong-MS Jun 6, 2026
c27217b
cleanup
Chenglong-MS Jun 6, 2026
3f0312a
cleanup
Chenglong-MS Jun 6, 2026
21863b4
semantic ui test
Chenglong-MS Jun 9, 2026
5b45326
minor cleanup
Chenglong-MS Jun 9, 2026
b879c61
temp unified agent approach
Chenglong-MS Jun 11, 2026
e846fcb
minor fix
Chenglong-MS Jun 11, 2026
d210a3f
cleaning up
Chenglong-MS Jun 12, 2026
1fa3f12
cleaning up
Chenglong-MS Jun 12, 2026
01ba1fe
analyst: differentiate open-ended vs concrete exploration budget
Chenglong-MS Jun 17, 2026
6a839da
add loops
Chenglong-MS Jun 17, 2026
8bd4973
Add mini analyst agent with frontend mini-mode toggle and open-model …
Chenglong-MS Jun 18, 2026
fe36b74
Remove mini_notool ablation from MiniAnalystAgent
Chenglong-MS Jun 18, 2026
38e5667
Migrate model-evaluation README to high-level plan.md
Chenglong-MS Jun 18, 2026
d5a0acb
Merge pull request #367 from microsoft/mini-analyst-agent
Chenglong-MS Jun 18, 2026
70d4730
fixes
Chenglong-MS Jun 24, 2026
19c511a
some accessibility fixes
Chenglong-MS Jun 24, 2026
d40da0e
accessibility fix
Chenglong-MS Jun 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions loops/model-evaluation/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Loop — Open-Source (Ollama) Model Evaluation

**High-level plan.** Execute end-to-end, making reasonable decisions when details are
ambiguous, and record them in the final report (`report.md`; all working artifacts go
under `work/`).

## Goal

Benchmark open-source (Ollama) models that drive Data Formulator's analyst agents —
inspect tabular data, write transformation code, and commit a visualization — and report
**two independent axes**:

1. **Success rate** — does the agent actually produce a rendered chart? (reliability)
2. **Quality when produced** — how good is the chart when it finishes, scored 0-100 by a
code + vision grader? (competence)

Keep them separate: a model can write good code yet fail to deliver it through the
protocol. The dominant open-model failure mode is **driving the tool/transport, not
analyzing the data**, so each model runs through more than one agent transport:

- `analyst` — native function/tool calls (with a content-JSON salvage fallback).
- `mini` — single-decision, pure-prompt JSON contract; the production low-cost agent.

Always include the Azure references `gpt-5.5`, `gpt-5-mini` as the baseline.

## Data

A frozen **45-question** set across **15 datasets** from the `../visbench` benchmark, fed
as the **raw / grouped source tables** (not VisBench's derived single-table `data.csv`) so
the agent must do its own joins:

- **vega_datasets** single tables — 9 single-table questions.
- **TidyTuesday** multi-CSV weeks — 18 multi-table questions.
- **Spider** databases grouped by DB — 18 multi-table questions.

Reuse VisBench's quality-filtered question and reference chart for each item. The single-
vs multi-table split (9 / 36) is the axis along which models diverge most.

## Steps

1. **Select & pull models** — the open roster across size tiers (1B → 120B) plus the three
Azure references.
2. **Prepare the benchmark** — materialize the 45 questions as raw/grouped tables and
freeze the VisBench questions + reference charts, reused identically across every model
and agent.
3. **Run agents** — every `(agent, model, question)` cell with `--agent` in `analyst`
and `mini`; capture the event stream and render each chart to PNG. Frozen controls:
`max_iterations = 5`, 240 s timeout, resumable.
4. **Score (two phases, GPT-5.5 grader):**
- **Phase 1 — reliability:** five sequential gates (responded → emitted action → code
ran → output → **produced chart**). The chart gate is decisive and defines the
success rate; only those runs proceed.
- **Phase 2 — quality (0-100, produced charts only):** code review vs the question
(0-50) + vision review of the rendered PNG vs the reference chart (0-50).
5. **Aggregate & report** — report the two axes separately (never collapse them); for
ranking only, derive success-weighted quality (Phase 2 over all 45, no-chart = 0) and
combined = `0.3 × (success_rate × 100) + 0.7 × success-weighted quality`. Always show
the single- vs multi-table split, the per-gate drop-off, comparison to the references,
and recommendations per size tier (with which `--agent`).

## Principles

- **Two axes stay separate** — `combined` is for ranking only.
- **Freeze controls** — same questions, grader, `max_iterations`, and timeout across every cell.
- **`mini` is the production low-cost agent** — `simple` was removed; don't run `--agent simple`.
- **`uv` only**, no secrets (Azure auth via Entra ID), resumable, all artifacts under `work/`.
5 changes: 2 additions & 3 deletions py-src/data_formulator/agent_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,14 @@
# ── Heavy: code-gen, multi-step, tool-using ─────────────────────────────
"data_transform": "low", # generates Python transform scripts
"data_rec": "low", # chart / transformation recommendation
"data_agent": "low", # multi-step exploration agent
"report_gen": "low", # narrative + inspect/embed tools
"analyst": "low", # unified multi-step exploration + report agent
"interactive_explore": "low", # exploration idea agent
"data_loading_chat": "low", # conversational data loading w/ tools

# ── Light: single-turn extractors / classifiers / formatters ────────────
"data_load": "minimal", # one-shot type inference
"data_clean": "minimal", # extract tables from text
"experience_distill": "minimal", # summarise an analysis context
"workflow_distill": "minimal", # summarise an analysis context
"chart_insight": "minimal", # title + 1–3 takeaways from a chart
"chart_restyle": "minimal", # apply style edits to a Vega-Lite spec
"code_explanation": "minimal", # describe derived fields
Expand Down
9 changes: 0 additions & 9 deletions py-src/data_formulator/agents/__init__.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,13 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

from data_formulator.agents.agent_data_transform import DataTransformationAgent
from data_formulator.agents.agent_data_rec import DataRecAgent

from data_formulator.agents.agent_data_load import DataLoadAgent
from data_formulator.agents.agent_sort_data import SortDataAgent
from data_formulator.agents.agent_simple import SimpleAgents
from data_formulator.agents.agent_interactive_explore import InteractiveExploreAgent
from data_formulator.agents.agent_chart_insight import ChartInsightAgent
from data_formulator.agents.agent_chart_restyle import ChartRestyleAgent

__all__ = [
"DataTransformationAgent",
"DataRecAgent",
"DataLoadAgent",
"SortDataAgent",
"InteractiveExploreAgent",
"ChartInsightAgent",
"ChartRestyleAgent",
]
150 changes: 0 additions & 150 deletions py-src/data_formulator/agents/agent_chart_insight.py

This file was deleted.

Loading
Loading