Skip to content

Add Benchmarking section; rewrite semantic-scholar description as example#60

Merged
jbragg merged 1 commit into
mainfrom
fix/prefer-asta-for-paper-search
May 19, 2026
Merged

Add Benchmarking section; rewrite semantic-scholar description as example#60
jbragg merged 1 commit into
mainfrom
fix/prefer-asta-for-paper-search

Conversation

@jbragg
Copy link
Copy Markdown
Collaborator

@jbragg jbragg commented May 15, 2026

Summary

Adds a Benchmarking section to README documenting how to run Asta skills via agent-baselines' inspect-swe solver against Inspect-compatible eval suites, including paired comparisons for measuring skill changes. The semantic-scholar rewrite below serves as the worked example referenced from that section.

Rewrites skills/semantic-scholar/SKILL.md's description: field. The current description is a long list of trigger-phrase strings ("get paper details", "look up a paper", "find citations", ...) that the agent must match against the user's prompt. It misses single-fact queries about named research artifacts (benchmarks, datasets, models, methods) that don't use one of the literal trigger phrases, and it doesn't mention snippet-search (paper-body text search) at all.

Replace with a two-sentence form — capability first, then use-when:

Look up or search papers, authors, citations, and full-text snippets on Semantic Scholar.
Use for fast, targeted queries about a paper, author, or specific named research artifact
(benchmark, dataset, model, method, etc.) — not comprehensive reports.

What changes:

  • Surfaces snippet-search (full-text body search) as a first-class capability so the agent routes single-paper fact lookups here instead of dropping to PDF parsing.
  • Adds research artifact (benchmark / dataset / model / method / etc.) as a trigger object so the skill activates on operationally-named-item queries (e.g. "what does SVAMP stand for", "does OLMES recommend N in-context examples") that the literal-phrase trigger list misses.
  • Keeps the original "Use this for fast, targeted queries (not comprehensive reports)" qualifier verbatim — preserves the differentiation from the find-literature skill's comprehensive-search lane.

Validation

Paired comparison on the asta_multitool eval, four cases that operationally reference a named research artifact. claude_code 2.1.128 · sonnet-4-6 · ghcr.io/allenai/asta:v0.17.0 · authed (ASTA_TOKEN set) · no working-limit. n=3 per case per arm.

case prompt baseline this PR ablation: drop "about X"
cbbd1475 SVAMP acronym 0.00 0.33 0.00
fee5ee2a AI2 reading-comprehension dataset recall 0.00 0.67 0.50
1c13faea OLMES in-context-examples recommendation 1.00 1.00 1.00
9ad75f7d LLaMA acronym 1.00 1.00 1.00
mean 0.50 0.75 (+0.25) 0.625

Mechanism: baseline samples on SVAMP (cbbd1475) and AI2-dataset (fee5ee2a) routinely activate the wrong skill (find-literature, or none at all) and drop to web-search/PDF fallback. This PR routes them to semantic-scholarasta papers search / snippet-search — visible in the per-sample skill activations.

Ablation isolates the trigger-object clause ("about a paper, author, or specific named research artifact (...)"). Dropping it (third column) loses most of the AI2-dataset routing fix (fee5ee2a: 0.67 → 0.50) and all of the SVAMP score lift (cbbd1475: 0.33 → 0.00), confirming that phrase is what's doing the work — not the broader rephrasing or the new snippet-search mention.

(OLMES + LLaMA are at ceiling on baseline — no headroom for this change to demonstrate.)

Companion PRs

  • allenai/agent-baselines#26 — the resolver + reproducibility metadata that made this paired comparison measurable.
  • allenai/asta-bench-private#225 — adds an inspect_ai entry-points registration for astabench/ai2/evals/ tasks so the multitool can be loaded by registered name (astabench/asta_multitool_challenge) from agent-baselines' solvers/inspect-swe subproject via uv run --no-group astabench --with ../asta-bench-private ....

…mple

README: introduces a Benchmarking section linking out to agent-baselines'
inspect-swe solver for running Asta skills against Inspect eval suites
(astabench for science tasks), including paired comparisons for
measuring skill changes.

skills/semantic-scholar/SKILL.md: rewrites the description as the
worked example referenced from that section. Replaces the literal-
trigger-phrase list with a two-sentence capability + use-when form:

- Surfaces snippet-search (full-text body search) as a first-class
  capability so the agent routes single-paper fact lookups here
  instead of falling through to PDF parsing.
- Adds 'research artifact' (benchmark / dataset / model / method / etc.)
  as a trigger object so the skill activates on operationally-
  named-item queries that the literal-phrase list misses.
- Keeps the 'fast, targeted queries (not comprehensive reports)'
  qualifier to preserve the differentiation from the find-literature
  skill's comprehensive-search lane.

Paired n=3 comparison on asta_multitool (claude_code 2.1.128 /
sonnet-4-6 / asta:v0.17.0): baseline 0.50 -> this PR 0.75 (+0.25);
ablation that drops the 'about a paper, author, or specific named
research artifact (...)' clause: 0.625 (+0.125), confirming that
clause is what's doing the work.
@jbragg jbragg force-pushed the fix/prefer-asta-for-paper-search branch from 18a4a8b to 9411292 Compare May 15, 2026 18:04
@jbragg jbragg requested review from mdarcy220 and rodneykinney May 15, 2026 18:05
Copy link
Copy Markdown
Member

@rodneykinney rodneykinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@jbragg jbragg merged commit b0ca63a into main May 19, 2026
6 checks passed
@jbragg jbragg deleted the fix/prefer-asta-for-paper-search branch May 19, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants