Add Benchmarking section; rewrite semantic-scholar description as example by jbragg · Pull Request #60 · allenai/asta-plugins

jbragg · 2026-05-15T18:03:14Z

Summary

Adds a Benchmarking section to README documenting how to run Asta skills via agent-baselines' inspect-swe solver against Inspect-compatible eval suites, including paired comparisons for measuring skill changes. The semantic-scholar rewrite below serves as the worked example referenced from that section.

Rewrites skills/semantic-scholar/SKILL.md's description: field. The current description is a long list of trigger-phrase strings ("get paper details", "look up a paper", "find citations", ...) that the agent must match against the user's prompt. It misses single-fact queries about named research artifacts (benchmarks, datasets, models, methods) that don't use one of the literal trigger phrases, and it doesn't mention snippet-search (paper-body text search) at all.

Replace with a two-sentence form — capability first, then use-when:

Look up or search papers, authors, citations, and full-text snippets on Semantic Scholar.
Use for fast, targeted queries about a paper, author, or specific named research artifact
(benchmark, dataset, model, method, etc.) — not comprehensive reports.

What changes:

Surfaces snippet-search (full-text body search) as a first-class capability so the agent routes single-paper fact lookups here instead of dropping to PDF parsing.
Adds research artifact (benchmark / dataset / model / method / etc.) as a trigger object so the skill activates on operationally-named-item queries (e.g. "what does SVAMP stand for", "does OLMES recommend N in-context examples") that the literal-phrase trigger list misses.
Keeps the original "Use this for fast, targeted queries (not comprehensive reports)" qualifier verbatim — preserves the differentiation from the find-literature skill's comprehensive-search lane.

Validation

Paired comparison on the asta_multitool eval, four cases that operationally reference a named research artifact. claude_code 2.1.128 · sonnet-4-6 · ghcr.io/allenai/asta:v0.17.0 · authed (ASTA_TOKEN set) · no working-limit. n=3 per case per arm.

case	prompt	baseline	this PR	ablation: drop "about X"
`cbbd1475`	SVAMP acronym	0.00	0.33	0.00
`fee5ee2a`	AI2 reading-comprehension dataset recall	0.00	0.67	0.50
`1c13faea`	OLMES in-context-examples recommendation	1.00	1.00	1.00
`9ad75f7d`	LLaMA acronym	1.00	1.00	1.00
mean		0.50	0.75 (+0.25)	0.625

Mechanism: baseline samples on SVAMP (cbbd1475) and AI2-dataset (fee5ee2a) routinely activate the wrong skill (find-literature, or none at all) and drop to web-search/PDF fallback. This PR routes them to semantic-scholar → asta papers search / snippet-search — visible in the per-sample skill activations.

Ablation isolates the trigger-object clause ("about a paper, author, or specific named research artifact (...)"). Dropping it (third column) loses most of the AI2-dataset routing fix (fee5ee2a: 0.67 → 0.50) and all of the SVAMP score lift (cbbd1475: 0.33 → 0.00), confirming that phrase is what's doing the work — not the broader rephrasing or the new snippet-search mention.

(OLMES + LLaMA are at ceiling on baseline — no headroom for this change to demonstrate.)

Companion PRs

allenai/agent-baselines#26 — the resolver + reproducibility metadata that made this paired comparison measurable.
allenai/asta-bench-private#225 — adds an inspect_ai entry-points registration for astabench/ai2/evals/ tasks so the multitool can be loaded by registered name (astabench/asta_multitool_challenge) from agent-baselines' solvers/inspect-swe subproject via uv run --no-group astabench --with ../asta-bench-private ....

…mple README: introduces a Benchmarking section linking out to agent-baselines' inspect-swe solver for running Asta skills against Inspect eval suites (astabench for science tasks), including paired comparisons for measuring skill changes. skills/semantic-scholar/SKILL.md: rewrites the description as the worked example referenced from that section. Replaces the literal- trigger-phrase list with a two-sentence capability + use-when form: - Surfaces snippet-search (full-text body search) as a first-class capability so the agent routes single-paper fact lookups here instead of falling through to PDF parsing. - Adds 'research artifact' (benchmark / dataset / model / method / etc.) as a trigger object so the skill activates on operationally- named-item queries that the literal-phrase list misses. - Keeps the 'fast, targeted queries (not comprehensive reports)' qualifier to preserve the differentiation from the find-literature skill's comprehensive-search lane. Paired n=3 comparison on asta_multitool (claude_code 2.1.128 / sonnet-4-6 / asta:v0.17.0): baseline 0.50 -> this PR 0.75 (+0.25); ablation that drops the 'about a paper, author, or specific named research artifact (...)' clause: 0.625 (+0.125), confirming that clause is what's doing the work.

rodneykinney

Neat!

jbragg force-pushed the fix/prefer-asta-for-paper-search branch from 18a4a8b to 9411292 Compare May 15, 2026 18:04

jbragg requested review from mdarcy220 and rodneykinney May 15, 2026 18:05

rodneykinney approved these changes May 15, 2026

View reviewed changes

jbragg merged commit b0ca63a into main May 19, 2026
6 checks passed

jbragg deleted the fix/prefer-asta-for-paper-search branch May 19, 2026 22:24

jbragg mentioned this pull request May 20, 2026

One workspace skill (scaffold/see/save); add asta_skills suite as new-evals example #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Benchmarking section; rewrite semantic-scholar description as example#60

Add Benchmarking section; rewrite semantic-scholar description as example#60
jbragg merged 1 commit into
mainfrom
fix/prefer-asta-for-paper-search

jbragg commented May 15, 2026

Uh oh!

rodneykinney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbragg commented May 15, 2026

Summary

Validation

Companion PRs

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants