Add Benchmarking section; rewrite semantic-scholar description as example#60
Merged
Conversation
…mple README: introduces a Benchmarking section linking out to agent-baselines' inspect-swe solver for running Asta skills against Inspect eval suites (astabench for science tasks), including paired comparisons for measuring skill changes. skills/semantic-scholar/SKILL.md: rewrites the description as the worked example referenced from that section. Replaces the literal- trigger-phrase list with a two-sentence capability + use-when form: - Surfaces snippet-search (full-text body search) as a first-class capability so the agent routes single-paper fact lookups here instead of falling through to PDF parsing. - Adds 'research artifact' (benchmark / dataset / model / method / etc.) as a trigger object so the skill activates on operationally- named-item queries that the literal-phrase list misses. - Keeps the 'fast, targeted queries (not comprehensive reports)' qualifier to preserve the differentiation from the find-literature skill's comprehensive-search lane. Paired n=3 comparison on asta_multitool (claude_code 2.1.128 / sonnet-4-6 / asta:v0.17.0): baseline 0.50 -> this PR 0.75 (+0.25); ablation that drops the 'about a paper, author, or specific named research artifact (...)' clause: 0.625 (+0.125), confirming that clause is what's doing the work.
18a4a8b to
9411292
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Benchmarking section to README documenting how to run Asta skills via agent-baselines'
inspect-swesolver against Inspect-compatible eval suites, including paired comparisons for measuring skill changes. The semantic-scholar rewrite below serves as the worked example referenced from that section.Rewrites
skills/semantic-scholar/SKILL.md'sdescription:field. The current description is a long list of trigger-phrase strings ("get paper details", "look up a paper", "find citations", ...) that the agent must match against the user's prompt. It misses single-fact queries about named research artifacts (benchmarks, datasets, models, methods) that don't use one of the literal trigger phrases, and it doesn't mentionsnippet-search(paper-body text search) at all.Replace with a two-sentence form — capability first, then use-when:
What changes:
snippet-search(full-text body search) as a first-class capability so the agent routes single-paper fact lookups here instead of dropping to PDF parsing.find-literatureskill's comprehensive-search lane.Validation
Paired comparison on the
asta_multitooleval, four cases that operationally reference a named research artifact. claude_code 2.1.128 · sonnet-4-6 ·ghcr.io/allenai/asta:v0.17.0· authed (ASTA_TOKENset) · no working-limit. n=3 per case per arm.cbbd1475fee5ee2a1c13faea9ad75f7dMechanism: baseline samples on SVAMP (
cbbd1475) and AI2-dataset (fee5ee2a) routinely activate the wrong skill (find-literature, or none at all) and drop to web-search/PDF fallback. This PR routes them tosemantic-scholar→asta papers search/snippet-search— visible in the per-sample skill activations.Ablation isolates the trigger-object clause (
"about a paper, author, or specific named research artifact (...)"). Dropping it (third column) loses most of the AI2-dataset routing fix (fee5ee2a: 0.67 → 0.50) and all of the SVAMP score lift (cbbd1475: 0.33 → 0.00), confirming that phrase is what's doing the work — not the broader rephrasing or the new snippet-search mention.(OLMES + LLaMA are at ceiling on baseline — no headroom for this change to demonstrate.)
Companion PRs
inspect_aientry-points registration forastabench/ai2/evals/tasks so the multitool can be loaded by registered name (astabench/asta_multitool_challenge) fromagent-baselines'solvers/inspect-swesubproject viauv run --no-group astabench --with ../asta-bench-private ....