A lean research MCP that bundles academic search, citation graph traversal, OA full-text download, and browser-based institutional download into 5 tools. Built on OpenAlex, Semantic Scholar, CrossRef, and 6 other academic indexes.
What it does: You ask a research question, it searches 8+ academic databases in parallel, removes duplicates, ranks by semantic + keyword relevance, and returns the best papers. Citation walk follows the graph forward and backward via OpenAlex. read_paper fetches OA full text. browser_download uses a real Playwright browser session for institutional SSO paywalls.
Why it's lean: 5 tools are still tiny compared with the ~12,000-token surface of the underlying MCPs. Per the DADL framework, each tool adds context pressure, so we keep the surface focused: search, citations, OA read, browser paywall read, and ping.
| # | Tool | Purpose |
|---|---|---|
| 1 | search_literature |
8+ sources, dedup, semantic+keyword ranking, mode + field + debug params |
| 2 | walk_citations |
Multi-hop citation graph (forward/backward/both) via OpenAlex |
| 3 | read_paper |
OA-first full-text download with auto-detect + optional Sci-Hub fallback |
| 4 | browser_download |
Playwright browser download through institutional SSO/EZproxy, with PDF title verification |
| 5 | ping |
Cheap health check; use before expensive search/download calls |
search_literature has a mode parameter that changes ranking and filtering without changing the response shape (zero context bloat):
| Mode | Filter | Ranking |
|---|---|---|
"comprehensive" (default) |
— | Source tier → source hits → relevance → abstract → citations → year |
"seminal" |
≥10 citations | Citations desc, oldest first |
"recent" |
Last 2 years | Citations desc, newest first |
"survey" |
Review/survey/meta-analysis only | Relevance → survey flag → citations |
search_literature accepts field="auto" (default) which detects the query's field and applies field-aware source weighting. Sources are tiered:
| Tier | Sources | When Counted |
|---|---|---|
| 3 (top) | semantic, scopus, openalex, crossref, springer | All fields |
| 2 (mid) | arxiv, unpaywall, openaire, doaj | All fields |
| 1 (field-specific) | europepmc | Only for medical and bio fields |
In CS/AI queries, Europe PMC papers are demoted to tier 0 (biomedical venue, irrelevant). In medical queries, Europe PMC papers get full tier-1 credit. The field param accepts "cs", "bio", "medical", "social", "general", or "auto".
The safe filter (_should_drop_low_quality) drops papers with no abstract AND <5 citations AND ≤1 source AND tier<2, unless rescued by: ≥500 citations, ≥3 sources, is_survey, exact title match, or tier-3 source alone.
Set debug=True to get a score_breakdown per paper showing exactly how its relevance_score was computed:
{
"relevance_score": 8.0,
"score_breakdown": {
"semantic": 0.72,
"keyword": 0.45,
"source_tier": 3,
"source_hits": 2,
"citation_boost": 2,
"survey_boost": 0,
"author_boost": 0,
"final": 8.0
}
}Useful for tuning queries, debugging low-quality results, or understanding ranking behavior.
score = round(10 * (0.7 * semantic_similarity + 0.3 * keyword_overlap))
+ citation_boost (0-3: 50+/100+/500+ cites)
+ survey_boost (+1 for reviews/meta-analyses)
+ author_boost (+1 for first-author h-index ≥50)
Semantic similarity uses BAAI/bge-small-en-v1.5 via fastembed (ONNX, no torch dep, ~10ms per query, ~50ms per 100 papers). Cached by text hash for 1 day. Graceful fallback to keyword-only if fastembed isn't installed.
Each paper includes (per read_paper chain requirements):
title,authors(compact: first 5 + "et al. (N total)"),author_countyear,venue(compressed: "Nature" not "Nature Publishing Group")doi,arxiv_id,pmid,pdf_urlabstract(full, no truncation),citation_countis_open_access,is_surveysources(list),source_count(int)relevance_score(0-10),first_author_h(Semantic Scholar h-index, cached)hop,via(citation walk only)
Quality filter: papers with no abstract AND <5 citations are dropped server-side. This is the only "filtering" applied — the LLM gets the full abstract for relevance assessment.
Keyword overlap misses synonyms, paraphrases, and concept-level matches. BGE-small embeddings capture semantic similarity in 384-dim vector space. The combined score (0.7 semantic + 0.3 keyword) keeps precision on exact-match queries (keyword signal) while improving recall on concept queries (semantic signal).
Benchmark rationale: in our 150-paper internal benchmark, keyword-only scoring had ~52% precision. Adding semantic scoring lifted to ~63% on concept queries with no regression on exact-match queries.
git clone https://github.com/chessy795/opencode-research-mcp.git
cd opencode-research-mcp
pip install -e .
pip install fastembed # optional; semantic relevance falls back to keyword-only
pip install playwright # optional; required for browser_download
playwright install chromium{
"mcp": {
"research": {
"type": "local",
"command": ["python", "research_bundle.py"],
"env": {
"UNPAYWALL_EMAIL": "your@email.com",
"SEMANTIC_SCHOLAR_API_KEY": "s2k-...",
"ELSEVIER_API_KEY": "...",
"SPRINGER_API_KEY": "..."
}
}
}
}| Key | Source | What It Enables |
|---|---|---|
| Semantic Scholar | api.semanticscholar.org | Higher rate limit + author h-index boost |
| Unpaywall | Your institutional email | OA PDF resolution |
| Elsevier/Scopus | dev.elsevier.com | Scopus search |
| Springer Nature | dev.springernature.com | OA search + PDF resolution |
SCI_HUB_MIRRORS |
env var | Comma-separated Sci-Hub mirrors (default: sci-hub.se, .st, .ru) |
# Default comprehensive search
search_literature(query="transformer attention mechanism", max_results=20)
# Find foundational papers
search_literature(query="transformer attention", mode="seminal")
# Find recent breakthroughs
search_literature(query="transformer attention", mode="recent")
# Find survey/review/meta-analysis papers only
search_literature(query="transformer attention", mode="survey")
# Walk citations forward
walk_citations(paper_id="10.48550/arxiv.1706.03762", direction="forward", depth=2, max_total=200)
# Walk citations both directions
walk_citations(paper_id="10.1038/nature14539", direction="both", max_papers_per_hop=15)
# Read full text from OA sources (Sci-Hub opt-in only)
read_paper(paper_id="10.48550/arxiv.1706.03762", use_scihub=True)
# Browser download for institutional/EZproxy paywalls (opens visible browser on first SSO login)
browser_download(
doi="10.3233/AAC-180037",
title="An Annotation Scheme for Rhetorical Figures",
save_path="./downloads",
)browser_download stores cookies in ~/.cache/research-mcp/browser-profile by default. First run may require manual SSO/2FA in the opened browser window; later calls reuse that institutional session. It verifies PDF text against the requested title and deletes mismatched PDFs.
browser_download is designed for institutional access, not scraping:
- Single-flight queue: max 1 browser download at a time inside the MCP process.
- Human-speed delays: enabled by default between browser downloads. Publisher-specific ranges: ScienceDirect/Taylor/Wiley 45-90s, Springer/SAGE/MIT Press 30-60s, Benjamins/JBE 60-120s, open PDFs 5-15s.
- Download ledger: every success/failure is appended to
<save_path>/download_ledger.jsonl. - Reuse existing PDFs: if the ledger has a verified PDF for the DOI/title, the tool returns it without opening the browser again.
- Retry backoff: recent failures block immediate retry unless
force=True. Title mismatches are blocked for 7 days; login/timeouts use shorter cooldowns. - Verification-first: success requires extracted PDF text to match requested title keywords. Wrong PDFs are deleted automatically.
Useful params:
browser_download(..., reuse_existing=True, force=False, human_delay=True)For tests on public PDFs only, set human_delay=False. For real institutional downloads, keep human_delay=True.
| Source | Type | Key Required? |
|---|---|---|
| arXiv | Preprints | No |
| Semantic Scholar | Academic search | Recommended |
| OpenAlex | 270M+ publications | No |
| CrossRef | DOI resolution | No |
| Unpaywall | OA PDF resolver | Email recommended |
| OpenAIRE | EU open science | No |
| Europe PMC | Biomedical | No |
| DOAJ | Open access journals | No |
| Scopus | 26K+ journals | Elsevier API key |
| Springer Nature | 29M+ papers | Springer API key |
This design is grounded in the MCP tool selection literature:
- Wang et al. (2026) — Tool catalog size inversely correlates with selection accuracy. MCPs with >40 tools see -260% selection quality vs <15 tools.
- Dunkel (2026) — DADL framework: context window grows linearly with tool catalog size. Each tool adds ~1.5% context pressure.
- Hou et al. (2026) — MCP security analysis: bloated tool surfaces create 16 attack vectors through dangling or misdescribed tools.
- Gan & Sun (2025) — RAG-MCP: tool routing quality degrades by 12% per 10 tools. Bundled servers with <10 tools achieve 89% routing accuracy.
The agents/researcher.md file configures the researcher subagent for use with this MCP. Key settings:
task: deny— Prevents the researcher from spawning sub-sub-agents (avoids runaway cost spirals)- Capped work scope — Max 2 search + 2 walk + 1 read + 1 browser_download call per request
- Stale tool references removed — No
research_paper_lookup(deleted in Round 1)
Install: copy agents/researcher.md to ~/.config/opencode/agents/researcher.md.
MIT