research-mcp

A lean research MCP that bundles academic search, citation graph traversal, OA full-text download, and browser-based institutional download into 5 tools. Built on OpenAlex, Semantic Scholar, CrossRef, and 6 other academic indexes.

What it does: You ask a research question, it searches 8+ academic databases in parallel, removes duplicates, ranks by semantic + keyword relevance, and returns the best papers. Citation walk follows the graph forward and backward via OpenAlex. read_paper fetches OA full text. browser_download uses a real Playwright browser session for institutional SSO paywalls.

Why it's lean: 5 tools are still tiny compared with the ~12,000-token surface of the underlying MCPs. Per the DADL framework, each tool adds context pressure, so we keep the surface focused: search, citations, OA read, browser paywall read, and ping.

5 Tools

#	Tool	Purpose
1	`search_literature`	8+ sources, dedup, semantic+keyword ranking, `mode` + `field` + `debug` params
2	`walk_citations`	Multi-hop citation graph (forward/backward/both) via OpenAlex
3	`read_paper`	OA-first full-text download with auto-detect + optional Sci-Hub fallback
4	`browser_download`	Playwright browser download through institutional SSO/EZproxy, with PDF title verification
5	`ping`	Cheap health check; use before expensive search/download calls

Search Modes

search_literature has a mode parameter that changes ranking and filtering without changing the response shape (zero context bloat):

Mode	Filter	Ranking
`"comprehensive"` (default)	—	Source tier → source hits → relevance → abstract → citations → year
`"seminal"`	≥10 citations	Citations desc, oldest first
`"recent"`	Last 2 years	Citations desc, newest first
`"survey"`	Review/survey/meta-analysis only	Relevance → survey flag → citations

Field-Aware Source Tiering

search_literature accepts field="auto" (default) which detects the query's field and applies field-aware source weighting. Sources are tiered:

Tier	Sources	When Counted
3 (top)	semantic, scopus, openalex, crossref, springer	All fields
2 (mid)	arxiv, unpaywall, openaire, doaj	All fields
1 (field-specific)	europepmc	Only for `medical` and `bio` fields

In CS/AI queries, Europe PMC papers are demoted to tier 0 (biomedical venue, irrelevant). In medical queries, Europe PMC papers get full tier-1 credit. The field param accepts "cs", "bio", "medical", "social", "general", or "auto".

The safe filter (_should_drop_low_quality) drops papers with no abstract AND <5 citations AND ≤1 source AND tier<2, unless rescued by: ≥500 citations, ≥3 sources, is_survey, exact title match, or tier-3 source alone.

Debug Mode

Set debug=True to get a score_breakdown per paper showing exactly how its relevance_score was computed:

{
  "relevance_score": 8.0,
  "score_breakdown": {
    "semantic": 0.72,
    "keyword": 0.45,
    "source_tier": 3,
    "source_hits": 2,
    "citation_boost": 2,
    "survey_boost": 0,
    "author_boost": 0,
    "final": 8.0
  }
}

Useful for tuning queries, debugging low-quality results, or understanding ranking behavior.

Relevance Scoring (0-10 per paper)

score = round(10 * (0.7 * semantic_similarity + 0.3 * keyword_overlap))
      + citation_boost   (0-3: 50+/100+/500+ cites)
      + survey_boost      (+1 for reviews/meta-analyses)
      + author_boost      (+1 for first-author h-index ≥50)

Semantic similarity uses BAAI/bge-small-en-v1.5 via fastembed (ONNX, no torch dep, ~10ms per query, ~50ms per 100 papers). Cached by text hash for 1 day. Graceful fallback to keyword-only if fastembed isn't installed.

Token-Efficient Response Shape

Each paper includes (per read_paper chain requirements):

title, authors (compact: first 5 + "et al. (N total)"), author_count
year, venue (compressed: "Nature" not "Nature Publishing Group")
doi, arxiv_id, pmid, pdf_url
abstract (full, no truncation), citation_count
is_open_access, is_survey
sources (list), source_count (int)
relevance_score (0-10), first_author_h (Semantic Scholar h-index, cached)
hop, via (citation walk only)

Quality filter: papers with no abstract AND <5 citations are dropped server-side. This is the only "filtering" applied — the LLM gets the full abstract for relevance assessment.

Why Semantic Relevance

Keyword overlap misses synonyms, paraphrases, and concept-level matches. BGE-small embeddings capture semantic similarity in 384-dim vector space. The combined score (0.7 semantic + 0.3 keyword) keeps precision on exact-match queries (keyword signal) while improving recall on concept queries (semantic signal).

Benchmark rationale: in our 150-paper internal benchmark, keyword-only scoring had ~52% precision. Adding semantic scoring lifted to ~63% on concept queries with no regression on exact-match queries.

Setup

git clone https://github.com/chessy795/opencode-research-mcp.git
cd opencode-research-mcp
pip install -e .
pip install fastembed  # optional; semantic relevance falls back to keyword-only
pip install playwright # optional; required for browser_download
playwright install chromium

opencode Config

{
  "mcp": {
    "research": {
      "type": "local",
      "command": ["python", "research_bundle.py"],
      "env": {
        "UNPAYWALL_EMAIL": "your@email.com",
        "SEMANTIC_SCHOLAR_API_KEY": "s2k-...",
        "ELSEVIER_API_KEY": "...",
        "SPRINGER_API_KEY": "..."
      }
    }
  }
}

API Keys (all optional)

Key	Source	What It Enables
Semantic Scholar	api.semanticscholar.org	Higher rate limit + author h-index boost
Unpaywall	Your institutional email	OA PDF resolution
Elsevier/Scopus	dev.elsevier.com	Scopus search
Springer Nature	dev.springernature.com	OA search + PDF resolution
`SCI_HUB_MIRRORS`	env var	Comma-separated Sci-Hub mirrors (default: sci-hub.se, .st, .ru)

Usage

# Default comprehensive search
search_literature(query="transformer attention mechanism", max_results=20)

# Find foundational papers
search_literature(query="transformer attention", mode="seminal")

# Find recent breakthroughs
search_literature(query="transformer attention", mode="recent")

# Find survey/review/meta-analysis papers only
search_literature(query="transformer attention", mode="survey")

# Walk citations forward
walk_citations(paper_id="10.48550/arxiv.1706.03762", direction="forward", depth=2, max_total=200)

# Walk citations both directions
walk_citations(paper_id="10.1038/nature14539", direction="both", max_papers_per_hop=15)

# Read full text from OA sources (Sci-Hub opt-in only)
read_paper(paper_id="10.48550/arxiv.1706.03762", use_scihub=True)

# Browser download for institutional/EZproxy paywalls (opens visible browser on first SSO login)
browser_download(
    doi="10.3233/AAC-180037",
    title="An Annotation Scheme for Rhetorical Figures",
    save_path="./downloads",
)

browser_download stores cookies in ~/.cache/research-mcp/browser-profile by default. First run may require manual SSO/2FA in the opened browser window; later calls reuse that institutional session. It verifies PDF text against the requested title and deletes mismatched PDFs.

Browser Download Guardrails

browser_download is designed for institutional access, not scraping:

Single-flight queue: max 1 browser download at a time inside the MCP process.
Human-speed delays: enabled by default between browser downloads. Publisher-specific ranges: ScienceDirect/Taylor/Wiley 45-90s, Springer/SAGE/MIT Press 30-60s, Benjamins/JBE 60-120s, open PDFs 5-15s.
Download ledger: every success/failure is appended to <save_path>/download_ledger.jsonl.
Reuse existing PDFs: if the ledger has a verified PDF for the DOI/title, the tool returns it without opening the browser again.
Retry backoff: recent failures block immediate retry unless force=True. Title mismatches are blocked for 7 days; login/timeouts use shorter cooldowns.
Verification-first: success requires extracted PDF text to match requested title keywords. Wrong PDFs are deleted automatically.

Useful params:

browser_download(..., reuse_existing=True, force=False, human_delay=True)

For tests on public PDFs only, set human_delay=False. For real institutional downloads, keep human_delay=True.

Sources

Source	Type	Key Required?
arXiv	Preprints	No
Semantic Scholar	Academic search	Recommended
OpenAlex	270M+ publications	No
CrossRef	DOI resolution	No
Unpaywall	OA PDF resolver	Email recommended
OpenAIRE	EU open science	No
Europe PMC	Biomedical	No
DOAJ	Open access journals	No
Scopus	26K+ journals	Elsevier API key
Springer Nature	29M+ papers	Springer API key

Design Rationale

This design is grounded in the MCP tool selection literature:

Wang et al. (2026) — Tool catalog size inversely correlates with selection accuracy. MCPs with >40 tools see -260% selection quality vs <15 tools.
Dunkel (2026) — DADL framework: context window grows linearly with tool catalog size. Each tool adds ~1.5% context pressure.
Hou et al. (2026) — MCP security analysis: bloated tool surfaces create 16 attack vectors through dangling or misdescribed tools.
Gan & Sun (2025) — RAG-MCP: tool routing quality degrades by 12% per 10 tools. Bundled servers with <10 tools achieve 89% routing accuracy.

Agent Config

The agents/researcher.md file configures the researcher subagent for use with this MCP. Key settings:

task: deny — Prevents the researcher from spawning sub-sub-agents (avoids runaway cost spirals)
Capped work scope — Max 2 search + 2 walk + 1 read + 1 browser_download call per request
Stale tool references removed — No research_paper_lookup (deleted in Round 1)

Install: copy agents/researcher.md to ~/.config/opencode/agents/researcher.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
agents		agents
examples		examples
tests		tests
.gitignore		.gitignore
BENCHMARK_FINAL.md		BENCHMARK_FINAL.md
BENCHMARK_FINAL_v2.md		BENCHMARK_FINAL_v2.md
BENCHMARK_V4.md		BENCHMARK_V4.md
LICENSE		LICENSE
RATIONALE.md		RATIONALE.md
README.md		README.md
publisher_apis.py		publisher_apis.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
research_bundle.py		research_bundle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

research-mcp

5 Tools

Search Modes

Field-Aware Source Tiering

Debug Mode

Relevance Scoring (0-10 per paper)

Token-Efficient Response Shape

Why Semantic Relevance

Setup

opencode Config

API Keys (all optional)

Usage

Browser Download Guardrails

Sources

Design Rationale

Agent Config

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

research-mcp

5 Tools

Search Modes

Field-Aware Source Tiering

Debug Mode

Relevance Scoring (0-10 per paper)

Token-Efficient Response Shape

Why Semantic Relevance

Setup

opencode Config

API Keys (all optional)

Usage

Browser Download Guardrails

Sources

Design Rationale

Agent Config

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages