- Python 3.11+
- uv
- bun
Optional but recommended:
SEMANTIC_SCHOLAR_API_KEYfor higher API limits- OpenAI-compatible endpoint via
LLM_BASE_URL
uv sync --extra dev
cd frontend && bun install && cd ..Create .env in project root:
LLM_API_KEY=your-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o
SEMANTIC_SCHOLAR_API_KEY=optional
NEXT_PUBLIC_API_URL=http://localhost:8000
# Optional - LLM concurrency for parallel operations
# Default: 2 (safe for free/low-tier API keys)
# Recommended: 2-4 for free tier, 4-8 for paid tier
# Higher values improve performance but may trigger rate limits
LLM_CONCURRENCY=2
# Optional - claim verification concurrency
# Default: 2 (safe for free/low-tier API keys)
# Recommended: 2-4 for free tier, 4-8 for paid tier
CLAIM_VERIFICATION_CONCURRENCY=2LLM Concurrency (LLM_CONCURRENCY)
- Free/Low-tier OpenAI: Use 2-4 (default is 2)
- Higher values may trigger 429 rate limit errors
- Recommended: Start with 2, increase gradually to 4 if rate limits don't occur
- Paid/Team-tier OpenAI: Use 4-8
- Higher tiers allow more concurrent requests
- Recommended: 4-6 for balanced performance, up to 8 for maximum throughput
- DeepSeek/Zhipu APIs: Check provider-specific rate limits
- May have lower limits than OpenAI
- Start conservative, monitor logs for 429 errors
Claim Verification Concurrency (CLAIM_VERIFICATION_CONCURRENCY)
- Follows same guidance as
LLM_CONCURRENCY - Lower values (2-4) reduce rate limit risk
- Higher values (4-8) improve critic_agent speed
uv run uvicorn backend.main:app --reload --port 8000cd frontend
bun run devRun all checks and ensure they pass.
ruff check backend/
ruff format backend/ --check
find backend -name '*.py' -exec python -m py_compile {} +cd frontend && bun x tsc --noEmit
cd frontend && bun run lintuv run pytest tests/ -v
uv run pytest tests/test_integration.py -v
uv run pytest tests/test_exporter.py::test_export_markdown -vcd frontend && bun test
cd frontend && bun run test:e2e- Update models in
backend/schemas.pyif contract changes - Implement route behavior in
backend/main.py - Add/adjust tests in
tests/ - Update
docs/API.md
- Adjust node logic in
backend/nodes.py - Update routing/interrupt/retry in
backend/workflow.py - Verify session resume behavior (
/api/research/status,/sessions) - Add regression tests for changed flow
- Extend store state in
frontend/src/store/research.ts - Wire API calls in
frontend/src/lib/api/ - Reflect state in console/workspace components
- Add vitest coverage in
frontend/src/__tests__/
Workflow Benchmark Script (tests/benchmark_workflow.py)
Measures end-to-end workflow performance including:
- Per-node timing breakdown (planner, retriever, extractor, writer, critic)
- LLM call estimation
- Total workflow time
Usage:
# Single benchmark run
python tests/benchmark_workflow.py --query "transformer architecture in NLP" --papers 3
# Multiple iterations for consistency
python tests/benchmark_workflow.py --query "deep learning for medical imaging" --iterations 3
# Compare concurrency configurations
python tests/benchmark_workflow.py --query "reinforcement learning" --compare --papers 3Requirements:
- Backend must be running (
uvicorn backend.main:app --reload --port 8000) - Valid
LLM_API_KEYconfigured in.env
Expected Results:
- Baseline (LLM_CONCURRENCY=2): ~45s for 3 papers, ~60-80s for 10 papers
- Optimized (LLM_CONCURRENCY=4): ~30s for 3 papers, ~40-50s for 10 papers
Citation Validation Script (tests/validate_citations.py)
Validates citation accuracy across multiple research topics to ensure optimization work doesn't degrade quality.
Usage:
# Manual validation on 3 topics (original baseline)
python tests/validate_citations.py
# Regression testing against previous session
python tests/validate_citations.py --compare <session_id>Success Criteria:
- Citation accuracy ≥ 97.0% (maintained from 97.3% baseline)
- No increase in hallucinated citations
- Citation index errors remain minimal
| Metric | Baseline | Target | Status |
|---|---|---|---|
| 10-paper workflow time | 50-95s | 35-65s | ✅ Implemented |
| LLM call count (10 papers) | ~26-36 | ~20-28 | ✅ Achieved |
| Citation accuracy | 97.3% | ≥97.0% | ✅ Maintained |
- No regression in
tests/test_claim_verification.py: All existing tests must pass - No regression in
tests/test_integration.py: End-to-end workflows remain functional - Citation accuracy ≥ 97%: Verified by manual validation on 3 topics
- 429 error handling: RateLimitError properly retried with exponential backoff (implemented in
llm_client.py) - Batch extraction fallback: Per-section fallback activated on batch failure (implemented in
claim_verifier.py) - Fulltext enrichment merge: Tested with edge cases (papers with/without PDFs) (test_extractor_parallel.py)
- Backend imports: stdlib -> third-party -> local
- Absolute imports for backend modules (
from backend...) - Python typing with built-in generics (
list[str],dict[str, Any]) - Frontend import aliases via
@/ - Keep docs and API contracts synchronized in the same PR