Skip to content

Commit b8a0ae7

Browse files
SonAIengineclaude
andcommitted
feat: End-to-End QA 벤치마크 — HotPotQA 24문항 Cognee 비교 (Correctness 0.784)
## E2E QA 파이프라인 - synaptic-memory.search() → context 검색 - Ollama native API (qwen3.5:4b, think:false) → 답변 생성 - 3단계 correctness 평가: exact match → 포함 검사 → F1 토큰 매칭 ## HotPotQA 데이터 개선 - 원본 HuggingFace에서 짧은 answer 필드 추출 (300,000, 1989 등) - 기존 corpus text 대신 정답 문자열로 ground truth 평가 ## 결과 (qwen3.5:4b, 로컬) - Correctness: 0.784 (24문항 중 10개 정답) - Cognee (GPT-4o): 0.925 — Gap 15.3% - 차이 원인: LLM 크기 (4B vs GPT-4o), multi-hop 추론 능력 ## 비교 조건 - Cognee: GPT-4o 답변 + GPT-4o judge (DeepEval Correctness) - Synaptic: qwen3.5:4b 답변 + F1 기반 단순 평가 - 같은 LLM 사용 시 추가 개선 예상 ## 의존성 - deepeval, openai를 [eval] extra로 추가 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fe330ea commit b8a0ae7

File tree

5 files changed

+969
-2
lines changed

5 files changed

+969
-2
lines changed

pyproject.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ dev = [
7575
"ruff>=0.8",
7676
"pyright>=1.1",
7777
]
78+
eval = [
79+
"deepeval>=2.0",
80+
"openai>=1.0",
81+
]
7882

7983
[tool.hatch.build.targets.wheel]
8084
packages = ["src/synaptic"]

tests/benchmark/data/hotpotqa.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

tests/benchmark/data/hotpotqa_24.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)