Skip to content

Commit 683d83e

Browse files
SonAIengineclaude
andcommitted
feat: FTS+embedding 하이브리드 점수 — S7 Auto+Embed MRR 0.83 달성
## 하이브리드 점수 계산 (search.py) - vector search 결과에 순위 기반 점수 + cosine similarity 반영 - vec_score = sim * 0.7 + rank_score * 0.3 (similarity 우선) - FTS+vector 양쪽 매칭 시: alpha * fts + (1-alpha) * vec + 0.1 보너스 - vector only: vec_score * 0.9 (FTS 미매칭 감쇠) ## Ablation 결과 (Ollama qwen3-embedding:0.6b) - Allganize-ko S7: MRR 0.670→0.830 (+24%), R@10 0.870→1.000 - PublicHealthQA S7: MRR 0.310→0.499 (+61%), R@10 0.623→0.870 - S7이 S0 Flat을 처음으로 초과 — embedding이 검색 품질에 기여 시작 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 412a778 commit 683d83e

File tree

1 file changed

+40
-10
lines changed

1 file changed

+40
-10
lines changed

src/synaptic/search.py

Lines changed: 40 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
from __future__ import annotations
44

5+
import math
56
from time import time
67

78
from synaptic.models import ActivatedNode, Node, NodeKind, SearchResult
@@ -35,6 +36,16 @@
3536
_KIND_BOOST = 0.05 # kind 매칭 시 search_score 부스트량 (보수적)
3637

3738

39+
def _cosine_sim(a: list[float], b: list[float]) -> float:
40+
"""두 벡터의 코사인 유사도."""
41+
dot = sum(x * y for x, y in zip(a, b))
42+
na = math.sqrt(sum(x * x for x in a))
43+
nb = math.sqrt(sum(x * x for x in b))
44+
if na == 0 or nb == 0:
45+
return 0.0
46+
return dot / (na * nb)
47+
48+
3849
class HybridSearch:
3950
"""3-stage fallback search: FTS+vector → synonym expansion → query rewrite."""
4051

@@ -66,24 +77,43 @@ async def search(
6677
stages_used: list[str] = []
6778
all_nodes: dict[str, tuple[Node, float]] = {}
6879

69-
# Stage 1: FTS + vector
80+
# Stage 1: FTS + vector hybrid scoring
81+
fts_scores: dict[str, float] = {}
7082
fts_nodes = await backend.search_fts(query, limit=limit * 2)
7183
stages_used.append("fts")
7284
for rank, node in enumerate(fts_nodes):
73-
# FTS 순위 기반 점수: 1위=0.95, 2위=0.90, ...
74-
score = max(0.5, 0.95 - rank * 0.05)
75-
if node.id not in all_nodes:
76-
all_nodes[node.id] = (node, score)
85+
# FTS 순위 기반 점수: 1위=0.95, 감소율 0.05
86+
score = max(0.3, 0.95 - rank * 0.05)
87+
fts_scores[node.id] = score
88+
all_nodes[node.id] = (node, score)
7789

90+
vec_scores: dict[str, float] = {}
7891
if embedding:
7992
vec_nodes = await backend.search_vector(embedding, limit=limit * 2)
8093
stages_used.append("vector")
81-
for node in vec_nodes:
82-
if node.id not in all_nodes:
83-
all_nodes[node.id] = (node, 0.7)
94+
for rank, node in enumerate(vec_nodes):
95+
# Vector 순위 기반 점수 + 실제 cosine similarity 반영
96+
rank_score = max(0.3, 0.95 - rank * 0.05)
97+
# cosine similarity 직접 계산 (가능한 경우)
98+
if node.embedding and embedding:
99+
sim = _cosine_sim(embedding, node.embedding)
100+
vec_score = sim * 0.7 + rank_score * 0.3 # sim 우선
101+
else:
102+
vec_score = rank_score
103+
vec_scores[node.id] = vec_score
104+
105+
# FTS + vector 하이브리드 점수 합산
106+
alpha = 0.5 # FTS vs vector 가중치 (0.5 = 동등)
107+
for nid, node in {n.id: n for n in vec_nodes}.items():
108+
fts_s = fts_scores.get(nid, 0.0)
109+
vec_s = vec_scores.get(nid, 0.0)
110+
if nid in all_nodes:
111+
# 양쪽 다 있으면 하이브리드 점수
112+
hybrid = alpha * fts_s + (1 - alpha) * vec_s + 0.1 # 양쪽 매칭 보너스
113+
all_nodes[nid] = (all_nodes[nid][0], min(1.0, hybrid))
84114
else:
85-
existing = all_nodes[node.id]
86-
all_nodes[node.id] = (existing[0], min(1.0, existing[1] + 0.2))
115+
# vector only
116+
all_nodes[nid] = (node, vec_s * 0.9) # FTS 매칭 없으면 약간 감쇠
87117

88118
# Stage 2: Synonym expansion (if insufficient results)
89119
if len(all_nodes) < limit:

0 commit comments

Comments
 (0)