Skip to content

Commit 988391b

Browse files
SonAIengineclaude
andcommitted
feat: 온톨로지 구축 최적화 — HybridClassifier, Batch LLM, EmbeddingRelation, PhraseExtractor
## 1. RuleBasedClassifier confidence score (#1) - classify_with_confidence() → (NodeKind, float) 반환 - confidence = min(1.0, total_score / 6.0), 6점 이상 확실 - 기존 classify() 하위 호환 유지 ## 2. LLMClassifier batch 모드 (#2) - classify_batch_async(items, content_limit=500) - 8~16개 문서를 한 번의 LLM 호출로 분류 (비용 3~5x 절감) - 캐시 히트 제외, JSON 파싱 실패 시 개별 fallback ## 3. HybridClassifier — 2단계 분류 (#1+#2 통합) - 규칙 confidence >= 0.6 → 확정 (무료, 즉시) - confidence < 0.6 → LLM 위임 (비용 발생) - KindClassifier 프로토콜 준수 ## 4. EmbeddingRelationDetector (#3) - cosine similarity 기반 관계 자동 생성 (LLM 불필요) - similarity_threshold 0.7 이상 → RELATED 엣지 - NodeKind 쌍에 따라 EdgeKind 자동 조정 - fallback으로 RuleBasedRelationDetector 조합 가능 ## 5. PhraseExtractor — HippoRAG2 dual-node KG (#4) - 문서에서 고유명사/키프레이즈 자동 추출 (zero-dep) - phrase를 ENTITY 노드로 생성, passage→phrase CONTAINS 엣지 - 동일 phrase가 여러 passage에서 bridge 역할 - EdgeKind.CONTAINS 추가 - ⚠️ 현재 검색 노이즈 문제 — phrase 필터링 최적화 필요 ## SOTA 기반 설계 - Batch classification: ICLR 2024 BatchPrompt 참고 (batch 8~16 sweet spot) - 2단계 분류: Hy-LIFT 패턴 (규칙→LLM, Recall 92%) - Embedding relation: ProLEA 2025 참고 (Hits@1 0.931~0.969) - Dual-node KG: HippoRAG2 참고 (passage+phrase 분리) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f808f2c commit 988391b

124 files changed

Lines changed: 2004 additions & 4 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ dev = [
7878
eval = [
7979
"deepeval>=2.0",
8080
"openai>=1.0",
81+
"python-dotenv>=1.2.2",
8182
]
8283

8384
[tool.hatch.build.targets.wheel]

src/synaptic/__init__.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,11 @@
77
from synaptic.ppr import personalized_pagerank
88
from synaptic.extensions.classifier_rules import RuleBasedClassifier
99
from synaptic.extensions.embedder import EmbeddingProvider, MockEmbeddingProvider
10-
from synaptic.extensions.relation_detector import RuleBasedRelationDetector
10+
from synaptic.extensions.phrase_extractor import PhraseExtractor
11+
from synaptic.extensions.relation_detector import (
12+
EmbeddingRelationDetector,
13+
RuleBasedRelationDetector,
14+
)
1115
from synaptic.graph import SynapticGraph
1216
from synaptic.evidence import EvidenceAssembler
1317
from synaptic.models import (
@@ -55,12 +59,14 @@
5559
"EvidenceChain",
5660
"EvidenceStep",
5761
"EmbeddingProvider",
62+
"EmbeddingRelationDetector",
5863
"GraphTraversal",
5964
"KindClassifier",
6065
"MockEmbeddingProvider",
6166
"Node",
6267
"NodeKind",
6368
"OntologyRegistry",
69+
"PhraseExtractor",
6470
"personalized_pagerank",
6571
"PropertyDef",
6672
"QueryRewriter",
@@ -72,6 +78,7 @@
7278
"LLMRelationDetector",
7379
"OllamaLLMProvider",
7480
"OpenAILLMProvider",
81+
"HybridClassifier",
7582
"RuleBasedClassifier",
7683
"RuleBasedRelationDetector",
7784
"SearchIntent",
@@ -95,6 +102,10 @@ def __getattr__(name: str) -> object:
95102
from synaptic.extensions.embedder import OllamaEmbeddingProvider # noqa: PLC0415
96103

97104
return OllamaEmbeddingProvider
105+
if name == "HybridClassifier":
106+
from synaptic.extensions.classifier_hybrid import HybridClassifier # noqa: PLC0415
107+
108+
return HybridClassifier
98109
if name == "LLMClassifier":
99110
from synaptic.extensions.classifier_llm import LLMClassifier # noqa: PLC0415
100111

2.64 KB
Binary file not shown.
3.82 KB
Binary file not shown.
11.2 KB
Binary file not shown.
12.7 KB
Binary file not shown.
19.1 KB
Binary file not shown.
21.2 KB
Binary file not shown.
3.51 KB
Binary file not shown.

0 commit comments

Comments
 (0)