Unified test harness covering all SochDB Python SDK features with synthetic data generation, 10 real-world scenarios, metrics collection, and scorecard reporting.
- ✅ Namespaces - Multi-tenant isolation with use_namespace()
- ✅ Hybrid Search - Vector + keyword search with alpha blending
- ✅ Semantic Cache - Cache hit/miss tracking and paraphrase groups
- ✅ Context Builder - Token budgeting with STRICT/LENIENT modes
- ✅ SSI Transactions - Atomicity, rollback, conflict handling
- ✅ Temporal Graph - POINT_IN_TIME and RANGE queries
- ✅ Atomic Writes - WAL-based recovery and consistency
- ✅ Sessions/Audit - Complete audit trail logging
- ✅ MCP/Server Mode - Tool provider integration
Deterministically creates test data:
- Topic centroids - 200 unit-normalized vectors for ground-truth
- Keyword mapping - Topic-specific keywords for BM25 signal
- Paraphrase groups - Controlled query variants for cache testing
- Graph structures - Incident clusters with temporal edges
- Ground-truth labels - Expected doc IDs, relevance scores
Runs 10 comprehensive scenarios:
- Multi-tenant Support Agent - RAG + memory + cost control
- Sales/CRM Agent - Lead enrichment with atomic updates
- SecOps Triage Agent - Entity graph + incident timelines
- On-call Runbook Agent - Hybrid search + context budgets
- Memory-building Agent - Crash-safe atomic writes
- Finance Close Agent - Ledger integrity + conflict resolution
- Compliance Agent - Policy evaluation + explainability
- Procurement Agent - Contract search + clause linking
- Edge Field-Tech Agent - Offline time-travel diagnostics
- Tool-using Agent - MCP integration testing
Computes comprehensive metrics:
- Correctness (70%) - Leakage, atomicity, consistency
- Retrieval (15%) - NDCG@10, Recall@10, MRR
- Performance (10%) - P95 latencies per operation
- Cost Proxies (5%) - Cache hit rates, token budgets
Produces final reports:
- JSON scorecard - Structured results with all metrics
- Summary table - Human-readable pass/fail report
- CSV export - Optional tabular format
# Install dependencies
pip install -r harness_requirements.txt
# Install SochDB SDK
cd ../sochdb-python-sdk
pip install -e .
cd ../sochdb_py_temp_testpython comprehensive_harness.py# Small scale, custom seed
python comprehensive_harness.py --scale small --seed 42
# Large scale, server mode
python comprehensive_harness.py --scale large --mode server
# Custom output file
python comprehensive_harness.py --output results/scorecard_v1.json--seed- Random seed for reproducibility (default: 1337)--scale- Test scale: small/medium/large (default: medium)--mode- DB mode: embedded/server (default: embedded)--output- Output JSON file (default: scorecard.json)
| Scale | Tenants | Docs/Collection | Queries | Duration |
|---|---|---|---|---|
| Small | 3 | 50 | 20 | ~30s |
| Medium | 5 | 200 | 50 | ~2min |
| Large | 10 | 1000 | 100 | ~10min |
{
"run_meta": {
"seed": 1337,
"scale": "medium",
"mode": "embedded",
"sdk_version": "0.3.3",
"started_at": "2026-01-09T...",
"duration_s": 123.4
},
"scenario_scores": {
"01_multi_tenant_support": {
"pass": true,
"metrics": {
"correctness": {
"leakage_rate": 0.0,
"atomicity_failures": 0,
"consistency_failures": 0
},
"retrieval": {
"ndcg_at_10": 0.875,
"recall_at_10": 0.923,
"mrr": 0.891
},
"cache": {
"hit_rate": 0.67
},
"performance": {
"p95_latencies_ms": {
"vector_search": 12.3,
"hybrid_search": 18.7
}
}
}
}
},
"global_metrics": {
"p95_latency_ms": {
"vector_search": 12.3,
"hybrid_search": 18.7,
"txn_commit": 4.1
},
"error_rate": 0.0
},
"overall": {
"pass": true,
"score_0_100": 95.8,
"passed_scenarios": 10,
"total_scenarios": 10,
"failed_checks": []
}
}- Leakage rate - Cross-tenant data access (must be 0)
- Atomicity failures - Partial updates after rollback (must be 0)
- Consistency failures - Post-crash invariant violations (must be 0)
- Time-travel correctness - Temporal query accuracy (must be 100%)
- NDCG@10 - Normalized Discounted Cumulative Gain
- Recall@10 - Fraction of relevant docs in top-10
- MRR - Mean Reciprocal Rank
- P95 latencies - 95th percentile per operation type
- Thresholds - hybrid_search <50ms, txn_commit <10ms
- Cache hit rate - After paraphrase warmup (target: ≥60%)
- Token budgets - STRICT mode compliance (must be 100%)
- LLM calls avoided - Via cache hits
| Scenario | Metric | Threshold |
|---|---|---|
| 01 Multi-tenant | Leakage rate | 0% |
| 01 Multi-tenant | NDCG@10 | ≥0.70 |
| 01 Multi-tenant | Cache hit rate | ≥60% |
| 02 CRM | Atomicity failures | 0 |
| 02 CRM | Audit coverage | 100% |
| 03 SecOps | Cluster F1 | ≥90% |
| 03 SecOps | Temporal correctness | 100% |
| 04 Runbook | Top-1 accuracy | ≥70% |
| 04 Runbook | Top-3 accuracy | ≥90% |
| 05 Memory | Consistency after crash | 100% |
| 06 Finance | Double-post rate | 0% |
| 06 Finance | Retry success | ≥99% |
| 07 Compliance | Policy accuracy | 100% |
| 07 Compliance | Explainability | 100% |
| 08 Procurement | Recall@10 | ≥85% |
| 09 Field-Tech | Time-travel accuracy | 100% |
| 10 MCP | Tool call success | ≥99.9% |
# 200 topic centroids (unit-normalized)
centroids = normalize(random.randn(200, 384))
# Document embedding = centroid + small noise
doc_vector = normalize(centroid[topic_id] + noise)
# Query embedding uses same centroid
# → Known relevant docs = docs with same topic_id# Each topic has 3-5 keywords
topic_keywords = {
0: ["authentication", "security", "login"],
1: ["performance", "latency", "throughput"],
...
}
# 70% of docs for topic include its keywords
# 5% of other docs include as noise
# → BM25 signal with controlled collisions# Same topic embedding, different text
paraphrases = [
"How do I fix authentication issues?",
"What's the solution for auth problems?",
"Help with authentication errors",
]
# → Cache hit test with known equivalence================================================================================
SCORECARD SUMMARY
================================================================================
Run Meta:
Seed: 1337
Scale: medium
Mode: embedded
Duration: 127.34s
Overall Score: 95.8/100
Passed: 10/10
Status: ✓ PASS
Scenario Status NDCG@10 Recall@10
----------------------------------------------------------------------
01_multi_tenant_support ✓ PASS 0.875 0.923
02_sales_crm ✓ PASS N/A N/A
03_secops_triage ✓ PASS N/A N/A
04_oncall_runbook ✓ PASS N/A 0.912
05_memory_crash_safe ✓ PASS N/A N/A
06_finance_close ✓ PASS N/A N/A
07_compliance ✓ PASS N/A N/A
08_procurement ✓ PASS N/A 0.887
09_edge_field_tech ✓ PASS N/A N/A
10_mcp_tool ✓ PASS N/A N/A
Global P95 Latencies (ms):
vector_search: 11.23ms
hybrid_search: 17.89ms
txn_commit: 3.87ms
ledger_commit: 4.21ms
================================================================================
#!/bin/bash
# .github/workflows/sdk-integration-test.yml
set -e
# Run harness
python comprehensive_harness.py \
--scale medium \
--seed 1337 \
--output ci_scorecard.json
# Check exit code
if [ $? -ne 0 ]; then
echo "❌ SDK integration tests FAILED"
exit 1
fi
# Extract score
score=$(jq '.overall.score_0_100' ci_scorecard.json)
echo "📊 Integration test score: $score/100"
# Fail if score < 90
if (( $(echo "$score < 90" | bc -l) )); then
echo "❌ Score below threshold (90)"
exit 1
fi
echo "✅ SDK integration tests PASSED"-
Import errors
# Ensure SDK is installed pip install -e ../sochdb-python-sdk -
Permission errors
# Clean up old test data rm -rf test_harness_db/ -
Low scores
# Check specific scenario failures jq '.scenario_scores[] | select(.pass == false)' scorecard.json
-
Slow execution
# Use smaller scale for quick validation python comprehensive_harness.py --scale small
def scenario_11_custom(self, metrics: ScenarioMetrics):
"""Custom scenario description."""
# 1. Setup
with self.db.use_namespace("custom") as ns:
collection = ns.create_collection("test", dimension=384)
# 2. Execute operations
start = time.time()
# ... your test logic ...
duration_ms = (time.time() - start) * 1000
metrics.add_latency("custom_op", duration_ms)
# 3. Validate correctness
if failures > 0:
metrics.passed = False
metrics.errors.append(f"Validation failed: {failures}")
# 4. Record metrics
metrics.ndcg_at_10 = compute_ndcg(...)
print(f" → Custom metric: {value}")@dataclass
class ScenarioMetrics:
# ... existing fields ...
# Add your custom metric
custom_metric: Optional[float] = None
def to_dict(self) -> Dict[str, Any]:
result = super().to_dict()
result["custom"] = {"metric": self.custom_metric}
return resultApache 2.0 - See LICENSE file
- Sushanth (@sushanthpy)