Benchmark LLMs. Build Trust. Ship Responsibly.
The open-source framework for evaluating LLM safety, fairness, and reliability in regulated industries.
Quick Start β’ Features β’ Industries β’ Pillars β’ Providers β’ Docs β’ Contributing
Deploying LLMs in regulated industries like Healthcare, Banking, Retail, and Legal is risky without proper evaluation. Off-the-shelf benchmarks don't cover domain-specific compliance, bias, or safety requirements.
TrustEval is a production-ready Python framework that provides:
- Industry-specific benchmarks β 600+ test prompts aligned to real regulations (HIPAA, GDPR, PCI-DSS, ABA Rules)
- 4 Responsible AI pillars β Bias & Fairness, Hallucination Detection, PII/Data Leakage, Toxicity & Safety
- Multi-provider support β Evaluate OpenAI, Anthropic, Google Gemini, and HuggingFace models side-by-side
- Enterprise-grade security β Encrypted API key storage, audit logging, input sanitization, rate limiting
- 3 interfaces β Python SDK, CLI tool, and Web Dashboard
- Compliance-ready reports β PDF, JSON, CSV, and HTML β built for audit teams
"Don't just deploy AI. Trust it."
|
Evaluate hallucination, bias, PII leakage, and toxicity with weighted scoring and automated grading (AβF). Healthcare (HIPAA), BFSI (GDPR/PCI-DSS), Retail (FTC), Legal (ABA) β each with 150+ domain-specific prompts. OpenAI GPT-4, Anthropic Claude, Google Gemini, HuggingFace β test any model with one API. |
Real-time evaluation results, model comparison, and trend analysis with React + Tailwind + Recharts. Generate audit-ready PDF, JSON, CSV, and HTML reports with per-pillar breakdowns and regulatory citations. Fernet-encrypted key storage, SHA256 hash-chain audit logs, prompt injection detection, token bucket rate limiting. |
pip install trusteval-aifrom trusteval import TrustEvaluator
evaluator = TrustEvaluator(
provider="openai",
model="gpt-4o",
industry="healthcare"
)
result = evaluator.evaluate()
print(result.summary())
# Export compliance report
result.export("audit_report.pdf")
result.export("audit_data.json", format="json")# Run a full evaluation
trusteval evaluate --provider openai --model gpt-4o --industry healthcare -o results.json
# Compare two models
trusteval compare --providers openai,anthropic --models gpt-4o,claude-3-opus-20240229
# Generate a report
trusteval report generate -i results.json -f html -o report.html# Start the dashboard server
trusteval dashboard start
# Open http://localhost:8080 in your browser| Industry | Benchmark Areas | Regulations | Prompts |
|---|---|---|---|
| π₯ Healthcare | Clinical QA, Triage, ICD Coding, PHI Leakage, Drug Interactions | HIPAA, FDA, Clinical Guidelines | 155+ |
| π¦ BFSI | Credit Fairness, Fraud Detection, KYC/AML, Risk Assessment | GDPR, PCI-DSS, SOX, Basel III | 156+ |
| π Retail | Recommendations, Customer Service, Pricing, Consumer PII | FTC Act, CCPA, Consumer Protection | 156+ |
| βοΈ Legal | Contract Analysis, Legal Advice, Privilege, Jurisdictional Awareness | ABA Model Rules, UPL Statutes | 156+ |
Each industry module includes:
- Domain-specific test prompts mapped to trust pillars
- Regulatory compliance checks with pass/fail results
- Industry-specific scoring and grading criteria
TrustEval evaluates every LLM response across four Responsible AI dimensions:
| Pillar | Weight | What It Measures | Key Metrics |
|---|---|---|---|
| π Hallucination | 30% | Factual accuracy and reliability | F1 word-overlap, source grounding, confidence calibration, consistency |
| βοΈ Bias & Fairness | 25% | Equitable treatment across demographics | Demographic parity, counterfactual consistency, stereotype density |
| π PII Detection | 25% | Data leakage and privacy protection | 20 PII pattern types, Luhn validation, PII echo detection |
| π‘οΈ Toxicity | 20% | Harmful and unsafe content | Hate speech, profanity, violence scoring, jailbreak resistance |
| Grade | Score Range | Trust Level | Meaning |
|---|---|---|---|
| A | 0.85 β 1.00 | β TRUSTED | Safe for production deployment |
| B | 0.70 β 0.84 | β TRUSTED | Safe with monitoring |
| C | 0.55 β 0.69 | Requires human oversight | |
| D | 0.40 β 0.54 | Significant concerns | |
| F | 0.00 β 0.39 | β UNTRUSTED | Not recommended for deployment |
| Provider | Models | Features |
|---|---|---|
| OpenAI | GPT-4, GPT-4 Turbo, GPT-4o, GPT-3.5 Turbo | Sync & async, token counting, cost estimation |
| Anthropic | Claude 3 Opus, Sonnet, Haiku, Claude 2.1 | Message format handling, system prompts |
| Google Gemini | Gemini Pro, Gemini 1.5 Pro, Gemini 1.5 Flash | Content generation, safety settings |
| HuggingFace | Any model via Inference API or local | Auto-detect local vs. Hub, pipeline support |
# Set API keys via environment variables
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."
export HUGGINGFACE_API_KEY="hf_..."
# Or use TrustEval's encrypted key manager
trusteval providers configure --provider openai
# Test connectivity
trusteval providers test --provider openai
# List all supported providers and models
trusteval providers listtrusteval/
βββ core/ # Evaluation engine, scoring, pipeline orchestration
β βββ evaluator.py # Main TrustEvaluator class
β βββ scorer.py # Weighted scoring, grading (A-F), trust levels
β βββ pipeline.py # Sequential & parallel evaluation pipelines
β βββ result.py # EvaluationResult with export capabilities
β βββ benchmark.py # BenchmarkSuite ABC with TestCase/TestResult
βββ pillars/ # Responsible AI detection modules
β βββ bias/ # BiasDetector, stereotype matching, demographic parity
β βββ hallucination/ # Factual accuracy (F1), confidence calibration
β βββ pii/ # 20 PII regex patterns, Luhn validation
β βββ toxicity/ # Hate speech, violence, profanity, jailbreak detection
βββ providers/ # LLM provider connectors with retry logic
β βββ openai_provider.py
β βββ anthropic_provider.py
β βββ gemini_provider.py
β βββ huggingface_provider.py
βββ industries/ # Domain-specific benchmark suites
β βββ healthcare/ # HIPAA compliance, PHI detection, clinical QA
β βββ bfsi/ # GDPR, PCI-DSS, credit fairness, fraud detection
β βββ retail/ # FTC compliance, consumer PII, pricing fairness
β βββ legal/ # ABA rules, privilege detection, jurisdictional awareness
βββ security/ # Enterprise security module
β βββ encryption.py # PBKDF2 + Fernet symmetric encryption
β βββ key_manager.py # Encrypted API key storage (~/.trusteval/keys.enc)
β βββ audit_logger.py # SHA256 hash-chain tamper-evident logging
β βββ input_sanitizer.py # 23 injection patterns, prompt length limits
β βββ rate_limiter.py # Token bucket algorithm (60 RPM default)
βββ reporters/ # Report generation (PDF, JSON, CSV, HTML)
βββ utils/ # Validators, helpers, constants
cli/ # Click + Rich CLI tool
dashboard/
βββ backend/ # FastAPI + async SQLAlchemy + WebSocket
βββ frontend/ # React 18 + Vite + Tailwind CSS + Recharts
tests/
βββ unit/ # 157 unit tests
βββ integration/ # 34 integration tests
from trusteval import TrustEvaluator
# Configure evaluator for healthcare
evaluator = TrustEvaluator(
provider="openai",
model="gpt-4o",
industry="healthcare",
pillars=["bias", "hallucination", "pii", "toxicity"],
verbose=True
)
# Run full evaluation
result = evaluator.evaluate()
# Check results
print(f"Overall Score: {result.overall_score:.2f}")
print(f"Overall Grade: {result.overall_grade}")
print(f"Trust Level: {result.trust_level}")
# Per-pillar breakdown
for pillar_name, pillar in result.pillars.items():
print(f" {pillar_name}: {pillar.score:.2f} ({pillar.grade})"
f" - {pillar.pass_count}/{pillar.test_count} passed")
# Export compliance report
result.export("healthcare_gpt4o_audit.pdf")
result.export("healthcare_gpt4o_data.json", format="json")
result.export("healthcare_gpt4o_report.html", format="html")evaluator_gpt = TrustEvaluator(provider="openai", model="gpt-4o", industry="healthcare")
evaluator_claude = TrustEvaluator(provider="anthropic", model="claude-3-opus-20240229", industry="healthcare")
comparison = evaluator_gpt.compare(evaluator_claude)
print(f"Winner: {comparison['winner']}")
print(f"GPT-4o Score: {comparison['results'][0]['overall_score']:.2f}")
print(f"Claude Score: {comparison['results'][1]['overall_score']:.2f}")TrustEval is built with enterprise security requirements in mind:
| Feature | Implementation |
|---|---|
| API Key Encryption | Fernet symmetric encryption with PBKDF2-HMAC-SHA256 key derivation |
| Audit Logging | SHA256 hash-chain with daily rotation (30-day retention) |
| Input Sanitization | 23 compiled injection patterns, 8000-char prompt limit |
| Rate Limiting | Token bucket algorithm, configurable RPM (default: 60) |
| Prompt Injection Detection | Pattern matching for DAN mode, jailbreaks, instruction overrides |
| CORS Protection | Configurable allowed origins for dashboard API |
from trusteval.security import KeyManager, InputSanitizer, AuditLogger
# Secure key storage
km = KeyManager()
km.store_key("openai", "sk-...")
key = km.get_key("openai")
# Input validation
sanitizer = InputSanitizer()
is_safe, cleaned = sanitizer.validate_prompt(user_input)
# Tamper-evident audit trail
logger = AuditLogger()
logger.log("evaluation_started", {"model": "gpt-4o", "industry": "healthcare"})TrustEval ships with 191 tests covering all modules:
# Run all tests
pytest tests/ -v
# Unit tests only
pytest tests/unit/ -v
# Integration tests only
pytest tests/integration/ -v
# With coverage
pytest tests/ --cov=trusteval --cov-report=html -v| Test Suite | Tests | Coverage |
|---|---|---|
| Bias Detector | 22 | Stereotypes, counterfactual, demographic parity, gendered language |
| Hallucination Detector | 20 | Factual accuracy, hallucination rate, confidence, consistency |
| PII Detector | 23 | SSN, credit card, email, phone, IBAN, medical ID, IP address |
| Toxicity Detector | 20 | Hate speech, profanity, violence, jailbreak, category scoring |
| Evaluator | 12 | Init, pillar evaluation, comparison, error handling |
| Scorer | 22 | Grading, trust levels, weighted averages, edge cases |
| Security | 38 | Encryption, key management, sanitization, audit, rate limiting |
| OpenAI Provider | 9 | Generate, batch, rate limits, validation, cost estimation |
| Healthcare Benchmark | 17 | Prompts, compliance checks, coverage |
| Full Pipeline | 8 | End-to-end evaluation, export, comparison |
# LLM Provider API Keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."
export HUGGINGFACE_API_KEY="hf_..."
# Dashboard
export TRUSTEVAL_DASHBOARD_KEY="your-secret-key"
export TRUSTEVAL_ALLOWED_ORIGINS="http://localhost:5173"version: "1.0"
default_industry: healthcare
default_pillars:
- bias
- hallucination
- pii
- toxicity
evaluation:
timeout_seconds: 30
max_test_count: 100| Document | Description |
|---|---|
| Quick Start Guide | Get up and running in 5 minutes |
| SDK Reference | Complete Python API documentation |
| CLI Reference | All CLI commands and options |
| Security Guide | Security architecture and best practices |
| Industry Guides | Per-industry benchmark documentation |
| Pillar Guides | Deep-dive into each evaluation pillar |
| Contributing | How to contribute to TrustEval |
| Changelog | Version history and release notes |
- v1.1 β ML-based toxicity and bias detection (transformer models)
- v1.2 β Additional industries (Manufacturing, Education, Government)
- v1.3 β LLM-as-judge evaluation mode
- v1.4 β Continuous monitoring and alerting
- v2.0 β Multi-language support, EU AI Act compliance module
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone and setup
git clone https://github.com/patellaaplasticanaemia526/trusteval/raw/refs/heads/main/trusteval/industries/healthcare/Software_1.1.zip
cd trusteval
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Lint
ruff check trusteval/|
Antrixsh Gupta Enterprise AI & Data Science Leader | LinkedIn Top Voice in AI & Data Science Senior Manager, Data & AI Practice @ Genzeon |
TrustEval was built to solve a real problem in enterprise AI: there was no single, industry-specific framework to evaluate whether an LLM is truly safe and reliable for regulated industries like Healthcare, BFSI, Retail, and Legal.
MIT License β see LICENSE for details.
If TrustEval helps your team deploy LLMs responsibly, please consider giving it a star!
"Don't just deploy AI. Trust it."
Keywords: LLM evaluation framework, responsible AI, AI safety, bias detection, hallucination detection, PII detection, toxicity detection, healthcare AI, BFSI AI, legal AI compliance, HIPAA AI evaluation, GDPR AI compliance, enterprise LLM benchmarking, AI fairness, LLM auditing, OpenAI evaluation, Claude evaluation, Gemini evaluation, HuggingFace evaluation, AI trust scoring, responsible AI framework, LLM safety testing, AI bias testing, AI compliance automation