Trust Bench - Multi-Agent Security Evaluation Framework

🏗️ Architecture Overview

Trust Bench SecureEval + Ops v3.0 Architecture

Trust Bench v3.0 Architecture: SecureEval envelope with Ops & Observability layer

🎥 Demo Video

Watch the full Trust Bench SecureEval + Ops demo here:
👉 Trust Bench v3.0 Demo (1 min 45 s)

A short walkthrough showing Trust Bench SecureEval + Ops v3.0 in action — identical UI, enhanced with security guardrails, resilience, structured logs, and health probes.

Trust Bench (Project2v2) is a LangGraph-based multi-agent workflow that inspects software repositories for security leakage, code quality gaps, and documentation health. The system features intelligent agent routing with specialized personas, cross-agent collaboration, transparent reasoning, and reproducible outputs that graders can run entirely offline.

🚀 Latest Features

Phase 1: Intelligent Agent Routing ✅ Complete

Phase 2: Ops Layer ✅ Complete

🧾 Structured Logging: JSON-formatted logs with run IDs via Project2v2/app/logging.py
🩺 Health Probes: /healthz and /readyz FastAPI routes in Project2v2/app/health.py
🔍 Observability Tests: pytest Project2v2/tests/test_ops_layer.py for log formatting & health endpoints
📊 Documentation Updates: OPERATIONS.md logging/health guidance, SECURITY.md redaction policy
📈 CI Evidence: Coverage + validator hooks ready for pipeline integration

Phase 0: Parity Lock-In ✅ Complete

🧪 Golden Fixtures: Canonical report.json, report.md, and bundle.zip locked in tests/fixtures/
🛡️ Parity Tests: pytest Project2v2/tests/test_parity.py verifies JSON structure, markdown digest, and bundle contents
📦 Artifact Freeze: Project2v2/output/ mirrors fixtures to guarantee identical observable behavior for future phases
📓 Documentation Hooks: OPERATIONS.md and SECURITY.md populated with Phase 0 baselines to evolve alongside new features
🛡️ Security Agent: Specialized vulnerability assessment and risk analysis
⚡ Quality Agent: Code quality improvements and best practices guidance
📚 Documentation Agent: Documentation generation and improvement suggestions
🎯 Orchestrator Agent: General queries, project overview, and multi-agent coordination
Smart Routing: LLM-powered question classification with confidence scoring

Phase 1: SecureEval Layer ✅ Complete

🛡️ Input Guardrails: Pydantic validation via Project2v2/app/security/guardrails.py
🔒 Sandbox Execution: Allowlisted subprocess wrapper (safe_run) preventing arbitrary shell usage
♻️ Resilience Decorators: Retry with exponential backoff + cross-platform timeout wrappers
🧪 Safety Tests: pytest Project2v2/tests/test_secure_eval.py covering validation, sandboxing, and resilience paths
📘 Documentation: SECURITY.md / OPERATIONS.md updated with Phase 1 safeguards and resilience defaults

Phase 2: Multi-Agent Consultation ✅ Complete

🔄 Collaborative Analysis: Complex queries automatically trigger multiple agents
🎯 Multi-Agent Detection: System identifies when specialist consultation is needed
📋 Executive Synthesis: Comprehensive responses combining insights from all relevant agents
🤝 Cross-Domain Queries: Handle requests spanning security, quality, and documentation
Intelligent Orchestration: Seamless coordination between specialist agents

Phase 3: Advanced Orchestration ✅ Complete

🤝 Consensus Building: Agents collaborate to reach agreements on complex assessments
⚔️ Conflict Resolution: Systematic resolution of conflicting agent recommendations
🔄 Iterative Refinement: Multiple rounds of analysis for nuanced scenarios
⚖️ Priority Negotiation: Balance competing concerns (e.g., security vs maintainability)
🧠 Advanced Synthesis: Unified recommendations from complex multi-agent negotiations
📊 Comprehensive Analysis: Deep, multi-perspective evaluations with consensus metrics

Phase 4: Custom Agent Weights ✅ Complete

🎛️ Interactive Weight Adjustment: Real-time sliders for Security, Quality, and Documentation agent importance
⚖️ Weighted Scoring System: Final evaluation scores calculated using custom agent weightings
📋 Preset Configurations: Quick-select buttons for Security Focus, Quality Focus, Documentation Focus, and Balanced approaches
📈 Live Score Preview: Real-time preview of how weight changes affect final evaluation scores
🔧 Flexible Integration: Works seamlessly through web interface, CLI, and API with backward compatibility

Phase 5: Agent Confidence Scoring ✅ Complete

📊 Confidence Calculations: Advanced algorithms assess agent confidence based on response completeness, specificity, and score consistency
🎯 Visual Confidence Meters: Color-coded progress bars (green/yellow/red) display confidence levels for each agent analysis
📋 Confidence Reporting: Confidence scores included in JSON/Markdown reports with detailed breakdowns and visual indicators
🔍 Smart Recommendations: System provides insights based on confidence levels to guide users toward more reliable agent outputs
⚡ Real-time Display: Live confidence updates in web interface alongside analysis results with expandable details

Phase 6: Enhanced UI Indicators ✅ Complete

🎯 Consensus Journey Visualization: Complete timeline of agent negotiations with progress markers and round-by-round analysis
💬 Live Negotiation Highlights: Speech bubbles showing actual agent conversations with mood indicators (green/yellow/red)
⚔️ Visual Conflict Resolution: Before/after comparison panels displaying initial disagreements vs final negotiated results
🔄 Interactive Process Steps: Expandable accordion cards for each negotiation round with detailed collaboration insights
📊 Agent Mood Mapping: Real-time mood badges showing agreement, negotiation, and conflict states during consensus building
🎭 Authentic Agent Data: All visualizations use genuine agent conversations and collaboration data, not simulated content

Phase 7: Enhanced Export Features ✅ Complete

📦 Complete Analysis Bundles: ZIP downloads containing JSON reports, Markdown summaries, and chat transcripts in one package
💬 Chat Export/Import: Save and restore conversation histories with agent routing decisions and confidence scores
🔄 Session Continuity: Import previous conversations to continue analysis or share findings with team members
📊 Multiple Download Options: Individual JSON/Markdown reports (legacy) plus new enhanced bundles with chat data
🕐 Timestamped Archives: UTC timestamps and metadata preservation for audit trails and team collaboration
🛡️ Secure File Handling: Safe path validation and proper encoding for cross-platform compatibility

Overview

Agents: Manager (plan/finalize), SecurityAgent, QualityAgent, DocumentationAgent
Core Tools: regex secret scanner, repository structure analyzer, documentation reviewer
Collaboration: agents exchange messages and adjust scores based on peer findings (security alerts penalize quality/documentation; quality metrics influence documentation, etc.)
Deliverables: JSON and Markdown reports containing composite scores, agent summaries, conversation logs, and instrumentation metrics

[Manager Plan]
     |
[SecurityAgent] --> alerts --> [QualityAgent] --> metrics --> [DocumentationAgent]
     \____________________________ shared context _____________________________/
                           |
                   [Manager Finalize] --> report.json / report.md

Tool Integrations

Tool	Consumed By	Capability Extension
`run_secret_scan`	SecurityAgent	Detects high-signal credentials (AWS, GitHub, RSA keys)
`analyze_repository_structure`	QualityAgent	Counts files, languages, estimated test coverage
`evaluate_documentation`	DocumentationAgent	Scores README variants by coverage and cross-agent context
`serialize_tool_result`	All agents	Normalizes tool dataclasses for message passing

MCP endpoints are intentionally not shipped in Project2v2. See MCP Server (Scope Decision).

Prerequisites

Before installing Trust Bench SecureEval + Ops, ensure you have the following:

System Requirements

Python: 3.10 or later (3.11+ recommended for best performance)
Operating System:
- Windows 10/11 (tested on Windows 11)
- Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+)
- macOS 11+ (Big Sur or later)
Hardware:
- RAM: Minimum 4 GB, 8 GB recommended for large repositories
- Storage: 2 GB free disk space for dependencies and analysis artifacts
- CPU: No GPU required; runs on standard CPU architectures

Required Software

Git: Version 2.30+ for repository cloning
Python Package Manager: pip 21+ (included with Python 3.10+)

Network & Access Requirements

Outbound Internet Access:
- GitHub.com (for cloning public repositories)
- LLM Provider APIs: OpenAI, Groq, or Gemini (for interactive chat features)
Permissions: Read access to target repositories you want to analyze

API Keys (Optional but Recommended)

For the interactive web chat and agent consultation features, you need at least one LLM provider API key:

OpenAI API Key (Get one here)
- Recommended for best performance
- Requires active billing/credits
Groq API Key (Alternative, free tier available)
Gemini API Key (Google's alternative)

Note: The core evaluation features (security scan, quality analysis, documentation review) work without any API keys. API keys are only required for the chat/consultation interface.

Knowledge Prerequisites

Basic Command Line: Familiarity with terminal/PowerShell commands
Git Basics: Understanding of repository cloning and navigation
Python Basics: Ability to activate virtual environments and run Python scripts

Installation & Setup

Quick Start (5 minutes)

Follow these step-by-step instructions to get Trust Bench running on your system.

Step 1: Clone the Repository

git clone https://github.com/mwill20/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-Ops

Expected output: Repository cloned successfully, directory contains Project2v2/ folder.

Step 2: Create a Virtual Environment (Recommended)

This isolates Trust Bench's dependencies from your system Python.

Windows (PowerShell):

python -m venv .venv
.\.venv\Scripts\activate

Linux / macOS:

python3 -m venv .venv
source .venv/bin/activate

Expected output: Your terminal prompt should now show (.venv) prefix.

Step 3: Install Dependencies

# Upgrade pip first (recommended)
pip install --upgrade pip

# Install core dependencies
pip install -r Project2v2/requirements-phase1.txt

# Optional: Install advanced features (evaluation metrics, static analysis)
pip install -r Project2v2/requirements-optional.txt

Expected output:

Core install: ~50-100 packages installed (LangGraph, Flask, Pydantic, etc.)
Optional install: Additional packages for enhanced analysis capabilities

Verification:

python -c "import langgraph, flask, pydantic; print('Dependencies OK')"

Should print: Dependencies OK

Step 4: Configure Environment Variables (Optional)

Copy the example environment file and add your API keys:

Windows:

copy Project2v2\.env.example Project2v2\.env

Linux / macOS:

cp Project2v2/.env.example Project2v2/.env

Edit Project2v2/.env with your preferred text editor:

# Minimum configuration for chat features
OPENAI_API_KEY=sk-your-actual-key-here
LLM_PROVIDER=openai
ENABLE_SECURITY_FILTERS=true

Important: If you skip this step, the core analysis features still work! API keys are only required for the interactive chat interface.

Step 5: Verify Installation

Run a quick self-test:

cd Project2v2
python main.py --repo ../Project2v2 --output test_output

Expected output:

=== Multi-Agent Evaluation Complete ===
Repository: /path/to/Trust-Bench-SecureEval-Ops/Project2v2
Overall Score: 32/100
Grade: needs_attention
System Latency: 0.08 seconds
Faithfulness: 0.62
Refusal Accuracy: 1.0
Per-Agent Timings:
  - SecurityAgent: 0.07 seconds
  - QualityAgent: 0.003 seconds
  - DocumentationAgent: 0.002 seconds
Report (JSON): test_output/report.json
Report (Markdown): test_output/report.md

✅ You're ready to go! See Running the System below for usage options.

Troubleshooting Installation

Issue: python: command not found or python3: command not found

Solution: Install Python 3.10+ from python.org or your package manager
Verify: python --version or python3 --version

Issue: pip install fails with permission errors

Solution: Ensure you've activated the virtual environment (Step 2)
Or use: pip install --user -r requirements-phase1.txt

Issue: ModuleNotFoundError after installation

Solution: Ensure you're in the activated virtual environment
Reinstall dependencies: pip install --force-reinstall -r Project2v2/requirements-phase1.txt

Issue: Windows "execution policy" blocks scripts

Solution: Run PowerShell as Administrator and execute:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Environment Variables Reference

These variables can be set in Project2v2/.env or as system environment variables:

Variable	Required	Default	Description
`OPENAI_API_KEY`	No*	None	OpenAI API key for chat features
`GROQ_API_KEY`	No*	None	Alternative: Groq API key
`GEMINI_API_KEY`	No*	None	Alternative: Google Gemini API key
`LLM_PROVIDER`	No	`openai`	Which LLM to use: `openai`, `groq`, or `gemini`
`ENABLE_SECURITY_FILTERS`	No	`true`	Enable input validation and prompt sanitization
`TB_MAX_FILES`	No	`2000`	Maximum files to scan per repository
`TB_MAX_FILE_SIZE_MB`	No	`2`	Skip files larger than this (MB)
`TB_CLONE_TIMEOUT`	No	`120`	Repository clone timeout (seconds)
`LOG_LEVEL`	No	`INFO`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR`

* At least one LLM provider key is required for chat features; core analysis works without any keys.

Running the System

Web Interface (recommended)

cd Project2v2
python web_interface.py
# browse to http://localhost:5001

The web interface now features intelligent agent routing that automatically directs your questions to the most appropriate specialist agent. Ask security questions, request code quality improvements, or seek documentation help - the system will route to the right expert and provide contextual responses with visual agent indicators.

Direct CLI

cd Project2v2
python main.py --repo .. --output output

Legacy CLI (kept for rubric compatibility)

python -m trustbench_core.eval.evaluate_agent --repo <path> --output Project2v2/output

This forwards to Project2v2/main.py; the new entrypoint remains the single source of truth.

Convenience Scripts

cd Project2v2
.\run_audit.ps1 .. my_output   # PowerShell
run_audit.bat ..               # Windows CMD
launch.bat                     # Interactive menu (web UI, CLI, presets)

Interactive Web Application

Trust Bench SecureEval + Ops features a production-grade Flask web application that provides an intuitive interface for repository security auditing and AI-powered analysis consultation.

Framework & Architecture

Technology Stack:

Backend: Flask 3.0+ (Python web framework)
Frontend: Custom HTML5, CSS3, JavaScript (ES6+)
Agent Orchestration: LangGraph for multi-agent workflows
Real-time Updates: AJAX for asynchronous status polling
Security: Pydantic input validation, CSRF protection, API key sanitization

User Interface Layout

Left Sidebar (Control Panel)

The sidebar contains all user inputs and configuration options:

How to Use Section
- Quick summary of the evaluation process
- Links to documentation and help resources
Repository Input
- GitHub URL input field with validation
- Example: https://github.com/owner/repository
- Supports public repositories (private repos require authentication)
LLM Provider Selection (Optional)
- Dropdown menu: OpenAI, Groq, or Gemini
- API key input with masked display (not stored or logged)
- "Test Connection" button for immediate validation
- Privacy notice: Keys are session-only, never persisted
Evaluation Metrics Panel
- Interactive sliders for custom agent weighting:
  - 🛡️ Security Agent (vulnerability scanning)
  - 🏗️ Quality Agent (code structure & testing)
  - 📚 Documentation Agent (README & guides)
- Preset buttons for quick configuration:
  - Security Focus (50% security, 25% quality, 25% docs)
  - Quality Focus (25% security, 50% quality, 25% docs)
  - Documentation Focus (25% security, 25% quality, 50% docs)
  - Balanced (equal weights)
- Live score preview showing weighted vs. equal scoring

Main Panel (Orchestration View)

The main panel displays real-time analysis progress and results:

Progress Workflow Visualization
- Step 1: Input - Repository URL validated
- Step 2: Orchestrator - Manager node coordinates agents
- Step 3: Agent Tiles (expandable):
  - Security Agent 🛡️ - Scanning for secrets & vulnerabilities
  - Quality Agent 🏗️ - Analyzing code & test coverage
  - Documentation Agent 📚 - Reviewing docs & READMEs
- Step 4: Results - Scores compiled, report generated
- Color-coded states:
  - Gray: Idle/waiting
  - Blue: Active/running
  - Green: Completed successfully
  - Red: Error state
Agent Detail Expandos
- Each agent tile has a "Show details" button
- Reveals capabilities, collaboration context, and real-time status
- Example: Security Agent shows number of findings, risk level
Phase 3: Consensus Journey Visualization
- Progress timeline with negotiation round markers
- Agent mood indicators:
  - 🟢 Green: Agreement
  - 🟡 Yellow: Negotiating
  - �red Red: Conflict
- Live negotiation highlights: Speech bubbles showing agent conversations
- Conflict resolution panels: Before/after comparison of disagreements
Results Section
- Overall Score: Numeric score (0-100) with grade (excellent/good/fair/needs_attention)
- Per-dimension breakdown:
  - Security score with confidence meter
  - Quality score with confidence meter
  - Documentation score with confidence meter
- Confidence visualization: Color-coded progress bars
  - High (≥80%): Green
  - Medium (60-79%): Yellow
  - Low (<60%): Red
- Download options:
  - Individual JSON/Markdown reports
  - Complete analysis bundle (ZIP with chat transcripts)
- Export/Import chat: Save conversation history for later review
Agent Chat Interface (when LLM key provided)
- Chat history with message threading
- Agent-specific responses with avatars:
  - 🛡️ Security Agent (red accent)
  - 🏗️ Quality Agent (blue accent)
  - 📚 Documentation Agent (green accent)
  - 🎯 Orchestrator (purple accent)
- Routing transparency: Each response shows why that agent was selected
- Confidence badges: Visual indicator of agent certainty
- Context awareness: Agents reference latest report data

UX Design Considerations

1. Visibility of System Status

Real-time progress indicators prevent "black box" feeling
Color-coded states (idle/running/completed) provide at-a-glance status
Agent tiles update dynamically as orchestrator dispatches tasks

2. Progressive Disclosure

Complex details hidden by default (expandable tiles)
High-level summary visible without scrolling
Advanced features (custom weights, chat) optional but accessible

3. Safety & Privacy

API key handling: Keys accepted through UI, used in-memory only
Never logged: Security filters redact sensitive data from logs
Session-only storage: Keys cleared when browser closes
Privacy notice: Explicit user communication about data handling

4. Error Recovery

Graceful degradation: Core features work without API keys
Inline error messages with actionable solutions
Timeout handling: Long-running tasks show progress, can be cancelled

5. Accessibility

Semantic HTML with ARIA labels
Keyboard navigation support
High-contrast color schemes for readability
Responsive design (works on tablets and desktops)

6. Performance Optimization

Lazy loading of agent details (only when expanded)
Debounced API calls prevent server overload
Cached report data reduces redundant analysis

Interactive Features Showcase

Try these workflows:

Quick Audit: Enter a GitHub URL, click "Analyze Repository", watch agents collaborate
Custom Weighting: Adjust sliders to emphasize security, see score preview update in real-time
Agent Consultation: After analysis, ask "What are the top security risks?" - Security Agent responds with context
Consensus Exploration: Expand orchestration timeline to see how agents negotiated conflicting findings
Export & Share: Download complete bundle, share with team, import conversation on another machine

Evaluation Metrics Instrumentation

Every run records deterministic metrics alongside agent results:

System latency - overall wall-clock time plus per-agent/per-tool timings (metrics.system_latency_seconds, metrics.per_agent_latency)
Faithfulness - heuristic alignment of summaries with tool evidence (metrics.faithfulness)
Refusal accuracy - simulated unsafe prompt harness (returns 1.0 while LLM calls are disabled) (metrics.refusal_accuracy)

Metrics appear in both report.json (under metrics) and report.md (rendered table). Example CLI output:

System Latency: 0.08 seconds
Faithfulness: 0.62
Refusal Accuracy: 1.0
Per-Agent Timings:
  - SecurityAgent: 0.07 seconds
  - QualityAgent: 0.003 seconds
  - DocumentationAgent: 0.002 seconds

Reporting Outputs

Each audit (web or CLI) produces:

report.json - timestamp, repo path, composite summary, per-agent results, metrics, full conversation log
report.md - human-readable summary with agent cards, instrumentation metrics, conversation log
Optional timestamped archives (github_analysis_*) when launched through the web interface

Deployment

Trust Bench SecureEval + Ops supports multiple deployment scenarios, from local development to production cloud environments.

Deployment Mode 1: Local Development (Recommended for First Use)

This is the default setup described in Installation & Setup.

Use when:

Exploring the tool on your laptop
Development and testing
Quick ad-hoc repository audits

Start the web interface:

cd Project2v2
python web_interface.py

Access: http://localhost:5001

Deployment Mode 2: Docker Container (Production-Ready)

Quick start with Docker Compose:

# 1. Clone the repository
git clone https://github.com/yourusername/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-Ops

# 2. Create .env file with your API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env

# 3. Start the container
docker-compose up -d

# 4. Access the web interface
# Open http://localhost:5001 in your browser

# View logs
docker-compose logs -f

# Stop the container
docker-compose down

Manual Docker build and run:

# Build the image
docker build -t trustbench:latest .

# Run with environment variables
docker run -d \
  --name trustbench \
  -p 5001:5001 \
  -e OPENAI_API_KEY=sk-your-key-here \
  -e ENABLE_SECURITY_FILTERS=true \
  -e TB_RUN_MODE=strict \
  -v trustbench-data:/data \
  -v trustbench-logs:/logs \
  trustbench:latest

# View logs
docker logs -f trustbench

# Stop and remove
docker stop trustbench
docker rm trustbench

Docker features:

Multi-stage build: Smaller final image (~300 MB vs ~1 GB)
Non-root user: Runs as trustbench user (UID 1000) for security
Health checks: Automatic restart on failures
Persistent storage: Volumes for analysis results and logs
Resource limits: Configurable CPU/memory constraints
Read-only filesystem: Enhanced security with tmpfs for temp files

Environment variables (see .env.example for full list):

# Required: At least one API key
OPENAI_API_KEY=sk-...
# or GROQ_API_KEY=gsk-...
# or GEMINI_API_KEY=...

# Optional: Configuration overrides
LLM_PROVIDER=openai
TB_MAX_FILES=2000
TB_MAX_FILE_SIZE_MB=2
AGENT_TIMEOUT_SECONDS=120
LOG_LEVEL=INFO

Volumes explained:

/data: Persistent storage for analysis results and reports
/logs: Application logs (if file logging enabled)

Production deployment with Docker:

# Use docker-compose for persistent setup
docker-compose up -d

# Update to latest version
git pull
docker-compose build
docker-compose up -d

# View resource usage
docker stats trustbench-secureeval-ops

Deployment Mode 3: Server / Cloud Deployment

For long-running or team deployments, run Trust Bench on a dedicated server or cloud VM.

Prerequisites

VM with Python 3.10+ (Ubuntu 22.04 LTS, Amazon Linux 2023, etc.)
2-4 vCPU, 8 GB RAM recommended
Inbound port 5001 open (or your chosen port)

Option A: Systemd Service (Linux)

Create /etc/systemd/system/trust-bench.service:

[Unit]
Description=Trust Bench SecureEval + Ops Web Interface
After=network.target

[Service]
Type=simple
User=trustbench
WorkingDirectory=/opt/trust-bench/Project2v2
Environment="PATH=/opt/trust-bench/.venv/bin"
Environment="OPENAI_API_KEY=sk-your-key-here"
Environment="ENABLE_SECURITY_FILTERS=true"
ExecStart=/opt/trust-bench/.venv/bin/python web_interface.py
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable trust-bench
sudo systemctl start trust-bench
sudo systemctl status trust-bench

Option B: Supervisor (Cross-Platform)

Install supervisor: pip install supervisor

Add to supervisord.conf:

[program:trust-bench]
command=/opt/trust-bench/.venv/bin/python web_interface.py
directory=/opt/trust-bench/Project2v2
user=trustbench
autostart=true
autorestart=true
environment=OPENAI_API_KEY="sk-your-key",ENABLE_SECURITY_FILTERS="true"
stdout_logfile=/var/log/trust-bench/stdout.log
stderr_logfile=/var/log/trust-bench/stderr.log

Start:

supervisorctl reread
supervisorctl update
supervisorctl start trust-bench

Option C: Reverse Proxy (nginx)

For HTTPS and domain mapping, place nginx in front of Trust Bench:

server {
    listen 443 ssl http2;
    server_name trustbench.yourdomain.com;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://127.0.0.1:5001;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # WebSocket support (for future real-time features)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Cloud Platform Examples

AWS EC2:

Launch t3.medium instance (Ubuntu 22.04)
Configure security group: Allow inbound TCP 5001 (or 443 if using nginx)
Follow systemd service setup above
Optional: Use Elastic IP for consistent access

Azure VM:

Create B2s VM (Ubuntu 22.04)
Open NSG port 5001 or 443
Follow systemd service setup above

Google Cloud:

Create e2-medium instance (Ubuntu 22.04)
Configure firewall rule for port 5001/443
Follow systemd service setup above

DigitalOcean Droplet:

Basic Droplet ($12/month, 2 GB RAM)
Follow systemd service setup above
Optional: Enable DigitalOcean firewall

Security Considerations for Production

Use HTTPS: Always deploy behind a reverse proxy with TLS certificates
Firewall: Restrict access to trusted IP ranges if possible
API Keys: Use environment variables, never commit to git
Rate Limiting: Consider adding nginx rate limiting for public deployments
Monitoring: Set up log aggregation (see OPERATIONS.md)
Updates: Regularly git pull and restart the service for security patches

Scaling Considerations

Current architecture supports:

5-10 concurrent users (single Flask worker)
10-20 repository analyses per hour
Single-server deployment

For higher loads, consider:

Deploy behind Gunicorn: gunicorn -w 4 -b 0.0.0.0:5001 web_interface:app
Use Redis for session management
Add load balancer for multiple instances
See OPERATIONS.md for production hardening guidance

Troubleshooting

Common issues and their solutions:

Installation Issues

Error: `python: command not found`

Symptoms: Terminal cannot find python or python3 command

Solutions:

Install Python 3.10+ from python.org
On Linux: sudo apt install python3.10 python3.10-venv (Ubuntu/Debian)
On macOS: brew install python@3.10
Verify: python --version or python3 --version should show 3.10 or higher

Error: `pip install` fails with permission errors

Symptoms:

ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied

Solutions:

Activate virtual environment first: .venv\Scripts\activate (Windows) or source .venv/bin/activate (Linux/macOS)
Or use user install: pip install --user -r requirements-phase1.txt
Never use sudo pip (breaks system Python)

Error: `ModuleNotFoundError` after successful install

Symptoms:

ModuleNotFoundError: No module named 'langgraph'

Solutions:

Ensure virtual environment is activated (check for (.venv) prefix in terminal)

Reinstall in correct environment:

pip install --force-reinstall -r Project2v2/requirements-phase1.txt

Verify environment: which python (Linux/macOS) or where python (Windows) should point to .venv

Runtime Issues

App doesn't start: `Address already in use`

Symptoms:

OSError: [Errno 48] Address already in use

Solutions:

Another process is using port 5001
Find and kill it:
- Linux/macOS: lsof -ti:5001 | xargs kill -9
- Windows: netstat -ano | findstr :5001, then taskkill /PID <PID> /F
Or change port: Set WEB_PORT=5002 in .env

LLM Chat Not Working: `Invalid API key`

Symptoms: Chat returns "Sorry, I'm having trouble accessing the AI service"

Solutions:

Verify API key is set: echo $OPENAI_API_KEY (Linux/macOS) or echo %OPENAI_API_KEY% (Windows)
Check key has active billing/credits at OpenAI Platform
Try alternative provider: Set LLM_PROVIDER=groq and GROQ_API_KEY=... in .env
Workaround: Core analysis features work without API keys (only chat is affected)

Agent Tiles Stay "Running" Indefinitely

Symptoms: Progress bars stuck, no error message

Solutions:

Check timeout settings in .env: AGENT_TIMEOUT_SECONDS=120 (increase if analyzing large repos)
View detailed logs: tail -f Project2v2/logs/app.log (if logging configured)

For large repos, increase limits:

TB_MAX_FILES=5000
TB_MAX_FILE_SIZE_MB=5
TB_CLONE_TIMEOUT=300

Check network connectivity to GitHub (required for repository cloning)

Repository Analysis Fails Immediately

Symptoms: "Repository not found" or "Cloning failed" error

Solutions:

Verify URL format: Must be https://github.com/owner/repo (no trailing slash)
Public repositories only: Private repos require authentication (not yet supported in UI)
Large repositories: Repos >1 GB may timeout; increase TB_CLONE_TIMEOUT=300
Network issues: Check firewall/proxy settings allow access to github.com

Web Interface Issues

Blank page or "Cannot GET /"

Symptoms: Browser shows empty page or 404 error

Solutions:

Verify app is running: Check terminal for "Running on http://127.0.0.1:5001"
Use correct URL: http://localhost:5001 (not 127.0.0.1:5001 on some systems)
Clear browser cache: Ctrl+F5 (Windows) or Cmd+Shift+R (macOS)
Check browser console for JavaScript errors (F12 Developer Tools)

Windows: "execution policy" blocks scripts

Symptoms:

.venv\Scripts\activate : File cannot be loaded because running scripts is disabled on this system.

Solutions:

# Run PowerShell as Administrator
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Then retry activation
.\.venv\Scripts\activate

Performance Issues

Analysis is very slow (>5 minutes)

Causes & Solutions:

Large repository: Normal for repos with 10,000+ files
- Solution: Increase timeouts or analyze smaller repos first
Slow internet: Clone operation is network-bound
- Solution: Pre-clone repo locally, use CLI: python main.py --repo /path/to/local/repo
Low RAM: System swapping to disk
- Solution: Close other applications, add swap space, or upgrade RAM

Out of memory errors

Symptoms:

MemoryError or killed by OS

Solutions:

Reduce file scan limits:
```
TB_MAX_FILES=1000
TB_MAX_FILE_SIZE_MB=1
```
Analyze smaller repositories first
Increase system RAM or swap space
Use CLI for more efficient processing: python main.py --repo <path>

Getting Help

If you've tried the above solutions and still have issues:

Check logs: Project2v2/logs/app.log (if logging is configured)
GitHub Issues: Open an issue with:
- Python version: python --version
- OS details: uname -a (Linux/macOS) or systeminfo (Windows)
- Error message (full traceback)
- Steps to reproduce
Documentation: See OPERATIONS.md and SECURITY.md
Community: Check existing issues for similar problems

Demo Video

If playback doesn't work on GitHub, download the file locally from the same link above.

The full-resolution video is hosted via OneDrive to keep the repository history lean. If you want an offline copy, download it from the link above and place it under Project2v2/assets/images/.

Example Results (Project2v2 self-audit)

Overall Score: ~32/100 (needs_attention)
Security: seeded secrets detected (score 0) drive collaboration penalties
Quality: medium score, automatically penalized by SecurityAgent findings
Documentation: strong base score but reduced for missing security/testing guidance
Collaboration: more than five cross-agent messages; Manager summarizes adjustments in the final log

MCP Server (Scope Decision)

Project2v2 prioritizes deterministic, offline-capable tooling. To keep grading reproducible and avoid external runtime dependencies, the earlier MCP server has been intentionally deprecated for this version. Required tool integrations (three or more) are provided as direct Python callables. MCP can be revisited later if cross-client interoperability (Claude Desktop, Cursor, etc.) becomes necessary, but it is not required for Module 2 compliance.

File Structure (trimmed)

Trust_Bench/
|-- Project2v2/
|   |-- main.py
|   |-- web_interface.py
|   |-- multi_agent_system/
|   |   |-- agents.py
|   |   |-- orchestrator.py
|   |   |-- tools.py
|   |   `-- reporting.py
|   |-- requirements-phase1.txt
|   |-- requirements-optional.txt
|   |-- run_audit.(bat|ps1)
|   |-- launch.bat
|   `-- output/
|-- trustbench_core/      (legacy CLI wrapper forwarding to Project2v2/main.py)
`-- Project2v2/checklist.yaml (optional-features register)

Security & Hardening Notes

All detected secrets are synthetic and included solely for demonstration purposes. No real credentials are exposed.
security_utils.py and the web UI sanitize repository URLs, prompts, and API keys.
Optional extras (ragas, semgrep, streamlit) enable deeper analytics and dashboarding when desired.

Future Work & Enhancements

The following features are planned for future releases:

Future Considerations (Lower Priority)

Batch Analysis: Automated analysis of multiple repositories simultaneously with queue management and comparative dashboards
- Note: This feature requires significant infrastructure (job queuing, background workers, database) and is planned for a later major release

Long-term Vision

Additional specialized agents (Performance, Accessibility, Compliance)
Integration with CI/CD pipelines
Real-time repository monitoring
Advanced analytics and trending

Documentation Roadmap & Evidence

To satisfy Project 3 publication requirements, the repository ships with dedicated documentation artifacts. Keep these synchronized with feature development:

Document	Purpose	Status / Next Action
`README.md`	Executive summary, architecture overview, quickstart, CI/coverage references	✅ Actively maintained
`OPERATIONS.md`	Operational excellence rubric (start/stop/health, logging, resilience, monitoring)	📝 Skeleton ready – fill details as services evolve
`SECURITY.md`	Security & safety rubric (input validation, guardrails, sandbox, SOC 2-lite map)	📝 Skeleton ready – update per guardrail implementation
`USER_GUIDE.md`	End-user walkthrough + context	✅ Append “Production Enhancements” notes when Phase 0 work lands
`docs/evidence/` (optional)	Test/coverage screenshots, CI logs for publication attachments	🔄 Create as artifacts are generated

Working agreement:

Phase 0 (“Parity Lock-In”) adds test coverage notes under README → Testing & CI Summary, and populates the open TODOs in OPERATIONS.md / SECURITY.md.
Each subsequent phase should update the relevant section(s) before closing the task to avoid drift.

Production Enhancements (Module 3)

Phase 3 and 4 harden the system for publication while preserving identical UX:

CI gate (python-ci workflow) runs tests, coverage, security audit, and rubric validator on every PR.
Structured JSON logs, health probes, and resilience settings documented in OPERATIONS.md.
Security controls (input validation, sandbox, redaction, SOC 2-lite mapping) captured in SECURITY.md.
Evidence bundle placeholders live under Project2v2/docs/evidence/ for publishing artifacts (CI proof, screenshots, coverage report).

License & Support

License

Trust Bench SecureEval + Ops is released under the MIT License.

What this means for you:

✅ Commercial use: Use in commercial products and services
✅ Modification: Fork, modify, and adapt the code
✅ Distribution: Share modified or unmodified versions
✅ Private use: Use privately without publishing your changes
⚠️ Liability: No warranty provided; use at your own risk

Full license text: See LICENSE in the repository root.

Attribution: When using Trust Bench in publications or products, please include:

Trust Bench SecureEval + Ops by Michael Williams
https://github.com/mwill20/Trust-Bench-SecureEval-Ops
Licensed under MIT

Support Channels

Bug Reports & Feature Requests:

GitHub Issues: Open an issue
Labels:
- bug: Something isn't working correctly
- enhancement: New feature or improvement request
- documentation: Documentation improvements
- question: General questions about usage

Security Concerns:

Private reporting: For security vulnerabilities, please open a private security advisory
Contact: See SECURITY.md for responsible disclosure process
Do NOT open public issues for security vulnerabilities

Community & Discussion:

GitHub Discussions: Q&A, ideas, and general discussion
Pull Requests: Contributions welcome! See Contributing below

Contributing

We welcome contributions to Trust Bench SecureEval + Ops!

How to contribute:

Fork the repository on GitHub
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes:
- Follow existing code style (PEP 8 for Python)
- Add tests for new features
- Update documentation as needed
Run tests: pytest Project2v2/tests/
Commit your changes: git commit -m 'Add amazing feature'
Push to your fork: git push origin feature/amazing-feature
Open a Pull Request with a clear description of changes

Contribution guidelines:

All PRs must pass CI checks (tests, coverage, linting)
Maintain or improve test coverage (currently 79%)
Update CHANGELOG.md with notable changes
Follow the existing code structure and patterns

Development setup:

# Clone your fork
git clone https://github.com/YOUR-USERNAME/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-Ops

# Create development environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install development dependencies
pip install -r Project2v2/requirements-phase1.txt
pip install -r Project2v2/requirements-optional.txt
pip install pytest pytest-cov black flake8

# Run tests
pytest Project2v2/tests/ -v

# Format code
black Project2v2/

# Lint
flake8 Project2v2/

Maintenance Status

🟢 Actively Maintained (as of November 2025)

This project is part of an ongoing AI security engineering learning track. We aim to:

Address critical bugs within 48 hours
Respond to issues and PRs within 1 week
Keep dependencies current (quarterly security updates)
Add new features based on community feedback

Roadmap: See Future Work & Enhancements above for planned features.

Contact Information

Maintainer: Michael Williams (@mwill20)

Email: For security issues only (see SECURITY.md)

Preferred contact: GitHub Issues for all non-security topics

Credits & References

Ready Tensor AI Agent Course - Module 2 (Multi-Agent Evaluation)
LangGraph, CrewAI, AutoGen (collaboration inspiration)
Semgrep, OpenAI/Groq/Gemini APIs (referenced integrations)
Project2v2 implementation by @mwill20 and collaborators

Version Project2v2 - October 2025 - Refactored for offline deterministic evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
Project2v2		Project2v2
trustbench_core		trustbench_core
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Trust Bench - Multi-Agent Security Evaluation Framework

🏗️ Architecture Overview

🎥 Demo Video

🚀 Latest Features

Phase 1: Intelligent Agent Routing ✅ Complete

Phase 2: Ops Layer ✅ Complete

Phase 0: Parity Lock-In ✅ Complete

Phase 1: SecureEval Layer ✅ Complete

Phase 2: Multi-Agent Consultation ✅ Complete

Phase 3: Advanced Orchestration ✅ Complete

Phase 4: Custom Agent Weights ✅ Complete

Phase 5: Agent Confidence Scoring ✅ Complete

Phase 6: Enhanced UI Indicators ✅ Complete

Phase 7: Enhanced Export Features ✅ Complete

Contents

Overview

Tool Integrations

Prerequisites

System Requirements

Required Software

Network & Access Requirements

API Keys (Optional but Recommended)

Knowledge Prerequisites

Installation & Setup

Quick Start (5 minutes)

Step 1: Clone the Repository

Step 2: Create a Virtual Environment (Recommended)

Step 3: Install Dependencies

Step 4: Configure Environment Variables (Optional)

Step 5: Verify Installation

Troubleshooting Installation

Environment Variables Reference

Running the System

Web Interface (recommended)

Direct CLI

Legacy CLI (kept for rubric compatibility)

Convenience Scripts

Interactive Web Application

Framework & Architecture

User Interface Layout

Left Sidebar (Control Panel)

Main Panel (Orchestration View)

UX Design Considerations

1. Visibility of System Status

2. Progressive Disclosure

3. Safety & Privacy

4. Error Recovery

5. Accessibility

6. Performance Optimization

Interactive Features Showcase

Evaluation Metrics Instrumentation

Reporting Outputs

Deployment

Deployment Mode 1: Local Development (Recommended for First Use)

Deployment Mode 2: Docker Container (Production-Ready)

Deployment Mode 3: Server / Cloud Deployment

Prerequisites

Option A: Systemd Service (Linux)

Option B: Supervisor (Cross-Platform)

Option C: Reverse Proxy (nginx)

Cloud Platform Examples

Security Considerations for Production

Scaling Considerations

Troubleshooting

Installation Issues

Error: python: command not found

Error: pip install fails with permission errors

Error: ModuleNotFoundError after successful install

Runtime Issues

App doesn't start: Address already in use

LLM Chat Not Working: Invalid API key

Agent Tiles Stay "Running" Indefinitely

Repository Analysis Fails Immediately

Web Interface Issues

Blank page or "Cannot GET /"

Windows: "execution policy" blocks scripts

Error: `python: command not found`

Error: `pip install` fails with permission errors

Error: `ModuleNotFoundError` after successful install

App doesn't start: `Address already in use`

LLM Chat Not Working: `Invalid API key`

Packages