Watch the full Trust Bench SecureEval + Ops demo here:
👉 Trust Bench v3.0 Demo (1 min 45 s)
A short walkthrough showing Trust Bench SecureEval + Ops v3.0 in action — identical UI, enhanced with security guardrails, resilience, structured logs, and health probes.
Trust Bench (Project2v2) is a LangGraph-based multi-agent workflow that inspects software repositories for security leakage, code quality gaps, and documentation health. The system features intelligent agent routing with specialized personas, cross-agent collaboration, transparent reasoning, and reproducible outputs that graders can run entirely offline.
- 🧾 Structured Logging: JSON-formatted logs with run IDs via
Project2v2/app/logging.py - 🩺 Health Probes:
/healthzand/readyzFastAPI routes inProject2v2/app/health.py - 🔍 Observability Tests:
pytest Project2v2/tests/test_ops_layer.pyfor log formatting & health endpoints - 📊 Documentation Updates: OPERATIONS.md logging/health guidance, SECURITY.md redaction policy
- 📈 CI Evidence: Coverage + validator hooks ready for pipeline integration
- 🧪 Golden Fixtures: Canonical
report.json,report.md, andbundle.ziplocked intests/fixtures/ - 🛡️ Parity Tests:
pytest Project2v2/tests/test_parity.pyverifies JSON structure, markdown digest, and bundle contents - 📦 Artifact Freeze:
Project2v2/output/mirrors fixtures to guarantee identical observable behavior for future phases - 📓 Documentation Hooks:
OPERATIONS.mdandSECURITY.mdpopulated with Phase 0 baselines to evolve alongside new features - 🛡️ Security Agent: Specialized vulnerability assessment and risk analysis
- ⚡ Quality Agent: Code quality improvements and best practices guidance
- 📚 Documentation Agent: Documentation generation and improvement suggestions
- 🎯 Orchestrator Agent: General queries, project overview, and multi-agent coordination
- Smart Routing: LLM-powered question classification with confidence scoring
- 🛡️ Input Guardrails: Pydantic validation via
Project2v2/app/security/guardrails.py - 🔒 Sandbox Execution: Allowlisted subprocess wrapper (
safe_run) preventing arbitrary shell usage - ♻️ Resilience Decorators: Retry with exponential backoff + cross-platform timeout wrappers
- 🧪 Safety Tests:
pytest Project2v2/tests/test_secure_eval.pycovering validation, sandboxing, and resilience paths - 📘 Documentation: SECURITY.md / OPERATIONS.md updated with Phase 1 safeguards and resilience defaults
- 🔄 Collaborative Analysis: Complex queries automatically trigger multiple agents
- 🎯 Multi-Agent Detection: System identifies when specialist consultation is needed
- 📋 Executive Synthesis: Comprehensive responses combining insights from all relevant agents
- 🤝 Cross-Domain Queries: Handle requests spanning security, quality, and documentation
- Intelligent Orchestration: Seamless coordination between specialist agents
- 🤝 Consensus Building: Agents collaborate to reach agreements on complex assessments
- ⚔️ Conflict Resolution: Systematic resolution of conflicting agent recommendations
- 🔄 Iterative Refinement: Multiple rounds of analysis for nuanced scenarios
- ⚖️ Priority Negotiation: Balance competing concerns (e.g., security vs maintainability)
- 🧠 Advanced Synthesis: Unified recommendations from complex multi-agent negotiations
- 📊 Comprehensive Analysis: Deep, multi-perspective evaluations with consensus metrics
- 🎛️ Interactive Weight Adjustment: Real-time sliders for Security, Quality, and Documentation agent importance
- ⚖️ Weighted Scoring System: Final evaluation scores calculated using custom agent weightings
- 📋 Preset Configurations: Quick-select buttons for Security Focus, Quality Focus, Documentation Focus, and Balanced approaches
- 📈 Live Score Preview: Real-time preview of how weight changes affect final evaluation scores
- 🔧 Flexible Integration: Works seamlessly through web interface, CLI, and API with backward compatibility
- 📊 Confidence Calculations: Advanced algorithms assess agent confidence based on response completeness, specificity, and score consistency
- 🎯 Visual Confidence Meters: Color-coded progress bars (green/yellow/red) display confidence levels for each agent analysis
- 📋 Confidence Reporting: Confidence scores included in JSON/Markdown reports with detailed breakdowns and visual indicators
- 🔍 Smart Recommendations: System provides insights based on confidence levels to guide users toward more reliable agent outputs
- ⚡ Real-time Display: Live confidence updates in web interface alongside analysis results with expandable details
- 🎯 Consensus Journey Visualization: Complete timeline of agent negotiations with progress markers and round-by-round analysis
- 💬 Live Negotiation Highlights: Speech bubbles showing actual agent conversations with mood indicators (green/yellow/red)
- ⚔️ Visual Conflict Resolution: Before/after comparison panels displaying initial disagreements vs final negotiated results
- 🔄 Interactive Process Steps: Expandable accordion cards for each negotiation round with detailed collaboration insights
- 📊 Agent Mood Mapping: Real-time mood badges showing agreement, negotiation, and conflict states during consensus building
- 🎭 Authentic Agent Data: All visualizations use genuine agent conversations and collaboration data, not simulated content
- 📦 Complete Analysis Bundles: ZIP downloads containing JSON reports, Markdown summaries, and chat transcripts in one package
- 💬 Chat Export/Import: Save and restore conversation histories with agent routing decisions and confidence scores
- 🔄 Session Continuity: Import previous conversations to continue analysis or share findings with team members
- 📊 Multiple Download Options: Individual JSON/Markdown reports (legacy) plus new enhanced bundles with chat data
- 🕐 Timestamped Archives: UTC timestamps and metadata preservation for audit trails and team collaboration
- 🛡️ Secure File Handling: Safe path validation and proper encoding for cross-platform compatibility
- Overview
- Tool Integrations
- Installation & Setup
- Running the System
- Evaluation Metrics Instrumentation
- Reporting Outputs
- Demo Video
- Example Results (Project2v2 self-audit)
- MCP Server (Scope Decision)
- File Structure (trimmed)
- Security & Hardening Notes
- Documentation Roadmap & Evidence
- Credits & References
- Agents: Manager (plan/finalize), SecurityAgent, QualityAgent, DocumentationAgent
- Core Tools: regex secret scanner, repository structure analyzer, documentation reviewer
- Collaboration: agents exchange messages and adjust scores based on peer findings (security alerts penalize quality/documentation; quality metrics influence documentation, etc.)
- Deliverables: JSON and Markdown reports containing composite scores, agent summaries, conversation logs, and instrumentation metrics
[Manager Plan]
|
[SecurityAgent] --> alerts --> [QualityAgent] --> metrics --> [DocumentationAgent]
\____________________________ shared context _____________________________/
|
[Manager Finalize] --> report.json / report.md
| Tool | Consumed By | Capability Extension |
|---|---|---|
run_secret_scan |
SecurityAgent | Detects high-signal credentials (AWS, GitHub, RSA keys) |
analyze_repository_structure |
QualityAgent | Counts files, languages, estimated test coverage |
evaluate_documentation |
DocumentationAgent | Scores README variants by coverage and cross-agent context |
serialize_tool_result |
All agents | Normalizes tool dataclasses for message passing |
MCP endpoints are intentionally not shipped in Project2v2. See MCP Server (Scope Decision).
Before installing Trust Bench SecureEval + Ops, ensure you have the following:
- Python: 3.10 or later (3.11+ recommended for best performance)
- Operating System:
- Windows 10/11 (tested on Windows 11)
- Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+)
- macOS 11+ (Big Sur or later)
- Hardware:
- RAM: Minimum 4 GB, 8 GB recommended for large repositories
- Storage: 2 GB free disk space for dependencies and analysis artifacts
- CPU: No GPU required; runs on standard CPU architectures
- Git: Version 2.30+ for repository cloning
- Python Package Manager: pip 21+ (included with Python 3.10+)
- Outbound Internet Access:
- GitHub.com (for cloning public repositories)
- LLM Provider APIs: OpenAI, Groq, or Gemini (for interactive chat features)
- Permissions: Read access to target repositories you want to analyze
For the interactive web chat and agent consultation features, you need at least one LLM provider API key:
- OpenAI API Key (Get one here)
- Recommended for best performance
- Requires active billing/credits
- Groq API Key (Alternative, free tier available)
- Gemini API Key (Google's alternative)
Note: The core evaluation features (security scan, quality analysis, documentation review) work without any API keys. API keys are only required for the chat/consultation interface.
- Basic Command Line: Familiarity with terminal/PowerShell commands
- Git Basics: Understanding of repository cloning and navigation
- Python Basics: Ability to activate virtual environments and run Python scripts
Follow these step-by-step instructions to get Trust Bench running on your system.
git clone https://github.com/mwill20/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-OpsExpected output: Repository cloned successfully, directory contains Project2v2/ folder.
This isolates Trust Bench's dependencies from your system Python.
Windows (PowerShell):
python -m venv .venv
.\.venv\Scripts\activateLinux / macOS:
python3 -m venv .venv
source .venv/bin/activateExpected output: Your terminal prompt should now show (.venv) prefix.
# Upgrade pip first (recommended)
pip install --upgrade pip
# Install core dependencies
pip install -r Project2v2/requirements-phase1.txt
# Optional: Install advanced features (evaluation metrics, static analysis)
pip install -r Project2v2/requirements-optional.txtExpected output:
- Core install: ~50-100 packages installed (LangGraph, Flask, Pydantic, etc.)
- Optional install: Additional packages for enhanced analysis capabilities
Verification:
python -c "import langgraph, flask, pydantic; print('Dependencies OK')"Should print: Dependencies OK
Copy the example environment file and add your API keys:
Windows:
copy Project2v2\.env.example Project2v2\.envLinux / macOS:
cp Project2v2/.env.example Project2v2/.envEdit Project2v2/.env with your preferred text editor:
# Minimum configuration for chat features
OPENAI_API_KEY=sk-your-actual-key-here
LLM_PROVIDER=openai
ENABLE_SECURITY_FILTERS=trueImportant: If you skip this step, the core analysis features still work! API keys are only required for the interactive chat interface.
Run a quick self-test:
cd Project2v2
python main.py --repo ../Project2v2 --output test_outputExpected output:
=== Multi-Agent Evaluation Complete ===
Repository: /path/to/Trust-Bench-SecureEval-Ops/Project2v2
Overall Score: 32/100
Grade: needs_attention
System Latency: 0.08 seconds
Faithfulness: 0.62
Refusal Accuracy: 1.0
Per-Agent Timings:
- SecurityAgent: 0.07 seconds
- QualityAgent: 0.003 seconds
- DocumentationAgent: 0.002 seconds
Report (JSON): test_output/report.json
Report (Markdown): test_output/report.md
✅ You're ready to go! See Running the System below for usage options.
Issue: python: command not found or python3: command not found
- Solution: Install Python 3.10+ from python.org or your package manager
- Verify:
python --versionorpython3 --version
Issue: pip install fails with permission errors
- Solution: Ensure you've activated the virtual environment (Step 2)
- Or use:
pip install --user -r requirements-phase1.txt
Issue: ModuleNotFoundError after installation
- Solution: Ensure you're in the activated virtual environment
- Reinstall dependencies:
pip install --force-reinstall -r Project2v2/requirements-phase1.txt
Issue: Windows "execution policy" blocks scripts
- Solution: Run PowerShell as Administrator and execute:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
These variables can be set in Project2v2/.env or as system environment variables:
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
No* | None | OpenAI API key for chat features |
GROQ_API_KEY |
No* | None | Alternative: Groq API key |
GEMINI_API_KEY |
No* | None | Alternative: Google Gemini API key |
LLM_PROVIDER |
No | openai |
Which LLM to use: openai, groq, or gemini |
ENABLE_SECURITY_FILTERS |
No | true |
Enable input validation and prompt sanitization |
TB_MAX_FILES |
No | 2000 |
Maximum files to scan per repository |
TB_MAX_FILE_SIZE_MB |
No | 2 |
Skip files larger than this (MB) |
TB_CLONE_TIMEOUT |
No | 120 |
Repository clone timeout (seconds) |
LOG_LEVEL |
No | INFO |
Logging verbosity: DEBUG, INFO, WARNING, ERROR |
* At least one LLM provider key is required for chat features; core analysis works without any keys.
cd Project2v2
python web_interface.py
# browse to http://localhost:5001The web interface now features intelligent agent routing that automatically directs your questions to the most appropriate specialist agent. Ask security questions, request code quality improvements, or seek documentation help - the system will route to the right expert and provide contextual responses with visual agent indicators.
cd Project2v2
python main.py --repo .. --output outputpython -m trustbench_core.eval.evaluate_agent --repo <path> --output Project2v2/outputThis forwards to Project2v2/main.py; the new entrypoint remains the single source of truth.
cd Project2v2
.\run_audit.ps1 .. my_output # PowerShell
run_audit.bat .. # Windows CMD
launch.bat # Interactive menu (web UI, CLI, presets)Trust Bench SecureEval + Ops features a production-grade Flask web application that provides an intuitive interface for repository security auditing and AI-powered analysis consultation.
Technology Stack:
- Backend: Flask 3.0+ (Python web framework)
- Frontend: Custom HTML5, CSS3, JavaScript (ES6+)
- Agent Orchestration: LangGraph for multi-agent workflows
- Real-time Updates: AJAX for asynchronous status polling
- Security: Pydantic input validation, CSRF protection, API key sanitization
The sidebar contains all user inputs and configuration options:
-
How to Use Section
- Quick summary of the evaluation process
- Links to documentation and help resources
-
Repository Input
- GitHub URL input field with validation
- Example:
https://github.com/owner/repository - Supports public repositories (private repos require authentication)
-
LLM Provider Selection (Optional)
- Dropdown menu: OpenAI, Groq, or Gemini
- API key input with masked display (not stored or logged)
- "Test Connection" button for immediate validation
- Privacy notice: Keys are session-only, never persisted
-
Evaluation Metrics Panel
- Interactive sliders for custom agent weighting:
- 🛡️ Security Agent (vulnerability scanning)
- 🏗️ Quality Agent (code structure & testing)
- 📚 Documentation Agent (README & guides)
- Preset buttons for quick configuration:
- Security Focus (50% security, 25% quality, 25% docs)
- Quality Focus (25% security, 50% quality, 25% docs)
- Documentation Focus (25% security, 25% quality, 50% docs)
- Balanced (equal weights)
- Live score preview showing weighted vs. equal scoring
- Interactive sliders for custom agent weighting:
The main panel displays real-time analysis progress and results:
-
Progress Workflow Visualization
- Step 1: Input - Repository URL validated
- Step 2: Orchestrator - Manager node coordinates agents
- Step 3: Agent Tiles (expandable):
- Security Agent 🛡️ - Scanning for secrets & vulnerabilities
- Quality Agent 🏗️ - Analyzing code & test coverage
- Documentation Agent 📚 - Reviewing docs & READMEs
- Step 4: Results - Scores compiled, report generated
- Color-coded states:
- Gray: Idle/waiting
- Blue: Active/running
- Green: Completed successfully
- Red: Error state
-
Agent Detail Expandos
- Each agent tile has a "Show details" button
- Reveals capabilities, collaboration context, and real-time status
- Example: Security Agent shows number of findings, risk level
-
Phase 3: Consensus Journey Visualization
- Progress timeline with negotiation round markers
- Agent mood indicators:
- 🟢 Green: Agreement
- 🟡 Yellow: Negotiating
- �red Red: Conflict
- Live negotiation highlights: Speech bubbles showing agent conversations
- Conflict resolution panels: Before/after comparison of disagreements
-
Results Section
- Overall Score: Numeric score (0-100) with grade (excellent/good/fair/needs_attention)
- Per-dimension breakdown:
- Security score with confidence meter
- Quality score with confidence meter
- Documentation score with confidence meter
- Confidence visualization: Color-coded progress bars
- High (≥80%): Green
- Medium (60-79%): Yellow
- Low (<60%): Red
- Download options:
- Individual JSON/Markdown reports
- Complete analysis bundle (ZIP with chat transcripts)
- Export/Import chat: Save conversation history for later review
-
Agent Chat Interface (when LLM key provided)
- Chat history with message threading
- Agent-specific responses with avatars:
- 🛡️ Security Agent (red accent)
- 🏗️ Quality Agent (blue accent)
- 📚 Documentation Agent (green accent)
- 🎯 Orchestrator (purple accent)
- Routing transparency: Each response shows why that agent was selected
- Confidence badges: Visual indicator of agent certainty
- Context awareness: Agents reference latest report data
- Real-time progress indicators prevent "black box" feeling
- Color-coded states (idle/running/completed) provide at-a-glance status
- Agent tiles update dynamically as orchestrator dispatches tasks
- Complex details hidden by default (expandable tiles)
- High-level summary visible without scrolling
- Advanced features (custom weights, chat) optional but accessible
- API key handling: Keys accepted through UI, used in-memory only
- Never logged: Security filters redact sensitive data from logs
- Session-only storage: Keys cleared when browser closes
- Privacy notice: Explicit user communication about data handling
- Graceful degradation: Core features work without API keys
- Inline error messages with actionable solutions
- Timeout handling: Long-running tasks show progress, can be cancelled
- Semantic HTML with ARIA labels
- Keyboard navigation support
- High-contrast color schemes for readability
- Responsive design (works on tablets and desktops)
- Lazy loading of agent details (only when expanded)
- Debounced API calls prevent server overload
- Cached report data reduces redundant analysis
Try these workflows:
- Quick Audit: Enter a GitHub URL, click "Analyze Repository", watch agents collaborate
- Custom Weighting: Adjust sliders to emphasize security, see score preview update in real-time
- Agent Consultation: After analysis, ask "What are the top security risks?" - Security Agent responds with context
- Consensus Exploration: Expand orchestration timeline to see how agents negotiated conflicting findings
- Export & Share: Download complete bundle, share with team, import conversation on another machine
Every run records deterministic metrics alongside agent results:
- System latency - overall wall-clock time plus per-agent/per-tool timings (
metrics.system_latency_seconds,metrics.per_agent_latency) - Faithfulness - heuristic alignment of summaries with tool evidence (
metrics.faithfulness) - Refusal accuracy - simulated unsafe prompt harness (returns 1.0 while LLM calls are disabled) (
metrics.refusal_accuracy)
Metrics appear in both report.json (under metrics) and report.md (rendered table). Example CLI output:
System Latency: 0.08 seconds
Faithfulness: 0.62
Refusal Accuracy: 1.0
Per-Agent Timings:
- SecurityAgent: 0.07 seconds
- QualityAgent: 0.003 seconds
- DocumentationAgent: 0.002 seconds
Each audit (web or CLI) produces:
report.json- timestamp, repo path, composite summary, per-agent results, metrics, full conversation logreport.md- human-readable summary with agent cards, instrumentation metrics, conversation log- Optional timestamped archives (
github_analysis_*) when launched through the web interface
Trust Bench SecureEval + Ops supports multiple deployment scenarios, from local development to production cloud environments.
This is the default setup described in Installation & Setup.
Use when:
- Exploring the tool on your laptop
- Development and testing
- Quick ad-hoc repository audits
Start the web interface:
cd Project2v2
python web_interface.pyAccess: http://localhost:5001
Quick start with Docker Compose:
# 1. Clone the repository
git clone https://github.com/yourusername/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-Ops
# 2. Create .env file with your API key
echo "OPENAI_API_KEY=sk-your-key-here" > .env
# 3. Start the container
docker-compose up -d
# 4. Access the web interface
# Open http://localhost:5001 in your browser
# View logs
docker-compose logs -f
# Stop the container
docker-compose downManual Docker build and run:
# Build the image
docker build -t trustbench:latest .
# Run with environment variables
docker run -d \
--name trustbench \
-p 5001:5001 \
-e OPENAI_API_KEY=sk-your-key-here \
-e ENABLE_SECURITY_FILTERS=true \
-e TB_RUN_MODE=strict \
-v trustbench-data:/data \
-v trustbench-logs:/logs \
trustbench:latest
# View logs
docker logs -f trustbench
# Stop and remove
docker stop trustbench
docker rm trustbenchDocker features:
- Multi-stage build: Smaller final image (~300 MB vs ~1 GB)
- Non-root user: Runs as
trustbenchuser (UID 1000) for security - Health checks: Automatic restart on failures
- Persistent storage: Volumes for analysis results and logs
- Resource limits: Configurable CPU/memory constraints
- Read-only filesystem: Enhanced security with tmpfs for temp files
Environment variables (see .env.example for full list):
# Required: At least one API key
OPENAI_API_KEY=sk-...
# or GROQ_API_KEY=gsk-...
# or GEMINI_API_KEY=...
# Optional: Configuration overrides
LLM_PROVIDER=openai
TB_MAX_FILES=2000
TB_MAX_FILE_SIZE_MB=2
AGENT_TIMEOUT_SECONDS=120
LOG_LEVEL=INFOVolumes explained:
/data: Persistent storage for analysis results and reports/logs: Application logs (if file logging enabled)
Production deployment with Docker:
# Use docker-compose for persistent setup
docker-compose up -d
# Update to latest version
git pull
docker-compose build
docker-compose up -d
# View resource usage
docker stats trustbench-secureeval-opsFor long-running or team deployments, run Trust Bench on a dedicated server or cloud VM.
- VM with Python 3.10+ (Ubuntu 22.04 LTS, Amazon Linux 2023, etc.)
- 2-4 vCPU, 8 GB RAM recommended
- Inbound port 5001 open (or your chosen port)
Create /etc/systemd/system/trust-bench.service:
[Unit]
Description=Trust Bench SecureEval + Ops Web Interface
After=network.target
[Service]
Type=simple
User=trustbench
WorkingDirectory=/opt/trust-bench/Project2v2
Environment="PATH=/opt/trust-bench/.venv/bin"
Environment="OPENAI_API_KEY=sk-your-key-here"
Environment="ENABLE_SECURITY_FILTERS=true"
ExecStart=/opt/trust-bench/.venv/bin/python web_interface.py
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable trust-bench
sudo systemctl start trust-bench
sudo systemctl status trust-benchInstall supervisor: pip install supervisor
Add to supervisord.conf:
[program:trust-bench]
command=/opt/trust-bench/.venv/bin/python web_interface.py
directory=/opt/trust-bench/Project2v2
user=trustbench
autostart=true
autorestart=true
environment=OPENAI_API_KEY="sk-your-key",ENABLE_SECURITY_FILTERS="true"
stdout_logfile=/var/log/trust-bench/stdout.log
stderr_logfile=/var/log/trust-bench/stderr.logStart:
supervisorctl reread
supervisorctl update
supervisorctl start trust-benchFor HTTPS and domain mapping, place nginx in front of Trust Bench:
server {
listen 443 ssl http2;
server_name trustbench.yourdomain.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location / {
proxy_pass http://127.0.0.1:5001;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (for future real-time features)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}AWS EC2:
- Launch t3.medium instance (Ubuntu 22.04)
- Configure security group: Allow inbound TCP 5001 (or 443 if using nginx)
- Follow systemd service setup above
- Optional: Use Elastic IP for consistent access
Azure VM:
- Create B2s VM (Ubuntu 22.04)
- Open NSG port 5001 or 443
- Follow systemd service setup above
Google Cloud:
- Create e2-medium instance (Ubuntu 22.04)
- Configure firewall rule for port 5001/443
- Follow systemd service setup above
DigitalOcean Droplet:
- Basic Droplet ($12/month, 2 GB RAM)
- Follow systemd service setup above
- Optional: Enable DigitalOcean firewall
- Use HTTPS: Always deploy behind a reverse proxy with TLS certificates
- Firewall: Restrict access to trusted IP ranges if possible
- API Keys: Use environment variables, never commit to git
- Rate Limiting: Consider adding nginx rate limiting for public deployments
- Monitoring: Set up log aggregation (see OPERATIONS.md)
- Updates: Regularly
git pulland restart the service for security patches
Current architecture supports:
- 5-10 concurrent users (single Flask worker)
- 10-20 repository analyses per hour
- Single-server deployment
For higher loads, consider:
- Deploy behind Gunicorn:
gunicorn -w 4 -b 0.0.0.0:5001 web_interface:app - Use Redis for session management
- Add load balancer for multiple instances
- See OPERATIONS.md for production hardening guidance
Common issues and their solutions:
Symptoms: Terminal cannot find python or python3 command
Solutions:
- Install Python 3.10+ from python.org
- On Linux:
sudo apt install python3.10 python3.10-venv(Ubuntu/Debian) - On macOS:
brew install python@3.10 - Verify:
python --versionorpython3 --versionshould show 3.10 or higher
Symptoms:
ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied
Solutions:
- Activate virtual environment first:
.venv\Scripts\activate(Windows) orsource .venv/bin/activate(Linux/macOS) - Or use user install:
pip install --user -r requirements-phase1.txt - Never use
sudo pip(breaks system Python)
Symptoms:
ModuleNotFoundError: No module named 'langgraph'
Solutions:
- Ensure virtual environment is activated (check for
(.venv)prefix in terminal) - Reinstall in correct environment:
pip install --force-reinstall -r Project2v2/requirements-phase1.txt
- Verify environment:
which python(Linux/macOS) orwhere python(Windows) should point to.venv
Symptoms:
OSError: [Errno 48] Address already in use
Solutions:
- Another process is using port 5001
- Find and kill it:
- Linux/macOS:
lsof -ti:5001 | xargs kill -9 - Windows:
netstat -ano | findstr :5001, thentaskkill /PID <PID> /F
- Linux/macOS:
- Or change port: Set
WEB_PORT=5002in.env
Symptoms: Chat returns "Sorry, I'm having trouble accessing the AI service"
Solutions:
- Verify API key is set:
echo $OPENAI_API_KEY(Linux/macOS) orecho %OPENAI_API_KEY%(Windows) - Check key has active billing/credits at OpenAI Platform
- Try alternative provider: Set
LLM_PROVIDER=groqandGROQ_API_KEY=...in.env - Workaround: Core analysis features work without API keys (only chat is affected)
Symptoms: Progress bars stuck, no error message
Solutions:
- Check timeout settings in
.env:AGENT_TIMEOUT_SECONDS=120(increase if analyzing large repos) - View detailed logs:
tail -f Project2v2/logs/app.log(if logging configured) - For large repos, increase limits:
TB_MAX_FILES=5000 TB_MAX_FILE_SIZE_MB=5 TB_CLONE_TIMEOUT=300
- Check network connectivity to GitHub (required for repository cloning)
Symptoms: "Repository not found" or "Cloning failed" error
Solutions:
- Verify URL format: Must be
https://github.com/owner/repo(no trailing slash) - Public repositories only: Private repos require authentication (not yet supported in UI)
- Large repositories: Repos >1 GB may timeout; increase
TB_CLONE_TIMEOUT=300 - Network issues: Check firewall/proxy settings allow access to github.com
Symptoms: Browser shows empty page or 404 error
Solutions:
- Verify app is running: Check terminal for "Running on http://127.0.0.1:5001"
- Use correct URL:
http://localhost:5001(not127.0.0.1:5001on some systems) - Clear browser cache: Ctrl+F5 (Windows) or Cmd+Shift+R (macOS)
- Check browser console for JavaScript errors (F12 Developer Tools)
Symptoms:
.venv\Scripts\activate : File cannot be loaded because running scripts is disabled on this system.
Solutions:
# Run PowerShell as Administrator
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then retry activation
.\.venv\Scripts\activateCauses & Solutions:
- Large repository: Normal for repos with 10,000+ files
- Solution: Increase timeouts or analyze smaller repos first
- Slow internet: Clone operation is network-bound
- Solution: Pre-clone repo locally, use CLI:
python main.py --repo /path/to/local/repo
- Solution: Pre-clone repo locally, use CLI:
- Low RAM: System swapping to disk
- Solution: Close other applications, add swap space, or upgrade RAM
Symptoms:
MemoryError or killed by OS
Solutions:
- Reduce file scan limits:
TB_MAX_FILES=1000 TB_MAX_FILE_SIZE_MB=1
- Analyze smaller repositories first
- Increase system RAM or swap space
- Use CLI for more efficient processing:
python main.py --repo <path>
If you've tried the above solutions and still have issues:
- Check logs:
Project2v2/logs/app.log(if logging is configured) - GitHub Issues: Open an issue with:
- Python version:
python --version - OS details:
uname -a(Linux/macOS) orsysteminfo(Windows) - Error message (full traceback)
- Steps to reproduce
- Python version:
- Documentation: See OPERATIONS.md and SECURITY.md
- Community: Check existing issues for similar problems
If playback doesn't work on GitHub, download the file locally from the same link above.
The full-resolution video is hosted via OneDrive to keep the repository history lean. If you want an offline copy, download it from the link above and place it under
Project2v2/assets/images/.
- Overall Score: ~32/100 (
needs_attention) - Security: seeded secrets detected (score 0) drive collaboration penalties
- Quality: medium score, automatically penalized by SecurityAgent findings
- Documentation: strong base score but reduced for missing security/testing guidance
- Collaboration: more than five cross-agent messages; Manager summarizes adjustments in the final log
Project2v2 prioritizes deterministic, offline-capable tooling. To keep grading reproducible and avoid external runtime dependencies, the earlier MCP server has been intentionally deprecated for this version. Required tool integrations (three or more) are provided as direct Python callables. MCP can be revisited later if cross-client interoperability (Claude Desktop, Cursor, etc.) becomes necessary, but it is not required for Module 2 compliance.
Trust_Bench/
|-- Project2v2/
| |-- main.py
| |-- web_interface.py
| |-- multi_agent_system/
| | |-- agents.py
| | |-- orchestrator.py
| | |-- tools.py
| | `-- reporting.py
| |-- requirements-phase1.txt
| |-- requirements-optional.txt
| |-- run_audit.(bat|ps1)
| |-- launch.bat
| `-- output/
|-- trustbench_core/ (legacy CLI wrapper forwarding to Project2v2/main.py)
`-- Project2v2/checklist.yaml (optional-features register)
- All detected secrets are synthetic and included solely for demonstration purposes. No real credentials are exposed.
security_utils.pyand the web UI sanitize repository URLs, prompts, and API keys.- Optional extras (
ragas,semgrep,streamlit) enable deeper analytics and dashboarding when desired.
The following features are planned for future releases:
- Batch Analysis: Automated analysis of multiple repositories simultaneously with queue management and comparative dashboards
- Note: This feature requires significant infrastructure (job queuing, background workers, database) and is planned for a later major release
- Additional specialized agents (Performance, Accessibility, Compliance)
- Integration with CI/CD pipelines
- Real-time repository monitoring
- Advanced analytics and trending
To satisfy Project 3 publication requirements, the repository ships with dedicated documentation artifacts. Keep these synchronized with feature development:
| Document | Purpose | Status / Next Action |
|---|---|---|
README.md |
Executive summary, architecture overview, quickstart, CI/coverage references | ✅ Actively maintained |
OPERATIONS.md |
Operational excellence rubric (start/stop/health, logging, resilience, monitoring) | 📝 Skeleton ready – fill details as services evolve |
SECURITY.md |
Security & safety rubric (input validation, guardrails, sandbox, SOC 2-lite map) | 📝 Skeleton ready – update per guardrail implementation |
USER_GUIDE.md |
End-user walkthrough + context | ✅ Append “Production Enhancements” notes when Phase 0 work lands |
docs/evidence/ (optional) |
Test/coverage screenshots, CI logs for publication attachments | 🔄 Create as artifacts are generated |
Working agreement:
- Phase 0 (“Parity Lock-In”) adds test coverage notes under README → Testing & CI Summary, and populates the open TODOs in
OPERATIONS.md/SECURITY.md. - Each subsequent phase should update the relevant section(s) before closing the task to avoid drift.
Phase 3 and 4 harden the system for publication while preserving identical UX:
- CI gate (
python-ciworkflow) runs tests, coverage, security audit, and rubric validator on every PR. - Structured JSON logs, health probes, and resilience settings documented in OPERATIONS.md.
- Security controls (input validation, sandbox, redaction, SOC 2-lite mapping) captured in SECURITY.md.
- Evidence bundle placeholders live under Project2v2/docs/evidence/ for publishing artifacts (CI proof, screenshots, coverage report).
Trust Bench SecureEval + Ops is released under the MIT License.
What this means for you:
- ✅ Commercial use: Use in commercial products and services
- ✅ Modification: Fork, modify, and adapt the code
- ✅ Distribution: Share modified or unmodified versions
- ✅ Private use: Use privately without publishing your changes
⚠️ Liability: No warranty provided; use at your own risk
Full license text: See LICENSE in the repository root.
Attribution: When using Trust Bench in publications or products, please include:
Trust Bench SecureEval + Ops by Michael Williams
https://github.com/mwill20/Trust-Bench-SecureEval-Ops
Licensed under MIT
Bug Reports & Feature Requests:
- GitHub Issues: Open an issue
- Labels:
bug: Something isn't working correctlyenhancement: New feature or improvement requestdocumentation: Documentation improvementsquestion: General questions about usage
Security Concerns:
- Private reporting: For security vulnerabilities, please open a private security advisory
- Contact: See SECURITY.md for responsible disclosure process
- Do NOT open public issues for security vulnerabilities
Community & Discussion:
- GitHub Discussions: Q&A, ideas, and general discussion
- Pull Requests: Contributions welcome! See Contributing below
We welcome contributions to Trust Bench SecureEval + Ops!
How to contribute:
- Fork the repository on GitHub
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes:
- Follow existing code style (PEP 8 for Python)
- Add tests for new features
- Update documentation as needed
- Run tests:
pytest Project2v2/tests/ - Commit your changes:
git commit -m 'Add amazing feature' - Push to your fork:
git push origin feature/amazing-feature - Open a Pull Request with a clear description of changes
Contribution guidelines:
- All PRs must pass CI checks (tests, coverage, linting)
- Maintain or improve test coverage (currently 79%)
- Update
CHANGELOG.mdwith notable changes - Follow the existing code structure and patterns
Development setup:
# Clone your fork
git clone https://github.com/YOUR-USERNAME/Trust-Bench-SecureEval-Ops.git
cd Trust-Bench-SecureEval-Ops
# Create development environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install development dependencies
pip install -r Project2v2/requirements-phase1.txt
pip install -r Project2v2/requirements-optional.txt
pip install pytest pytest-cov black flake8
# Run tests
pytest Project2v2/tests/ -v
# Format code
black Project2v2/
# Lint
flake8 Project2v2/🟢 Actively Maintained (as of November 2025)
This project is part of an ongoing AI security engineering learning track. We aim to:
- Address critical bugs within 48 hours
- Respond to issues and PRs within 1 week
- Keep dependencies current (quarterly security updates)
- Add new features based on community feedback
Roadmap: See Future Work & Enhancements above for planned features.
Maintainer: Michael Williams (@mwill20)
Email: For security issues only (see SECURITY.md)
Preferred contact: GitHub Issues for all non-security topics
- Ready Tensor AI Agent Course - Module 2 (Multi-Agent Evaluation)
- LangGraph, CrewAI, AutoGen (collaboration inspiration)
- Semgrep, OpenAI/Groq/Gemini APIs (referenced integrations)
- Project2v2 implementation by @mwill20 and collaborators
Version Project2v2 - October 2025 - Refactored for offline deterministic evaluation.

