A production-ready RAG-based system for automatically answering TPRM (Third-Party Risk Management) assessment questions using AWS Bedrock and vector retrieval.
- Multi-Model Comparison: Test multiple LLMs (Claude 3 Opus, Sonnet, Haiku) and compare results
- Consistency Analysis: Run multiple iterations to measure answer reliability
- RAG Pipeline: Semantic chunking, Titan Embeddings, FAISS vector store
- Confidence Scoring: Structured scoring with rubric-based assessment
- Citation Tracking: Every answer includes document citations
- Production Ready: Type hints, comprehensive logging, error handling
This codebase follows security best practices:
- ✅ No credentials in code - Uses AWS credential chain
- ✅ Safe serialization - JSON instead of pickle for data persistence
- ✅ Input validation - Path traversal prevention, input sanitization
- ✅ YAML safe loading - Prevents arbitrary code execution
- ✅ Whitelisted file types - Only processes allowed extensions
cd invela_assessment
pip3 install -r requirements.txt# Option 1: AWS CLI
aws configure
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1Place company documents in data/company_docs/:
data/company_docs/
├── SOC2_Report_2024.pdf
├── Privacy_Policy.docx
├── InfoSec_Policy.pdf
└── ...
Supported formats: PDF, DOCX, TXT, MD, JSON
Edit config.yaml to specify which models to test:
bedrock:
models:
- name: "Claude 3 Sonnet"
model_id: "anthropic.claude-3-sonnet-20240229-v1:0"
temperature: 0.0 # Keep at 0 for consistency
- name: "Claude 3 Haiku"
model_id: "anthropic.claude-3-haiku-20240307-v1:0"
temperature: 0.0# Test with specific questions first
python3 run_assessment.py --questions 1 22 26 --runs 2
# Full run with all models
python3 run_assessment.py --runs 3Results are saved to results/:
results/
├── claude_3_sonnet_run1_20241201_143022.json
├── claude_3_sonnet_run2_20241201_143156.json
├── claude_3_haiku_run1_20241201_143312.json
├── model_comparison_20241201_143500.json
└── assessment.log
======================================================================
MULTI-MODEL ASSESSMENT COMPARISON SUMMARY
======================================================================
Models tested: 2
Questions: 142
--- Model Rankings (by Avg Confidence) ---
#1: Claude 3 Sonnet - 0.742
#2: Claude 3 Haiku - 0.698
--- Per-Model Details ---
Claude 3 Sonnet:
Runs: 2
Avg Confidence: 0.7420
Variance across runs: 0.0012
Run 1: conf=0.741, complete=98, partial=31, not_found=13
Run 2: conf=0.743, complete=98, partial=31, not_found=13
invela_assessment/
├── config.yaml # Main configuration
├── run_assessment.py # CLI entry point
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── src/
│ ├── __init__.py # Package exports
│ ├── config.py # Configuration management
│ ├── bedrock_client.py # AWS Bedrock LLM + Embeddings
│ ├── document_processor.py # Document loading & chunking
│ ├── vector_store.py # FAISS vector storage
│ ├── rag_engine.py # RAG orchestration
│ ├── analysis.py # Multi-run analysis
│ └── runner.py # Main runner
├── data/
│ ├── company_docs/ # Your documents (gitignored)
│ ├── vector_index/ # Generated index (gitignored)
│ └── invela_accreditation_questions.json
└── results/ # Output (gitignored)
| Parameter | Default | Description |
|---|---|---|
temperature |
0.0 | Keep at 0 for deterministic outputs |
chunk_size |
512 | Tokens per chunk |
chunk_overlap |
50 | Overlap between chunks |
top_k |
10 | Documents to retrieve |
rerank_top_n |
5 | Documents after reranking |
num_runs |
3 | Runs per model for consistency |
These models work without inference profiles:
# Claude 3 Family
- anthropic.claude-3-opus-20240229-v1:0
- anthropic.claude-3-sonnet-20240229-v1:0
- anthropic.claude-3-haiku-20240307-v1:0from src import AssessmentRunner
# Create runner
runner = AssessmentRunner(config_path="config.yaml")
# Initialize (loads questions, builds vector store)
runner.initialize()
# Run single model
results = runner.run_model(runner.config.bedrock.models[0], num_runs=2)
# Run all models and compare
report = runner.run_all_models(num_runs=3)
runner.print_summary(report)aws configure # Set up AWS CLI
# Or set environment variablesNewer models (Claude 3.5, 4.x) require inference profiles. Use Claude 3 models or set up inference profiles in AWS Console.
- Add more relevant documentation
- Check
comparison_report.jsonfor specific issues - Questions about missing docs will naturally have low consistency
The codebase follows:
- PEP 8 style guide
- Type hints throughout
- Comprehensive docstrings
- Defensive error handling
# Test with subset of questions
python3 run_assessment.py --questions 1 2 3 --runs 1
# Force rebuild index after adding docs
python3 run_assessment.py --reindexProprietary - Invela Inc.
- v2.0.0 - Multi-model support, improved security, JSON serialization
- v1.0.0 - Initial release