A comprehensive data warehouse solution that combines traditional SQL analytics with AI-powered natural language querying, built with FastAPI, React, and advanced RAG (Retrieval-Augmented Generation) capabilities.
# Clone and start the application
git clone <repository>
cd ai-schema
docker-compose up -d
# Access the application
open http://localhost:9999/banking_demo.html- System Architecture
- What are RAG and LangChain?
- Data Flow
- Features
- Demo Data
- API Endpoints
- Technologies
- Setup Guide
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Frontend Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ React UI (banking_demo.html) โ
โ โข Interactive dashboards โ
โ โข Natural language query interface โ
โ โข Real-time SQL execution โ
โ โข Data visualization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ API Gateway โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ FastAPI Backend (app/main.py) โ
โ โข Request routing and validation โ
โ โข Authentication & authorization โ
โ โข Query processing orchestration โ
โ โข Response formatting โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AI Processing Layer โ โ Traditional Query Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ๐ค LangChain + RAG Pipeline โ โ ๐ SQL Query Engine โ
โ โข Natural language parsing โ โ โข Direct SQL execution โ
โ โข Context retrieval โ โ โข Query validation โ
โ โข Schema-aware generation โ โ โข Performance optimization โ
โ โข Google Gemini integration โ โ โข Result formatting โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Database Connector Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Multi-Database Support โ
โ โข PostgreSQL Connector โ
โ โข Snowflake Connector โ
โ โข BigQuery Connector โ
โ โข Redshift Connector โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Storage Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ๐๏ธ PostgreSQL (Primary Data) โ ๐ง pgvector (Vector Store) โ
โ โข 250+ customers โ โข Schema embeddings โ
โ โข 387 bank accounts โ โข Query context vectors โ
โ โข 3,585+ transactions โ โข Semantic search index โ
โ โข 141 credit cards โ โข RAG knowledge base โ
โ โข 100+ loans with payments โ โข Historical query patterns โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Imagine you want to cook dinner. You don't just randomly throw ingredients together - you:
- Check your kitchen (what ingredients do you have?)
- Look at recipes (what can you make with these ingredients?)
- Follow the cooking steps (combine ingredients in the right way)
- Serve the meal (present the final result)
This is exactly how our AI system works when you ask a question!
Simple explanation: Before the AI tries to answer your question, it first "looks in the cookbook" (our database) to see what ingredients (data tables) are available and what recipes (past successful queries) have worked before.
In technical terms: RAG retrieves relevant information from a knowledge base before generating answers, ensuring responses are grounded in actual data rather than made-up information.
Simple explanation: LangChain is like a master chef who coordinates the entire cooking process - from understanding what you want to eat, checking ingredients, following recipes, and presenting the final dish.
In technical terms: LangChain orchestrates the entire AI workflow, chaining together different components like understanding user input, retrieving context, generating SQL, and formatting responses.
๐ฝ๏ธ FOOD ANALOGY ๐ฆ BANKING SYSTEM
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
User: "I want pasta" โ User: "Show me top customers"
Chef checks pantry โ RAG checks database schema
Chef finds: tomatoes, pasta โ RAG finds: customers, accounts, transactions
Chef recalls pasta recipes โ RAG recalls similar past queries
Chef picks: tomato pasta โ RAG picks: customer spending analysis
Chef cooks step by step โ LangChain generates SQL step by step
Chef serves the dish โ System returns formatted results
Simple explanation: Think of embeddings as "smart fingerprints" for information.
- Your database schema (table names, columns) gets converted into mathematical "fingerprints"
- Past successful queries also get their own "fingerprints"
- When you ask a new question, we create a "fingerprint" for your question
- We find the most similar "fingerprints" in our collection to understand what you really want
Real example:
Your question: "Show customers who spend a lot"
System thinks: "This fingerprint is similar to past queries about
customer spending analysis, high-value customers, and transaction totals"
Result: Retrieves the right tables and query patterns to build your answer
This combination allows non-technical users to ask complex questions like "Show me customers with high credit utilization who might be at risk" and get accurate SQL queries that consider the actual database schema, relationships, and business context.
curl -X GET "http://localhost:8000/health" \
-H "Content-Type: application/json" | jqExpected Output:
{
"status": "healthy",
"database": "connected",
"ai_service": "available",
"timestamp": "2024-08-11T10:30:45Z"
}curl -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{
"message": "Show me top 10 customers by total spending",
"session_id": "demo_session_1"
}' | jqExpected Output:
{
"response": "Here are your top 10 customers by spending:",
"sql_query": "SELECT c.first_name, c.last_name, SUM(t.amount) as total_spending FROM customers c JOIN accounts a ON c.customer_id = a.customer_id JOIN transactions t ON a.account_id = t.account_id GROUP BY c.customer_id ORDER BY total_spending DESC LIMIT 10",
"results": [
{"first_name": "John", "last_name": "Smith", "total_spending": 15420.50},
{"first_name": "Sarah", "last_name": "Johnson", "total_spending": 12380.75}
],
"execution_time": "0.34s"
}curl -X POST "http://localhost:8000/sql/execute" \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT COUNT(*) as total_cards, AVG(credit_limit) as avg_limit, AVG(balance) as avg_balance FROM credit_cards WHERE status = '\''active'\'';",
"database": "banking_db"
}' | jqExpected Output:
{
"results": [
{
"total_cards": 141,
"avg_limit": 8500.00,
"avg_balance": 2347.83
}
],
"execution_time": "0.12s",
"row_count": 1
}curl -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{
"message": "Find customers with credit utilization above 80%",
"session_id": "demo_session_2"
}' | jqExpected Output:
{
"response": "Found 12 customers with high credit utilization (>80%):",
"sql_query": "SELECT c.first_name, c.last_name, cc.balance, cc.credit_limit, ROUND((cc.balance/cc.credit_limit)*100, 2) as utilization_pct FROM customers c JOIN credit_cards cc ON c.customer_id = cc.customer_id WHERE (cc.balance/cc.credit_limit) > 0.8 ORDER BY utilization_pct DESC",
"results": [
{"first_name": "Mike", "last_name": "Wilson", "utilization_pct": 94.5},
{"first_name": "Lisa", "last_name": "Davis", "utilization_pct": 87.2}
],
"insights": "These customers may be at financial risk and could benefit from credit counseling."
}curl -X POST "http://localhost:8000/sql/execute" \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT category, COUNT(*) as transaction_count, SUM(ABS(amount)) as total_amount FROM transactions WHERE transaction_date >= CURRENT_DATE - INTERVAL '\''7 days'\'' GROUP BY category ORDER BY total_amount DESC LIMIT 5;",
"database": "banking_db"
}' | jqExpected Output:
{
"results": [
{"category": "Groceries", "transaction_count": 89, "total_amount": 12450.30},
{"category": "Gas Stations", "transaction_count": 45, "total_amount": 3420.75},
{"category": "Restaurants", "transaction_count": 67, "total_amount": 5680.20},
{"category": "Entertainment", "transaction_count": 23, "total_amount": 2340.50},
{"category": "Online Shopping", "transaction_count": 34, "total_amount": 4120.80}
],
"execution_time": "0.18s",
"row_count": 5
}# Test invalid SQL
curl -X POST "http://localhost:8000/sql/execute" \
-H "Content-Type: application/json" \
-d '{
"query": "SELECT * FROM nonexistent_table;",
"database": "banking_db"
}' | jq
# Test malformed request
curl -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{"invalid": "request"}' | jqcurl -X GET "http://localhost:8000/schema/banking_db" \
-H "Content-Type: application/json" | jqcurl -X GET "http://localhost:8000/metrics" \
-H "Content-Type: application/json" | jq#!/bin/bash
echo "๐ฆ Banking Analytics API Demo"
echo "============================="
echo "1. Health Check..."
curl -s "http://localhost:8000/health" | jq
echo -e "\n2. Top Customers Query..."
curl -s -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{"message": "Show me top 5 customers by spending", "session_id": "demo"}' | jq
echo -e "\n3. Credit Card Stats..."
curl -s -X POST "http://localhost:8000/sql/execute" \
-H "Content-Type: application/json" \
-d '{"query": "SELECT COUNT(*) as cards, AVG(credit_limit) as avg_limit FROM credit_cards;"}' | jq
echo -e "\n4. High Risk Customers..."
curl -s -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{"message": "Find high credit utilization customers", "session_id": "demo"}' | jq
echo -e "\n5. Recent Transactions..."
curl -s -X POST "http://localhost:8000/sql/execute" \
-H "Content-Type: application/json" \
-d '{"query": "SELECT category, COUNT(*) FROM transactions WHERE transaction_date >= CURRENT_DATE - INTERVAL 7 DAY GROUP BY category LIMIT 3;"}' | jq
echo -e "\nโ
Demo Complete!"# 1. Edit your .env file
nano .env
# 2. Update the key
GOOGLE_API_KEY=your_new_gemini_api_key_here
# 3. Restart only the backend service
docker-compose restart backend
# 4. Verify the update
curl -X GET "http://localhost:8000/health" | jq# 1. Stop the backend service
docker-compose stop backend
# 2. Update environment variable
export GOOGLE_API_KEY="your_new_gemini_api_key_here"
# 3. Start the backend service
docker-compose up -d backend
# 4. Check logs for successful startup
docker-compose logs -f backend# 1. Stop all services
docker-compose down
# 2. Update .env file
echo "GOOGLE_API_KEY=your_new_gemini_api_key_here" >> .env
# 3. Start all services
docker-compose up -d
# 4. Verify all services are healthy
docker-compose ps
curl -X GET "http://localhost:8000/health" | jq# Test AI functionality
curl -X POST "http://localhost:8000/chat/send_message" \
-H "Content-Type: application/json" \
-d '{
"message": "Test AI connection - show customer count",
"session_id": "test_key"
}' | jq
# Check backend logs for API errors
docker-compose logs backend | grep -i "gemini\|api\|error"
# Verify health endpoint shows AI service available
curl -X GET "http://localhost:8000/health" | jq '.ai_service'# Restart just the backend (fastest)
docker-compose restart backend
# Restart backend and vector database
docker-compose restart backend db
# Restart everything (if major changes)
docker-compose down && docker-compose up -d# Check if key is loaded
docker exec ai-schema-backend-1 env | grep GOOGLE_API_KEY
# Test key validity
curl -X POST "https://generativelanguage.googleapis.com/v1/models/gemini-pro:generateContent?key=YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"contents":[{"parts":[{"text":"test"}]}]}'
# View backend service logs
docker-compose logs -f backend | grep -i "gemini\|auth\|api"๐ค Business User ๐ค AI System ๐พ Database
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"Show me our top customers โ "Let me understand what you need..."
by spending this year"
โ
1. UNDERSTAND (LangChain)
โข Parse: "top customers" + "spending" + "this year"
โข Intent: Customer analysis query
โ
2. RESEARCH (RAG) โ ๐ Check Vector Store
โข "What tables have customer data?"
โข "How did we solve similar questions before?"
โข "What are the relationships between tables?"
โ โ ๐ Schema Info Retrieved
3. PLAN (LangChain + RAG)
โข Found: customers, accounts, transactions tables
โข Pattern: JOIN tables โ SUM amounts โ GROUP BY customer
โข Recipe: Similar to "customer spending analysis" queries
โ
4. BUILD SQL (Google Gemini)
โข Creates: SELECT customers, SUM(transactions)...
โข Adds: JOIN statements for relationships
โข Includes: WHERE date filters for "this year"
โ
5. EXECUTE โ ๐ฆ Run Query on Bank Data
โข Safety check: No destructive operations
โข Run the generated SQL
โ โ ๐ Results Retrieved
6. PRESENT
โข Format results in readable table
โข Add insights: "Top customer spent $50,000"
"Perfect! John Smith is our โ "Here are your top 10 customers by
biggest customer with $50K spending in 2024..."
in spending this year."
๐ฏ Business Benefits:
- No SQL Knowledge Needed: Ask questions in plain English
- Always Accurate: AI uses actual database structure, not guesswork
- Fast Results: Sub-second query generation and execution
- Learning System: Gets smarter with each question asked
๐ง Technical Advantages:
- Schema-Aware: AI knows your exact table structure
- Context-Rich: Past successful queries guide new ones
- Multi-Database: Works across PostgreSQL, Snowflake, BigQuery
- Security-First: Query validation prevents data damage
Scenario 1: Risk Management
Manager: "Which customers have maxed out their credit cards?"
System: โ Checks credit_cards table โ Calculates utilization ratios โ Returns risk list
Result: "15 customers at 90%+ utilization, flagged for review"
Scenario 2: Performance Analysis
Executive: "How did our loan portfolio perform last quarter?"
System: โ Joins loans + payments tables โ Calculates metrics โ Trends analysis
Result: "Q3 loan performance: 94% on-time payments, $2.3M new originations"
Scenario 3: Customer Insights
Marketing: "Show me customers who might need a savings account"
System: โ Analyzes transaction patterns โ Identifies high cash flow โ No savings products
Result: "127 customers with high deposits but no savings accounts"
User Input: "Show me top customers by total spending"
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LangChain Query Processing Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 1. Input Validation & Preprocessing โ
โ โข Clean and normalize user input โ
โ โข Extract query intent and entities โ
โ โ
โ 2. Context Retrieval (RAG) โ
โ โข Query vector store for relevant schema info โ
โ โข Retrieve similar historical queries โ
โ โข Get table relationships and constraints โ
โ โ
โ 3. Prompt Engineering โ
โ โข Combine user query with retrieved context โ
โ โข Apply banking domain-specific templates โ
โ โข Include schema definitions and examples โ
โ โ
โ 4. LLM Generation (Google Gemini) โ
โ โข Generate SQL query based on context โ
โ โข Apply banking business rules โ
โ โข Ensure query safety and optimization โ
โ โ
โ 5. Post-Processing โ
โ โข Validate generated SQL syntax โ
โ โข Apply security checks โ
โ โข Format for execution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
Generated SQL: "SELECT c.first_name, c.last_name,
SUM(t.amount) as total_spending
FROM customers c
JOIN accounts a ON c.customer_id = a.customer_id
JOIN transactions t ON a.account_id = t.account_id
GROUP BY c.customer_id, c.first_name, c.last_name
ORDER BY total_spending DESC LIMIT 10"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Vector Store (pgvector) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ๐ Schema Embeddings โ
โ โข Table definitions with descriptions โ
โ โข Column metadata and relationships โ
โ โข Business rules and constraints โ
โ โ
โ ๐ Query Pattern Embeddings โ
โ โข Common banking queries and their SQL โ
โ โข User query history and results โ
โ โข Domain-specific query templates โ
โ โ
โ ๐ Semantic Search Process โ
โ 1. Convert user query to embedding โ
โ 2. Find similar vectors in knowledge base โ
โ 3. Retrieve top-k relevant contexts โ
โ 4. Rank by relevance and recency โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Natural Language Queries: Ask questions in plain English
- Context-Aware Responses: RAG ensures accurate, data-grounded answers
- Banking Domain Intelligence: Specialized for financial data analysis
- Query Learning: System improves from user interactions
- Direct SQL Interface: For power users and developers
- Real-time Execution: Sub-second query performance
- Multi-database Support: PostgreSQL, Snowflake, BigQuery, Redshift
- Query Optimization: Automatic performance enhancements
- Query Validation: Prevents destructive operations
- Access Control: Role-based permissions
- Audit Logging: Complete query history tracking
- Data Privacy: Sensitive data protection
The platform includes comprehensive banking data with 250+ customers, 387 accounts, 3,585+ transactions, 141 credit cards, and 100+ loans with realistic financial patterns for demonstration purposes.
- Backend: FastAPI + Python + SQLAlchemy
- AI: LangChain + Google Gemini + pgvector RAG
- Frontend: HTML5/CSS3/JavaScript + Tailwind CSS
- Database: PostgreSQL + Snowflake + BigQuery + Redshift
- Infrastructure: Docker + Redis + Nginx
POST /chat/send_message
โโโ Natural language processing
โโโ RAG context retrieval
โโโ SQL generation via LangChain
โโโ Result formatting and return
GET /chat/history
โโโ Retrieve conversation history
POST /sql/execute
โโโ Direct SQL query execution
โโโ Query validation and security checks
โโโ Formatted result return
GET /databases
โโโ List available database connections
POST /databases/connect
โโโ Establish new database connection
โโโ Validate credentials
โโโ Test connectivity
GET /schema/{database}
โโโ Retrieve schema information
โโโ Table and column metadata
โโโ Relationship mappings
GET /health
โโโ System health check
โโโ Database connectivity status
โโโ AI service availability
GET /metrics
โโโ Query performance statistics
โโโ Usage analytics
โโโ System resource utilization
- Docker and Docker Compose
- 8GB+ RAM recommended
- Google Gemini API key
- Database credentials (PostgreSQL/Snowflake/BigQuery/Redshift)
Create .env file with your configuration:
# AI Configuration
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=optional_openai_key
# Database Configuration
DATABASE_URL=postgresql://user:password@localhost:5432/banking_db
SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=your_warehouse
# Vector Store
VECTOR_DB_URL=postgresql://user:password@localhost:5432/vector_db
# Application Settings
API_HOST=0.0.0.0
API_PORT=8000
FRONTEND_PORT=3000
REDIS_URL=redis://localhost:6379# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Check service status
docker-compose ps# Execute database initialization scripts
docker exec -it ai-schema-db-1 psql -U postgres -d banking_db -f /docker-entrypoint-initdb.d/banking_schema.sql
docker exec -it ai-schema-db-1 psql -U postgres -d banking_db -f /docker-entrypoint-initdb.d/banking_data.sql- Demo Interface: http://localhost:9999/banking_demo.html
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
Customer Analytics
User: "Show me top 10 customers by total spending in 2024"
AI: Generates optimized SQL with proper joins and aggregations
Result: Customer rankings with spending amounts
Risk Analysis
User: "Find customers with credit utilization above 80%"
AI: Calculates utilization ratios across credit products
Result: High-risk customer identification
Fraud Detection
User: "Detect unusual transaction patterns in the last 30 days"
AI: Applies statistical analysis and pattern recognition
Result: Flagged transactions with risk scores
-- Complex analytical query
WITH customer_metrics AS (
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name as full_name,
COUNT(DISTINCT a.account_id) as account_count,
SUM(CASE WHEN t.amount > 0 THEN t.amount ELSE 0 END) as total_credits,
SUM(CASE WHEN t.amount < 0 THEN ABS(t.amount) ELSE 0 END) as total_debits,
AVG(a.balance) as avg_balance
FROM customers c
JOIN accounts a ON c.customer_id = a.customer_id
JOIN transactions t ON a.account_id = t.account_id
WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY c.customer_id, c.first_name, c.last_name
)
SELECT
full_name,
account_count,
total_credits,
total_debits,
avg_balance,
(total_credits - total_debits) as net_flow
FROM customer_metrics
ORDER BY net_flow DESC
LIMIT 25;The system creates embeddings for:
- Database schemas with semantic descriptions
- Historical queries for pattern recognition
- Business rules for contextual understanding
- Domain knowledge for banking-specific insights
# Simplified LangChain pipeline
from langchain.chains import SQLDatabaseChain
from langchain.llms import GoogleGenerativeAI
# Initialize components
llm = GoogleGenerativeAI(model="gemini-pro")
sql_chain = SQLDatabaseChain.from_llm(
llm=llm,
db=database,
verbose=True,
return_intermediate_steps=True
)
# Process natural language query
response = sql_chain.run("Show high-value customers")# Database connector factory
class DatabaseConnectorFactory:
@staticmethod
def create_connector(db_type: str, config: dict):
connectors = {
"postgresql": PostgreSQLConnector,
"snowflake": SnowflakeConnector,
"bigquery": BigQueryConnector,
"redshift": RedshiftConnector
}
return connectors[db_type](config)1. Database Connection Failed
# Check database service
docker-compose logs db
# Verify connection parameters
docker exec -it ai-schema-backend-1 python -c "
from app.database import get_db_connection
print(get_db_connection().execute('SELECT 1'))
"2. AI Queries Not Working
# Verify API key
echo $GOOGLE_API_KEY
# Check vector store
docker exec -it ai-schema-db-1 psql -U postgres -d vector_db -c "
SELECT count(*) FROM embeddings;
"3. Frontend Not Loading
# Check frontend service
docker-compose logs frontend
# Verify API connectivity
curl http://localhost:8000/health- Connection Pooling: Reuse database connections
- Query Caching: Redis-based result caching
- Index Optimization: Automated index recommendations
- Parallel Processing: Concurrent query execution
- Embedding Caching: Store frequently used embeddings
- Model Optimization: Use appropriate model sizes
- Context Limiting: Optimize RAG context windows
- Response Streaming: Real-time response delivery
- Query Sanitization: Prevent SQL injection
- Access Control: Role-based permissions
- Audit Logging: Complete operation tracking
- Data Masking: Sensitive field protection
- Authentication: JWT token validation
- Rate Limiting: Prevent abuse
- HTTPS Only: Encrypted communication
- CORS Configuration: Cross-origin security
- Query response times
- AI model performance
- Database connection health
- Resource utilization
- User engagement patterns
- Query success rates
- Feature adoption
- Performance improvements
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Your banking dataset (250+ customers, 387 accounts, 3,585+ transactions, 141 credit cards, 100+ loans) is perfect for implementing these ML models:
# Features from your data
- Account balance trends
- Transaction frequency patterns
- Product usage (credit cards, loans)
- Customer demographics
- Days since last transaction
# ML Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Example implementation
def predict_churn():
features = [
'avg_balance_last_3_months',
'transaction_count_decline',
'days_since_last_login',
'credit_utilization_ratio',
'number_of_products'
]
# Train model to predict: Will customer leave in next 90 days?
model = RandomForestClassifier()
# Result: "85% chance Customer John Smith will churn"# Features from your banking data
- Payment history from loans table
- Credit utilization from credit_cards
- Account balance stability
- Transaction patterns
- Income estimation from deposits
# ML Implementation
def credit_risk_model():
sql_query = """
SELECT
c.customer_id,
AVG(a.balance) as avg_balance,
COUNT(t.transaction_id) as transaction_count,
SUM(CASE WHEN cc.balance > cc.credit_limit * 0.8 THEN 1 ELSE 0 END) as high_utilization,
COUNT(CASE WHEN l.status = 'defaulted' THEN 1 END) as past_defaults
FROM customers c
JOIN accounts a ON c.customer_id = a.customer_id
LEFT JOIN transactions t ON a.account_id = t.account_id
LEFT JOIN credit_cards cc ON c.customer_id = cc.customer_id
LEFT JOIN loans l ON c.customer_id = l.customer_id
GROUP BY c.customer_id
"""
# Output: Risk score 1-1000 for each customer
# "Customer ID 123: Risk Score 750 (High Risk)"# Real-time transaction scoring
def fraud_detection_model():
features = [
'transaction_amount',
'time_of_day',
'merchant_category',
'location_deviation',
'amount_vs_historical_avg',
'velocity_checks' # multiple transactions quickly
]
# Anomaly detection using Isolation Forest
from sklearn.ensemble import IsolationForest
# Real-time scoring: "Transaction flagged - 95% fraud probability"
# Auto-block high-risk transactions# Predict future revenue per customer
def clv_prediction():
features = [
'monthly_avg_balance',
'fee_generation',
'product_adoption_rate',
'transaction_volume',
'tenure_months'
]
# Regression model to predict:
# "Customer will generate $2,847 in revenue over next 2 years"# Using your loans and payments data
def loan_default_model():
sql_features = """
SELECT
l.loan_id,
l.amount,
l.interest_rate,
l.term_months,
COUNT(p.payment_id) as payments_made,
AVG(CASE WHEN p.payment_date > p.due_date THEN 1 ELSE 0 END) as late_payment_rate,
c.age,
AVG(a.balance) as avg_account_balance
FROM loans l
JOIN loan_payments p ON l.loan_id = p.loan_id
JOIN customers c ON l.customer_id = c.customer_id
JOIN accounts a ON c.customer_id = a.customer_id
GROUP BY l.loan_id, l.amount, l.interest_rate, l.term_months, c.age
"""
# Predict: "15% probability of default in next 6 months"# Auto-categorize transactions using NLP
def smart_categorization():
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Train on your merchant names and descriptions
features = ['merchant_name', 'transaction_amount', 'time_pattern']
# Auto-categorize: "STARBUCKS STORE #123" โ "Coffee & Dining"
# Improve your current merchant categories with ML# Customer behavior clustering
def spending_patterns():
from sklearn.cluster import KMeans
features = [
'grocery_spending_ratio',
'entertainment_ratio',
'gas_stations_ratio',
'online_vs_offline_ratio',
'weekend_vs_weekday_spending'
]
# Cluster customers into groups:
# - "Budget Conscious" (low spending, high savings)
# - "Premium Lifestyle" (high entertainment, dining)
# - "Family Focused" (groceries, gas, utilities)# Add ML Pipeline to your existing system
class MLPipeline:
def __init__(self):
self.models = {
'churn': ChurnModel(),
'fraud': FraudModel(),
'credit_risk': CreditRiskModel(),
'clv': CLVModel()
}
def get_customer_insights(self, customer_id):
# Get all predictions for a customer
insights = {}
# Fetch customer data from your existing database
customer_data = self.get_customer_features(customer_id)
# Run all models
insights['churn_probability'] = self.models['churn'].predict(customer_data)
insights['fraud_risk'] = self.models['fraud'].predict(customer_data)
insights['credit_score'] = self.models['credit_risk'].predict(customer_data)
insights['lifetime_value'] = self.models['clv'].predict(customer_data)
return insights
def real_time_transaction_scoring(self, transaction):
# Score transactions as they happen
fraud_score = self.models['fraud'].predict_single(transaction)
if fraud_score > 0.8:
return "BLOCK_TRANSACTION"
elif fraud_score > 0.5:
return "REQUIRE_VERIFICATION"
else:
return "APPROVE"# Start with these using your existing data
1. Customer Segmentation (K-Means clustering)
2. Basic Fraud Detection (Anomaly detection)
3. Credit Utilization Risk (Simple scoring)
4. Transaction Pattern Analysis# More sophisticated ML
1. Churn Prediction (Random Forest/XGBoost)
2. Loan Default Prediction (Gradient Boosting)
3. Customer Lifetime Value (Regression)
4. Real-time Fraud Scoring# Integrate with your existing RAG system
1. Natural Language Model Queries
User: "Show me customers likely to churn"
AI: Runs churn model + generates insights
2. Automated Insights
"15 customers flagged for high churn risk this week"
3. Predictive Recommendations
"Customer John Smith: Recommend personal loan based on spending patterns"- Fraud Prevention: Save $50K+ annually by catching fraud early
- Credit Risk: Reduce defaults by 20-30% with better scoring
- Compliance: Automated suspicious activity monitoring
- Cross-selling: "Customer likely to need mortgage" โ Increase sales 15%
- Retention: Identify churn early โ Save high-value customers
- Pricing: Dynamic pricing based on risk profiles
- Automated Decisions: Instant loan approvals for low-risk customers
- Resource Allocation: Focus on high-value, low-risk customers
- Proactive Support: Reach out before customers have problems
# Add to your existing requirements.txt
scikit-learn==1.3.0
pandas==2.0.3
numpy==1.24.3
xgboost==1.7.6
lightgbm==4.0.0
tensorflow==2.13.0 # For deep learning models
plotly==5.15.0 # For ML visualizations
mlflow==2.5.0 # Model versioning and trackingAdd ML insights to your existing banking_demo.html:
// New ML-powered buttons
<button onclick="getChurnPrediction()">๐ฎ Predict Customer Churn</button>
<button onclick="getFraudAlerts()">๐จ Real-time Fraud Detection</button>
<button onclick="getCreditRiskAnalysis()">๐ณ Credit Risk Analysis</button>
<button onclick="getCustomerSegments()">๐ฅ Customer Segmentation</button>
<button onclick="getLoanDefaultRisk()">๐ Loan Default Predictions</button>Want me to help you implement any of these specific ML models? I can create the code for training, deployment, and integration with your existing RAG system! ๐
- Advanced visualization components
- Real-time data streaming
- Machine learning model integration โ
- Multi-tenant architecture
- Advanced security features
- Mobile application
- Enterprise SSO integration
Built with โค๏ธ for the future of data analytics