🏦 AI-Powered Banking Analytics Platform

A comprehensive data warehouse solution that combines traditional SQL analytics with AI-powered natural language querying, built with FastAPI, React, and advanced RAG (Retrieval-Augmented Generation) capabilities.

🚀 Quick Start

# Clone and start the application
git clone <repository>
cd ai-schema
docker-compose up -d

# Access the application
open http://localhost:9999/banking_demo.html

📋 Table of Contents

System Architecture
What are RAG and LangChain?
Data Flow
Features
Demo Data
API Endpoints
Technologies
Setup Guide

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Frontend Layer                               │
├─────────────────────────────────────────────────────────────────┤
│  React UI (banking_demo.html)                                  │
│  • Interactive dashboards                                      │
│  • Natural language query interface                           │
│  • Real-time SQL execution                                    │
│  • Data visualization                                         │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    API Gateway                                  │
├─────────────────────────────────────────────────────────────────┤
│  FastAPI Backend (app/main.py)                                │
│  • Request routing and validation                             │
│  • Authentication & authorization                             │
│  • Query processing orchestration                             │
│  • Response formatting                                        │
└─────────────────────────────────────────────────────────────────┘
                                │
                    ┌───────────┴───────────┐
                    ▼                       ▼
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│        AI Processing Layer      │ │    Traditional Query Layer      │
├─────────────────────────────────┤ ├─────────────────────────────────┤
│  🤖 LangChain + RAG Pipeline   │ │  📊 SQL Query Engine           │
│  • Natural language parsing    │ │  • Direct SQL execution        │
│  • Context retrieval          │ │  • Query validation            │
│  • Schema-aware generation    │ │  • Performance optimization    │
│  • Google Gemini integration  │ │  • Result formatting           │
└─────────────────────────────────┘ └─────────────────────────────────┘
                    │                       │
                    └───────────┬───────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                Database Connector Layer                         │
├─────────────────────────────────────────────────────────────────┤
│  Multi-Database Support                                        │
│  • PostgreSQL Connector                                       │
│  • Snowflake Connector                                        │
│  • BigQuery Connector                                         │
│  • Redshift Connector                                         │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                Data Storage Layer                               │
├─────────────────────────────────────────────────────────────────┤
│  🗃️ PostgreSQL (Primary Data)    │  🧠 pgvector (Vector Store)   │
│  • 250+ customers                │  • Schema embeddings          │
│  • 387 bank accounts            │  • Query context vectors      │
│  • 3,585+ transactions          │  • Semantic search index      │
│  • 141 credit cards             │  • RAG knowledge base         │
│  • 100+ loans with payments     │  • Historical query patterns  │
└─────────────────────────────────────────────────────────────────┘

🧠 What are RAG and LangChain? (Simple Explanation)

The Food Analogy 🍳

Imagine you want to cook dinner. You don't just randomly throw ingredients together - you:

Check your kitchen (what ingredients do you have?)
Look at recipes (what can you make with these ingredients?)
Follow the cooking steps (combine ingredients in the right way)
Serve the meal (present the final result)

This is exactly how our AI system works when you ask a question!

RAG (Retrieval-Augmented Generation) = Smart Recipe Lookup 📚

Simple explanation: Before the AI tries to answer your question, it first "looks in the cookbook" (our database) to see what ingredients (data tables) are available and what recipes (past successful queries) have worked before.

In technical terms: RAG retrieves relevant information from a knowledge base before generating answers, ensuring responses are grounded in actual data rather than made-up information.

LangChain = The Smart Chef 👨‍🍳

Simple explanation: LangChain is like a master chef who coordinates the entire cooking process - from understanding what you want to eat, checking ingredients, following recipes, and presenting the final dish.

In technical terms: LangChain orchestrates the entire AI workflow, chaining together different components like understanding user input, retrieving context, generating SQL, and formatting responses.

How They Work Together in Our Banking System

🍽️  FOOD ANALOGY                    🏦  BANKING SYSTEM
────────────────────────────────────────────────────────────
User: "I want pasta"          →     User: "Show me top customers"
                                    
Chef checks pantry           →     RAG checks database schema
Chef finds: tomatoes, pasta  →     RAG finds: customers, accounts, transactions
                                    
Chef recalls pasta recipes   →     RAG recalls similar past queries
Chef picks: tomato pasta     →     RAG picks: customer spending analysis
                                    
Chef cooks step by step      →     LangChain generates SQL step by step
Chef serves the dish         →     System returns formatted results

What are Embeddings and Vectors? 🧭

Simple explanation: Think of embeddings as "smart fingerprints" for information.

Your database schema (table names, columns) gets converted into mathematical "fingerprints"
Past successful queries also get their own "fingerprints"
When you ask a new question, we create a "fingerprint" for your question
We find the most similar "fingerprints" in our collection to understand what you really want

Real example:

Your question: "Show customers who spend a lot"
System thinks: "This fingerprint is similar to past queries about 
customer spending analysis, high-value customers, and transaction totals"
Result: Retrieves the right tables and query patterns to build your answer

Benefits in Banking Analytics

This combination allows non-technical users to ask complex questions like "Show me customers with high credit utilization who might be at risk" and get accurate SQL queries that consider the actual database schema, relationships, and business context.

🚀 Demo API Testing with cURL

Quick Demo Commands (Copy & Paste Ready)

1. Health Check - Verify System Status

curl -X GET "http://localhost:8000/health" \
  -H "Content-Type: application/json" | jq

Expected Output:

{
  "status": "healthy",
  "database": "connected",
  "ai_service": "available",
  "timestamp": "2024-08-11T10:30:45Z"
}

2. Natural Language Query - Top Customers by Spending

curl -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Show me top 10 customers by total spending",
    "session_id": "demo_session_1"
  }' | jq

Expected Output:

{
  "response": "Here are your top 10 customers by spending:",
  "sql_query": "SELECT c.first_name, c.last_name, SUM(t.amount) as total_spending FROM customers c JOIN accounts a ON c.customer_id = a.customer_id JOIN transactions t ON a.account_id = t.account_id GROUP BY c.customer_id ORDER BY total_spending DESC LIMIT 10",
  "results": [
    {"first_name": "John", "last_name": "Smith", "total_spending": 15420.50},
    {"first_name": "Sarah", "last_name": "Johnson", "total_spending": 12380.75}
  ],
  "execution_time": "0.34s"
}

3. Direct SQL Execution - Credit Card Analysis

curl -X POST "http://localhost:8000/sql/execute" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT COUNT(*) as total_cards, AVG(credit_limit) as avg_limit, AVG(balance) as avg_balance FROM credit_cards WHERE status = '\''active'\'';",
    "database": "banking_db"
  }' | jq

Expected Output:

{
  "results": [
    {
      "total_cards": 141,
      "avg_limit": 8500.00,
      "avg_balance": 2347.83
    }
  ],
  "execution_time": "0.12s",
  "row_count": 1
}

4. Risk Analysis Query - High Credit Utilization

curl -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Find customers with credit utilization above 80%",
    "session_id": "demo_session_2"
  }' | jq

Expected Output:

{
  "response": "Found 12 customers with high credit utilization (>80%):",
  "sql_query": "SELECT c.first_name, c.last_name, cc.balance, cc.credit_limit, ROUND((cc.balance/cc.credit_limit)*100, 2) as utilization_pct FROM customers c JOIN credit_cards cc ON c.customer_id = cc.customer_id WHERE (cc.balance/cc.credit_limit) > 0.8 ORDER BY utilization_pct DESC",
  "results": [
    {"first_name": "Mike", "last_name": "Wilson", "utilization_pct": 94.5},
    {"first_name": "Lisa", "last_name": "Davis", "utilization_pct": 87.2}
  ],
  "insights": "These customers may be at financial risk and could benefit from credit counseling."
}

5. Transaction Pattern Analysis - Recent Activity

curl -X POST "http://localhost:8000/sql/execute" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT category, COUNT(*) as transaction_count, SUM(ABS(amount)) as total_amount FROM transactions WHERE transaction_date >= CURRENT_DATE - INTERVAL '\''7 days'\'' GROUP BY category ORDER BY total_amount DESC LIMIT 5;",
    "database": "banking_db"
  }' | jq

Expected Output:

{
  "results": [
    {"category": "Groceries", "transaction_count": 89, "total_amount": 12450.30},
    {"category": "Gas Stations", "transaction_count": 45, "total_amount": 3420.75},
    {"category": "Restaurants", "transaction_count": 67, "total_amount": 5680.20},
    {"category": "Entertainment", "transaction_count": 23, "total_amount": 2340.50},
    {"category": "Online Shopping", "transaction_count": 34, "total_amount": 4120.80}
  ],
  "execution_time": "0.18s",
  "row_count": 5
}

🔧 Advanced Demo Commands

Test Error Handling

# Test invalid SQL
curl -X POST "http://localhost:8000/sql/execute" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "SELECT * FROM nonexistent_table;",
    "database": "banking_db"
  }' | jq

# Test malformed request
curl -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{"invalid": "request"}' | jq

Get Database Schema Information

curl -X GET "http://localhost:8000/schema/banking_db" \
  -H "Content-Type: application/json" | jq

Check System Metrics

curl -X GET "http://localhost:8000/metrics" \
  -H "Content-Type: application/json" | jq

📊 Running All Demos at Once

#!/bin/bash
echo "🏦 Banking Analytics API Demo"
echo "============================="

echo "1. Health Check..."
curl -s "http://localhost:8000/health" | jq

echo -e "\n2. Top Customers Query..."
curl -s -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{"message": "Show me top 5 customers by spending", "session_id": "demo"}' | jq

echo -e "\n3. Credit Card Stats..."
curl -s -X POST "http://localhost:8000/sql/execute" \
  -H "Content-Type: application/json" \
  -d '{"query": "SELECT COUNT(*) as cards, AVG(credit_limit) as avg_limit FROM credit_cards;"}' | jq

echo -e "\n4. High Risk Customers..."
curl -s -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{"message": "Find high credit utilization customers", "session_id": "demo"}' | jq

echo -e "\n5. Recent Transactions..."
curl -s -X POST "http://localhost:8000/sql/execute" \
  -H "Content-Type: application/json" \
  -d '{"query": "SELECT category, COUNT(*) FROM transactions WHERE transaction_date >= CURRENT_DATE - INTERVAL 7 DAY GROUP BY category LIMIT 3;"}' | jq

echo -e "\n✅ Demo Complete!"

🔑 Updating Gemini API Key

Method 1: Update .env File (Recommended)

# 1. Edit your .env file
nano .env

# 2. Update the key
GOOGLE_API_KEY=your_new_gemini_api_key_here

# 3. Restart only the backend service
docker-compose restart backend

# 4. Verify the update
curl -X GET "http://localhost:8000/health" | jq

Method 2: Environment Variable Update

# 1. Stop the backend service
docker-compose stop backend

# 2. Update environment variable
export GOOGLE_API_KEY="your_new_gemini_api_key_here"

# 3. Start the backend service
docker-compose up -d backend

# 4. Check logs for successful startup
docker-compose logs -f backend

Method 3: Full System Restart (If needed)

# 1. Stop all services
docker-compose down

# 2. Update .env file
echo "GOOGLE_API_KEY=your_new_gemini_api_key_here" >> .env

# 3. Start all services
docker-compose up -d

# 4. Verify all services are healthy
docker-compose ps
curl -X GET "http://localhost:8000/health" | jq

🔍 Verify API Key Update

# Test AI functionality
curl -X POST "http://localhost:8000/chat/send_message" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Test AI connection - show customer count",
    "session_id": "test_key"
  }' | jq

# Check backend logs for API errors
docker-compose logs backend | grep -i "gemini\|api\|error"

# Verify health endpoint shows AI service available
curl -X GET "http://localhost:8000/health" | jq '.ai_service'

⚡ Quick Service Restart Commands

# Restart just the backend (fastest)
docker-compose restart backend

# Restart backend and vector database
docker-compose restart backend db

# Restart everything (if major changes)
docker-compose down && docker-compose up -d

🚨 Troubleshooting API Key Issues

# Check if key is loaded
docker exec ai-schema-backend-1 env | grep GOOGLE_API_KEY

# Test key validity
curl -X POST "https://generativelanguage.googleapis.com/v1/models/gemini-pro:generateContent?key=YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"test"}]}]}'

# View backend service logs
docker-compose logs -f backend | grep -i "gemini\|auth\|api"

🔄 Data Flow (Business-Friendly Explanation)

The Complete Journey: From Question to Answer 🚀

👤 Business User                    🤖 AI System                      💾 Database
─────────────────────────────────────────────────────────────────────────────────

"Show me our top customers        →  "Let me understand what you need..."
 by spending this year"              
                                     ↓
                                   1. UNDERSTAND (LangChain)
                                     • Parse: "top customers" + "spending" + "this year"
                                     • Intent: Customer analysis query
                                     
                                     ↓
                                   2. RESEARCH (RAG)                →  📚 Check Vector Store
                                     • "What tables have customer data?"
                                     • "How did we solve similar questions before?"
                                     • "What are the relationships between tables?"
                                     
                                     ↓                              ←  📋 Schema Info Retrieved
                                   3. PLAN (LangChain + RAG)
                                     • Found: customers, accounts, transactions tables
                                     • Pattern: JOIN tables → SUM amounts → GROUP BY customer
                                     • Recipe: Similar to "customer spending analysis" queries
                                     
                                     ↓
                                   4. BUILD SQL (Google Gemini)
                                     • Creates: SELECT customers, SUM(transactions)...
                                     • Adds: JOIN statements for relationships
                                     • Includes: WHERE date filters for "this year"
                                     
                                     ↓
                                   5. EXECUTE                      →  🏦 Run Query on Bank Data
                                     • Safety check: No destructive operations
                                     • Run the generated SQL
                                     
                                     ↓                              ←  📊 Results Retrieved
                                   6. PRESENT
                                     • Format results in readable table
                                     • Add insights: "Top customer spent $50,000"
                                     
"Perfect! John Smith is our        ←  "Here are your top 10 customers by 
 biggest customer with $50K           spending in 2024..."
 in spending this year."

Why This Approach Works 💡

🎯 Business Benefits:

No SQL Knowledge Needed: Ask questions in plain English
Always Accurate: AI uses actual database structure, not guesswork
Fast Results: Sub-second query generation and execution
Learning System: Gets smarter with each question asked

🔧 Technical Advantages:

Schema-Aware: AI knows your exact table structure
Context-Rich: Past successful queries guide new ones
Multi-Database: Works across PostgreSQL, Snowflake, BigQuery
Security-First: Query validation prevents data damage

Real Business Scenarios 📈

Scenario 1: Risk Management

Manager: "Which customers have maxed out their credit cards?"
System: → Checks credit_cards table → Calculates utilization ratios → Returns risk list
Result: "15 customers at 90%+ utilization, flagged for review"

Scenario 2: Performance Analysis

Executive: "How did our loan portfolio perform last quarter?"
System: → Joins loans + payments tables → Calculates metrics → Trends analysis
Result: "Q3 loan performance: 94% on-time payments, $2.3M new originations"

Scenario 3: Customer Insights

Marketing: "Show me customers who might need a savings account"
System: → Analyzes transaction patterns → Identifies high cash flow → No savings products
Result: "127 customers with high deposits but no savings accounts"

1. Natural Language Query Processing (Technical Detail)

User Input: "Show me top customers by total spending"
            │
            ▼
┌─────────────────────────────────────────────────────────────────┐
│  LangChain Query Processing Pipeline                            │
├─────────────────────────────────────────────────────────────────┤
│  1. Input Validation & Preprocessing                           │
│     • Clean and normalize user input                          │
│     • Extract query intent and entities                       │
│                                                               │
│  2. Context Retrieval (RAG)                                   │
│     • Query vector store for relevant schema info             │
│     • Retrieve similar historical queries                     │
│     • Get table relationships and constraints                 │
│                                                               │
│  3. Prompt Engineering                                         │
│     • Combine user query with retrieved context               │
│     • Apply banking domain-specific templates                 │
│     • Include schema definitions and examples                 │
│                                                               │
│  4. LLM Generation (Google Gemini)                           │
│     • Generate SQL query based on context                     │
│     • Apply banking business rules                            │
│     • Ensure query safety and optimization                    │
│                                                               │
│  5. Post-Processing                                            │
│     • Validate generated SQL syntax                           │
│     • Apply security checks                                   │
│     • Format for execution                                    │
└─────────────────────────────────────────────────────────────────┘
            │
            ▼
Generated SQL: "SELECT c.first_name, c.last_name, 
                SUM(t.amount) as total_spending 
                FROM customers c 
                JOIN accounts a ON c.customer_id = a.customer_id 
                JOIN transactions t ON a.account_id = t.account_id 
                GROUP BY c.customer_id, c.first_name, c.last_name 
                ORDER BY total_spending DESC LIMIT 10"

2. Vector Store & RAG Implementation

┌─────────────────────────────────────────────────────────────────┐
│                    Vector Store (pgvector)                     │
├─────────────────────────────────────────────────────────────────┤
│  📊 Schema Embeddings                                          │
│  • Table definitions with descriptions                         │
│  • Column metadata and relationships                          │
│  • Business rules and constraints                             │
│                                                               │
│  📝 Query Pattern Embeddings                                   │
│  • Common banking queries and their SQL                       │
│  • User query history and results                             │
│  • Domain-specific query templates                            │
│                                                               │
│  🔍 Semantic Search Process                                    │
│  1. Convert user query to embedding                           │
│  2. Find similar vectors in knowledge base                    │
│  3. Retrieve top-k relevant contexts                          │
│  4. Rank by relevance and recency                             │
└─────────────────────────────────────────────────────────────────┘

✨ Features

🤖 AI-Powered Analytics

Natural Language Queries: Ask questions in plain English
Context-Aware Responses: RAG ensures accurate, data-grounded answers
Banking Domain Intelligence: Specialized for financial data analysis
Query Learning: System improves from user interactions

📊 Traditional Analytics

Direct SQL Interface: For power users and developers
Real-time Execution: Sub-second query performance
Multi-database Support: PostgreSQL, Snowflake, BigQuery, Redshift
Query Optimization: Automatic performance enhancements

🛡️ Security & Governance

Query Validation: Prevents destructive operations
Access Control: Role-based permissions
Audit Logging: Complete query history tracking
Data Privacy: Sensitive data protection

📊 Demo Data

The platform includes comprehensive banking data with 250+ customers, 387 accounts, 3,585+ transactions, 141 credit cards, and 100+ loans with realistic financial patterns for demonstration purposes.

🛠️ Technologies

Backend: FastAPI + Python + SQLAlchemy
AI: LangChain + Google Gemini + pgvector RAG
Frontend: HTML5/CSS3/JavaScript + Tailwind CSS
Database: PostgreSQL + Snowflake + BigQuery + Redshift
Infrastructure: Docker + Redis + Nginx

🔌 API Endpoints

Chat & Query Processing

POST /chat/send_message
├── Natural language processing
├── RAG context retrieval
├── SQL generation via LangChain
└── Result formatting and return

GET /chat/history
└── Retrieve conversation history

POST /sql/execute
├── Direct SQL query execution
├── Query validation and security checks
└── Formatted result return

Database Management

GET /databases
└── List available database connections

POST /databases/connect
├── Establish new database connection
├── Validate credentials
└── Test connectivity

GET /schema/{database}
├── Retrieve schema information
├── Table and column metadata
└── Relationship mappings

Health & Monitoring

GET /health
├── System health check
├── Database connectivity status
└── AI service availability

GET /metrics
├── Query performance statistics
├── Usage analytics
└── System resource utilization

📚 Setup Guide

Prerequisites

Docker and Docker Compose
8GB+ RAM recommended
Google Gemini API key
Database credentials (PostgreSQL/Snowflake/BigQuery/Redshift)

1. Environment Configuration

Create .env file with your configuration:

# AI Configuration
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=optional_openai_key

# Database Configuration
DATABASE_URL=postgresql://user:password@localhost:5432/banking_db
SNOWFLAKE_ACCOUNT=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=your_warehouse

# Vector Store
VECTOR_DB_URL=postgresql://user:password@localhost:5432/vector_db

# Application Settings
API_HOST=0.0.0.0
API_PORT=8000
FRONTEND_PORT=3000
REDIS_URL=redis://localhost:6379

2. Start the Application

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Check service status
docker-compose ps

3. Initialize Demo Data

# Execute database initialization scripts
docker exec -it ai-schema-db-1 psql -U postgres -d banking_db -f /docker-entrypoint-initdb.d/banking_schema.sql
docker exec -it ai-schema-db-1 psql -U postgres -d banking_db -f /docker-entrypoint-initdb.d/banking_data.sql

4. Access the Application

Demo Interface: http://localhost:9999/banking_demo.html
API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health

🎯 Usage Examples

Natural Language Queries

Customer Analytics

User: "Show me top 10 customers by total spending in 2024"
AI: Generates optimized SQL with proper joins and aggregations
Result: Customer rankings with spending amounts

Risk Analysis

User: "Find customers with credit utilization above 80%"
AI: Calculates utilization ratios across credit products
Result: High-risk customer identification

Fraud Detection

User: "Detect unusual transaction patterns in the last 30 days"
AI: Applies statistical analysis and pattern recognition
Result: Flagged transactions with risk scores

Direct SQL Interface

-- Complex analytical query
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name as full_name,
        COUNT(DISTINCT a.account_id) as account_count,
        SUM(CASE WHEN t.amount > 0 THEN t.amount ELSE 0 END) as total_credits,
        SUM(CASE WHEN t.amount < 0 THEN ABS(t.amount) ELSE 0 END) as total_debits,
        AVG(a.balance) as avg_balance
    FROM customers c
    JOIN accounts a ON c.customer_id = a.customer_id
    JOIN transactions t ON a.account_id = t.account_id
    WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '90 days'
    GROUP BY c.customer_id, c.first_name, c.last_name
)
SELECT 
    full_name,
    account_count,
    total_credits,
    total_debits,
    avg_balance,
    (total_credits - total_debits) as net_flow
FROM customer_metrics
ORDER BY net_flow DESC
LIMIT 25;

🚀 Advanced Features

Vector Embeddings & RAG

The system creates embeddings for:

Database schemas with semantic descriptions
Historical queries for pattern recognition
Business rules for contextual understanding
Domain knowledge for banking-specific insights

LangChain Integration

# Simplified LangChain pipeline
from langchain.chains import SQLDatabaseChain
from langchain.llms import GoogleGenerativeAI

# Initialize components
llm = GoogleGenerativeAI(model="gemini-pro")
sql_chain = SQLDatabaseChain.from_llm(
    llm=llm,
    db=database,
    verbose=True,
    return_intermediate_steps=True
)

# Process natural language query
response = sql_chain.run("Show high-value customers")

Multi-Database Architecture

# Database connector factory
class DatabaseConnectorFactory:
    @staticmethod
    def create_connector(db_type: str, config: dict):
        connectors = {
            "postgresql": PostgreSQLConnector,
            "snowflake": SnowflakeConnector,
            "bigquery": BigQueryConnector,
            "redshift": RedshiftConnector
        }
        return connectors[db_type](config)

🔍 Troubleshooting

Common Issues

1. Database Connection Failed

# Check database service
docker-compose logs db

# Verify connection parameters
docker exec -it ai-schema-backend-1 python -c "
from app.database import get_db_connection
print(get_db_connection().execute('SELECT 1'))
"

2. AI Queries Not Working

# Verify API key
echo $GOOGLE_API_KEY

# Check vector store
docker exec -it ai-schema-db-1 psql -U postgres -d vector_db -c "
SELECT count(*) FROM embeddings;
"

3. Frontend Not Loading

# Check frontend service
docker-compose logs frontend

# Verify API connectivity
curl http://localhost:8000/health

📈 Performance Optimization

Query Performance

Connection Pooling: Reuse database connections
Query Caching: Redis-based result caching
Index Optimization: Automated index recommendations
Parallel Processing: Concurrent query execution

AI Performance

Embedding Caching: Store frequently used embeddings
Model Optimization: Use appropriate model sizes
Context Limiting: Optimize RAG context windows
Response Streaming: Real-time response delivery

🛡️ Security Best Practices

Data Protection

Query Sanitization: Prevent SQL injection
Access Control: Role-based permissions
Audit Logging: Complete operation tracking
Data Masking: Sensitive field protection

API Security

Authentication: JWT token validation
Rate Limiting: Prevent abuse
HTTPS Only: Encrypted communication
CORS Configuration: Cross-origin security

📊 Monitoring & Analytics

System Metrics

Query response times
AI model performance
Database connection health
Resource utilization

Business Metrics

User engagement patterns
Query success rates
Feature adoption
Performance improvements

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤖 Machine Learning Features on Banking Data

🎯 ML Models You Can Build with Your Data

Your banking dataset (250+ customers, 387 accounts, 3,585+ transactions, 141 credit cards, 100+ loans) is perfect for implementing these ML models:

1. Customer Churn Prediction 📉

# Features from your data
- Account balance trends
- Transaction frequency patterns
- Product usage (credit cards, loans)
- Customer demographics
- Days since last transaction

# ML Model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Example implementation
def predict_churn():
    features = [
        'avg_balance_last_3_months',
        'transaction_count_decline',
        'days_since_last_login',
        'credit_utilization_ratio',
        'number_of_products'
    ]
    
    # Train model to predict: Will customer leave in next 90 days?
    model = RandomForestClassifier()
    # Result: "85% chance Customer John Smith will churn"

2. Credit Risk Scoring 💳

# Features from your banking data
- Payment history from loans table
- Credit utilization from credit_cards
- Account balance stability
- Transaction patterns
- Income estimation from deposits

# ML Implementation
def credit_risk_model():
    sql_query = """
    SELECT 
        c.customer_id,
        AVG(a.balance) as avg_balance,
        COUNT(t.transaction_id) as transaction_count,
        SUM(CASE WHEN cc.balance > cc.credit_limit * 0.8 THEN 1 ELSE 0 END) as high_utilization,
        COUNT(CASE WHEN l.status = 'defaulted' THEN 1 END) as past_defaults
    FROM customers c
    JOIN accounts a ON c.customer_id = a.customer_id
    LEFT JOIN transactions t ON a.account_id = t.account_id
    LEFT JOIN credit_cards cc ON c.customer_id = cc.customer_id
    LEFT JOIN loans l ON c.customer_id = l.customer_id
    GROUP BY c.customer_id
    """
    
    # Output: Risk score 1-1000 for each customer
    # "Customer ID 123: Risk Score 750 (High Risk)"

3. Fraud Detection 🚨

# Real-time transaction scoring
def fraud_detection_model():
    features = [
        'transaction_amount',
        'time_of_day',
        'merchant_category',
        'location_deviation',
        'amount_vs_historical_avg',
        'velocity_checks'  # multiple transactions quickly
    ]
    
    # Anomaly detection using Isolation Forest
    from sklearn.ensemble import IsolationForest
    
    # Real-time scoring: "Transaction flagged - 95% fraud probability"
    # Auto-block high-risk transactions

4. Customer Lifetime Value (CLV) 💰

# Predict future revenue per customer
def clv_prediction():
    features = [
        'monthly_avg_balance',
        'fee_generation',
        'product_adoption_rate',
        'transaction_volume',
        'tenure_months'
    ]
    
    # Regression model to predict:
    # "Customer will generate $2,847 in revenue over next 2 years"

5. Loan Default Prediction 🏠

# Using your loans and payments data
def loan_default_model():
    sql_features = """
    SELECT 
        l.loan_id,
        l.amount,
        l.interest_rate,
        l.term_months,
        COUNT(p.payment_id) as payments_made,
        AVG(CASE WHEN p.payment_date > p.due_date THEN 1 ELSE 0 END) as late_payment_rate,
        c.age,
        AVG(a.balance) as avg_account_balance
    FROM loans l
    JOIN loan_payments p ON l.loan_id = p.loan_id
    JOIN customers c ON l.customer_id = c.customer_id
    JOIN accounts a ON c.customer_id = a.customer_id
    GROUP BY l.loan_id, l.amount, l.interest_rate, l.term_months, c.age
    """
    
    # Predict: "15% probability of default in next 6 months"

6. Transaction Categorization 🏷️

# Auto-categorize transactions using NLP
def smart_categorization():
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    # Train on your merchant names and descriptions
    features = ['merchant_name', 'transaction_amount', 'time_pattern']
    
    # Auto-categorize: "STARBUCKS STORE #123" → "Coffee & Dining"
    # Improve your current merchant categories with ML

7. Spending Pattern Analysis 📊

# Customer behavior clustering
def spending_patterns():
    from sklearn.cluster import KMeans
    
    features = [
        'grocery_spending_ratio',
        'entertainment_ratio',
        'gas_stations_ratio',
        'online_vs_offline_ratio',
        'weekend_vs_weekday_spending'
    ]
    
    # Cluster customers into groups:
    # - "Budget Conscious" (low spending, high savings)
    # - "Premium Lifestyle" (high entertainment, dining)
    # - "Family Focused" (groceries, gas, utilities)

🔧 Implementation Architecture

# Add ML Pipeline to your existing system
class MLPipeline:
    def __init__(self):
        self.models = {
            'churn': ChurnModel(),
            'fraud': FraudModel(),
            'credit_risk': CreditRiskModel(),
            'clv': CLVModel()
        }
    
    def get_customer_insights(self, customer_id):
        # Get all predictions for a customer
        insights = {}
        
        # Fetch customer data from your existing database
        customer_data = self.get_customer_features(customer_id)
        
        # Run all models
        insights['churn_probability'] = self.models['churn'].predict(customer_data)
        insights['fraud_risk'] = self.models['fraud'].predict(customer_data)
        insights['credit_score'] = self.models['credit_risk'].predict(customer_data)
        insights['lifetime_value'] = self.models['clv'].predict(customer_data)
        
        return insights
    
    def real_time_transaction_scoring(self, transaction):
        # Score transactions as they happen
        fraud_score = self.models['fraud'].predict_single(transaction)
        
        if fraud_score > 0.8:
            return "BLOCK_TRANSACTION"
        elif fraud_score > 0.5:
            return "REQUIRE_VERIFICATION"
        else:
            return "APPROVE"

🚀 Quick Implementation Plan

Phase 1: Basic Models (2-4 weeks)

# Start with these using your existing data
1. Customer Segmentation (K-Means clustering)
2. Basic Fraud Detection (Anomaly detection)
3. Credit Utilization Risk (Simple scoring)
4. Transaction Pattern Analysis

Phase 2: Advanced Models (1-2 months)

# More sophisticated ML
1. Churn Prediction (Random Forest/XGBoost)
2. Loan Default Prediction (Gradient Boosting)
3. Customer Lifetime Value (Regression)
4. Real-time Fraud Scoring

Phase 3: AI Integration (2-3 months)

# Integrate with your existing RAG system
1. Natural Language Model Queries
   User: "Show me customers likely to churn"
   AI: Runs churn model + generates insights

2. Automated Insights
   "15 customers flagged for high churn risk this week"

3. Predictive Recommendations
   "Customer John Smith: Recommend personal loan based on spending patterns"

🎯 Business Value from ML Models

Risk Management

Fraud Prevention: Save $50K+ annually by catching fraud early
Credit Risk: Reduce defaults by 20-30% with better scoring
Compliance: Automated suspicious activity monitoring

Revenue Growth

Cross-selling: "Customer likely to need mortgage" → Increase sales 15%
Retention: Identify churn early → Save high-value customers
Pricing: Dynamic pricing based on risk profiles

Operational Efficiency

Automated Decisions: Instant loan approvals for low-risk customers
Resource Allocation: Focus on high-value, low-risk customers
Proactive Support: Reach out before customers have problems

🛠️ ML Tech Stack Addition

# Add to your existing requirements.txt
scikit-learn==1.3.0
pandas==2.0.3
numpy==1.24.3
xgboost==1.7.6
lightgbm==4.0.0
tensorflow==2.13.0  # For deep learning models
plotly==5.15.0      # For ML visualizations
mlflow==2.5.0       # Model versioning and tracking

📊 ML Dashboard Integration

Add ML insights to your existing banking_demo.html:

// New ML-powered buttons
<button onclick="getChurnPrediction()">🔮 Predict Customer Churn</button>
<button onclick="getFraudAlerts()">🚨 Real-time Fraud Detection</button>
<button onclick="getCreditRiskAnalysis()">💳 Credit Risk Analysis</button>
<button onclick="getCustomerSegments()">👥 Customer Segmentation</button>
<button onclick="getLoanDefaultRisk()">🏠 Loan Default Predictions</button>

Want me to help you implement any of these specific ML models? I can create the code for training, deployment, and integration with your existing RAG system! 🚀

🎯 Roadmap

Built with ❤️ for the future of data analytics

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
sample_data		sample_data
.env.example		.env.example
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
Makefile		Makefile
README.md		README.md
add_loan_data.sql		add_loan_data.sql
answer.pptx		answer.pptx
banking_data.sql		banking_data.sql
banking_demo.html		banking_demo.html
banking_demo_fixed.html		banking_demo_fixed.html
banking_schema.sql		banking_schema.sql
diagram.png		diagram.png
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
fix_merchant_categories.sql		fix_merchant_categories.sql
generate_demo_data.sql		generate_demo_data.sql
init-multiple-databases.sh		init-multiple-databases.sh
init.sql		init.sql
init_banking_db.sql		init_banking_db.sql
init_vector_tables.sql		init_vector_tables.sql
test_frontend.html		test_frontend.html

Folders and files

Latest commit

History

Repository files navigation

🏦 AI-Powered Banking Analytics Platform

🚀 Quick Start

📋 Table of Contents

🏗️ System Architecture

🧠 What are RAG and LangChain? (Simple Explanation)

The Food Analogy 🍳

RAG (Retrieval-Augmented Generation) = Smart Recipe Lookup 📚

LangChain = The Smart Chef 👨‍🍳

How They Work Together in Our Banking System

What are Embeddings and Vectors? 🧭

Benefits in Banking Analytics

🚀 Demo API Testing with cURL

Quick Demo Commands (Copy & Paste Ready)

1. Health Check - Verify System Status

2. Natural Language Query - Top Customers by Spending

3. Direct SQL Execution - Credit Card Analysis

4. Risk Analysis Query - High Credit Utilization

5. Transaction Pattern Analysis - Recent Activity

🔧 Advanced Demo Commands

Test Error Handling

Get Database Schema Information

Check System Metrics

📊 Running All Demos at Once

🔑 Updating Gemini API Key

Method 1: Update .env File (Recommended)

Method 2: Environment Variable Update

Method 3: Full System Restart (If needed)

🔍 Verify API Key Update

⚡ Quick Service Restart Commands

🚨 Troubleshooting API Key Issues

🔄 Data Flow (Business-Friendly Explanation)

The Complete Journey: From Question to Answer 🚀

Why This Approach Works 💡

Real Business Scenarios 📈

1. Natural Language Query Processing (Technical Detail)

2. Vector Store & RAG Implementation

✨ Features

🤖 AI-Powered Analytics

📊 Traditional Analytics

🛡️ Security & Governance

📊 Demo Data

🛠️ Technologies

🔌 API Endpoints

Chat & Query Processing

Database Management

Health & Monitoring

📚 Setup Guide

Prerequisites

1. Environment Configuration

2. Start the Application

3. Initialize Demo Data

4. Access the Application

🎯 Usage Examples

Natural Language Queries

Direct SQL Interface

🚀 Advanced Features

Vector Embeddings & RAG

LangChain Integration

Multi-Database Architecture

🔍 Troubleshooting

Common Issues

📈 Performance Optimization

Query Performance

AI Performance

🛡️ Security Best Practices

Data Protection

API Security

📊 Monitoring & Analytics

System Metrics

Business Metrics

🤝 Contributing

📄 License

🤖 Machine Learning Features on Banking Data

🎯 ML Models You Can Build with Your Data

1. Customer Churn Prediction 📉

2. Credit Risk Scoring 💳

Packages