Lomesh2000 · Lomesh2000 · Apr 11, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,29 @@
+venv
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info
+dist
+build
+.git
+.gitignore
+.env
+.envrc
+.vscode
+.idea
+*.swp
+*.swo
+*~
+.DS_Store
+Potential\ Datasets
+Reference\ papers
+test
+.pytest_cache
+*.log
+*.pkl
+*.pickle
+node_modules
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,62 @@
+name: Build and Test
+
+on:
+  push:
+    branches:
+      - main
+      - develop
+      - master
+  pull_request:
+    branches:
+      - main
+      - develop
+      - master
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.12'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt
+
+      - name: Lint with flake8 (optional)
+        continue-on-error: true
+        run: |
+          pip install flake8
+          flake8 src --count --select=E9,F63,F7,F82 --show-source --statistics
+          flake8 src --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+
+      - name: Test API import
+        run: |
+          python -c "from src.api.app import app; print('✓ API imports successfully')"
+
+  build:
+    needs: test
+    runs-on: ubuntu-latest
+    if: github.event_name == 'push'
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Build Docker image
+        uses: docker/build-push-action@v5
+        with:
+          context: .
+          push: false
+          tags: lexguard:latest
+          cache-from: type=gha
+          cache-to: type=gha,mode=max
+
+      - name: Log build success
+        run: echo "✓ Docker image built successfully"
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,81 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Virtual environments
+.venv
+pip-log.txt
+pip-delete-this-directory.txt
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.DS_Store
+
+# Environment
+.env
+.env.local
+.env.*.local
+
+# Logs
+*.log
+logs/
+
+# Data/artifacts
+*.pkl
+*.pickle
+*.joblib
+*.h5
+*.pb
+.cache/
+
+# IDE Pycharm
+.idea/
+
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb
+
+# Test coverage
+.coverage
+.pytest_cache/
+htmlcov/
+
+# Build artifacts
+dist/
+build/
+*.egg-info/
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Project specific
+/data/preds.json
+/data/preds.json.partial
+/tmp/
+node_modules/
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,37 @@
+# Use Python 3.10 (more stable than 3.12 for complex packages)
+FROM python:3.10-slim
+
+# Set working directory
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Upgrade pip first
+RUN pip install --upgrade pip setuptools wheel
+
+# Copy requirements
+COPY requirements.txt .
+
+# Install PyTorch first (pre-built wheel, avoid compilation)
+RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+
+# Install other Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy entire project
+COPY . .
+
+# Expose port
+EXPOSE 8000
+
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+
+# Run the app
+CMD ["python", "-m", "src.api.app"]
diff --git a/README.md b/README.md
@@ -1,94 +1,21 @@
-# LexGuard
-# Policy Compliance Verification System (Offline RAG-LLM Web App)
+# PolicyLens: Explainable Graph-based RAG Compliance Engine
 
-## Problem Statement
+## Understand policies. Detect risks. Explain decisions.
 
-We aim to develop an **offline, LLM-powered web application** for automated policy compliance verification across heterogeneous contexts, such as **GDPR-related regulations between countries, contractual clauses in internship agreements versus institutional policy documents, or HR guidelines (e.g., leave policies) against organizational rulebooks**.  
-
-Since the documents involved are often **large, sensitive, and security-critical**, they cannot be shared with external online LLM services. Moreover, their size may **exceed the native context window** of modern language models, necessitating the integration of a **retrieval-augmented generation (RAG) pipeline**.  
-
-In this setup, policy documents would be ingested into a **vector database** (or equivalent retrieval layer), enabling efficient semantic search to dynamically retrieve only the most relevant segments for context construction during queries. A crucial challenge lies in **domain-aware vectorization**, where embeddings must be generated with respect to the compliance-checking objectives rather than generic semantic similarity.  
-
-The system should be designed as a **modular, API-driven architecture**, where components (e.g., embedding service, retrieval engine, reasoning agent, compliance evaluator) remain **loosely coupled** to allow easy substitution of LLMs or AI agents without disrupting the overall workflow.
-
----
-
-## Team Structure & Responsibilities
-
-### **Student A — Data Collection, Curation & Governance**
-- Acquire GDPR texts, institutional policies, HR manuals, contracts, etc.
-- Redaction, de-duplication, versioning, and schema design.
-- Build a labeled dataset for evaluation.
-- Deliverables: `datasets/`, schema/ontology, annotation guidelines, data card.
-
-### **Student B — Ingestion, Chunking, Embedding & Retrieval**
-- Implement document parsers (PDF/DOCX/HTML).
-- Domain-aware chunking + embeddings.
-- Setup vector database + retrieval pipeline (semantic + keyword hybrid).
-- Deliverables: Ingestion service, vector DB, retrieval evaluation report.
-
-### **Student C — Reasoning, Compliance Engine & Evaluation**
-- Design decision schema (status, evidence, rationale, confidence).
-- Develop compliance assessment engine (prompting + rule library).
-- Build evaluation harness with precision/recall, evidence alignment metrics.
-- Deliverables: Compliance engine API, evaluation reports, error analysis.
-
-### **Student D — Offline Web App, APIs & Deployment**
-- Build offline web UI (upload, search, compare, assess).
-- Develop API gateway for modular services.
-- Package everything in Docker Compose for offline deployment.
-- Deliverables: Web UI, REST APIs, deployment scripts, observability dashboards.
-
----
-## Notion Page
-https://rust-mandolin-74e.notion.site/LexGuard-25ca338f8c5b80aca495c55c3bdc8ea2?pvs=74
-
----
-## Potential datasets for testing the pipeline
-
-https://stanfordnlp.github.io/contract-nli/
-
-The ContractNLI dataset contains contracts (NDAs), fixed hypotheses (requirements), and human annotations that say whether each requirement is entailed (compliant), contradicted (noncompliant), or not mentioned (uncertain), along with the evidence spans in the contract text.
-
-I converted the raw JSON format into a structured CSV/Excel file where each row contains:
---- reference_clause → the hypothesis text (requirement)
---- target_clause → the evidence span from the contract
----compliance → one of compliant, noncompliant, or uncertain
----source_file → original contract filename
-
-### Why this is useful
-
-Our final system will take two documents (a reference requirements sheet and a target contract) and produce a compliance report.
-
-The ContractNLI-based CSV acts as a testbed for this pipeline because:
-
-Each row is a mini version of our task (requirement vs contract clause → compliance label).
-
-The compliance labels are ground truth, so we can check if our pipeline makes the right predictions.
+## 🚀 Overview
+LexGuard is an AI-powered compliance analysis system that uses a hybrid Graph + RAG pipeline to evaluate policy documents and detect violations with explainability.
 
 ---
-## Project Milestones
-
-### **Week 1–3 — Foundations**
-Identified references: 
-[28/08/25, 11:19:15 AM] Vinu: https://www.meity.gov.in/static/uploads/2024/06/2bf1f0e9f04e6fb4f8fef35e82c42aa5.pdf
-[28/08/25, 11:19:30 AM] Vinu: https://aclanthology.org/2025.coling-main.178.pdf
-
-Discuss the above paper and expand on the ideas—similar to the negation documents we discussed, what other strategies could be explored to improve the system’s accuracy? Identify and discuss relevant datasets in this context, and provide a concrete example based on the Digital Personal Data Protection Act, 2023.
-
-### **Week 4–6 — RAG & Engine v1**
-
-
-### **Week 7–8 — **
-
 
-### **Week 9–11 — **
+## 🔥 Key Features
 
+- 📄 PDF Upload & Chunk-based Analysis
+- 🧠 Graph-based Retrieval (Eventic + Static Graphs)
+- 🔍 Explainable Outputs (Hits, Positive & Negative Nodes)
+- ⚡ LLM-based Reasoning (OpenAI / Local)
+- 🎛 Tunable RAG Parameters (lambda, hop_k, etc.)
+- 📊 Interactive UI Dashboard
 
 ---
-### Participation
 
-- Subham  ||
-- Trupti  |
-- Divyam  | 
-- Owais   |
+## 🧠 Architecture