Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
venv
__pycache__
*.pyc
*.pyo
*.pyd
.Python
*.so
*.egg
*.egg-info
dist
build
.git
.gitignore
.env
.envrc
.vscode
.idea
*.swp
*.swo
*~
.DS_Store
Potential\ Datasets
Reference\ papers
test
.pytest_cache
*.log
*.pkl
*.pickle
node_modules
62 changes: 62 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: Build and Test

on:
push:
branches:
- main
- develop
- master
pull_request:
branches:
- main
- develop
- master

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Lint with flake8 (optional)
continue-on-error: true
run: |
pip install flake8
flake8 src --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 src --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

- name: Test API import
run: |
python -c "from src.api.app import app; print('✓ API imports successfully')"

build:
needs: test
runs-on: ubuntu-latest
if: github.event_name == 'push'
steps:
- uses: actions/checkout@v4

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Build Docker image
uses: docker/build-push-action@v5
with:
context: .
push: false
tags: lexguard:latest
cache-from: type=gha
cache-to: type=gha,mode=max

- name: Log build success
run: echo "✓ Docker image built successfully"
81 changes: 81 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual environments
.venv
pip-log.txt
pip-delete-this-directory.txt

# IDE
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# Environment
.env
.env.local
.env.*.local

# Logs
*.log
logs/

# Data/artifacts
*.pkl
*.pickle
*.joblib
*.h5
*.pb
.cache/

# IDE Pycharm
.idea/

# Jupyter
.ipynb_checkpoints/
*.ipynb

# Test coverage
.coverage
.pytest_cache/
htmlcov/

# Build artifacts
dist/
build/
*.egg-info/

# OS
.DS_Store
Thumbs.db

# Project specific
/data/preds.json
/data/preds.json.partial
/tmp/
node_modules/
37 changes: 37 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Use Python 3.10 (more stable than 3.12 for complex packages)
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
&& rm -rf /var/lib/apt/lists/*

# Upgrade pip first
RUN pip install --upgrade pip setuptools wheel

# Copy requirements
COPY requirements.txt .

# Install PyTorch first (pre-built wheel, avoid compilation)
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install other Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy entire project
COPY . .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1

# Run the app
CMD ["python", "-m", "src.api.app"]
97 changes: 12 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,21 @@
# LexGuard
# Policy Compliance Verification System (Offline RAG-LLM Web App)
# PolicyLens: Explainable Graph-based RAG Compliance Engine

## Problem Statement
## Understand policies. Detect risks. Explain decisions.

We aim to develop an **offline, LLM-powered web application** for automated policy compliance verification across heterogeneous contexts, such as **GDPR-related regulations between countries, contractual clauses in internship agreements versus institutional policy documents, or HR guidelines (e.g., leave policies) against organizational rulebooks**.

Since the documents involved are often **large, sensitive, and security-critical**, they cannot be shared with external online LLM services. Moreover, their size may **exceed the native context window** of modern language models, necessitating the integration of a **retrieval-augmented generation (RAG) pipeline**.

In this setup, policy documents would be ingested into a **vector database** (or equivalent retrieval layer), enabling efficient semantic search to dynamically retrieve only the most relevant segments for context construction during queries. A crucial challenge lies in **domain-aware vectorization**, where embeddings must be generated with respect to the compliance-checking objectives rather than generic semantic similarity.

The system should be designed as a **modular, API-driven architecture**, where components (e.g., embedding service, retrieval engine, reasoning agent, compliance evaluator) remain **loosely coupled** to allow easy substitution of LLMs or AI agents without disrupting the overall workflow.

---

## Team Structure & Responsibilities

### **Student A — Data Collection, Curation & Governance**
- Acquire GDPR texts, institutional policies, HR manuals, contracts, etc.
- Redaction, de-duplication, versioning, and schema design.
- Build a labeled dataset for evaluation.
- Deliverables: `datasets/`, schema/ontology, annotation guidelines, data card.

### **Student B — Ingestion, Chunking, Embedding & Retrieval**
- Implement document parsers (PDF/DOCX/HTML).
- Domain-aware chunking + embeddings.
- Setup vector database + retrieval pipeline (semantic + keyword hybrid).
- Deliverables: Ingestion service, vector DB, retrieval evaluation report.

### **Student C — Reasoning, Compliance Engine & Evaluation**
- Design decision schema (status, evidence, rationale, confidence).
- Develop compliance assessment engine (prompting + rule library).
- Build evaluation harness with precision/recall, evidence alignment metrics.
- Deliverables: Compliance engine API, evaluation reports, error analysis.

### **Student D — Offline Web App, APIs & Deployment**
- Build offline web UI (upload, search, compare, assess).
- Develop API gateway for modular services.
- Package everything in Docker Compose for offline deployment.
- Deliverables: Web UI, REST APIs, deployment scripts, observability dashboards.

---
## Notion Page
https://rust-mandolin-74e.notion.site/LexGuard-25ca338f8c5b80aca495c55c3bdc8ea2?pvs=74

---
## Potential datasets for testing the pipeline

https://stanfordnlp.github.io/contract-nli/

The ContractNLI dataset contains contracts (NDAs), fixed hypotheses (requirements), and human annotations that say whether each requirement is entailed (compliant), contradicted (noncompliant), or not mentioned (uncertain), along with the evidence spans in the contract text.

I converted the raw JSON format into a structured CSV/Excel file where each row contains:
--- reference_clause → the hypothesis text (requirement)
--- target_clause → the evidence span from the contract
---compliance → one of compliant, noncompliant, or uncertain
---source_file → original contract filename

### Why this is useful

Our final system will take two documents (a reference requirements sheet and a target contract) and produce a compliance report.

The ContractNLI-based CSV acts as a testbed for this pipeline because:

Each row is a mini version of our task (requirement vs contract clause → compliance label).

The compliance labels are ground truth, so we can check if our pipeline makes the right predictions.
## 🚀 Overview
LexGuard is an AI-powered compliance analysis system that uses a hybrid Graph + RAG pipeline to evaluate policy documents and detect violations with explainability.

---
## Project Milestones

### **Week 1–3 — Foundations**
Identified references:
[28/08/25, 11:19:15 AM] Vinu: https://www.meity.gov.in/static/uploads/2024/06/2bf1f0e9f04e6fb4f8fef35e82c42aa5.pdf
[28/08/25, 11:19:30 AM] Vinu: https://aclanthology.org/2025.coling-main.178.pdf

Discuss the above paper and expand on the ideas—similar to the negation documents we discussed, what other strategies could be explored to improve the system’s accuracy? Identify and discuss relevant datasets in this context, and provide a concrete example based on the Digital Personal Data Protection Act, 2023.

### **Week 4–6 — RAG & Engine v1**


### **Week 7–8 — **


### **Week 9–11 — **
## 🔥 Key Features

- 📄 PDF Upload & Chunk-based Analysis
- 🧠 Graph-based Retrieval (Eventic + Static Graphs)
- 🔍 Explainable Outputs (Hits, Positive & Negative Nodes)
- ⚡ LLM-based Reasoning (OpenAI / Local)
- 🎛 Tunable RAG Parameters (lambda, hop_k, etc.)
- 📊 Interactive UI Dashboard

---
### Participation

- Subham ||
- Trupti |
- Divyam |
- Owais |
## 🧠 Architecture
Loading
Loading