GitHub - hakupao/MedAudit-Diff-Watcher: Local-first CSV audit workstation for clinical data: folder watching, smart diff, SQLite audit trail, GUI + CLI

Local-first CSV audit workstation for versioned folder drops: watch, compare, persist, review, and summarize medical data changes with full traceability.

Features • Architecture • Quick Start • GUI & CLI • Configuration

Overview

MedAudit Diff Watcher is a sophisticated, local-first workstation for auditing CSV changes in medical/clinical data workflows. It monitors configured folders, detects data drops, computes intelligent diffs using configurable match keys, persists audit trails to SQLite, and provides both GUI and CLI interfaces for review and summarization.

Perfect For:

🏥 Medical data quality monitoring
📋 Clinical trial CSV change audits
🔍 Data reconciliation workflows
📊 Regulatory compliance documentation
🎯 Change impact analysis

Why MedAudit?

🔐 Local-First: All data stays on your machine – no cloud syncing
👁️ Intelligent Watching: Auto-detects new/modified CSV files
🔀 Smart Diffing: Match-key based row alignment (not just line diffs)
💾 Persistent Audit Trail: SQLite database of all changes and reviews
🎨 Dual Interface: GUI for interactive review, CLI for automation
🔧 Integration Ready: Beyond Compare/WinMerge launching, optional AI summaries
📈 Visual Reports: HTML/CSV change summaries and statistics

Key Features

Feature	Description
👁️ Folder Watching	Monitor directories for new/modified CSV files in real-time
🔀 Intelligent CSV Diffing	Match rows by key columns (ID, study, subject) – not line-by-line
💾 SQLite Persistence	Audit trail database with full change history
📊 Multiple Reporting Formats	HTML dashboards, CSV exports, summary reports
🎨 Interactive GUI	PySide6-based Qt interface for review and decisions
🖥️ CLI Mode	Batch processing, automation, unattended operation
🔧 Merge Tool Integration	Launch Beyond Compare or WinMerge for manual review
🤖 AI Summaries (Optional)	LLM-powered change summaries using OpenAI/Claude
⚙️ YAML Configuration	Flexible configuration for multiple audit workflows
📈 Change Statistics	Track additions, modifications, deletions by domain

Architecture

graph TB
 A[" Monitored Folders<br/>(Folder A, B, C)"] -->|watchdog| B[" Folder Watcher"]
 B -->|Detect new/mod| C[" CSV Files<br/>(*.csv)"]
 C -->|Load| D[" CSV Parser<br/>(pandas)"]
 D --> E[" Match-Key<br/>Alignment"]

 E -->|Previous version| F[" Diff Engine<br/>(rapidfuzz)"]
 E -->|Current version| F

 F --> G[" SQLite DB<br/>(audit_trail.db)"]
 G -->|Store| H[" Audit Records<br/>(changes, metadata)"]

 H -->|Query| I{" GUI or<br/> CLI Reports?"}

 I -->|GUI| J[" Interactive<br/>GUI Window"]
 I -->|CLI| K[" HTML/CSV<br/>Reports"]

 J -->|User Review| L[" Approved<br/>or Manual Diff"]
 L -->|Export| M[" Summary<br/>Report"]
 K --> M

 M -->|Optional| N[" AI Summary<br/>(OpenAI/Claude)"]
 N --> O[" Final Report"]

 style A fill:#e3f2fd
 style H fill:#c8e6c9
 style M fill:#fff9c4
 style O fill:#ffccbc

Components

Watcher: Monitors folders using watchdog for file changes
Parser: Reads CSV files with pandas, handles encoding/delimiters
Matcher: Aligns rows using configurable match keys (primary key columns)
Differ: Computes changes (new, modified, deleted rows) using rapidfuzz
Persister: Stores audit trail in SQLite with full metadata
GUI: PySide6 Qt application for interactive review
Reporter: Generates HTML/CSV summaries
AI: Optional integration with LLM for change summaries

Quick Start

Prerequisites

Python 3.11+
Windows 7+ or WSL2 (Windows-first design)
git, pip

Installation

# Clone repository
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# For GUI support
pip install ".[gui]"

# Verify installation
python -m medaudit --version

5-Minute Setup

# 1. Create config directory
mkdir .medaudit
cp examples/config.default.yaml .medaudit/config.yaml

# 2. Edit config with your folder paths
notepad .medaudit/config.yaml
# Set watched_folders, output_dir, match_keys

# 3. Initialize audit database
python -m medaudit init --config .medaudit/config.yaml

# 4. Launch GUI
python -m medaudit gui

# OR run CLI watcher
python -m medaudit watch --config .medaudit/config.yaml

GUI & CLI Modes

GUI Screenshots

Control Tab	Config Form Tab	Config YAML Tab

GUI Mode (Interactive Review)

python -m medaudit gui

Features:

📋 List of detected CSV changes
🔀 Side-by-side diff viewer
✅ Approve/Reject UI
🔧 Launch Beyond Compare/WinMerge
📊 Live statistics dashboard
🤖 AI summary integration
💾 Export reports

CLI Mode (Automation)

# Watch folders continuously
python -m medaudit watch --config config.yaml --output reports/

# Batch process existing CSVs
python -m medaudit batch --input data/ --config config.yaml

# Generate report from audit database
python -m medaudit report --db audit_trail.db --format html

# Export audit trail as CSV
python -m medaudit export --db audit_trail.db --output audit_export.csv

Configuration

config.yaml

# Folder Watching
watching:
  enabled: true
  poll_interval: 5  # seconds
  folders:
    - path: "C:/Data/Medical/Drop1/"
      description: "Lab data drops"
      pattern: "lab_*.csv"
    - path: "C:/Data/Medical/Drop2/"
      description: "AE data drops"
      pattern: "ae_*.csv"

# CSV Parsing
parsing:
  delimiter: ","
  encoding: "utf-8-sig"
  skip_rows: 0
  quote_char: '"'

# Row Matching (diffing strategy)
matching:
  match_keys:
    lab: ["USUBJID", "TEST_DATE", "TEST_CODE"]
    ae: ["USUBJID", "AE_START_DATE"]
  similarity_threshold: 0.85  # 85% match = same row

# SQLite Audit Trail
audit:
  db_path: "./audit_trail.db"
  keep_history: true
  retention_days: 365

# Reporting
reporting:
  output_dir: "./reports/"
  formats:
    - html
    - csv
    - json
  include_statistics: true

# External Tools
external:
  enable_beyond_compare: true
  beyond_compare_path: "C:/Program Files/Beyond Compare 4/BCompare.exe"
  enable_winmerge: true
  winmerge_path: "C:/Program Files/WinMerge/WinMergeU.exe"

# AI Summarization (Optional)
ai:
  enabled: false
  provider: "openai"  # openai, claude
  api_key: "${OPENAI_API_KEY}"
  model: "gpt-4"
  temperature: 0.3

Folder Structure

MedAudit-Diff-Watcher/
├── .medaudit/
│   ├── config.yaml              # Your configuration
│   └── audit_trail.db           # SQLite audit database
├── reports/
│   ├── 2024-04-01_summary.html
│   ├── 2024-04-01_detail.csv
│   └── ...
└── data/
    └── drops/
        ├── lab_2024_04_01.csv
        └── ae_2024_04_01.csv

Usage Examples

Python API

Basic Folder Watching

from medaudit.watcher import FolderWatcher
from medaudit.config import Config

# Load config
config = Config.from_yaml("config.yaml")

# Initialize watcher
watcher = FolderWatcher(config)

# Start watching (blocking)
watcher.watch()

# Or watch in background
import threading
thread = threading.Thread(target=watcher.watch)
thread.daemon = True
thread.start()

CSV Diffing

from medaudit.differ import CSVDiffer
from medaudit.config import Config

config = Config.from_yaml("config.yaml")
differ = CSVDiffer(config)

# Compare two CSV files
changes = differ.diff_files(
    "old_data.csv",
    "new_data.csv",
    match_keys=["USUBJID", "TEST_DATE"]
)

print(f"Added rows: {len(changes.added)}")
print(f"Modified rows: {len(changes.modified)}")
print(f"Deleted rows: {len(changes.deleted)}")

# Access individual changes
for change in changes.modified:
    print(f"Row {change.key}: {change.old_values} → {change.new_values}")

Audit Trail Queries

from medaudit.audit import AuditDatabase

db = AuditDatabase("audit_trail.db")

# Get all changes for a file
changes = db.get_file_changes("lab_2024_04_01.csv")

# Get changes in a date range
recent = db.get_changes_between(
    start_date="2024-04-01",
    end_date="2024-04-05"
)

# Get statistics
stats = db.get_statistics()
print(f"Total files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")

Report Generation

from medaudit.reporter import ReportGenerator

generator = ReportGenerator(config)

# Generate HTML report
html_report = generator.generate_html(
    changes=changes,
    title="Lab Data Audit Report - April 2024"
)
html_report.save("lab_audit_2024_04.html")

# Generate CSV export
csv_report = generator.generate_csv(changes)
csv_report.save("lab_audit_2024_04.csv")

# With AI summary
if config.ai.enabled:
    summary = generator.generate_ai_summary(
        changes,
        provider=config.ai.provider
    )
    print(f"Summary: {summary}")

Project Structure

📁 Complete Directory Layout

MedAudit-Diff-Watcher/
│
├── 📄 README.md & README_CN.md
├── 📄 pyproject.toml           # Python project configuration
├── 📄 requirements.txt
├── 📄 CHANGELOG.md
│
├── 🔷 medaudit/ (Main package)
│   ├── __main__.py             # CLI entry point
│   ├── __init__.py
│   ├── config.py               # Configuration loading
│   ├── watcher.py              # Folder watching
│   ├── parser.py               # CSV parsing
│   ├── differ.py               # Diffing engine
│   ├── audit.py                # SQLite audit trail
│   ├── reporter.py             # Report generation
│   ├── ai_summarizer.py        # AI integration
│   └── merge_tools.py          # Beyond Compare/WinMerge
│
├── 🎨 gui/
│   ├── __init__.py
│   ├── main_window.py          # Main GUI window
│   ├── diff_viewer.py          # Diff visualization
│   ├── audit_browser.py        # Audit trail browser
│   ├── statistics_panel.py     # Statistics dashboard
│   └── styles.css              # Qt stylesheets
│
├── 📚 docs/
│   ├── QUICK_START.md
│   ├── CLI_REFERENCE.md
│   ├── GUI_GUIDE.md
│   ├── CONFIG_REFERENCE.md
│   ├── TROUBLESHOOTING.md
│   └── assets/
│       ├── gui_screenshot.png
│       └── cli_example.png
│
├── 💾 examples/
│   ├── config.default.yaml
│   ├── sample_data/
│   │   ├── lab_old.csv
│   │   └── lab_new.csv
│   └── README.md
│
└── ✅ tests/
    ├── test_watcher.py
    ├── test_differ.py
    ├── test_audit.py
    └── test_reporter.py

Advanced Features

🔧 Match Key Configuration

Different CSV files may need different match strategies:

matching:
  profiles:
    lab:
      match_keys: ["USUBJID", "LAB_DATE", "LAB_CODE"]
      similarity: 0.95  # Strict: exact key match
    ae:
      match_keys: ["USUBJID", "AE_START_DATE"]
      similarity: 0.85  # More lenient
    dm:
      match_keys: ["USUBJID"]
      similarity: 0.99  # Very strict

🤖 AI Summarization

Automatic change summaries using LLMs:

config = Config.from_yaml("config.yaml")
config.ai.enabled = True
config.ai.provider = "openai"
config.ai.model = "gpt-4"

summarizer = AISummarizer(config)
summary = summarizer.summarize_changes(changes)

print(f"AI Summary: {summary}")
# Output: "3 lab tests added for subject X, 1 result corrected for test Y..."

📊 Statistics & Analytics

db = AuditDatabase("audit_trail.db")
stats = db.get_statistics()

print(f"Files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")
print(f"Additions: {stats.additions}")
print(f"Modifications: {stats.modifications}")
print(f"Deletions: {stats.deletions}")
print(f"Last update: {stats.last_updated}")

🔐 Data Privacy & Security

Local-First: All data remains on your machine
No Cloud Syncing: No data transmission
Encrypted Fields: Optional field-level encryption
Audit Immutability: Once recorded, audit entries cannot be modified
Access Logging: Who accessed what, when

Dependencies

pandas>=2.0           # CSV parsing & diffing
PyYAML>=6.0          # Configuration
watchdog>=3.0        # Folder monitoring
rapidfuzz>=2.0       # Fuzzy string matching
PySide6>=6.5         # GUI (optional)
openai>=1.0          # AI summaries (optional)
anthropic>=0.7       # Claude API (optional)

Install:

pip install -r requirements.txt
pip install ".[gui]"        # For GUI
pip install ".[ai]"         # For AI summaries

Troubleshooting

Common Issues

Issue	Solution
Folder not being watched	Check path in config.yaml, ensure folder exists
CSV encoding errors	Change `encoding` in config (try utf-8-sig, latin1)
GUI won't start	Install PySide6: `pip install ".[gui]"`
No matches found in diff	Adjust `similarity_threshold` in config
AI summaries not working	Set API key, check provider config

Contributing

Contributions welcome! See CONTRIBUTING.md.

# Development setup
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher
pip install -e ".[dev]"
pytest tests/

Roadmap

Web dashboard for remote access
Direct database source support (SQL audit)
Automated email notifications
Machine learning anomaly detection
Integration with EHR systems
Blockchain-based immutable audit trail

License

Citation

@software{medaudit2024,
  author = {hakupao},
  title = {MedAudit Diff Watcher: Local-First CSV Audit Workstation},
  url = {https://github.com/hakupao/MedAudit-Diff-Watcher},
  year = {2024}
}

Support

📧 Issues: GitHub Issues
📖 Docs: docs/
💬 Discussions: GitHub Discussions

⬆ Back to Top

Audit with confidence, locally

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
doc		doc
medaudit_diff_watcher		medaudit_diff_watcher
packaging		packaging
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
README_CN.md		README_CN.md
config.example.yaml		config.example.yaml
config.gui-dev.example.yaml		config.gui-dev.example.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

Architecture

Components

Quick Start

Prerequisites

Installation

5-Minute Setup

GUI & CLI Modes

GUI Screenshots

GUI Mode (Interactive Review)

CLI Mode (Automation)

Configuration

config.yaml

Folder Structure

Usage Examples

Python API

Basic Folder Watching

CSV Diffing

Audit Trail Queries

Report Generation

Project Structure

Advanced Features

Dependencies

Troubleshooting

Contributing

Roadmap

License

Citation

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages