Skip to content

hakupao/MedAudit-Diff-Watcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

English | δΈ­ζ–‡

MedAudit Diff Watcher

Python Windows PyYAML GUI Status

Local-first CSV audit workstation for versioned folder drops: watch, compare, persist, review, and summarize medical data changes with full traceability.

Features β€’ Architecture β€’ Quick Start β€’ GUI & CLI β€’ Configuration


Overview

MedAudit Diff Watcher is a sophisticated, local-first workstation for auditing CSV changes in medical/clinical data workflows. It monitors configured folders, detects data drops, computes intelligent diffs using configurable match keys, persists audit trails to SQLite, and provides both GUI and CLI interfaces for review and summarization.

Perfect For:

  • πŸ₯ Medical data quality monitoring
  • πŸ“‹ Clinical trial CSV change audits
  • πŸ” Data reconciliation workflows
  • πŸ“Š Regulatory compliance documentation
  • 🎯 Change impact analysis

Why MedAudit?

  • πŸ” Local-First: All data stays on your machine – no cloud syncing
  • πŸ‘οΈ Intelligent Watching: Auto-detects new/modified CSV files
  • πŸ”€ Smart Diffing: Match-key based row alignment (not just line diffs)
  • πŸ’Ύ Persistent Audit Trail: SQLite database of all changes and reviews
  • 🎨 Dual Interface: GUI for interactive review, CLI for automation
  • πŸ”§ Integration Ready: Beyond Compare/WinMerge launching, optional AI summaries
  • πŸ“ˆ Visual Reports: HTML/CSV change summaries and statistics

Key Features

Feature Description
πŸ‘οΈ Folder Watching Monitor directories for new/modified CSV files in real-time
πŸ”€ Intelligent CSV Diffing Match rows by key columns (ID, study, subject) – not line-by-line
πŸ’Ύ SQLite Persistence Audit trail database with full change history
πŸ“Š Multiple Reporting Formats HTML dashboards, CSV exports, summary reports
🎨 Interactive GUI PySide6-based Qt interface for review and decisions
πŸ–₯️ CLI Mode Batch processing, automation, unattended operation
πŸ”§ Merge Tool Integration Launch Beyond Compare or WinMerge for manual review
πŸ€– AI Summaries (Optional) LLM-powered change summaries using OpenAI/Claude
βš™οΈ YAML Configuration Flexible configuration for multiple audit workflows
πŸ“ˆ Change Statistics Track additions, modifications, deletions by domain

Architecture

graph TB
 A[" Monitored Folders<br/>(Folder A, B, C)"] -->|watchdog| B[" Folder Watcher"]
 B -->|Detect new/mod| C[" CSV Files<br/>(*.csv)"]
 C -->|Load| D[" CSV Parser<br/>(pandas)"]
 D --> E[" Match-Key<br/>Alignment"]

 E -->|Previous version| F[" Diff Engine<br/>(rapidfuzz)"]
 E -->|Current version| F

 F --> G[" SQLite DB<br/>(audit_trail.db)"]
 G -->|Store| H[" Audit Records<br/>(changes, metadata)"]

 H -->|Query| I{" GUI or<br/> CLI Reports?"}

 I -->|GUI| J[" Interactive<br/>GUI Window"]
 I -->|CLI| K[" HTML/CSV<br/>Reports"]

 J -->|User Review| L[" Approved<br/>or Manual Diff"]
 L -->|Export| M[" Summary<br/>Report"]
 K --> M

 M -->|Optional| N[" AI Summary<br/>(OpenAI/Claude)"]
 N --> O[" Final Report"]

 style A fill:#e3f2fd
 style H fill:#c8e6c9
 style M fill:#fff9c4
 style O fill:#ffccbc
Loading

Components

  • Watcher: Monitors folders using watchdog for file changes
  • Parser: Reads CSV files with pandas, handles encoding/delimiters
  • Matcher: Aligns rows using configurable match keys (primary key columns)
  • Differ: Computes changes (new, modified, deleted rows) using rapidfuzz
  • Persister: Stores audit trail in SQLite with full metadata
  • GUI: PySide6 Qt application for interactive review
  • Reporter: Generates HTML/CSV summaries
  • AI: Optional integration with LLM for change summaries

Quick Start

Prerequisites

  • Python 3.11+
  • Windows 7+ or WSL2 (Windows-first design)
  • git, pip

Installation

# Clone repository
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# For GUI support
pip install ".[gui]"

# Verify installation
python -m medaudit --version

5-Minute Setup

# 1. Create config directory
mkdir .medaudit
cp examples/config.default.yaml .medaudit/config.yaml

# 2. Edit config with your folder paths
notepad .medaudit/config.yaml
# Set watched_folders, output_dir, match_keys

# 3. Initialize audit database
python -m medaudit init --config .medaudit/config.yaml

# 4. Launch GUI
python -m medaudit gui

# OR run CLI watcher
python -m medaudit watch --config .medaudit/config.yaml

GUI & CLI Modes

GUI Screenshots

Control Tab Config Form Tab Config YAML Tab
Control Config Form Config YAML

GUI Mode (Interactive Review)

python -m medaudit gui

Features:

  • πŸ“‹ List of detected CSV changes
  • πŸ”€ Side-by-side diff viewer
  • βœ… Approve/Reject UI
  • πŸ”§ Launch Beyond Compare/WinMerge
  • πŸ“Š Live statistics dashboard
  • πŸ€– AI summary integration
  • πŸ’Ύ Export reports

CLI Mode (Automation)

# Watch folders continuously
python -m medaudit watch --config config.yaml --output reports/

# Batch process existing CSVs
python -m medaudit batch --input data/ --config config.yaml

# Generate report from audit database
python -m medaudit report --db audit_trail.db --format html

# Export audit trail as CSV
python -m medaudit export --db audit_trail.db --output audit_export.csv

Configuration

config.yaml

# Folder Watching
watching:
  enabled: true
  poll_interval: 5  # seconds
  folders:
    - path: "C:/Data/Medical/Drop1/"
      description: "Lab data drops"
      pattern: "lab_*.csv"
    - path: "C:/Data/Medical/Drop2/"
      description: "AE data drops"
      pattern: "ae_*.csv"

# CSV Parsing
parsing:
  delimiter: ","
  encoding: "utf-8-sig"
  skip_rows: 0
  quote_char: '"'

# Row Matching (diffing strategy)
matching:
  match_keys:
    lab: ["USUBJID", "TEST_DATE", "TEST_CODE"]
    ae: ["USUBJID", "AE_START_DATE"]
  similarity_threshold: 0.85  # 85% match = same row

# SQLite Audit Trail
audit:
  db_path: "./audit_trail.db"
  keep_history: true
  retention_days: 365

# Reporting
reporting:
  output_dir: "./reports/"
  formats:
    - html
    - csv
    - json
  include_statistics: true

# External Tools
external:
  enable_beyond_compare: true
  beyond_compare_path: "C:/Program Files/Beyond Compare 4/BCompare.exe"
  enable_winmerge: true
  winmerge_path: "C:/Program Files/WinMerge/WinMergeU.exe"

# AI Summarization (Optional)
ai:
  enabled: false
  provider: "openai"  # openai, claude
  api_key: "${OPENAI_API_KEY}"
  model: "gpt-4"
  temperature: 0.3

Folder Structure

MedAudit-Diff-Watcher/
β”œβ”€β”€ .medaudit/
β”‚   β”œβ”€β”€ config.yaml              # Your configuration
β”‚   └── audit_trail.db           # SQLite audit database
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ 2024-04-01_summary.html
β”‚   β”œβ”€β”€ 2024-04-01_detail.csv
β”‚   └── ...
└── data/
    └── drops/
        β”œβ”€β”€ lab_2024_04_01.csv
        └── ae_2024_04_01.csv

Usage Examples

Python API

Basic Folder Watching

from medaudit.watcher import FolderWatcher
from medaudit.config import Config

# Load config
config = Config.from_yaml("config.yaml")

# Initialize watcher
watcher = FolderWatcher(config)

# Start watching (blocking)
watcher.watch()

# Or watch in background
import threading
thread = threading.Thread(target=watcher.watch)
thread.daemon = True
thread.start()

CSV Diffing

from medaudit.differ import CSVDiffer
from medaudit.config import Config

config = Config.from_yaml("config.yaml")
differ = CSVDiffer(config)

# Compare two CSV files
changes = differ.diff_files(
    "old_data.csv",
    "new_data.csv",
    match_keys=["USUBJID", "TEST_DATE"]
)

print(f"Added rows: {len(changes.added)}")
print(f"Modified rows: {len(changes.modified)}")
print(f"Deleted rows: {len(changes.deleted)}")

# Access individual changes
for change in changes.modified:
    print(f"Row {change.key}: {change.old_values} β†’ {change.new_values}")

Audit Trail Queries

from medaudit.audit import AuditDatabase

db = AuditDatabase("audit_trail.db")

# Get all changes for a file
changes = db.get_file_changes("lab_2024_04_01.csv")

# Get changes in a date range
recent = db.get_changes_between(
    start_date="2024-04-01",
    end_date="2024-04-05"
)

# Get statistics
stats = db.get_statistics()
print(f"Total files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")

Report Generation

from medaudit.reporter import ReportGenerator

generator = ReportGenerator(config)

# Generate HTML report
html_report = generator.generate_html(
    changes=changes,
    title="Lab Data Audit Report - April 2024"
)
html_report.save("lab_audit_2024_04.html")

# Generate CSV export
csv_report = generator.generate_csv(changes)
csv_report.save("lab_audit_2024_04.csv")

# With AI summary
if config.ai.enabled:
    summary = generator.generate_ai_summary(
        changes,
        provider=config.ai.provider
    )
    print(f"Summary: {summary}")

Project Structure

πŸ“ Complete Directory Layout
MedAudit-Diff-Watcher/
β”‚
β”œβ”€β”€ πŸ“„ README.md & README_CN.md
β”œβ”€β”€ πŸ“„ pyproject.toml           # Python project configuration
β”œβ”€β”€ πŸ“„ requirements.txt
β”œβ”€β”€ πŸ“„ CHANGELOG.md
β”‚
β”œβ”€β”€ πŸ”· medaudit/ (Main package)
β”‚   β”œβ”€β”€ __main__.py             # CLI entry point
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py               # Configuration loading
β”‚   β”œβ”€β”€ watcher.py              # Folder watching
β”‚   β”œβ”€β”€ parser.py               # CSV parsing
β”‚   β”œβ”€β”€ differ.py               # Diffing engine
β”‚   β”œβ”€β”€ audit.py                # SQLite audit trail
β”‚   β”œβ”€β”€ reporter.py             # Report generation
β”‚   β”œβ”€β”€ ai_summarizer.py        # AI integration
β”‚   └── merge_tools.py          # Beyond Compare/WinMerge
β”‚
β”œβ”€β”€ 🎨 gui/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main_window.py          # Main GUI window
β”‚   β”œβ”€β”€ diff_viewer.py          # Diff visualization
β”‚   β”œβ”€β”€ audit_browser.py        # Audit trail browser
β”‚   β”œβ”€β”€ statistics_panel.py     # Statistics dashboard
β”‚   └── styles.css              # Qt stylesheets
β”‚
β”œβ”€β”€ πŸ“š docs/
β”‚   β”œβ”€β”€ QUICK_START.md
β”‚   β”œβ”€β”€ CLI_REFERENCE.md
β”‚   β”œβ”€β”€ GUI_GUIDE.md
β”‚   β”œβ”€β”€ CONFIG_REFERENCE.md
β”‚   β”œβ”€β”€ TROUBLESHOOTING.md
β”‚   └── assets/
β”‚       β”œβ”€β”€ gui_screenshot.png
β”‚       └── cli_example.png
β”‚
β”œβ”€β”€ πŸ’Ύ examples/
β”‚   β”œβ”€β”€ config.default.yaml
β”‚   β”œβ”€β”€ sample_data/
β”‚   β”‚   β”œβ”€β”€ lab_old.csv
β”‚   β”‚   └── lab_new.csv
β”‚   └── README.md
β”‚
└── βœ… tests/
    β”œβ”€β”€ test_watcher.py
    β”œβ”€β”€ test_differ.py
    β”œβ”€β”€ test_audit.py
    └── test_reporter.py

Advanced Features

πŸ”§ Match Key Configuration

Different CSV files may need different match strategies:

matching:
  profiles:
    lab:
      match_keys: ["USUBJID", "LAB_DATE", "LAB_CODE"]
      similarity: 0.95  # Strict: exact key match
    ae:
      match_keys: ["USUBJID", "AE_START_DATE"]
      similarity: 0.85  # More lenient
    dm:
      match_keys: ["USUBJID"]
      similarity: 0.99  # Very strict
πŸ€– AI Summarization

Automatic change summaries using LLMs:

config = Config.from_yaml("config.yaml")
config.ai.enabled = True
config.ai.provider = "openai"
config.ai.model = "gpt-4"

summarizer = AISummarizer(config)
summary = summarizer.summarize_changes(changes)

print(f"AI Summary: {summary}")
# Output: "3 lab tests added for subject X, 1 result corrected for test Y..."
πŸ“Š Statistics & Analytics
db = AuditDatabase("audit_trail.db")
stats = db.get_statistics()

print(f"Files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")
print(f"Additions: {stats.additions}")
print(f"Modifications: {stats.modifications}")
print(f"Deletions: {stats.deletions}")
print(f"Last update: {stats.last_updated}")
πŸ” Data Privacy & Security
  • Local-First: All data remains on your machine
  • No Cloud Syncing: No data transmission
  • Encrypted Fields: Optional field-level encryption
  • Audit Immutability: Once recorded, audit entries cannot be modified
  • Access Logging: Who accessed what, when

Dependencies

pandas>=2.0           # CSV parsing & diffing
PyYAML>=6.0          # Configuration
watchdog>=3.0        # Folder monitoring
rapidfuzz>=2.0       # Fuzzy string matching
PySide6>=6.5         # GUI (optional)
openai>=1.0          # AI summaries (optional)
anthropic>=0.7       # Claude API (optional)

Install:

pip install -r requirements.txt
pip install ".[gui]"        # For GUI
pip install ".[ai]"         # For AI summaries

Troubleshooting

Common Issues
Issue Solution
Folder not being watched Check path in config.yaml, ensure folder exists
CSV encoding errors Change encoding in config (try utf-8-sig, latin1)
GUI won't start Install PySide6: pip install ".[gui]"
No matches found in diff Adjust similarity_threshold in config
AI summaries not working Set API key, check provider config

Contributing

Contributions welcome! See CONTRIBUTING.md.

# Development setup
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher
pip install -e ".[dev]"
pytest tests/

Roadmap

  • Web dashboard for remote access
  • Direct database source support (SQL audit)
  • Automated email notifications
  • Machine learning anomaly detection
  • Integration with EHR systems
  • Blockchain-based immutable audit trail

License

MIT License Β© 2024 hakupao


Citation

@software{medaudit2024,
  author = {hakupao},
  title = {MedAudit Diff Watcher: Local-First CSV Audit Workstation},
  url = {https://github.com/hakupao/MedAudit-Diff-Watcher},
  year = {2024}
}

Support


⬆ Back to Top

Audit with confidence, locally

About

Local-first CSV audit workstation for clinical data: folder watching, smart diff, SQLite audit trail, GUI + CLI

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages