Local-first CSV audit workstation for versioned folder drops: watch, compare, persist, review, and summarize medical data changes with full traceability.
Features β’ Architecture β’ Quick Start β’ GUI & CLI β’ Configuration
MedAudit Diff Watcher is a sophisticated, local-first workstation for auditing CSV changes in medical/clinical data workflows. It monitors configured folders, detects data drops, computes intelligent diffs using configurable match keys, persists audit trails to SQLite, and provides both GUI and CLI interfaces for review and summarization.
Perfect For:
- π₯ Medical data quality monitoring
- π Clinical trial CSV change audits
- π Data reconciliation workflows
- π Regulatory compliance documentation
- π― Change impact analysis
Why MedAudit?
- π Local-First: All data stays on your machine β no cloud syncing
- ποΈ Intelligent Watching: Auto-detects new/modified CSV files
- π Smart Diffing: Match-key based row alignment (not just line diffs)
- πΎ Persistent Audit Trail: SQLite database of all changes and reviews
- π¨ Dual Interface: GUI for interactive review, CLI for automation
- π§ Integration Ready: Beyond Compare/WinMerge launching, optional AI summaries
- π Visual Reports: HTML/CSV change summaries and statistics
| Feature | Description |
|---|---|
| ποΈ Folder Watching | Monitor directories for new/modified CSV files in real-time |
| π Intelligent CSV Diffing | Match rows by key columns (ID, study, subject) β not line-by-line |
| πΎ SQLite Persistence | Audit trail database with full change history |
| π Multiple Reporting Formats | HTML dashboards, CSV exports, summary reports |
| π¨ Interactive GUI | PySide6-based Qt interface for review and decisions |
| π₯οΈ CLI Mode | Batch processing, automation, unattended operation |
| π§ Merge Tool Integration | Launch Beyond Compare or WinMerge for manual review |
| π€ AI Summaries (Optional) | LLM-powered change summaries using OpenAI/Claude |
| βοΈ YAML Configuration | Flexible configuration for multiple audit workflows |
| π Change Statistics | Track additions, modifications, deletions by domain |
graph TB
A[" Monitored Folders<br/>(Folder A, B, C)"] -->|watchdog| B[" Folder Watcher"]
B -->|Detect new/mod| C[" CSV Files<br/>(*.csv)"]
C -->|Load| D[" CSV Parser<br/>(pandas)"]
D --> E[" Match-Key<br/>Alignment"]
E -->|Previous version| F[" Diff Engine<br/>(rapidfuzz)"]
E -->|Current version| F
F --> G[" SQLite DB<br/>(audit_trail.db)"]
G -->|Store| H[" Audit Records<br/>(changes, metadata)"]
H -->|Query| I{" GUI or<br/> CLI Reports?"}
I -->|GUI| J[" Interactive<br/>GUI Window"]
I -->|CLI| K[" HTML/CSV<br/>Reports"]
J -->|User Review| L[" Approved<br/>or Manual Diff"]
L -->|Export| M[" Summary<br/>Report"]
K --> M
M -->|Optional| N[" AI Summary<br/>(OpenAI/Claude)"]
N --> O[" Final Report"]
style A fill:#e3f2fd
style H fill:#c8e6c9
style M fill:#fff9c4
style O fill:#ffccbc
- Watcher: Monitors folders using
watchdogfor file changes - Parser: Reads CSV files with pandas, handles encoding/delimiters
- Matcher: Aligns rows using configurable match keys (primary key columns)
- Differ: Computes changes (new, modified, deleted rows) using rapidfuzz
- Persister: Stores audit trail in SQLite with full metadata
- GUI: PySide6 Qt application for interactive review
- Reporter: Generates HTML/CSV summaries
- AI: Optional integration with LLM for change summaries
- Python 3.11+
- Windows 7+ or WSL2 (Windows-first design)
- git, pip
# Clone repository
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# For GUI support
pip install ".[gui]"
# Verify installation
python -m medaudit --version# 1. Create config directory
mkdir .medaudit
cp examples/config.default.yaml .medaudit/config.yaml
# 2. Edit config with your folder paths
notepad .medaudit/config.yaml
# Set watched_folders, output_dir, match_keys
# 3. Initialize audit database
python -m medaudit init --config .medaudit/config.yaml
# 4. Launch GUI
python -m medaudit gui
# OR run CLI watcher
python -m medaudit watch --config .medaudit/config.yaml| Control Tab | Config Form Tab | Config YAML Tab |
|---|---|---|
![]() |
![]() |
![]() |
python -m medaudit guiFeatures:
- π List of detected CSV changes
- π Side-by-side diff viewer
- β Approve/Reject UI
- π§ Launch Beyond Compare/WinMerge
- π Live statistics dashboard
- π€ AI summary integration
- πΎ Export reports
# Watch folders continuously
python -m medaudit watch --config config.yaml --output reports/
# Batch process existing CSVs
python -m medaudit batch --input data/ --config config.yaml
# Generate report from audit database
python -m medaudit report --db audit_trail.db --format html
# Export audit trail as CSV
python -m medaudit export --db audit_trail.db --output audit_export.csv# Folder Watching
watching:
enabled: true
poll_interval: 5 # seconds
folders:
- path: "C:/Data/Medical/Drop1/"
description: "Lab data drops"
pattern: "lab_*.csv"
- path: "C:/Data/Medical/Drop2/"
description: "AE data drops"
pattern: "ae_*.csv"
# CSV Parsing
parsing:
delimiter: ","
encoding: "utf-8-sig"
skip_rows: 0
quote_char: '"'
# Row Matching (diffing strategy)
matching:
match_keys:
lab: ["USUBJID", "TEST_DATE", "TEST_CODE"]
ae: ["USUBJID", "AE_START_DATE"]
similarity_threshold: 0.85 # 85% match = same row
# SQLite Audit Trail
audit:
db_path: "./audit_trail.db"
keep_history: true
retention_days: 365
# Reporting
reporting:
output_dir: "./reports/"
formats:
- html
- csv
- json
include_statistics: true
# External Tools
external:
enable_beyond_compare: true
beyond_compare_path: "C:/Program Files/Beyond Compare 4/BCompare.exe"
enable_winmerge: true
winmerge_path: "C:/Program Files/WinMerge/WinMergeU.exe"
# AI Summarization (Optional)
ai:
enabled: false
provider: "openai" # openai, claude
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
temperature: 0.3MedAudit-Diff-Watcher/
βββ .medaudit/
β βββ config.yaml # Your configuration
β βββ audit_trail.db # SQLite audit database
βββ reports/
β βββ 2024-04-01_summary.html
β βββ 2024-04-01_detail.csv
β βββ ...
βββ data/
βββ drops/
βββ lab_2024_04_01.csv
βββ ae_2024_04_01.csv
from medaudit.watcher import FolderWatcher
from medaudit.config import Config
# Load config
config = Config.from_yaml("config.yaml")
# Initialize watcher
watcher = FolderWatcher(config)
# Start watching (blocking)
watcher.watch()
# Or watch in background
import threading
thread = threading.Thread(target=watcher.watch)
thread.daemon = True
thread.start()from medaudit.differ import CSVDiffer
from medaudit.config import Config
config = Config.from_yaml("config.yaml")
differ = CSVDiffer(config)
# Compare two CSV files
changes = differ.diff_files(
"old_data.csv",
"new_data.csv",
match_keys=["USUBJID", "TEST_DATE"]
)
print(f"Added rows: {len(changes.added)}")
print(f"Modified rows: {len(changes.modified)}")
print(f"Deleted rows: {len(changes.deleted)}")
# Access individual changes
for change in changes.modified:
print(f"Row {change.key}: {change.old_values} β {change.new_values}")from medaudit.audit import AuditDatabase
db = AuditDatabase("audit_trail.db")
# Get all changes for a file
changes = db.get_file_changes("lab_2024_04_01.csv")
# Get changes in a date range
recent = db.get_changes_between(
start_date="2024-04-01",
end_date="2024-04-05"
)
# Get statistics
stats = db.get_statistics()
print(f"Total files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")from medaudit.reporter import ReportGenerator
generator = ReportGenerator(config)
# Generate HTML report
html_report = generator.generate_html(
changes=changes,
title="Lab Data Audit Report - April 2024"
)
html_report.save("lab_audit_2024_04.html")
# Generate CSV export
csv_report = generator.generate_csv(changes)
csv_report.save("lab_audit_2024_04.csv")
# With AI summary
if config.ai.enabled:
summary = generator.generate_ai_summary(
changes,
provider=config.ai.provider
)
print(f"Summary: {summary}")π Complete Directory Layout
MedAudit-Diff-Watcher/
β
βββ π README.md & README_CN.md
βββ π pyproject.toml # Python project configuration
βββ π requirements.txt
βββ π CHANGELOG.md
β
βββ π· medaudit/ (Main package)
β βββ __main__.py # CLI entry point
β βββ __init__.py
β βββ config.py # Configuration loading
β βββ watcher.py # Folder watching
β βββ parser.py # CSV parsing
β βββ differ.py # Diffing engine
β βββ audit.py # SQLite audit trail
β βββ reporter.py # Report generation
β βββ ai_summarizer.py # AI integration
β βββ merge_tools.py # Beyond Compare/WinMerge
β
βββ π¨ gui/
β βββ __init__.py
β βββ main_window.py # Main GUI window
β βββ diff_viewer.py # Diff visualization
β βββ audit_browser.py # Audit trail browser
β βββ statistics_panel.py # Statistics dashboard
β βββ styles.css # Qt stylesheets
β
βββ π docs/
β βββ QUICK_START.md
β βββ CLI_REFERENCE.md
β βββ GUI_GUIDE.md
β βββ CONFIG_REFERENCE.md
β βββ TROUBLESHOOTING.md
β βββ assets/
β βββ gui_screenshot.png
β βββ cli_example.png
β
βββ πΎ examples/
β βββ config.default.yaml
β βββ sample_data/
β β βββ lab_old.csv
β β βββ lab_new.csv
β βββ README.md
β
βββ β
tests/
βββ test_watcher.py
βββ test_differ.py
βββ test_audit.py
βββ test_reporter.py
π§ Match Key Configuration
Different CSV files may need different match strategies:
matching:
profiles:
lab:
match_keys: ["USUBJID", "LAB_DATE", "LAB_CODE"]
similarity: 0.95 # Strict: exact key match
ae:
match_keys: ["USUBJID", "AE_START_DATE"]
similarity: 0.85 # More lenient
dm:
match_keys: ["USUBJID"]
similarity: 0.99 # Very strictπ€ AI Summarization
Automatic change summaries using LLMs:
config = Config.from_yaml("config.yaml")
config.ai.enabled = True
config.ai.provider = "openai"
config.ai.model = "gpt-4"
summarizer = AISummarizer(config)
summary = summarizer.summarize_changes(changes)
print(f"AI Summary: {summary}")
# Output: "3 lab tests added for subject X, 1 result corrected for test Y..."π Statistics & Analytics
db = AuditDatabase("audit_trail.db")
stats = db.get_statistics()
print(f"Files tracked: {stats.file_count}")
print(f"Total changes: {stats.change_count}")
print(f"Additions: {stats.additions}")
print(f"Modifications: {stats.modifications}")
print(f"Deletions: {stats.deletions}")
print(f"Last update: {stats.last_updated}")π Data Privacy & Security
- Local-First: All data remains on your machine
- No Cloud Syncing: No data transmission
- Encrypted Fields: Optional field-level encryption
- Audit Immutability: Once recorded, audit entries cannot be modified
- Access Logging: Who accessed what, when
pandas>=2.0 # CSV parsing & diffing
PyYAML>=6.0 # Configuration
watchdog>=3.0 # Folder monitoring
rapidfuzz>=2.0 # Fuzzy string matching
PySide6>=6.5 # GUI (optional)
openai>=1.0 # AI summaries (optional)
anthropic>=0.7 # Claude API (optional)
Install:
pip install -r requirements.txt
pip install ".[gui]" # For GUI
pip install ".[ai]" # For AI summariesCommon Issues
| Issue | Solution |
|---|---|
| Folder not being watched | Check path in config.yaml, ensure folder exists |
| CSV encoding errors | Change encoding in config (try utf-8-sig, latin1) |
| GUI won't start | Install PySide6: pip install ".[gui]" |
| No matches found in diff | Adjust similarity_threshold in config |
| AI summaries not working | Set API key, check provider config |
Contributions welcome! See CONTRIBUTING.md.
# Development setup
git clone https://github.com/hakupao/MedAudit-Diff-Watcher.git
cd MedAudit-Diff-Watcher
pip install -e ".[dev]"
pytest tests/- Web dashboard for remote access
- Direct database source support (SQL audit)
- Automated email notifications
- Machine learning anomaly detection
- Integration with EHR systems
- Blockchain-based immutable audit trail
MIT License Β© 2024 hakupao
@software{medaudit2024,
author = {hakupao},
title = {MedAudit Diff Watcher: Local-First CSV Audit Workstation},
url = {https://github.com/hakupao/MedAudit-Diff-Watcher},
year = {2024}
}- π§ Issues: GitHub Issues
- π Docs: docs/
- π¬ Discussions: GitHub Discussions
Audit with confidence, locally


