Skip to content

Steve-Git9/TwoShakes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Two Shakes Data Cleaning

AI-Powered Data Preparation: From Messy to Analysis-Ready in Two Shakes of a Lamb's tail

Python Azure Microsoft Foundry Agent Framework MCP License

πŸ—οΈ Built for the Microsoft Purpose-Built AI Platform Hackathon Category: Best Use of Microsoft Foundry Β· Also targeting: Best Multi-Agent System Β· Best Enterprise Solution


Azure Deployment Link

https://dataprepagent-499e361a.azurewebsites.net/


DEMO VIDEO

https://youtu.be/wbrhYIQtNJQ



How It Works in 60 Seconds

  1. Upload any messy data file β€” CSV, Excel, JSON, XML, or even a PDF with tables
  2. AI profiles every column: detects types, missing values, outliers, duplicates, and scores quality 0–100
  3. Review a cleaning plan β€” approve, reject, or tweak each AI-proposed action before anything runs
  4. Optionally prepare for ML β€” the AI recommends encoding, scaling, and feature transforms tailored to your data
  5. Download your analysis-ready dataset as CSV, Excel, or Parquet

The LLM decides what to fix. Python executes it deterministically. Your data, your call: nothing changes without your approval.


Microsoft Hero Technologies β€” Where & How They're Used

This section maps every required hackathon technology to the exact source files that implement it.

☁️ Microsoft Foundry (Azure AI Foundry)

All LLM calls in DataPrepAgent go through models hosted on Microsoft Foundry. The single LLM client in src/agents/__init__.py connects to the Foundry endpoint using the AZURE_AI_PROJECT_ENDPOINT and AZURE_AI_MODEL_DEPLOYMENT_NAME environment variables. Three agents make LLM calls β€” the Profiler (semantic analysis), the Strategy Agent (cleaning plan generation), and the Validator (quality certificate) β€” plus the Feature Engineering Agent for ML recommendations. Azure Foundry's built-in content filters are active on every call.

Key code:

# src/agents/__init__.py β€” AgentClient constructor
client = AIProjectClient(
    endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT"),
    credential=AzureKeyCredential(os.getenv("AZURE_AI_PROJECT_KEY"))
)
# Creates Azure AI Agent with per-call threads
agent = client.agents.create_agent(
    model=os.getenv("AZURE_AI_MODEL_DEPLOYMENT_NAME"),
    name=self.name,
    instructions=self.instructions
)

πŸ€– Microsoft Agent Framework (azure-ai-projects)

The AgentClient class uses azure.ai.projects.AIProjectClient as its tier-1 backend β€” the actual Microsoft Agent Framework SDK. It creates real Azure AI Agents with per-call threads and message-based conversations. If the SDK is unavailable (e.g., in environments without the preview package), it falls back gracefully to openai.AzureOpenAI pointing at the same Foundry-hosted model. Every agent in the system (Profiler, Strategy, Cleaner, Validator, Feature Engineering, Feature Transformer) uses this single client.

Files: src/agents/__init__.py Β· src/agents/orchestrator_agent.py Β· src/agents/profiler_agent.py Β· src/agents/strategy_agent.py Β· src/agents/validator_agent.py Β· src/agents/feature_engineering_agent.py Β· src/agents/feature_transformer_agent.py

πŸ”Œ MCP Server (7 tools)

src/mcp_server.py exposes the full pipeline as 7 MCP tools via stdio transport. Any MCP-compatible client β€” including GitHub Copilot Agent Mode β€” can call these tools programmatically. The tools are: profile_data, suggest_cleaning_plan, clean_data, validate_cleaning, list_supported_formats, recommend_feature_engineering, apply_feature_engineering.

πŸ§‘β€πŸ’» GitHub Copilot Agent Mode

The repo includes a .vscode/mcp.json configuration file that registers DataPrepAgent's MCP server as a tool source for GitHub Copilot Agent Mode in VS Code. With this config, a developer can ask Copilot: "Profile the file test_data/messy_sales.csv and suggest a cleaning plan" and Copilot will call the MCP tools automatically.

// .vscode/mcp.json β€” already in the repo
{
  "servers": {
    "dataprepagent": {
      "command": "python",
      "args": ["src/mcp_server.py"],
      "env": { "AZURE_AI_PROJECT_ENDPOINT": "...", "AZURE_AI_PROJECT_KEY": "...", "AZURE_AI_MODEL_DEPLOYMENT_NAME": "gpt-4o-mini" }
    }
  }
}

πŸ“„ Azure AI Document Intelligence

src/parsers/pdf_parser.py uses Azure AI Document Intelligence's prebuilt-layout model to extract tables from PDF files and scanned images. This is a second Azure AI service beyond the LLM, demonstrating multi-service integration on the Azure platform.

☁️ Azure App Service (Deployment)

infra/deploy.sh provides one-command deployment to Azure App Service. The script creates a resource group, App Service plan (B1 Linux), web app with Python 3.11 runtime, configures all environment variables, and deploys the code. startup.sh runs Streamlit on port 8000 for the Azure container. Full step-by-step instructions in infra/azure-deployment.md.


Architecture β€” 8-Agent Orchestrated Pipeline

Architecture

Agentic design patterns used:

  • Multi-agent collaboration: 8 specialized agents, each with a single responsibility
  • Agent-to-agent messaging: Orchestrator sends structured AgentMessage objects to sub-agents
  • Orchestrator supervisor: Central coordinator drives the pipeline, manages state, handles errors
  • Self-healing retry loop: If quality score < target after cleaning, Orchestrator re-runs Strategy + Cleaner (up to 2 retries)
  • Human-in-the-loop checkpoints: Pipeline pauses twice for user approval (cleaning plan + FE plan)
  • Tool-using agents: MCP server exposes all agent capabilities as callable tools
  • Deterministic execution: LLM reasons about what to do; Python code executes it. No AI-generated data values.

The Problem

Data scientists spend 60–80% of their time on data cleaning and preparation. Messy CSVs with mixed date formats, Excel exports with merged cells, nested JSON APIs with missing fields β€” every dataset needs hours of manual wrangling before any real analysis can begin.

The Solution

DataPrepAgent automates the entire data preparation pipeline using 8 AI agents orchestrated by a supervisor. Upload a messy file, get a detailed quality report, review the AI-generated cleaning plan action by action, then optionally apply ML feature engineering β€” all in minutes.


What Makes This Different

🧠 LLM reasons, Python executes. The model analyzes your data and proposes a plan. But actual transformations are deterministic pandas and scikit-learn functions. The AI never generates or modifies data values directly β€” no hallucinated data, no surprises.

πŸ‘€ Human-in-the-loop at every decision point. Both the cleaning plan and the feature engineering plan are presented as reviewable lists. Toggle each action on or off. Edit fill strategies. Change scaling methods. Nothing runs until you approve it.

πŸ”„ Self-healing pipeline. If the cleaned data still scores below the quality target, the Orchestrator Agent automatically re-runs the Strategy and Cleaner agents with refined parameters β€” up to 2 retry attempts.

πŸ“‹ Enterprise audit trail. Every pipeline run produces an append-only JSONL audit log with SHA-256 file hashes, quality scores, approved/rejected actions, LLM call counts, and backend details. No raw data is stored β€” only metadata.

πŸ”Œ MCP-native. The full pipeline is exposed as 7 MCP tools. Connect from GitHub Copilot Agent Mode, Claude Desktop, or any MCP client and run data preparation programmatically.


## Features

- **Multi-format ingestion** β€” CSV, TSV, Excel (merged cells, multi-row headers), JSON (nested API flattening), XML, PDF tables (Azure Document Intelligence)
- **AI-powered profiling** β€” Statistical analysis enriched with LLM semantic understanding (column meaning, cross-column inconsistencies)
- **8-agent orchestrated pipeline** β€” Orchestrator drives sub-agents via A2A messaging with automatic retry
- **Human-in-the-loop cleaning** β€” Review every action before execution; edit parameters, toggle approvals
- **Deterministic transformations** β€” 11 cleaning action types + 18 feature engineering transforms, all backed by pandas/sklearn
- **Before/after validation** β€” 6 automated checks + LLM-generated quality certificate with 0–100 score
- **Feature engineering** β€” AI-recommended encoding (one-hot, label, ordinal, target, frequency), scaling (standard, min-max, robust, max-abs), distribution transforms (log, power, quantile), feature creation (interaction, polynomial), feature selection (low variance, high cardinality, high correlation)
- **Export** β€” CSV, Excel, Parquet in one click (clean data and ML-ready data)
- **Premium UI** β€” Custom design system with DM Serif Display, branded palette, polished components
- **Enterprise governance** β€” Append-only audit log, content filtering via Azure Foundry, no PII in LLM calls

Screenshots

Upload Profile
Upload Profile
Cleaning Plan Results
Plan Results
Feature Engineering Download
FE FE

---

## Quick Start

```bash
git clone https://github.com/Steve-Git9/TwoShakes.git
cd TwoShakes
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env             # fill in your Azure credentials
streamlit run frontend/app.py

Open http://localhost:8501 and upload any of the demo files in test_data/.

Environment Variables

AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com
AZURE_AI_PROJECT_KEY=your-api-key
AZURE_AI_MODEL_DEPLOYMENT_NAME=gpt-4o-mini
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-key
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.cognitiveservices.azure.com

MCP Server β€” Use with GitHub Copilot Agent Mode

DataPrepAgent exposes its full pipeline as MCP tools, designed to work with GitHub Copilot Agent Mode and any MCP-compatible client.

GitHub Copilot Agent Mode (VS Code): Add to your .vscode/mcp.json:

{
  "servers": {
    "dataprepagent": {
      "command": "python",
      "args": ["src/mcp_server.py"],
      "env": {
        "AZURE_AI_PROJECT_ENDPOINT": "...",
        "AZURE_AI_PROJECT_KEY": "...",
        "AZURE_AI_MODEL_DEPLOYMENT_NAME": "gpt-4o-mini"
      }
    }
  }
}

Then in Copilot Agent Mode, ask: "Profile the file test_data/messy_sales.csv and suggest a cleaning plan" β€” Copilot will call the MCP tools automatically.

Claude Desktop (~/claude_desktop_config.json):

{
  "mcpServers": {
    "dataprepagent": {
      "command": "python",
      "args": ["<absolute-path>/src/mcp_server.py"],
      "env": {
        "AZURE_AI_PROJECT_ENDPOINT": "...",
        "AZURE_AI_PROJECT_KEY": "...",
        "AZURE_AI_MODEL_DEPLOYMENT_NAME": "gpt-4o-mini"
      }
    }
  }
}
Tool Description
profile_data Ingest a file and return a full ProfileReport JSON
suggest_cleaning_plan Generate a CleaningPlan from a profile
clean_data Execute the plan, return cleaned file path + log
validate_cleaning Run 6 checks, return ValidationReport JSON
list_supported_formats List supported file extensions
recommend_feature_engineering AI analysis β†’ FeatureEngineeringPlan with 18 transform types
apply_feature_engineering Execute approved transforms β†’ ML-ready file + log

Agent Details

Orchestrator Agent

Supervisor that drives all sub-agents via structured A2A messages. Includes a self-healing loop: if quality score remains below target after cleaning, it sends a retry message back to the Strategy Agent and re-cleans (up to 2 attempts). Optionally triggers the feature engineering phase.

Ingestion Agent

Routes uploaded files to the correct parser (CSV, Excel, JSON, XML, PDF), strips empty rows/columns, and returns a FileMetadata object.

Profiler Agent

Two-stage analysis: (1) statistical β€” column types, missing rates, IQR outliers, fuzzy duplicate detection, all in Python; (2) LLM semantic enrichment via Azure Foundry β€” interprets column semantics, detects cross-column inconsistencies, generates a quality summary.

Strategy Agent

Sends the ProfileReport + samples to the LLM with a detailed prompt. Returns an ordered CleaningPlan β€” each action has a type, parameters, priority, and human-readable reason.

Cleaner Agent

Dispatches each approved action to deterministic pandas functions. Captures before/after samples, logs every action. Individual failures are caught and reported β€” the pipeline continues.

Validator Agent

Runs 6 automated checks (row count, duplicates, null reduction, empty columns, type consistency, transform success rate) then calls the LLM for a narrative quality certificate.

Feature Engineering Agent

Two-stage analysis: (1) statistical β€” skewness, kurtosis, correlation matrix, cardinality, variance; (2) LLM recommendation β€” selects from 18 transform types, grouped and ordered (encoding β†’ scaling β†’ distribution β†’ creation β†’ selection).

Category Techniques
Encoding one-hot, label, ordinal, target (smoothed mean), frequency
Scaling standard (z-score), min-max, robust (IQR), max-abs
Distribution log (auto-offset), Yeo-Johnson power, quantile normalization
Feature Creation interaction products, polynomial features, binning
Feature Selection drop low-variance, drop high-cardinality, drop highly-correlated

Feature Transformer Agent

Fault-tolerant executor: dispatches each approved FeatureEngineeringAction to the correct sklearn-backed function. Per-action exceptions are caught and logged; the pipeline always continues.


Project Structure

TwoShakes/
β”œβ”€β”€ .vscode/mcp.json                # GitHub Copilot Agent Mode config
β”œβ”€β”€ frontend/                       # Streamlit UI + custom design system
β”‚   β”œβ”€β”€ app.py                      # Entry point, sidebar, step routing
β”‚   β”œβ”€β”€ static/style.css            # 500-line master CSS
β”‚   β”œβ”€β”€ pages/                      # Upload, Profile, Plan, Results, FE
β”‚   └── components/                 # Reusable UI components
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agents/                     # 8 agents (incl. orchestrator, FE)
β”‚   β”‚   └── __init__.py             # AgentClient β€” azure-ai-projects + fallback
β”‚   β”œβ”€β”€ parsers/                    # CSV, Excel, JSON, XML, PDF parsers
β”‚   β”‚   └── pdf_parser.py           # Azure Document Intelligence
β”‚   β”œβ”€β”€ transformations/            # 11 cleaning + 18 FE transform functions
β”‚   β”œβ”€β”€ governance/audit_log.py     # Enterprise audit trail
β”‚   β”œβ”€β”€ mcp_server.py              # 7 MCP tools (stdio transport)
β”‚   └── models/schemas.py          # All Pydantic v2 data contracts
β”œβ”€β”€ test_data/                      # Demo messy datasets
β”œβ”€β”€ tests/                          # Unit + integration tests
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ deploy.sh                   # Azure App Service deployment
β”‚   └── azure-deployment.md         # Step-by-step deployment guide
β”œβ”€β”€ docs/                           # Architecture docs + screenshots
β”œβ”€β”€ .streamlit/config.toml          # Streamlit theme config
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ startup.sh                      # Azure App Service startup
└── .env.example

Responsible AI

  • Human-in-the-loop β€” No data is modified without explicit user approval of each action
  • No PII in LLM calls β€” Only column names, statistics, and 3–5 sample values are sent to the model
  • Transparency β€” Every transformation is logged with rows affected and before/after samples
  • Content filtering β€” Azure AI Foundry applies built-in content filters to all model calls
  • Deterministic execution β€” The LLM proposes; Python executes. No AI-generated data values
  • Audit trail β€” Append-only JSONL log with SHA-256 file hashes for governance and reproducibility

Azure Deployment

See infra/azure-deployment.md for step-by-step instructions.

set -a && source .env && set +a
bash infra/deploy.sh

License

MIT β€” see LICENSE

About

AI-powered data preparation pipeline: upload messy files, review AI cleaning plans, download analysis-ready datasets. Built on Azure AI Foundry + 8-agent orchestration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors