Skip to content

Latest commit

 

History

History
285 lines (219 loc) · 7.7 KB

File metadata and controls

285 lines (219 loc) · 7.7 KB

DataMind Agent Architecture

System Overview

DataMind implements an advanced closed-loop multi-agent system designed for autonomous data science workflows. The architecture follows a "tiny experiment" philosophy where each iteration focuses on minimal, testable hypotheses, enhanced with Julia native ML processing for 5-100x performance improvements.

🚀 Core Performance Features

Julia Native ML Pipeline

Location: src/ml/julia_native_ml.jl (467 lines of optimized code)

  • 5-100x Performance: Faster than Python/sklearn equivalents
  • Zero Overhead: No Python/C boundary costs
  • Type-Safe Computing: Compile-time optimization
  • Memory Efficient: Handles datasets 100x larger

Enhanced Intelligence

  • Real LLM Integration: GPT-4 powered agents with actual API calls
  • Knowledge Graph Learning: Neo4j backend with 177+ experiments tracked
  • Vector Database: ChromaDB integration for semantic search and cross-domain learning
  • Production Ready: Comprehensive error handling and optimization

🤖 Core Components

Meta-Controller

Location: src/controllers/meta_controller.jl

The Meta-Controller orchestrates the entire experiment lifecycle with enhanced intelligence:

struct MetaController
    experiment::Experiment
    config::Dict
    knowledge_graph::EnhancedKnowledgeGraph
    vector_database::ChromaDBClient
    iteration_count::Int
    ensemble_detection::Bool
end

Enhanced Responsibilities:

  • Manage experiment state and iterations with real LLM integration
  • Coordinate communication between specialized agents
  • Track experiment progress and intelligent termination conditions
  • Maintain knowledge graph with vector database updates
  • Enable cross-domain learning and pattern recognition

🧠 Agent Types

Planning Agent

Location: src/agents/planning_agent.jl

Decomposes research questions using advanced chain-of-thought reasoning with GPT-4:

function plan_experiment(agent::PlanningAgent, research_question::String)
    # Real GPT-4 API calls for intelligent planning
    # Context-aware from 177+ previous experiments
    # Returns structured plan with minimal subtasks
end

Key Features:

  • Real LLM Intelligence: GPT-4 powered planning with actual API integration
  • Context Awareness: Learning from 177+ tracked experiments
  • Domain-Specific Templates: Optimized for finance, e-commerce, weather, HR data
  • Hypothesis Generation: Intelligent validation and refinement

Code Generation Agent

Location: src/agents/codegen_agent.jl

Generates high-performance Julia code for data science tasks with optimization focus:

function generate_code(agent::CodeGenAgent, task::String, context::Dict)
    # GPT-4 powered code generation
    # Julia native ML optimization (5-100x faster)
    # Comprehensive error handling and output capture
end

Enhanced Features:

  • Julia Native Focus: Optimized for maximum performance computing
  • Real LLM Intelligence: GPT-4 powered code generation with context awareness
  • Production Ready: Enhanced error handling, data validation, numerical stability
  • Library Integration: GLM.jl, DataFrames.jl, Bootstrap ensembles

Evaluation Agent

Location: src/evaluation/evaluator.jl

Analyzes experiment results using advanced intelligence for success/failure/retry decisions:

function evaluate_results(agent::EvaluationAgent, results::Dict)
    # Intelligent parsing with GPT-4 analysis
    # Statistical validation and interpretation
    # Smart iteration termination decisions
end

Enhanced Features:

  • Intelligent Evaluation: Real LLM analysis of experiment outcomes
  • Metric Intelligence: Advanced parsing and statistical comparison
  • Learning Integration: Results feed back into knowledge graph
  • Ensemble Detection: Identifies multi-agent collaboration patterns

Reflection Agent

Location: src/agents/reflection_agent.jl

Updates knowledge graph and triggers intelligent next planning cycles:

function reflect_and_update(agent::ReflectionAgent, experiment_results::Dict)
    # Update Neo4j knowledge graph with advanced ontology
    # Vector database semantic indexing
    # Cross-domain pattern recognition
end

Communication Patterns

Message Bus

Agents communicate through structured messages:

struct AgentMessage
    id::String
    from_agent::String
    to_agent::String
    message_type::String
    content::Dict
    timestamp::DateTime
end

Data Flow

  1. Planning Phase:

    User Input → Planning Agent → Structured Plan → Meta-Controller
    
  2. Execution Phase:

    Plan → Code Gen Agent → Executable Code → Execution Sandbox → Results
    
  3. Evaluation Phase:

    Results → Evaluation Agent → Insights → Knowledge Graph → Next Iteration
    

Knowledge Management

Knowledge Graph

Location: src/knowledge/graph.jl

Maintains experiment provenance and learned patterns:

struct KnowledgeGraph
    experiments::Dict{String, Experiment}
    relationships::Vector{Relationship}
    patterns::Vector{Pattern}
end

Capabilities:

  • Track experiment lineage
  • Store successful code patterns
  • Query historical results
  • Identify similar problems

Provenance Tracking

Every experiment iteration creates provenance records:

  • Input research question
  • Generated plans and code
  • Execution results and metrics
  • Agent decisions and rationale

Execution Environment

Sandboxed Execution

Location: src/execution/sandbox.jl

Provides isolated execution for generated code:

function execute_with_timeout(code::String, timeout::Int=30)
    # Safe execution with resource limits
    # Captures stdout, stderr, and return values
end

Security Features:

  • Resource limits (memory, CPU, time)
  • Filesystem restrictions
  • Network isolation
  • Package management

Configuration Architecture

Agent Configuration

Agents are configured through YAML files:

agents:
  planning:
    model: "gpt-4"
    temperature: 0.1
    max_tokens: 1000
  
  codegen:
    model: "gpt-3.5-turbo"
    temperature: 0.2
    max_tokens: 2000

LLM Routing

Cost-aware model selection:

  • Cheap models for planning and evaluation
  • Expensive models for complex code generation
  • Fallback chains for reliability

Extension Points

Custom Agents

Add new agent types by implementing the Agent interface:

abstract type Agent end

function process_message(agent::Agent, message::AgentMessage)
    # Custom processing logic
end

Domain Adaptations

  • Custom planning templates for specific domains
  • Specialized evaluation criteria
  • Domain-specific code libraries

Integration Hooks

  • External data sources
  • Custom execution environments
  • Third-party ML platforms

Deployment Patterns

Local Development

  • Single-process execution
  • Mock LLM responses for testing
  • File-based knowledge storage

Production Deployment

  • Microservice architecture
  • Distributed execution with Ray
  • Graph database for knowledge storage
  • Message queue for agent communication

Performance Considerations

Scalability

  • Stateless agent design enables horizontal scaling
  • Knowledge graph can be sharded by experiment type
  • Execution sandbox supports parallel experiments

Cost Optimization

  • Model routing based on task complexity
  • Caching of similar analyses
  • Incremental learning from successful patterns

Monitoring and Observability

Metrics Collection

  • Experiment success rates
  • Agent response times
  • LLM token usage and costs
  • Code execution statistics

Debugging Support

  • Message tracing between agents
  • Execution logs and error capture
  • Visual experiment flow representation