Skip to content

maheshchebrolu-git/cloud-cost-optimization-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cloud Cost Optimization Agent

An autonomous multi-agent system for AWS cost optimization built with a Google ADK-inspired architecture and tree-of-thought reasoning. The system moves from one-off cost reports to continuous monitoring, prioritized recommendations, and safe automated remediation.

What Makes This Different

Traditional tools: static analysis β†’ manual review β†’ manual actions
This agent system: continuous monitoring β†’ multi-path reasoning β†’ autonomous actions β†’ learning

Key Capabilities

  • Autonomous operation: runs scheduled and continuous workflows
  • Tree-of-thought reasoning: explores multiple decision paths before acting
  • Adaptive learning: stores patterns and prior decisions in local memory
  • Safety-first automation: risk tiers, approvals, and simulation mode
  • Real-time response: reacts to anomalies and high-confidence savings opportunities
  • Multi-agent architecture: monitor, analyzer, executor, and orchestrator agents

Agent Architecture

Cloud Cost Optimization Agent System
β”œβ”€β”€ Orchestrator Agent    - workflow coordination and strategic insights
β”œβ”€β”€ Monitor Agent         - continuous AWS resource monitoring
β”œβ”€β”€ Analyzer Agent        - recommendations with tree-of-thought reasoning
β”œβ”€β”€ Executor Agent        - safe autonomous action execution
β”œβ”€β”€ Memory System         - persistent learning and pattern storage
β”œβ”€β”€ Safety Framework      - risk assessment and approval workflows
└── Dashboard System      - runtime status and pending approvals

Enhanced Detection Capabilities

Resource Type Detection Method Autonomous Actions Learning Features
Unattached EBS Volumes Age + attachment analysis Auto-snapshot + delete Pattern-based retention
Idle EC2 Instances Multi-metric analysis Auto-stop with schedules Usage pattern learning
Stale Snapshots Age + dependency tracking Smart retention policies Policy optimization
Idle Load Balancers Traffic analysis + trends Auto-consolidation Load pattern recognition
Overprovisioned RDS Performance + cost modeling Automated rightsizing Workload characterization
S3 Lifecycle Gaps Access pattern analysis Smart lifecycle policies Data aging patterns
Unused Resources Dependency mapping Safe automated cleanup Resource correlation

Quick Start

Prerequisites

  • Python 3.11+
  • AWS account with read permissions (write permissions for live remediation)
  • Google Cloud project (optional, for Vertex AI reasoning)
  • 4GB+ RAM recommended for full agent operation

Installation

  1. Clone the repository:
git clone https://github.com/maheshchebrolu-git/cloud-cost-optimization-agent.git
cd cloud-cost-optimization-agent
  1. Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Configure AWS credentials:
# Create .env
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_DEFAULT_REGION=us-east-1

# Optional: Google Cloud configuration
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_REGION=us-central1

Usage Modes

Interactive mode (recommended for first use)

python cloud_cost_agent.py --mode interactive --simulation

Single optimization run

python cloud_cost_agent.py --mode single --simulation

Continuous operation

python cloud_cost_agent.py --mode continuous --simulation

Monitoring only

python cloud_cost_agent.py --mode monitor

Batch sweeper (read-only analysis)

python cloud_cost_sweeper.py

Interactive Demo

πŸ€– Cloud Cost Optimization Agent - Interactive Mode
agent> run
Running optimization cycle...
βœ… Optimization complete!
   Potential savings: $247.50/month
   Actions executed: 5
   Pending approvals: 2

agent> approvals
πŸ“‹ Pending Approvals (2)
1. action_20251003_171857_1 - $62.00/month
   Resource: i-0abc123def456789
   Action: terminate_stopped_instance
   Risk: caution

agent> approve
Enter approval number: 1
Approve action? (y/n): y
βœ… Action approved and executed

Tree-of-Thought Reasoning in Action

Problem: should we delete a 30-day-old unattached EBS volume?

Traditional approach:

  • Volume is older than 30 days β†’ delete

Tree-of-thought agent approach:

Reasoning Path 1 (Cost Focus):
β”œβ”€β”€ $8/month waste for 30 days
β”œβ”€β”€ Quick win with minimal risk
└── Recommendation: delete immediately

Reasoning Path 2 (Risk Analysis):
β”œβ”€β”€ Check for recent snapshots
β”œβ”€β”€ Verify no pending reattachments
β”œβ”€β”€ Assess business criticality
└── Recommendation: snapshot first, then delete

Reasoning Path 3 (Pattern Learning):
β”œβ”€β”€ Historical volume usage patterns
β”œβ”€β”€ Team behavior analysis
β”œβ”€β”€ Seasonal considerations
└── Recommendation: set smart retention policy

Synthesized Decision:
β”œβ”€β”€ Confidence: 0.87
β”œβ”€β”€ Action: create final snapshot, delete volume, update policy
└── Learning: update retention rules for this volume type

Safety and Risk Management

Risk classification system

Risk Levels:
β”œβ”€β”€ SAFE (auto-execute)
β”‚   β”œβ”€β”€ Release unused Elastic IPs
β”‚   β”œβ”€β”€ Add S3 lifecycle policies
β”‚   └── Delete unattached volumes >30 days
β”‚
β”œβ”€β”€ CAUTION (request approval)
β”‚   β”œβ”€β”€ Stop/resize instances
β”‚   β”œβ”€β”€ Modify database configurations
β”‚   └── Change security settings
β”‚
└── REVIEW_REQUIRED (manual review)
    β”œβ”€β”€ Delete production databases
    β”œβ”€β”€ Modify network configurations
    └── Cross-service dependencies

Safety mechanisms

  • Simulation mode: test actions without mutating AWS resources
  • Human approval: required for medium/high risk actions
  • Rollback plans: reversal steps captured per action
  • Circuit breakers: stop execution on unexpected results
  • Audit trail: decision and action history stored locally

Production Deployment

Option 1: Google Cloud Run

./deploy/deploy.sh your-project-id us-central1
gcloud run services describe cloud-cost-agent --region=us-central1

Option 2: Docker

docker build -t cloud-cost-agent .
docker run -d \
  -e AWS_ACCESS_KEY_ID=... \
  -e AWS_SECRET_ACCESS_KEY=... \
  -v $(pwd)/memory:/app/memory \
  cloud-cost-agent

Option 3: Local development

python cloud_cost_agent.py --mode continuous --simulation --log-level DEBUG

Configuration

Variable Required Description Default
AWS_ACCESS_KEY_ID βœ… AWS access key -
AWS_SECRET_ACCESS_KEY βœ… AWS secret key -
AWS_DEFAULT_REGION ❌ Primary AWS region us-east-1
GOOGLE_CLOUD_PROJECT ❌ GCP project for Vertex AI -
SIMULATION_MODE ❌ Enable simulation mode true
LOG_LEVEL ❌ Logging verbosity INFO
MEMORY_PATH ❌ Agent memory storage ./memory

Required AWS Permissions

Minimum permissions (read-only)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "rds:Describe*",
        "s3:List*",
        "s3:GetBucketLifecycle*",
        "elbv2:Describe*",
        "elb:Describe*",
        "cloudwatch:GetMetricStatistics"
      ],
      "Resource": "*"
    }
  ]
}

Extended permissions (for actions)

{
  "Effect": "Allow",
  "Action": [
    "ec2:CreateSnapshot",
    "ec2:DeleteVolume",
    "ec2:ReleaseAddress",
    "ec2:StopInstances",
    "s3:PutLifecycleConfiguration"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:RequestedRegion": ["us-east-1", "us-west-2"]
    }
  }
}

Troubleshooting

  1. Agent startup fails

    pip install -r requirements.txt
    aws sts get-caller-identity
  2. No metrics found

    • New resources may not have CloudWatch history yet
    • The agent improves as metrics accumulate
  3. High memory usage

    • Increase container memory or reduce scan scope for large accounts
  4. Action approval timeouts

    • Use interactive approval commands or integrate your own approval channel

Debug mode

python cloud_cost_agent.py --mode single --log-level DEBUG --simulation

Health checks

agent> dashboard
python -c "import boto3; print(boto3.client('ec2').describe_regions())"
agent> run

Contributing

Adding new resource types

  1. Add a detector in agents/analyzer.py
  2. Add pricing constants for cost estimation
  3. Implement execution logic in agents/executor.py
  4. Add safety checks and risk classification
  5. Test thoroughly in simulation mode

License

This project is licensed under the MIT License β€” see the LICENSE file for details.

Support

Important Disclaimers

  • Always run in simulation mode before enabling live actions
  • Validate recommendations against business requirements
  • Monitor agent decisions and outcomes continuously
  • Keep backups for critical resources
  • Confirm actions comply with your organization's governance policies

Google ADK Integration

The project includes adk_integration.CloudCostADKIntegration, which exposes the orchestrator through a Google Agent Development Kit agent surface.

from adk_integration import CloudCostADKIntegration

cloud_adk = CloudCostADKIntegration(project_id="my-gcp-project", simulation_mode=True)
adk_agent = cloud_adk.adk_agent

Available ADK tools include run_optimization_cycle, run_monitoring_cycle, get_dashboard_snapshot, list_pending_approvals, approve_pending_action, and toggle_simulation_mode.

Install the preview google.adk package following Google ADK guidance until the SDK is generally available.

Releases

No releases published

Packages

 
 
 

Contributors