An autonomous multi-agent system for AWS cost optimization built with a Google ADK-inspired architecture and tree-of-thought reasoning. The system moves from one-off cost reports to continuous monitoring, prioritized recommendations, and safe automated remediation.
Traditional tools: static analysis β manual review β manual actions
This agent system: continuous monitoring β multi-path reasoning β autonomous actions β learning
- Autonomous operation: runs scheduled and continuous workflows
- Tree-of-thought reasoning: explores multiple decision paths before acting
- Adaptive learning: stores patterns and prior decisions in local memory
- Safety-first automation: risk tiers, approvals, and simulation mode
- Real-time response: reacts to anomalies and high-confidence savings opportunities
- Multi-agent architecture: monitor, analyzer, executor, and orchestrator agents
Cloud Cost Optimization Agent System
βββ Orchestrator Agent - workflow coordination and strategic insights
βββ Monitor Agent - continuous AWS resource monitoring
βββ Analyzer Agent - recommendations with tree-of-thought reasoning
βββ Executor Agent - safe autonomous action execution
βββ Memory System - persistent learning and pattern storage
βββ Safety Framework - risk assessment and approval workflows
βββ Dashboard System - runtime status and pending approvals
| Resource Type | Detection Method | Autonomous Actions | Learning Features |
|---|---|---|---|
| Unattached EBS Volumes | Age + attachment analysis | Auto-snapshot + delete | Pattern-based retention |
| Idle EC2 Instances | Multi-metric analysis | Auto-stop with schedules | Usage pattern learning |
| Stale Snapshots | Age + dependency tracking | Smart retention policies | Policy optimization |
| Idle Load Balancers | Traffic analysis + trends | Auto-consolidation | Load pattern recognition |
| Overprovisioned RDS | Performance + cost modeling | Automated rightsizing | Workload characterization |
| S3 Lifecycle Gaps | Access pattern analysis | Smart lifecycle policies | Data aging patterns |
| Unused Resources | Dependency mapping | Safe automated cleanup | Resource correlation |
- Python 3.11+
- AWS account with read permissions (write permissions for live remediation)
- Google Cloud project (optional, for Vertex AI reasoning)
- 4GB+ RAM recommended for full agent operation
- Clone the repository:
git clone https://github.com/maheshchebrolu-git/cloud-cost-optimization-agent.git
cd cloud-cost-optimization-agent- Install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Configure AWS credentials:
# Create .env
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
AWS_DEFAULT_REGION=us-east-1
# Optional: Google Cloud configuration
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_REGION=us-central1python cloud_cost_agent.py --mode interactive --simulationpython cloud_cost_agent.py --mode single --simulationpython cloud_cost_agent.py --mode continuous --simulationpython cloud_cost_agent.py --mode monitorpython cloud_cost_sweeper.pyπ€ Cloud Cost Optimization Agent - Interactive Mode
agent> run
Running optimization cycle...
β
Optimization complete!
Potential savings: $247.50/month
Actions executed: 5
Pending approvals: 2
agent> approvals
π Pending Approvals (2)
1. action_20251003_171857_1 - $62.00/month
Resource: i-0abc123def456789
Action: terminate_stopped_instance
Risk: caution
agent> approve
Enter approval number: 1
Approve action? (y/n): y
β
Action approved and executedTraditional approach:
- Volume is older than 30 days β delete
Tree-of-thought agent approach:
Reasoning Path 1 (Cost Focus):
βββ $8/month waste for 30 days
βββ Quick win with minimal risk
βββ Recommendation: delete immediately
Reasoning Path 2 (Risk Analysis):
βββ Check for recent snapshots
βββ Verify no pending reattachments
βββ Assess business criticality
βββ Recommendation: snapshot first, then delete
Reasoning Path 3 (Pattern Learning):
βββ Historical volume usage patterns
βββ Team behavior analysis
βββ Seasonal considerations
βββ Recommendation: set smart retention policy
Synthesized Decision:
βββ Confidence: 0.87
βββ Action: create final snapshot, delete volume, update policy
βββ Learning: update retention rules for this volume type
Risk Levels:
βββ SAFE (auto-execute)
β βββ Release unused Elastic IPs
β βββ Add S3 lifecycle policies
β βββ Delete unattached volumes >30 days
β
βββ CAUTION (request approval)
β βββ Stop/resize instances
β βββ Modify database configurations
β βββ Change security settings
β
βββ REVIEW_REQUIRED (manual review)
βββ Delete production databases
βββ Modify network configurations
βββ Cross-service dependencies- Simulation mode: test actions without mutating AWS resources
- Human approval: required for medium/high risk actions
- Rollback plans: reversal steps captured per action
- Circuit breakers: stop execution on unexpected results
- Audit trail: decision and action history stored locally
./deploy/deploy.sh your-project-id us-central1
gcloud run services describe cloud-cost-agent --region=us-central1docker build -t cloud-cost-agent .
docker run -d \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-v $(pwd)/memory:/app/memory \
cloud-cost-agentpython cloud_cost_agent.py --mode continuous --simulation --log-level DEBUG| Variable | Required | Description | Default |
|---|---|---|---|
AWS_ACCESS_KEY_ID |
β | AWS access key | - |
AWS_SECRET_ACCESS_KEY |
β | AWS secret key | - |
AWS_DEFAULT_REGION |
β | Primary AWS region | us-east-1 |
GOOGLE_CLOUD_PROJECT |
β | GCP project for Vertex AI | - |
SIMULATION_MODE |
β | Enable simulation mode | true |
LOG_LEVEL |
β | Logging verbosity | INFO |
MEMORY_PATH |
β | Agent memory storage | ./memory |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"rds:Describe*",
"s3:List*",
"s3:GetBucketLifecycle*",
"elbv2:Describe*",
"elb:Describe*",
"cloudwatch:GetMetricStatistics"
],
"Resource": "*"
}
]
}{
"Effect": "Allow",
"Action": [
"ec2:CreateSnapshot",
"ec2:DeleteVolume",
"ec2:ReleaseAddress",
"ec2:StopInstances",
"s3:PutLifecycleConfiguration"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": ["us-east-1", "us-west-2"]
}
}
}-
Agent startup fails
pip install -r requirements.txt aws sts get-caller-identity
-
No metrics found
- New resources may not have CloudWatch history yet
- The agent improves as metrics accumulate
-
High memory usage
- Increase container memory or reduce scan scope for large accounts
-
Action approval timeouts
- Use interactive approval commands or integrate your own approval channel
python cloud_cost_agent.py --mode single --log-level DEBUG --simulationagent> dashboard
python -c "import boto3; print(boto3.client('ec2').describe_regions())"
agent> run- Add a detector in
agents/analyzer.py - Add pricing constants for cost estimation
- Implement execution logic in
agents/executor.py - Add safety checks and risk classification
- Test thoroughly in simulation mode
This project is licensed under the MIT License β see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Source: Repository home
- Always run in simulation mode before enabling live actions
- Validate recommendations against business requirements
- Monitor agent decisions and outcomes continuously
- Keep backups for critical resources
- Confirm actions comply with your organization's governance policies
The project includes adk_integration.CloudCostADKIntegration, which exposes the orchestrator through a Google Agent Development Kit agent surface.
from adk_integration import CloudCostADKIntegration
cloud_adk = CloudCostADKIntegration(project_id="my-gcp-project", simulation_mode=True)
adk_agent = cloud_adk.adk_agentAvailable ADK tools include run_optimization_cycle, run_monitoring_cycle, get_dashboard_snapshot, list_pending_approvals, approve_pending_action, and toggle_simulation_mode.
Install the preview google.adk package following Google ADK guidance until the SDK is generally available.