Agentic CloudOps Lab is an Azure-hosted AI CloudOps control plane that performs safe Azure subscription reviews using live inventory collection, deterministic analysis, policy/runbook grounding, LLM summarization, response evaluation, and artifact generation.
The current implementation uses Microsoft Foundry Agent Service as the AI-facing supervisor and Azure Functions as the trusted deterministic backend.
This is a lab / proof-of-concept project. It is designed for learning, demos, architecture discussions, and controlled testing. It is not a production-ready CloudOps platform.
Cloud teams often have many subscriptions, resource groups, and services with inconsistent tagging, missing diagnostic settings, avoidable cost waste, weak governance, and configuration drift.
Traditional scripts can detect these issues, but they usually produce raw reports that require manual interpretation. General-purpose AI agents can summarize findings, but they should not directly make infrastructure changes without strong controls.
This lab combines both approaches:
- deterministic CloudOps checks for factual analysis
- Azure Function as a trusted tool backend
- Microsoft Foundry Agent as the AI-facing supervisor
- Azure OpenAI for grounded summaries
- policy and runbook grounding from Blob Storage
- evaluator logic before returning the final response
- safe remediation drafts that always require human approval
The goal is to demonstrate a practical enterprise pattern for agentic CloudOps without giving the agent destructive permissions.
This project demonstrates how an AI-assisted CloudOps review can help with:
- identifying cost optimization opportunities
- detecting missing governance controls
- reviewing basic security configuration gaps
- checking reliability and operational hygiene
- producing readable summaries for engineers and managers
- generating remediation drafts without executing them
- preserving review artifacts for audit and follow-up
The important design decision is that the agent helps review and explain. It does not automatically delete, deallocate, reconfigure, or remediate Azure resources.
High-level flow:
User / IDE / Foundry Agent
|
v
Microsoft Foundry Agent
|
| OpenAPI tool call
v
Azure Function: /api/tools/run-cloudops-review
|
| Managed Identity
v
Azure Resource Manager REST API
|
v
Live Azure inventory
|
v
Deterministic CloudOps analyzers
|
+--> Cost / governance checks
+--> Security checks
+--> Reliability checks
+--> Tagging checks
|
v
Blob-backed policy/runbook retrieval
|
v
Azure OpenAI grounded summary
|
v
Evaluator Agent safety/quality validation
|
v
Blob artifact storage
This project is intentionally designed as a safe review and planning system.
The agent does not execute remediation.
All generated remediation scripts are drafts. Every action remains human-approved.
Safety rules:
- No automatic deletion.
- No automatic deallocation.
- No automatic networking changes.
- No automatic Key Vault purge-protection changes.
- No destructive action without explicit human approval.
- Deterministic findings are the source of truth.
- LLM output is evaluated before being returned.
- Policy/runbook grounding is included when available.
- Generated remediation remains in
PendingApproval.
| Area | Status |
|---|---|
| Azure Function backend | Implemented |
| ARM inventory collection | Implemented |
| Deterministic analyzers | Implemented |
| Cost / governance / security / reliability findings | Implemented |
| Policy/runbook retrieval | Implemented with lightweight keyword retrieval |
| Azure OpenAI summary | Implemented |
| Evaluator Agent | Implemented |
| Foundry OpenAPI tool endpoint | Implemented |
| Artifact upload to Blob Storage | Implemented |
| Automatic remediation | Not implemented by design |
| Production hardening | Not included |
The current retrieval method is intentionally lightweight. It uses keyword retrieval from Blob-backed Markdown documents. A future version could replace this with Azure AI Search hybrid/vector retrieval.
The solution performs a safe CloudOps review of an Azure subscription.
It can:
- collect live Azure inventory from Azure Resource Manager using the Azure Function managed identity
- analyze resources for CloudOps findings across cost governance, security, governance, reliability, and tagging
- retrieve relevant policy/runbook documents from Blob Storage
- generate a grounded LLM summary using Azure OpenAI
- validate the final response with an Evaluator Agent
- generate safe remediation draft scripts
- upload inventory, reports, remediation drafts, and agent responses to Blob Storage
- expose a Microsoft Foundry-compatible OpenAPI tool endpoint
- keep all remediation actions in
PendingApproval - avoid executing remediation automatically
The Foundry Agent acts as the AI-facing supervisor.
It is configured with:
- agent instructions
- OpenAPI tool definition
- Function App key connection
- tool call to
run_cloudops_review
The agent decides when to call the CloudOps review tool and presents the result to the user.
The Azure Function is the trusted backend.
It performs:
- managed identity token acquisition
- Azure Resource Manager REST calls
- Blob Storage REST calls
- live inventory collection
- analyzer orchestration
- policy/runbook retrieval
- Azure OpenAI REST call
- Evaluator Agent validation
- artifact upload
The deployed Function path uses REST APIs and the Python standard library for Azure ARM, Blob Storage, and Azure OpenAI calls.
Blob Storage is used for durable artifacts and lightweight knowledge storage.
Containers used by the lab:
inventory
reports
remediation
agent-responses
rules
snapshots
verification
The rules container stores Markdown runbooks and policies used for grounding.
Example documents:
diagnostic-settings-policy.md
key-vault-protection-policy.md
app-service-security-policy.md
storage-soft-delete-policy.md
storage-lifecycle-policy.md
tagging-policy.md
remediation-safety-policy.md
cost_optimization.md
The Evaluator Agent validates the response before it is returned.
It checks that:
- finding total is included
- approval /
PendingApprovallanguage is present - the answer does not claim remediation was executed
- the answer does not include unsafe destructive commands
- retrieved policy/runbook knowledge is referenced when available
- major finding types are mentioned
A successful evaluation looks like:
{
"agent": "EvaluatorAgent",
"version": "phase4b-deterministic-v1",
"overall_status": "pass",
"failed_checks": 0,
"warning_checks": 0
}The Foundry-facing tool endpoint is:
POST /api/tools/run-cloudops-review
It returns a compact response designed for Foundry Agent tool calling.
Response includes:
tool_name
status
remediation_executed
approval_required
inventory
finding_summary
response_evaluation
retrieved_knowledge_sources
answer
artifacts
safety
Example behavior:
Foundry Agent
-> calls run_cloudops_review
-> Azure Function collects live inventory
-> deterministic analyzers run
-> policy/runbook retrieval runs
-> Azure OpenAI summary is generated
-> Evaluator Agent validates result
-> compact tool response returns to Foundry
Base URL:
https://<function-app-name>.azurewebsites.net/api
Endpoints:
GET /api/health
GET /api/inventory/latest
POST /api/inventory/collect
POST /api/agent/chat
POST /api/agent/run-live
POST /api/tools/run-cloudops-review
GET /api/artifacts/list/{container}
GET /api/diagnostics/imports
The primary endpoint for Foundry integration is:
POST /api/tools/run-cloudops-review
agentic-cloudops-lab/
backend/
analyzers and review logic
cloud-control-plane/
function_app.py
host.json
foundry/
agentic-cloudops-foundry-tools.openapi.json
cloudops_supervisor_agent_instructions.md
PHASE5_FOUNDRY_SETUP.md
infra/
Terraform infrastructure
rules/
local source copies of policy/runbook documents
tools/
deployment and test scripts
docs/
images/
sample-output/
reports/
generated local report examples, if used
If your local structure differs, treat this as the intended clean structure and adjust folder names before publishing.
The commands below show the expected flow. Adjust variable names and script names to match your local implementation.
git clone https://github.com/net9876/agentic-cloudops-lab.git
cd agentic-cloudops-labcd infra
cp example.tfvars dev.tfvarsEdit dev.tfvars with your Azure subscription, region, naming prefix, and required AI settings.
terraform init
terraform plan -var-file="dev.tfvars"
terraform apply -var-file="dev.tfvars"Example PowerShell flow:
cd ..\tools
.\deploy-function.ps1curl https://<function-app-name>.azurewebsites.net/api/healthcurl -X POST https://<function-app-name>.azurewebsites.net/api/tools/run-cloudops-reviewA compact review response should look similar to this:
{
"tool_name": "run_cloudops_review",
"status": "completed",
"remediation_executed": false,
"approval_required": true,
"inventory": {
"resource_count": 42,
"subscription_scope": "lab-subscription"
},
"finding_summary": {
"total_findings": 12,
"security": 4,
"cost": 3,
"reliability": 2,
"tagging": 3
},
"response_evaluation": {
"overall_status": "pass",
"failed_checks": 0,
"warning_checks": 0
},
"safety": {
"mode": "draft_only",
"pending_approval": true
}
}See also:
docs/sample-output/cloudops-review-response.json
This lab follows a conservative security model:
- use managed identity where possible
- do not store Azure credentials in code
- do not commit
.env,.tfvars, secrets, tokens, Function keys, or local state files - generated remediation scripts are drafts only
- remediation is not executed automatically
- restrict Function App access for real environments
- use least-privilege RBAC for managed identities
- do not test against production subscriptions without review
- rotate any credentials accidentally exposed during testing
Recommended .gitignore coverage:
.terraform/
*.tfstate
*.tfstate.*
*.tfvars
.env
.env.*
local.settings.json
__pycache__/
.venv/
*.zip
This lab may create billable Azure resources:
- Azure Function App
- Storage Account and Blob transactions
- Application Insights / Log Analytics, if enabled
- Azure OpenAI / Foundry usage
- supporting networking or monitoring resources, depending on your Terraform configuration
Destroy lab resources after testing.
If the lab was deployed with Terraform:
cd infra
terraform destroy -var-file="dev.tfvars"Before destroying, confirm that the resource group contains only lab resources.
If artifacts were created outside Terraform, remove them manually from:
inventory
reports
remediation
agent-responses
rules
snapshots
verification
Potential improvements:
- replace keyword retrieval with Azure AI Search hybrid/vector retrieval
- add GitHub Actions validation for Terraform and Python
- add sample screenshots and saved review artifacts
- add a local mock mode for users without Azure OpenAI quota
- add reusable analyzer plugins
- add richer cost optimization checks
- add optional dashboard for review history
- add end-to-end demo video or GIF
- add Azure Policy integration
- add stricter RBAC examples
This project intentionally does not include:
- automatic remediation execution
- production-grade approval workflow
- production network isolation design
- enterprise identity lifecycle management
- multi-tenant SaaS controls
- full SIEM/SOAR integration
- guaranteed cost savings
- support for every Azure resource type
Those areas are possible future extensions, but they are intentionally outside the current lab scope.
Recommended repository description:
Azure-hosted Agentic CloudOps lab using Foundry Agent, Azure Functions, deterministic analyzers, Azure OpenAI, and safe remediation drafts.
Recommended topics:
azure
cloudops
azure-functions
azure-openai
microsoft-foundry
agentic-ai
devops
terraform
finops
governance
sre
llmops
Add or update a LICENSE file before treating this as a reusable public open-source project.
MIT is usually a practical default for this type of lab, but choose the license that fits your intent.


