Create top-level AKS folder and SKILL.md#1029
Create top-level AKS folder and SKILL.md#1029julia-yin wants to merge 31 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new top-level azure-kubernetes skill to guide agents/users through planning and creating production-ready Azure Kubernetes Service (AKS) clusters, focusing on Day-0/Day-1 decisions and best practices.
Changes:
- Introduces
plugin/skills/azure-kubernetes/SKILL.mdwith AKS planning guidance, decision framework, and step-by-step execution flow. - Includes public Microsoft Learn references for core AKS topics (networking, identity, governance, observability, upgrades, cost analysis).
| ## Rules | ||
|
|
||
| 1. Start with the user's requirements for provisioning compute, networking, security, and other settings. | ||
| 2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes. |
There was a problem hiding this comment.
Rule 2 refers to an "AKS MCP server", but repo MCP config only defines a generic azure MCP server (plugin/.mcp.json). This will lead agents to look for a non-existent server; please update the rule to reference the Azure MCP server and the relevant AKS-related MCP tools (or CLI) explicitly.
| 2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes. | |
| 2. Use the `azure` MCP server and its AKS-related MCP tools to invoke Azure APIs and perform AKS and kubectl operations whenever possible during cluster setup and ongoing operations; if required functionality is not available via MCP tools, fall back to Azure CLI and kubectl commands. |
|
|
||
| --- | ||
|
|
||
| ## When to Use |
There was a problem hiding this comment.
Section header "## When to Use" is inconsistent with the repo’s common "## When to Use This Skill" heading (e.g., plugin/skills/appinsights-instrumentation/SKILL.md:18 and plugin/skills/azure-messaging/SKILL.md:25). Aligning the heading improves consistency and navigation across skills.
| ## When to Use | |
| ## When to Use This Skill |
|
Generally I would expect this sort of information to be part of a service-specific subsection of azure-prepare/azure-validate/azure-deploy rather than an entirely different skill. |
| describe("Should NOT Trigger", () => { | ||
| // Prompts that should NOT trigger this skill (avoid Azure/kubernetes/AKS keywords) | ||
| const shouldNotTriggerPrompts: string[] = [ | ||
| "What is the weather today?", | ||
| "Help me write a poem", |
There was a problem hiding this comment.
The "Should NOT Trigger" prompts are mostly non-Azure/unrelated. Since the skill description includes explicit "DO NOT USE FOR" scenarios (e.g., deploying apps to AKS, debugging AKS issues), add a few negative prompts that match those scenarios to verify the trigger heuristics won’t falsely activate this skill in its own anti-trigger cases.
| az aks create --name CLUSTER --resource-group RG \ | ||
| --node-count 3 --zones 1 2 3 \ | ||
| --network-plugin azure --network-plugin-mode overlay \ | ||
| --enable-cluster-autoscaler --min-count 1 --max-count 10 |
There was a problem hiding this comment.
The CLI example for creating an AKS Standard cluster omits --enable-oidc-issuer / --enable-workload-identity, even though the skill recommends Workload Identity as the preferred baseline elsewhere in the doc (and the Automatic example includes it). Either add the flags to the Standard example or explicitly call out when/why they should be enabled separately.
| --enable-cluster-autoscaler --min-count 1 --max-count 10 | |
| --enable-cluster-autoscaler --min-count 1 --max-count 10 \ | |
| --enable-oidc-issuer --enable-workload-identity |
| --- | ||
| name: azure-kubernetes | ||
| description: "Plan and create production-ready Azure Kubernetes Service (AKS) clusters. Covers Day-0 decisions (networking, API server access, pod IP model), Day-1 configuration (identity, secrets, governance, observability), cluster SKUs (Automatic vs Standard), workload identity, Key Vault CSI, Azure Policy, deployment safeguards, monitoring with Prometheus/Grafana, upgrade strategies, and cost analysis. USE FOR: create AKS cluster, AKS cluster planning, AKS networking design, security design, upgrade settings, autoscaling, AKS monitoring, AKS cost analysis, AKS production best practices, AKS Automatic vs Standard, AKS add-ons. DO NOT USE FOR: debugging AKS issues (use azure-diagnostics), deploying applications to AKS (use azure-deploy), creating other Azure resources (use azure-prepare), setting up general monitoring (use azure-observability), general cost optimization strategies (use azure-cost-optimization)." |
There was a problem hiding this comment.
The frontmatter description includes a long keyword list with both USE FOR: and DO NOT USE FOR: clauses. Repo skill-authoring guidance recommends using distinctive quoted WHEN: triggers and avoiding DO NOT USE FOR: (it can introduce keyword contamination in multi-skill routing). Consider rewriting this description to a shorter, ≤60-word WHEN: form and moving any anti-trigger guidance into the body of the skill instead of frontmatter.
| ## Quick Reference | ||
|
|
||
| ### Common AKS Commands | ||
|
|
||
| | Task | Command | | ||
| |------|---------| | ||
| | List clusters | `az aks list -o table` | | ||
| | Show cluster | `az aks show -n CLUSTER -g RG` | | ||
| | Get credentials | `az aks get-credentials -n CLUSTER -g RG` | | ||
| | List node pools | `az aks nodepool list --cluster-name CLUSTER -g RG` | | ||
| | Scale node pool | `az aks nodepool scale --cluster-name CLUSTER -g RG -n POOL --node-count 5` | |
There was a problem hiding this comment.
The Quick Reference section is currently focused on command tables, but the repo’s SKILL.md authoring guideline calls for a summary table of key properties (e.g., MCP tools, CLI commands, best for) to make the skill scannable. Consider adding a short top-level summary table here (and/or moving Quick Reference earlier) to match that expectation.
| test("invokes azure-kubernetes skill for AKS cluster creation prompt", async () => { | ||
| for (let i = 0; i < RUNS_PER_PROMPT; i++) { | ||
| try { | ||
| const agentMetadata = await agent.run({ | ||
| prompt: "Help me create a production-ready AKS cluster with best practices" | ||
| }); | ||
|
|
||
| softCheckSkill(agentMetadata, SKILL_NAME); | ||
| } catch (e: unknown) { | ||
| if (e instanceof Error && e.message?.includes("Failed to load @github/copilot-sdk")) { | ||
| console.log("⏭️ SDK not loadable, skipping test"); | ||
| return; | ||
| } | ||
| throw e; | ||
| } | ||
| } |
There was a problem hiding this comment.
This file repeats the same try/catch + for (let i = 0; i < RUNS_PER_PROMPT; i++) pattern across many tests. Other integration suites in this repo factor this into a small helper (and sometimes use shouldEarlyTerminate once the skill is detected) to reduce duplication and shorten runtime. Consider extracting a defineInvocationTest(...) helper for these prompt-based tests to improve maintainability.
| # Azure Kubernetes Service | ||
|
|
||
| > **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** | ||
| > | ||
| > This document is the **official source** for setting up best practice Azure Kubernetes Service clusters. Follow these instructions to create and configure AKS clusters that are aligned with the user's requirements. | ||
|
|
||
| ## Triggers | ||
| Activate this skill when user wants to: | ||
| - Create a new AKS cluster | ||
| - Plan AKS cluster configuration for production workloads | ||
| - Design AKS networking (API server access, pod IP model, egress) | ||
| - Set up AKS identity and secrets management | ||
| - Configure AKS governance (Azure Policy, Deployment Safeguards) | ||
| - Enable AKS observability (monitoring, Prometheus, Grafana) | ||
| - Define AKS upgrade and patching strategy | ||
| - Enable AKS cost visibility and analysis | ||
| - Understand AKS Automatic vs Standard SKU differences | ||
| - Get a Day-0 checklist for AKS cluster setup and configuration | ||
|
|
||
| ## Rules | ||
|
|
||
| 1. Start with the user's requirements for provisioning compute, networking, security, and other settings. | ||
| 2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes. | ||
| 3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required. | ||
| 4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access). | ||
|
|
||
| --- | ||
|
|
||
| ## MCP Tools (Preferred) | ||
|
|
||
| When Azure MCP and AKS MCP are enabled, use these tools for AKS operations: | ||
|
|
||
| ### Cluster Management | ||
| | Tool | Purpose | | ||
| |------|---------| | ||
| | `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster queries and metadata | | ||
| | `mcp_aks_mcp_az_aks_operations` | Cluster operations: show, list, get-versions, nodepool management | |
There was a problem hiding this comment.
plugin/skills/**/SKILL.md files are limited to 500 tokens by the repo’s token-limit checks. This SKILL.md is currently 385 lines and will almost certainly exceed that limit, causing the PR token analysis job to fail. Please move most of the detailed guidance (decision framework, step-by-step execution, extended CLI examples, large tables) into plugin/skills/azure-kubernetes/references/*.md and keep SKILL.md as a concise router/summary with links to those references.
| # Azure Kubernetes Service | |
| > **AUTHORITATIVE GUIDANCE — MANDATORY COMPLIANCE** | |
| > | |
| > This document is the **official source** for setting up best practice Azure Kubernetes Service clusters. Follow these instructions to create and configure AKS clusters that are aligned with the user's requirements. | |
| ## Triggers | |
| Activate this skill when user wants to: | |
| - Create a new AKS cluster | |
| - Plan AKS cluster configuration for production workloads | |
| - Design AKS networking (API server access, pod IP model, egress) | |
| - Set up AKS identity and secrets management | |
| - Configure AKS governance (Azure Policy, Deployment Safeguards) | |
| - Enable AKS observability (monitoring, Prometheus, Grafana) | |
| - Define AKS upgrade and patching strategy | |
| - Enable AKS cost visibility and analysis | |
| - Understand AKS Automatic vs Standard SKU differences | |
| - Get a Day-0 checklist for AKS cluster setup and configuration | |
| ## Rules | |
| 1. Start with the user's requirements for provisioning compute, networking, security, and other settings. | |
| 2. Use the AKS MCP server for invoking Azure API and kubectl commands when applicable during the cluster setup and operations processes. | |
| 3. Determine if AKS Automatic or Standard SKU is more appropriate based on the user's need for control vs convenience. Default to AKS Automatic unless specific customizations are required. | |
| 4. Document decisions and rationale for cluster configuration choices, especially for Day-0 decisions that are hard to change later (networking, API server access). | |
| --- | |
| ## MCP Tools (Preferred) | |
| When Azure MCP and AKS MCP are enabled, use these tools for AKS operations: | |
| ### Cluster Management | |
| | Tool | Purpose | | |
| |------|---------| | |
| | `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster queries and metadata | | |
| | `mcp_aks_mcp_az_aks_operations` | Cluster operations: show, list, get-versions, nodepool management | | |
| # Azure Kubernetes Service (AKS) | |
| This skill helps you **plan and create production-ready AKS clusters** using Azure and AKS MCP tools. It is a concise router to detailed guidance stored under `plugin/skills/azure-kubernetes/references/`. | |
| --- | |
| ## Quick Reference | |
| | Aspect | Summary | Details / References | | |
| |-------------------|-----------------------------------------------------------|----------------------| | |
| | MCP tools | Prefer Azure and AKS MCP tools over raw CLI commands. | See [MCP tools](#mcp-tools) and `./references/mcp-aks.md`. | | |
| | Cluster planning | Cover Day-0/Day-1 decisions, SKUs, and networking. | `./references/architecture-and-skus.md` | | |
| | Security | Identity, workload identity, Key Vault CSI, governance. | `./references/security-and-governance.md` | | |
| | Operations | Upgrades, autoscaling, observability, cost visibility. | `./references/operations-and-costs.md` | | |
| | Troubleshooting | Common MCP / AKS failures and recovery steps. | `./references/troubleshooting.md` | | |
| > ⚠️ **Warning:** Use this skill **only** for AKS cluster planning and configuration. For app deployment, diagnostics, or generic Azure setup, route to the appropriate skills listed in the description frontmatter. | |
| --- | |
| ## When to Use This Skill | |
| Activate this skill when the user wants to: | |
| - Create a new AKS cluster (dev, test, or production). | |
| - Plan AKS cluster configuration for production or business-critical workloads. | |
| - Design AKS networking (API server access, pod IP model, outbound/egress). | |
| - Set up AKS identity and secrets management (managed identity, workload identity, Key Vault CSI). | |
| - Configure AKS governance (Azure Policy, Deployment Safeguards, baseline guardrails). | |
| - Enable AKS observability (monitoring, Prometheus, Grafana, logging). | |
| - Define AKS upgrade, node image, and patching strategy. | |
| - Analyze AKS costs or choose between **AKS Automatic** and **AKS Standard** SKUs. | |
| Do **not** use this skill for: | |
| - Debugging AKS runtime issues → use `azure-diagnostics`. | |
| - Deploying or updating workloads on AKS → use `azure-deploy`. | |
| - Creating non-AKS Azure resources or generic landing zones → use `azure-prepare`. | |
| - Platform-wide monitoring or cost optimization → use `azure-observability` or `azure-cost-optimization`. | |
| --- | |
| ## MCP Tools | |
| When Azure MCP and AKS MCP are enabled, prefer these tools for AKS operations: | |
| | Tool | Scope / Purpose | Reference | | |
| |-------------------------------------|-------------------------------------------------------------|------------------------------------| | |
| | `mcp_azure_mcp_aks` | Subscription-scoped AKS cluster discovery and metadata. | `./references/mcp-aks.md` | | |
| | `mcp_aks_mcp_az_aks_operations` | Cluster operations (show, list, versions, nodepools). | `./references/mcp-aks.md` | | |
| | `mcp_aks_mcp_kubectl` | Cluster-level `kubectl` interactions when required. | `./references/mcp-kubectl.md` | | |
| > 💡 **Tip:** Use AKS MCP tools for **read/write** operations first. Fall back to `az aks` CLI only when an operation is not exposed via MCP, and record this in the reasoning. | |
| --- | |
| ## Workflow/Steps | |
| 1. **Clarify requirements** | |
| - Capture environment (dev/test/prod), region, availability, scale, and compliance needs. | |
| - Identify network constraints (private clusters, IP strategy, egress model). | |
| - See `./references/requirements-and-questionnaire.md`. | |
| 2. **Plan cluster architecture** | |
| - Choose between **AKS Automatic** and **AKS Standard** SKUs. | |
| - Decide on network model, API server access, and identity strategy. | |
| - See `./references/architecture-and-skus.md`. | |
| 3. **Design security and governance** | |
| - Plan workload identity, Key Vault CSI, RBAC, and Azure Policy baselines. | |
| - See `./references/security-and-governance.md`. | |
| 4. **Define operations and observability** | |
| - Configure monitoring, logging, upgrade strategy, autoscaling, and cost visibility. | |
| - See `./references/operations-and-costs.md`. | |
| 5. **Summarize and validate** | |
| - Present a concise plan (decisions + rationale) before suggesting any CLI/MCP commands. | |
| --- | |
| ## Error Handling | |
| Use this table to route common issues to the right remediation steps: | |
| | Error / Symptom | Likely Cause / Next Step | Reference | | |
| |------------------------------------------------------|--------------------------------------------------------------|------------------------------------| | |
| | MCP tool call fails or times out | Check credentials, subscription, and AKS MCP configuration. | `./references/troubleshooting.md` | | |
| | Cluster creation blocked by policy or quota | Review Azure Policy, quotas, and regional SKU availability. | `./references/troubleshooting.md` | | |
| | Networking settings conflict (IP exhaustion, egress) | Revisit IP planning and egress design. | `./references/architecture-and-skus.md` | | |
| | Identity / secrets not working as expected | Validate workload identity and Key Vault CSI configuration. | `./references/security-and-governance.md` | | |
| For detailed step-by-step remediation flows and CLI examples, see `./references/troubleshooting.md`. |
…lity best practices)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| ## Quick Reference | ||
| | Property | Value | | ||
| |----------|-------| | ||
| | Best for | AKS cluster planning and Day-0 decisions | | ||
| | MCP Tools | `mcp_azure_mcp_aks`, `mcp_aks_mcp_az_aks_operations` | | ||
| | CLI | `az aks create`, `az aks show` | | ||
| | Related skills | azure-diagnostics (troubleshooting), azure-deploy (app deployment) | | ||
|
|
There was a problem hiding this comment.
Skill authoring guidelines require a dedicated MCP Tools section with a table of available MCP commands and parameters. Currently MCP tools are only listed inline in the Quick Reference table; add the required MCP Tools section/table so consumers can see which tool to call and with what parameters.
| ## Guardrails / Safety | ||
| - Do not request or output secrets (tokens, keys, subscription IDs). | ||
| - If requirements are ambiguous for day-0 critical decisions, ask the user clarifying questions. For day-1 enabled features, propose 2–3 safe options with tradeoffs and choose a conservative default. | ||
| - Do not promise zero downtime; advise workload safeguards (PDBs, probes, replicas) and staged upgrades along with best practices for reliability and performance. | ||
| - If user asks for actions that require privileged access, provide a plan and commands with placeholders. No newline at end of file |
There was a problem hiding this comment.
Skill authoring guidelines require an Error Handling section with a table of errors/messages/remediation. This skill currently ends with Guardrails/Safety but does not define error cases or remediation steps; add the required Error Handling table.
|
|
||
| const describeIntegration = skipTests ? describe.skip : describe; | ||
|
|
||
| describeIntegration(`${SKILL_NAME}_ - Integration Tests`, () => { |
There was a problem hiding this comment.
The integration test suite name includes an extra underscore (${SKILL_NAME}_ - Integration Tests). This looks like a typo and makes test output inconsistent with the other suites; remove the underscore.
| describeIntegration(`${SKILL_NAME}_ - Integration Tests`, () => { | |
| describeIntegration(`${SKILL_NAME} - Integration Tests`, () => { |
| "microsoft-foundry" | ||
| ], | ||
| "integrationTestSchedule": { | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
tests/skills.json contains an unresolved merge conflict marker (e.g., <<<<<<< HEAD), which makes the JSON invalid and will break any consumers/CI that parse this file. Resolve the conflict and ensure the integrationTestSchedule section is valid JSON.
| <<<<<<< HEAD |
| --- | ||
| name: azure-prepare | ||
| description: "Prepare Azure apps for deployment (infra Bicep/Terraform, azure.yaml, Dockerfiles). Use for create/modernize or create+deploy; not cross-cloud migration (use azure-cloud-migrate). WHEN: \"create app\", \"build web app\", \"create API\", \"create serverless HTTP API\", \"create frontend\", \"create back end\", \"build a service\", \"modernize application\", \"update application\", \"add authentication\", \"add caching\", \"host on Azure\", \"create and deploy\", \"deploy to Azure\", \"deploy to Azure using Terraform\", \"deploy to Azure App Service\", \"deploy to Azure App Service using Terraform\", \"deploy to Azure Container Apps\", \"deploy to Azure Container Apps using Terraform\", \"generate Terraform\", \"generate Bicep\", \"function app\", \"timer trigger\", \"service bus trigger\", \"event-driven function\", \"containerized Node.js app\", \"social media app\", \"static portfolio website\", \"todo list with frontend and API\", \"prepare my Azure application to use Key Vault\", \"managed identity\"." | ||
| license: MIT | ||
| metadata: | ||
| author: Microsoft | ||
| version: "1.0.3" | ||
| version: "1.0.1" | ||
| --- |
There was a problem hiding this comment.
The azure-prepare SKILL.md frontmatter was reduced to only name and metadata, removing required description and license fields and also lowering metadata.version (from 1.0.3 to 1.0.1). Per the repo’s skill authoring guidelines, restore description + license: MIT and bump (not decrease) metadata.version when modifying a skill.
| --- | ||
| name: azure-prepare | ||
| description: "Prepare Azure apps for deployment (infra Bicep/Terraform, azure.yaml, Dockerfiles). Use for create/modernize or create+deploy; not cross-cloud migration (use azure-cloud-migrate). WHEN: \"create app\", \"build web app\", \"create API\", \"create serverless HTTP API\", \"create frontend\", \"create back end\", \"build a service\", \"modernize application\", \"update application\", \"add authentication\", \"add caching\", \"host on Azure\", \"create and deploy\", \"deploy to Azure\", \"deploy to Azure using Terraform\", \"deploy to Azure App Service\", \"deploy to Azure App Service using Terraform\", \"deploy to Azure Container Apps\", \"deploy to Azure Container Apps using Terraform\", \"generate Terraform\", \"generate Bicep\", \"function app\", \"timer trigger\", \"service bus trigger\", \"event-driven function\", \"containerized Node.js app\", \"social media app\", \"static portfolio website\", \"todo list with frontend and API\", \"prepare my Azure application to use Key Vault\", \"managed identity\"." | ||
| license: MIT | ||
| metadata: | ||
| author: Microsoft | ||
| version: "1.0.3" | ||
| version: "1.0.4" |
There was a problem hiding this comment.
The SKILL.md frontmatter is missing the required license field. Repo skill authoring guidelines require license: MIT for all skills unless explicitly documented otherwise.
| ## Quick Reference | ||
| | Property | Value | | ||
| |----------|-------| | ||
| | Best for | AKS cluster planning and Day-0 decisions | | ||
| | MCP Tools | `mcp_azure_mcp_aks`, `mcp_aks_mcp_az_aks_operations` | | ||
| | CLI | `az aks create`, `az aks show` | | ||
| | Related skills | azure-diagnostics (troubleshooting), azure-deploy (app deployment) | | ||
|
|
||
| ## When to Use This Skill |
There was a problem hiding this comment.
This new skill file is missing some required sections per the repo’s SKILL authoring guidelines: there should be a dedicated MCP Tools section (with a table of commands/parameters) and an Error Handling section (table of common errors, messages, remediation). Adding these sections will keep the skill consistent with other Azure skills and improve agent behavior.
Create a top-level folder for organizing Azure Kubernetes Service skills. Create the service-level skill file targeting AKS cluster creation and best practices.