Skip to content

Commit 70cb80e

Browse files
authored
Merge pull request #29 from DevExpGbb/act-3
Act 3
2 parents 857fccf + 5b6d592 commit 70cb80e

8 files changed

Lines changed: 1044 additions & 70 deletions
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
model: Claude Sonnet 4
3+
description: 'This prompt is used to check the health status of nodes in an Azure Kubernetes Service (AKS) cluster.'
4+
---
5+
6+
# Check for AKS Nodes Health Issues
7+
8+
Check the health status of all nodes in an Azure Kubernetes Service (AKS) cluster and identify any nodes that are not in a 'Ready' state. Provide a summary of the issues found and suggest possible remediation steps.
9+
10+
### Run these Commands
11+
12+
```bash
13+
kubectl get nodes
14+
kubectl describe node <node-name>
15+
kubectl top nodes
16+
kubectl cluster-info
17+
```
18+
19+
20+
### Output
21+
The output a report in a readable format (e.g., plain text, JSON) that includes:
22+
- Cluster Name
23+
- Node Name
24+
- Node Status
25+
- Issues Found (if any)
26+
- Suggested Remediation Steps
27+
28+
### Remediation Suggestions
29+
For nodes that are not in the 'Ready' state, suggest possible remediation steps such as:
30+
- Checking for resource constraints (CPU, memory)
31+
- Reviewing node logs for errors
32+
- Scaling the cluster if resource limits are being hit
33+
- Contacting Azure support if the issue persists
34+
35+
### Note
36+
Ensure that you have the necessary permissions to access the AKS clusters and perform the required operations.
37+
Do not generate any scripts.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
model: Claude Sonnet 4.5
3+
description: 'This prompt is used to check the health status of pods in an Azure Kubernetes Service (AKS) cluster.'
4+
---
5+
6+
# Check for Pod Health Issues
7+
8+
Check the health status of all pods in an Azure Kubernetes Service (AKS) cluster and identify any pods that are not in a 'Running' state. Provide a summary of the issues found and suggest possible remediation steps.
9+
10+
### Run these Commands
11+
12+
```bash
13+
kubectl get pods -n <namespace>
14+
kubectl describe pod <pod-name> -n <namespace>
15+
kubectl logs <pod-name> -n <namespace>
16+
```
17+
18+
### Output
19+
The output a report in a readable format (e.g., plain text, JSON) that includes:
20+
- Cluster Name
21+
- Pod Name
22+
- Pod Status
23+
- Issues Found (if any)
24+
- Suggested Remediation Steps
25+
26+
### Remediation Suggestions
27+
For pods that are not in the 'Running' state, suggest possible remediation steps such as:
28+
- Checking for resource constraints (CPU, memory)
29+
- Reviewing pod logs for errors
30+
- Scaling the cluster if resource limits are being hit
31+
- Redeploying the pod if it is in a crash loop
32+
33+
### Note
34+
Do not generate any scripts.
35+
Do not directly fix the issues; only provide analysis and suggestions.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
model: Claude Sonnet 4.5
3+
description: 'This prompt is used to provide remediation suggestions for pods in an Azure Kubernetes Service (AKS) cluster.'
4+
---
5+
6+
# AKS Remediation for cluster issues
7+
8+
Provide remediation based on analysis and suggestions from the previous steps.
9+
10+
### Proposed Remediation Steps
11+
Be specific in your remediation suggestions, including commands to run, configuration changes to make, or resources to consult. Tailor the suggestions based on the identified issues.
12+
13+
# Notes
14+
- Do not generate any scripts.
15+
- Always ask for confirmation before applying any remediation steps.

.github/workflows/argocd-deployment-failure.yml

Lines changed: 55 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -13,102 +13,87 @@ jobs:
1313
runs-on: ubuntu-latest
1414

1515
steps:
16+
- name: Verify webhook signature
17+
id: verify
18+
env:
19+
PAYLOAD: ${{ toJson(github.event.client_payload) }}
20+
WEBHOOK_SECRET: ${{ secrets.ARGOCD_WEBHOOK_SECRET }}
21+
run: |
22+
# This is a placeholder - GitHub repository_dispatch doesn't include signatures
23+
# The security comes from the GitHub token scope limitation
24+
echo "Webhook received from ArgoCD"
25+
echo "App: ${{ github.event.client_payload.app_name }}"
26+
echo "Status: ${{ github.event.client_payload.operation_phase }}"
27+
28+
- name: Extract deployment info
29+
id: deployment_info
30+
run: |
31+
APP_NAME="${{ github.event.client_payload.app_name }}"
32+
HEALTH_STATUS="${{ github.event.client_payload.health_status }}"
33+
SYNC_STATUS="${{ github.event.client_payload.sync_status }}"
34+
REVISION="${{ github.event.client_payload.revision }}"
35+
MESSAGE="${{ github.event.client_payload.message }}"
36+
REPO_URL="${{ github.event.client_payload.repo_url }}"
37+
TIMESTAMP="${{ github.event.client_payload.timestamp }}"
38+
39+
echo "app_name=${APP_NAME}" >> $GITHUB_OUTPUT
40+
echo "health_status=${HEALTH_STATUS}" >> $GITHUB_OUTPUT
41+
echo "sync_status=${SYNC_STATUS}" >> $GITHUB_OUTPUT
42+
echo "revision=${REVISION}" >> $GITHUB_OUTPUT
43+
1644
- name: Create GitHub Issue
1745
uses: actions/github-script@v7
1846
with:
1947
script: |
20-
const payload = context.payload.client_payload || {};
21-
const appName = payload.app_name || 'unknown';
22-
const clusterName = payload.cluster || 'in-cluster';
23-
const namespace = payload.namespace || 'default';
24-
const healthStatus = payload.health_status || 'unknown';
25-
const syncStatus = payload.sync_status || 'unknown';
26-
const message = payload.message || 'No error message available';
27-
const revision = payload.revision || 'unknown';
28-
const repoUrl = payload.repo_url || '';
29-
const timestamp = payload.timestamp || new Date().toISOString();
30-
const resources = payload.resources || [];
31-
32-
// Build degraded resources section
33-
let degradedDetails = '';
34-
const degradedResources = resources.filter(r =>
35-
r.health && (r.health.status === 'Degraded' || r.health.status === 'Missing' || r.health.status === 'Unknown')
36-
);
37-
38-
if (degradedResources.length > 0) {
39-
degradedDetails = '\n### 🔴 Degraded Resources\n\n';
40-
41-
for (const resource of degradedResources) {
42-
const kind = resource.kind || 'Unknown';
43-
const name = resource.name || 'unknown';
44-
const resourceNamespace = resource.namespace || namespace;
45-
const healthStatus = resource.health?.status || 'Unknown';
46-
const healthMessage = resource.health?.message || 'No message';
47-
const syncStatus = resource.status || 'Unknown';
48-
49-
degradedDetails += `#### ${kind}: \`${name}\`\n\n`;
50-
degradedDetails += `- **Namespace:** ${resourceNamespace}\n`;
51-
degradedDetails += `- **Health Status:** ${healthStatus}\n`;
52-
degradedDetails += `- **Sync Status:** ${syncStatus}\n`;
53-
degradedDetails += `- **Message:** ${healthMessage}\n\n`;
54-
55-
// Add kubectl command for this specific resource
56-
degradedDetails += `**Troubleshoot:**\n\`\`\`bash\n`;
57-
degradedDetails += `kubectl describe ${kind.toLowerCase()} ${name} -n ${resourceNamespace}\n`;
58-
if (kind === 'Pod' || kind === 'Deployment' || kind === 'StatefulSet' || kind === 'DaemonSet') {
59-
degradedDetails += `kubectl logs ${kind.toLowerCase()}/${name} -n ${resourceNamespace}\n`;
60-
}
61-
degradedDetails += `\`\`\`\n\n`;
62-
}
63-
}
48+
const appName = '${{ github.event.client_payload.app_name }}';
49+
const healthStatus = '${{ github.event.client_payload.health_status }}';
50+
const syncStatus = '${{ github.event.client_payload.sync_status }}';
51+
const operationPhase = '${{ github.event.client_payload.operation_phase }}';
52+
const message = '${{ github.event.client_payload.message }}';
53+
const revision = '${{ github.event.client_payload.revision }}';
54+
const repoUrl = '${{ github.event.client_payload.repo_url }}';
55+
const timestamp = '${{ github.event.client_payload.timestamp }}';
56+
const clusterName = '${{ github.event.client_payload.cluster_name }}';
57+
const clusterServer = '${{ github.event.client_payload.cluster_server }}';
58+
const destNamespace = '${{ github.event.client_payload.destination_namespace }}';
6459
6560
const issueTitle = `🚨 ArgoCD Deployment Failed: ${appName}`;
6661
6762
const issueBody = `## ArgoCD Deployment Failure
6863
6964
**Application:** \`${appName}\`
65+
**Status:** ${operationPhase}
7066
**Timestamp:** ${timestamp}
7167
72-
### Cluster Information
73-
74-
| Field | Value |
75-
|-------|-------|
76-
| Cluster Name | \`${clusterName}\` |
77-
| Namespace | \`${namespace}\` |
78-
79-
### Application Status
68+
### Details
8069
8170
| Field | Value |
8271
|-------|-------|
72+
| Cluster | \`${clusterName || clusterServer}\` |
73+
| Namespace | \`${destNamespace}\` |
8374
| Health Status | \`${healthStatus}\` |
8475
| Sync Status | \`${syncStatus}\` |
8576
| Revision | \`${revision}\` |
8677
| Repository | ${repoUrl} |
8778
79+
### Raw payload
80+
\`\`\`json
81+
${JSON.stringify(github.event.client_payload, null, 2)}
82+
\`\`\`
83+
8884
### Error Message
8985
9086
\`\`\`
91-
${message}
87+
${message || 'No error message available'}
9288
\`\`\`
93-
${degradedDetails}
94-
### Troubleshooting Commands
95-
96-
\`\`\`bash
97-
# Check application status in ArgoCD
98-
argocd app get ${appName}
9989
100-
# Check pods in namespace
101-
kubectl get pods -n ${namespace}
90+
### Recommended Actions
10291
103-
# Describe failed pods
104-
kubectl describe pods -n ${namespace}
105-
106-
# Get pod logs
107-
kubectl logs -n ${namespace} <pod-name>
108-
109-
# Check events
110-
kubectl get events -n ${namespace} --sort-by='.lastTimestamp'
111-
\`\`\`
92+
1. Check the ArgoCD UI for detailed error logs
93+
2. Review the application manifest for syntax errors
94+
3. Verify resource quotas and limits
95+
4. Check for image pull errors or missing secrets
96+
5. Review recent commits to the source repository
11297
11398
### Quick Links
11499

Act-3/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Act-3: Kubernetes Operations Don’t Scale Linearly
2+
3+
Problem:
4+
Kubernetes becomes the operational choke point and your team in having a hard time dealing with misconfigurations, failed deployments and runtime issues.
5+
Your team, platform engineering, is busy firefight instead of improving the platform. The deep Kubernetes expertise on your team doesn't scale across teams.
6+
7+
Answer:
8+
Let agents give your team a hand, turning a siloed operational knowledge into a shared capability.
9+
10+
## Crawl
11+
12+
A Senior member of the team (Steve) has created a reusable prompts that can run arbitrarily when someone needs to troubleshoot a container workload on an AKS cluster. Steve made this available in the repo and this can be used in GitHub Copilot in VSCode via "Slash Commands" if you follow the folder/naming convension set out by GitHub/VScode (i.e. `<repo-root>/.github/prompts/<prompt-name>.prompt.md`).
13+
14+
Execute this prompt locally:
15+
16+
![write-prompt](images/write-prompt.png)
17+
18+
## Walk/Run
19+
20+
Create a GitHub Action Workflow that will be called upon for each push to the repo. For this example it will be just for the main branch, but you can set up the triggers/rules for when the workflow gets run. See the docs about [Events That Trigger Workflows](https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows).
21+
22+
> [!NOTE]
23+
> We will use the GitHub Copilot CLI to automate the execution of our custom prompt in a scripted CI Runner - GitHub Actions.
24+
25+
We have an example of this in [Act-2 .github/workflows](../.github/workflows/copilot.generate-docs.yml).
26+
27+
### What does this do?
28+
29+
- The GitHub Action Workflow triggers on each push to the main branch - this ensures that documentation is created, if and when needed regardless if you remembered or not. This ensures that all team members have docs created for them, even if they did not run the `/write-docs` prompt manually before committing their changes. It also can be run manually in GitHub Actions since it also has the `workflow_dispatch` trigger enabled...this is optional of course but we have it here as an example anyways.
30+
- It installs the GitHub Copilot CLI
31+
- It ensures that we provide it credentials to call GitHub Copilot
32+
> [!NOTE]
33+
> Currently calling GitHub Copilot is a User only ability - meaning that GitHub Copilot is licensed to and therefore only callable by a human user account. In this example we have stored a Fine-Grained GitHub Personal Access Token (PAT -> a user bound API Key) that has been scoped with the `Copilot-Requests: Read-only` Permission. As such this will consume GitHub Copilot PRUs (Premium Request Units) from the tied user account. Today this is the only billing model to consume GitHub Copilot.
34+
- Store the required prompt file contents as an environment variable
35+
- Pass in the prompt and call GitHub Copilot CLI to generate docs

0 commit comments

Comments
 (0)