-
Notifications
You must be signed in to change notification settings - Fork 36
feat: add incident response prompt template #386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add incident response prompt template #386
Conversation
Fixes microsoft#319 Add incident response workflow prompt for Azure operations scenarios with: - Initial triage and severity assessment - Diagnostic KQL queries for Azure Monitor and Log Analytics - Mitigation patterns and communication templates - Root cause analysis documentation structure Includes Azure-specific patterns for resource health, Activity Log, Application Insights, and service health monitoring.
|
|
||
| ```kql | ||
| // Check Azure Resource Health events | ||
| AzureActivity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we can implicitly know what kql style queries are exactly needed for the incident in question. I could be a number of different services or reasons for the reason.
I recommend having task-researcher review the Azure mcp server API and determine how how a custom agent could determine and build the KQL queries for different incidents based on these suggested diagnostic parameters to look for.
I would then have prompt-builder use the research document to update (or replace) this prompt with how the custom agent should query the mcp tools to determine how to build the KQL style queries. Without putting any actual KQL style queries in this prompt.
| | order by FailureCount desc | ||
| ``` | ||
|
|
||
| ### Phase 3: Mitigation Actions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also give task-researcher the job to go and figure out these mitigation actions from microsoft-docs and have prompt-builder update the instructions with how the custom agent could discover these mitigation patterns and rollback procedures.
I would suggest having this prompt discover mitigation and rollback procedure documentation, possibly in the codebase that's using this prompt as there may be documentation for procedures, instead of embedding it here in this prompt. As these mitigation patterns, rollback procedures, and failover considerations may not pertain to the services that are part of an incident.
| #### RCA Document Structure | ||
|
|
||
| ```markdown | ||
| # Incident Report: {Title} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend extracting this RCA document template into markdown file in docs/templates in this codebase. Make sure you had task-researcher refer to a common and well used RCA document template.
As an example, Google's SRE Incident document is typically great -> https://sre.google/sre-book/example-postmortem/
Make sure prompt-builder adds instructions to continually update the incident document and to continue from an existing incident document if re-prompted later with a cleared conversation context.
| 4. **Why** wasn't this prevented? → {Find gaps in controls} | ||
| 5. **Why** wasn't this detected earlier? → {Improve monitoring} | ||
|
|
||
| ## Azure Documentation References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend replacing this section with just instructions about using the microsoft-docs mcp tools. There's likely additional azure docs references that will be needed for incident response.
| * [Application Insights](https://learn.microsoft.com/azure/azure-monitor/app/app-insights-overview) | ||
| * [Azure Service Health](https://learn.microsoft.com/azure/service-health/overview) | ||
|
|
||
| ## Escalation Criteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend removing this Escalation Criteria section
| * **What is affected?** Services, resources, regions, user segments | ||
| * **What changed recently?** Deployments, configuration changes, scaling events | ||
|
|
||
| #### Severity Assessment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Severity could be determined by a number of different factors based on the actual incident, I recommend providing instructions on how the agent could discover and determine severity. As an example, if the codebase where this prompt is used has a runbook or documentation for severity levels.
|
thanks for the review, will update it |
- Replace hardcoded severity table with discovery instructions - Remove hardcoded KQL queries, guide dynamic query building via Azure MCP - Replace hardcoded mitigation patterns with discovery from docs/runbooks - Extract RCA template to docs/templates/rca-template.md (Google SRE format) - Replace static Azure docs links with microsoft-docs MCP reference - Remove escalation criteria section Addresses review comments from @agreaves-ms
Fixes #319
Summary
Add incident response workflow prompt for Azure operations scenarios, as outlined in the roadmap.
Changes
Created
.github/prompts/incident-response.prompt.mdwith structured prompts for:Updated
.github/prompts/README.mdto include the new promptAcceptance Criteria
.github/prompts/incident-response.prompt.mdTesting