Skip to content

Conversation

@littleKitchen
Copy link
Contributor

Fixes #319

Summary

Add incident response workflow prompt for Azure operations scenarios, as outlined in the roadmap.

Changes

  • Created .github/prompts/incident-response.prompt.md with structured prompts for:

    • Initial Triage - Rapid assessment of incident scope and severity
    • Diagnostic Queries - KQL patterns for Azure Monitor, Log Analytics
    • Impact Analysis - Affected resources, services, and users
    • Mitigation Actions - Common remediation patterns
    • RCA Preparation - Root cause analysis documentation support
  • Updated .github/prompts/README.md to include the new prompt

Acceptance Criteria

  • File created at .github/prompts/incident-response.prompt.md
  • Frontmatter follows repository conventions
  • Prompt covers triage, diagnostics, mitigation, and RCA phases
  • Includes Azure-specific patterns (KQL, resource health, Activity Log)
  • References Azure Monitor and Log Analytics documentation

Testing

  • ✅ Markdown lint: 0 errors
  • ✅ Spell check: 0 errors
  • ✅ Frontmatter validates against schema

Fixes microsoft#319

Add incident response workflow prompt for Azure operations scenarios with:
- Initial triage and severity assessment
- Diagnostic KQL queries for Azure Monitor and Log Analytics
- Mitigation patterns and communication templates
- Root cause analysis documentation structure

Includes Azure-specific patterns for resource health, Activity Log,
Application Insights, and service health monitoring.
@littleKitchen littleKitchen requested a review from a team as a code owner February 1, 2026 00:40

```kql
// Check Azure Resource Health events
AzureActivity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we can implicitly know what kql style queries are exactly needed for the incident in question. I could be a number of different services or reasons for the reason.

I recommend having task-researcher review the Azure mcp server API and determine how how a custom agent could determine and build the KQL queries for different incidents based on these suggested diagnostic parameters to look for.

I would then have prompt-builder use the research document to update (or replace) this prompt with how the custom agent should query the mcp tools to determine how to build the KQL style queries. Without putting any actual KQL style queries in this prompt.

| order by FailureCount desc
```

### Phase 3: Mitigation Actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also give task-researcher the job to go and figure out these mitigation actions from microsoft-docs and have prompt-builder update the instructions with how the custom agent could discover these mitigation patterns and rollback procedures.

I would suggest having this prompt discover mitigation and rollback procedure documentation, possibly in the codebase that's using this prompt as there may be documentation for procedures, instead of embedding it here in this prompt. As these mitigation patterns, rollback procedures, and failover considerations may not pertain to the services that are part of an incident.

#### RCA Document Structure

```markdown
# Incident Report: {Title}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend extracting this RCA document template into markdown file in docs/templates in this codebase. Make sure you had task-researcher refer to a common and well used RCA document template.

As an example, Google's SRE Incident document is typically great -> https://sre.google/sre-book/example-postmortem/

Make sure prompt-builder adds instructions to continually update the incident document and to continue from an existing incident document if re-prompted later with a cleared conversation context.

4. **Why** wasn't this prevented? → {Find gaps in controls}
5. **Why** wasn't this detected earlier? → {Improve monitoring}

## Azure Documentation References
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend replacing this section with just instructions about using the microsoft-docs mcp tools. There's likely additional azure docs references that will be needed for incident response.

* [Application Insights](https://learn.microsoft.com/azure/azure-monitor/app/app-insights-overview)
* [Azure Service Health](https://learn.microsoft.com/azure/service-health/overview)

## Escalation Criteria
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend removing this Escalation Criteria section

* **What is affected?** Services, resources, regions, user segments
* **What changed recently?** Deployments, configuration changes, scaling events

#### Severity Assessment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity could be determined by a number of different factors based on the actual incident, I recommend providing instructions on how the agent could discover and determine severity. As an example, if the codebase where this prompt is used has a runbook or documentation for severity levels.

@liyuuKitchen
Copy link

thanks for the review, will update it

- Replace hardcoded severity table with discovery instructions
- Remove hardcoded KQL queries, guide dynamic query building via Azure MCP
- Replace hardcoded mitigation patterns with discovery from docs/runbooks
- Extract RCA template to docs/templates/rca-template.md (Google SRE format)
- Replace static Azure docs links with microsoft-docs MCP reference
- Remove escalation criteria section

Addresses review comments from @agreaves-ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add incident response prompt template

3 participants