stackrox · mtodor · May 5, 2026 · May 4, 2026
diff --git a/workflows/acs-triage/.ambient/ambient.json b/workflows/acs-triage/.ambient/ambient.json
@@ -7,9 +7,9 @@
       "jql": "project = ROX AND (type = Bug OR type = Vulnerability OR type = Weakness OR type = Ticket) AND status = New AND parent is EMPTY AND Team is EMPTY AND assignee is EMPTY AND labels NOT IN (auto-triaged) ORDER BY created",
       "autoTriagedLabel": "auto-triaged"
     },
-    "timeout": 300,
-    "maxIssues": 20
+    "timeout": 1800,
+    "maxIssues": 5
   },
-  "systemPrompt": "You are an ACS/StackRox Triage Specialist. Execute the `/triage` command for complete end-to-end pipeline (setup → fetch → classify → analyze → assign → report) or `/triage --comment` to post results to JIRA.\n\n**Key Commands:** `/triage` (JQL search for untriaged), `/triage ROX-12345` (specific issue), `/triage --comment` (writes to JIRA + adds auto-triaged label), `/comment-issues` (standalone commenting)\n\n**JQL Search:** Fetches issues matching: `project = ROX AND (type = Bug OR type = Vulnerability OR type = Weakness OR type = Ticket) AND status = New AND parent is EMPTY AND Team is EMPTY AND assignee is EMPTY AND labels NOT IN (auto-triaged) ORDER BY created`\n\n**Early Exit:** If JQL search returns 0 issues, exit immediately with \"No untriaged issues found\" (don't proceed with empty pipeline).\n\n**Label-Based Idempotency:** After posting triage comment, add `auto-triaged` label to issue. This prevents re-processing on subsequent runs. Workflow is fully idempotent and safe to run frequently.\n\n**Workflow Details:** See `.claude/commands/triage.md` for complete 7-phase pipeline. **CRITICAL:** Phases 1a+1b run in parallel, Phase 4 analysis MUST use parallel tool calls (saves 60-80s).\n\n**Domain Knowledge:** Consult `reference/*.md` files for teams, error patterns, CODEOWNERS mappings, vulnerability decision trees, and confidence thresholds. Team assignment uses 5-strategy priority system (95%-70% confidence).\n\n**Performance:** Load files once and cache. Primary JIRA query in Phase 1b. Max 3-5 additional batched queries for similar issue searches. See triage.md Performance Optimization Guidelines.\n\n**Constraints:** 300s timeout, 10-20 issues max, ≥80% confidence for auto-assignment, READ-ONLY by default.\n\n**Outputs:** All artifacts in `artifacts/acs-triage/` (setup-info.json, issues.json, triage-report.md, summary.json).\n\nFor complete documentation, see `CLAUDE.md`.",
+  "systemPrompt": "You are an ACS/StackRox Triage Specialist. Execute the `/triage` command for complete end-to-end pipeline (setup → fetch → classify → analyze → assign → report) or `/triage --comment` to post results to JIRA.\n\n**Key Commands:** `/triage` (JQL search for untriaged), `/triage ROX-12345` (specific issue), `/triage --comment` (writes to JIRA + adds auto-triaged label), `/comment-issues` (standalone commenting)\n\n**JQL Search:** Fetches issues matching: `project = ROX AND (type = Bug OR type = Vulnerability OR type = Weakness OR type = Ticket) AND status = New AND parent is EMPTY AND Team is EMPTY AND assignee is EMPTY AND labels NOT IN (auto-triaged) ORDER BY created`\n\n**Early Exit:** If JQL search returns 0 issues, exit immediately with \"No untriaged issues found\" (don't proceed with empty pipeline).\n\n**Label-Based Idempotency:** After posting triage comment, add `auto-triaged` label to issue. This prevents re-processing on subsequent runs. Workflow is fully idempotent and safe to run frequently.\n\n**Deep CI Failure Analysis:** CI_FAILURE issues automatically receive deep root cause analysis using the stackrox CI failure investigator agent methodology (read at runtime from `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md`). Each CI failure gets 4-5 minutes of investigation. Deep analysis results appear in comments and reports but do NOT influence team assignment.\n\n**Workflow Details:** See `.claude/commands/triage.md` for complete 7-phase pipeline. **CRITICAL:** Phases 1a+1b run in parallel, Phase 4 analysis MUST use parallel tool calls (saves 60-80s).\n\n**Domain Knowledge:** Consult `reference/*.md` files for teams, error patterns, CODEOWNERS mappings, vulnerability decision trees, and confidence thresholds. Team assignment uses 5-strategy priority system (95%-70% confidence).\n\n**Performance:** Load files once and cache. Primary JIRA query in Phase 1b. Max 3-5 additional batched queries for similar issue searches. See triage.md Performance Optimization Guidelines.\n\n**Constraints:** 1800s timeout, 5 issues max, ≥80% confidence for auto-assignment, READ-ONLY by default.\n\n**Outputs:** All artifacts in `artifacts/acs-triage/` (setup-info.json, issues.json, triage-report.md, summary.json).\n\nFor complete documentation, see `CLAUDE.md`.",
   "startupPrompt": "Greet the user and introduce yourself as an ACS Triage Specialist. Briefly explain that you execute automated triage for StackRox/ACS JIRA issues (CI failures, vulnerabilities, flaky tests) and generate comprehensive reports with intelligent team assignments using confidence scoring. Explain the available commands: `/triage` (complete pipeline, READ-ONLY), `/triage --comment` (pipeline + post to JIRA + add auto-triaged label), `/comment-issues` (standalone comment posting). Mention the workflow uses JQL search to find new untriaged issues (excludes issues with auto-triaged label), making it safe to run repeatedly."
 }
diff --git a/workflows/acs-triage/.claude/commands/comment-issues.md b/workflows/acs-triage/.claude/commands/comment-issues.md
@@ -130,6 +130,26 @@ GraphQL schema validation error pattern matches core-workflows team with 90% con
   - central/graphql/resolvers/policies.go
 - **Error Pattern:** GraphQL schema validation
 
+### CI Failure Root Cause Analysis
+
+**Root Cause:** Template logic inconsistency in GraphQL code generation causes placeholder Boolean fields to be emitted without proper resolver functions, leading to schema validation failure at startup.
+
+**Failure Category:** code-bug
+**Analysis Confidence:** High
+**Risk Assessment:** Medium
+
+**Affected Components:**
+- central/graphql/generator/codegen/codegen.go.tpl
+- central/graphql/resolvers/generated.go
+
+**Proposed Fix:** Fix template conditional to check hasAnyMethods before hasFields in the code generation template.
+
+**Relevant Logs:**
+{code}
+Error: Cannot query field "isDeprecated" on type "PolicyViolationEvent"
+  at /central/graphql/schema.go:142
+{code}
+
 ---
 *Generated by ACS Triage Workflow • 2026-04-27 15:30 UTC*
 ```

diff --git a/workflows/acs-triage/.claude/commands/triage.md b/workflows/acs-triage/.claude/commands/triage.md
@@ -32,6 +32,9 @@ Clone StackRox repository for CODEOWNERS and reference data if not already prese
 - Check if `/tmp/triage/stackrox/.github/CODEOWNERS` exists
 - If missing, clone `https://github.com/stackrox/stackrox` to `/tmp/triage/stackrox`
 - Extract current version from `VERSION` file
+- Check if `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` exists
+  - If present: deep CI failure analysis will use this agent's methodology
+  - If missing: log warning "CI failure investigator agent not found - deep analysis will use description-only mode"
 
 **Output:** Setup metadata in `artifacts/acs-triage/setup-info.json`
 
@@ -57,7 +60,7 @@ Query JIRA for untriaged issues or fetch a specific issue.
   labels NOT IN (auto-triaged)
   ORDER BY created ASC
   ```
-- Limit to 10-20 issues (timeout constraint: 300s)
+- Limit to 5 issues (timeout constraint: 1800s, deep CI analysis takes 4-5 min/issue)
 - Extract: key, summary, description, labels, components, priority, status, created, updated, affectedVersions, fixVersions, comments
 - **CRITICAL:** If JQL search returns 0 issues, exit immediately with message: "No untriaged issues found. All issues either have auto-triaged label or don't match criteria. Triage complete."
 
@@ -94,6 +97,8 @@ Run type-specific analysis for each issue based on its classification:
 #### 4a. CI Failure Analysis
 For issues where `issueType === "CI_FAILURE"`:
 
+##### Stage 1: Quick Pattern Analysis
+
 - Extract error messages from description/comments
 - Classify error type: GraphQL, panic, timeout, network, test failure, etc.
 - Extract file paths from stack traces
@@ -109,6 +114,37 @@ For issues where `issueType === "CI_FAILURE"`:
   }
   ```
 
+##### Stage 2: Deep Root Cause Analysis
+
+After Stage 1, perform deep root cause investigation for each CI_FAILURE issue:
+
+**Time budget:** 4-5 minutes per CI_FAILURE issue.
+
+**Process:**
+1. Read the investigator agent methodology from `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md`
+   - If the file is unavailable, proceed with description-only analysis (set `investigation_method: "description_only"`)
+2. Follow the agent's methodology to analyze the failure:
+   - Investigate CI job logs and URLs found in the JIRA description and comments
+   - Analyze error messages, stack traces, and test output
+   - Correlate findings with source code in the cloned stackrox repository
+3. Populate a `deep_analysis` sub-object within `ci_analysis`:
+   ```json
+   {
+     "root_cause": "Detailed explanation of why the CI failure occurs",
+     "failure_category": "code-bug | flaky-test | infrastructure | configuration | dependency | unknown",
+     "affected_components": ["file/module paths identified during investigation"],
+     "confidence": "High | Medium | Low",
+     "risk_assessment": "Low | Medium | High",
+     "proposed_fix": "Specific description of what needs to change",
+     "relevant_logs": "Sanitized log excerpts (max 500 chars)",
+     "investigation_method": "agent_methodology | description_only"
+   }
+   ```
+
+**Sanitization rules:** NEVER include API tokens, passwords, secrets, internal URLs with credentials, IP addresses, or employee emails in `deep_analysis` output. Use `[REDACTED]` for any sensitive values found.
+
+**IMPORTANT:** The `deep_analysis` output is for reporting and JIRA comments only. It does NOT feed into Phase 5 team assignment. Team assignment continues to use only the existing 5 strategies based on Stage 1 results.
+
 #### 4b. Vulnerability Analysis
 For issues where `issueType === "VULNERABILITY"`:
 
@@ -300,11 +336,18 @@ After running this command, you should have:
 - Phase 4: Run CI/Vuln/Flaky analysis in parallel (3 concurrent tool calls)
 - Total time savings: 70-100 seconds vs sequential execution
 
+**Deep CI Failure Analysis:**
+- Time budget: 4-5 minutes per CI_FAILURE issue (Stage 2 of Phase 4a)
+- Deep analysis runs sequentially per issue (each requires significant investigation)
+- With 5 issues max and potential for all to be CI failures, worst case is ~25 minutes for analysis alone
+- The investigator agent methodology is read once from `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` and applied to each issue
+
 ## Notes
 
-- **Timeout**: 300 seconds total (5 minutes)
-- **Issue Limit**: 10-20 issues to stay within timeout
-- **Parallel Analysis**: CI/Vuln/Flaky analysis MUST run concurrently (saves 60-80s)
+- **Timeout**: 1800 seconds total (30 minutes)
+- **Issue Limit**: 5 issues per run to allow time for deep CI failure analysis
+- **Deep CI Failure Analysis**: Each CI_FAILURE issue gets 4-5 minutes of deep root cause investigation using the stackrox CI failure investigator methodology. Results appear in comments and reports but do NOT influence team assignment.
+- **Parallel Analysis**: CI/Vuln/Flaky analysis MUST run concurrently (saves 60-80s). Within Phase 4a, deep analysis runs sequentially per CI_FAILURE issue.
 - **READ-ONLY by default**: Use `--comment` flag to write to JIRA
 - **High Confidence Threshold**: ≥80% for auto-assignment recommendations
 - **Version Awareness**: Automatically detects and adjusts for version mismatches
@@ -319,3 +362,4 @@ Consult these for domain knowledge:
 - `reference/vulnerability-decision-tree.md` - ProdSec workflow
 - `reference/flaky-test-patterns.md` - Known flaky test patterns
 - `reference/constants.md` - All confidence thresholds
+- `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` - CI failure investigation methodology (read at runtime from cloned stackrox repo)
diff --git a/workflows/acs-triage/CLAUDE.md b/workflows/acs-triage/CLAUDE.md
@@ -9,6 +9,7 @@ This is a **single-purpose workflow** for automated triage of StackRox/ACS JIRA
 ## Key Features
 
 - **Multi-Strategy Team Assignment**: 5-strategy priority system with 95%-70% confidence scores
+- **Deep CI Failure Analysis**: CI_FAILURE issues automatically receive root cause investigation using the stackrox CI failure investigator agent methodology (read at runtime from cloned stackrox repo). Results enrich triage comments and reports but do not influence team assignment.
 - **Version Awareness**: Detects mismatches between issue versions and current codebase
 - **Specialized Analysis**: Custom decision trees for CI failures, vulnerabilities, and flaky tests
 - **READ-ONLY Mode**: Generates reports without modifying JIRA automatically
@@ -63,10 +64,11 @@ See `reference/constants.md` for all confidence thresholds and configuration val
 ## Critical Constraints
 
 1. **READ-ONLY MODE**: Generate reports only, never modify JIRA automatically
-2. **Timeout**: Complete within 300 seconds (5 minutes)
-3. **Issue Limit**: Process 10-20 issues per session
+2. **Timeout**: Complete within 1800 seconds (30 minutes)
+3. **Issue Limit**: Process up to 5 issues per session (allows 4-5 minutes for deep CI failure analysis per issue)
 4. **Version Awareness**: Adjust confidence for version mismatches
 5. **High Confidence Threshold**: ≥80% for recommended assignments
+6. **Deep Analysis Separation**: CI failure root cause analysis is informational only -- it does NOT influence team assignment
 
 ## Automated Execution
 
@@ -80,8 +82,8 @@ The workflow is configured in `.ambient/ambient.json`:
       "jql": "project = ROX AND (type = Bug OR type = Vulnerability OR type = Weakness OR type = Ticket) AND status = New AND parent is EMPTY AND Team is EMPTY AND assignee is EMPTY AND labels NOT IN (auto-triaged) ORDER BY created",
       "autoTriagedLabel": "auto-triaged"
     },
-    "timeout": 300,
-    "maxIssues": 20
+    "timeout": 1800,
+    "maxIssues": 5
   }
 }
 ```
@@ -101,11 +103,16 @@ The triage workflow clones latest `main` branch from StackRox repo. Issues with
 - Confidence scores adjusted (see `reference/constants.md` for thresholds)
 - Reports flag with ⚠️ symbol
 
+## External Dependencies
+
+- **stackrox CI failure investigator**: Agent at `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` (cloned during setup phase). Used for deep CI failure root cause analysis. If unavailable, deep analysis falls back to description-only mode.
+
 ## Reference Data Sources
 
 The workflow uses reference data from:
 
 - `/tmp/triage/stackrox/.github/CODEOWNERS` - File path → team mappings (cloned during setup phase)
+- `/tmp/triage/stackrox/.claude/agents/stackrox-ci-failure-investigator.md` - CI failure investigation methodology (cloned during setup phase)
 - `reference/*.md` - Local domain knowledge files for error patterns, team mappings, decision trees
 
 ## Output Locations

diff --git a/workflows/acs-triage/FIELD_REFERENCE.md b/workflows/acs-triage/FIELD_REFERENCE.md
@@ -201,6 +201,65 @@ These fields are added by the `/analyze-ci` command for CI_FAILURE issues:
 - **Purpose:** Test matches known flaky pattern
 - **Impact:** May reclassify as FLAKY_TEST
 
+### ci_analysis.deep_analysis
+
+Deep root cause analysis performed using the stackrox CI failure investigator agent methodology. This sub-object is populated for all CI_FAILURE issues. Results are informational only -- they appear in JIRA comments and reports but do NOT influence team assignment.
+
+- **Type:** object
+- **Purpose:** Deep root cause investigation results
+- **Added By:** Phase 4a Stage 2
+
+#### ci_analysis.deep_analysis.root_cause
+
+- **Type:** string
+- **Example:** "Template logic inconsistency in GraphQL code generation causes placeholder Boolean fields to be emitted without proper resolver functions"
+- **Purpose:** Detailed explanation of why the CI failure occurs
+
+#### ci_analysis.deep_analysis.failure_category
+
+- **Type:** enum
+- **Values:** "code-bug", "flaky-test", "infrastructure", "configuration", "dependency", "unknown"
+- **Purpose:** Classification of the failure's nature
+
+#### ci_analysis.deep_analysis.affected_components
+
+- **Type:** array of strings
+- **Example:** `["central/graphql/generator/codegen/codegen.go.tpl", "central/graphql/resolvers/generated.go"]`
+- **Purpose:** Files and modules identified during investigation
+
+#### ci_analysis.deep_analysis.confidence
+
+- **Type:** enum
+- **Values:** "High", "Medium", "Low"
+- **Purpose:** Confidence in the root cause determination
+
+#### ci_analysis.deep_analysis.risk_assessment
+
+- **Type:** enum
+- **Values:** "Low", "Medium", "High"
+- **Purpose:** Potential impact of the failure and its fix
+
+#### ci_analysis.deep_analysis.proposed_fix
+
+- **Type:** string
+- **Example:** "Fix template conditional to check hasAnyMethods before hasFields"
+- **Purpose:** Specific description of what needs to change to resolve the failure
+
+#### ci_analysis.deep_analysis.relevant_logs
+
+- **Type:** string (max 500 chars)
+- **Purpose:** Sanitized excerpts from CI logs relevant to the root cause
+- **Note:** NEVER contains API tokens, passwords, secrets, internal URLs with credentials, IP addresses, or employee emails
+
+#### ci_analysis.deep_analysis.investigation_method
+
+- **Type:** enum
+- **Values:** "agent_methodology", "description_only"
+- **Purpose:** Indicates how the analysis was performed
+- **Note:** "description_only" means the investigator agent file was unavailable and analysis was based solely on JIRA description/comments
+
+---
+
 ## Vulnerability Analysis Fields
 
 These fields are added by the `/analyze-vuln` command for VULNERABILITY issues:
@@ -457,7 +516,17 @@ These fields are calculated by the `/generate-report` command:
   "ci_analysis": {
     "error_type": "GraphQL",
     "file_paths": ["central/graphql/resolvers/policies.go"],
-    "error_signature_match": { "pattern": "...", "confidence": 90 }
+    "error_signature_match": { "pattern": "...", "confidence": 90 },
+    "deep_analysis": {
+      "root_cause": "Template logic inconsistency in GraphQL code generation...",
+      "failure_category": "code-bug",
+      "affected_components": ["central/graphql/generator/codegen/codegen.go.tpl"],
+      "confidence": "High",
+      "risk_assessment": "Medium",
+      "proposed_fix": "Fix template conditional to check hasAnyMethods before hasFields",
+      "relevant_logs": "Error: Cannot query field \"isDeprecated\" on type \"PolicyViolationEvent\"",
+      "investigation_method": "agent_methodology"
+    }
   },
 
   "team_assignment": {

diff --git a/workflows/acs-triage/reference/constants.md b/workflows/acs-triage/reference/constants.md
@@ -50,8 +50,8 @@ Central location for all hardcoded values used throughout the ACS triage workflo
 
 | Constraint | Value | Reason |
 |-----------|-------|--------|
-| TIMEOUT_SECONDS | 300 | JIRA MCP performance limit |
-| MAX_ISSUES_PER_RUN | 10-20 | Keep within timeout |
+| TIMEOUT_SECONDS | 1800 | Allow time for deep CI failure analysis |
+| MAX_ISSUES_PER_RUN | 5 | Keep within timeout with deep analysis |
 | SETUP_CLONE_DEPTH | full | Need CODEOWNERS and version tags |
 
 ## Output Paths
@@ -92,6 +92,24 @@ Central location for all hardcoded values used throughout the ACS triage workflo
 | Database containers (central-db, scanner-db, scanner-v4-db) with npm/Go vulnerabilities | CLOSE (Obsolete) |
 | Non-main containers with npm vulnerabilities | CLOSE (Obsolete) |
 
+## Deep Analysis Constants
+
+| Constant | Value | Purpose |
+|----------|-------|---------|
+| DEEP_ANALYSIS_TIME_BUDGET_SECONDS | 300 | Max time per CI failure issue for deep root cause analysis |
+| DEEP_ANALYSIS_LOG_MAX_CHARS | 500 | Max length of sanitized log excerpts in deep analysis output |
+
+## Failure Categories
+
+| Category | Description |
+|----------|-------------|
+| code-bug | Real code defect causing the CI failure |
+| flaky-test | Intermittent test failure not caused by a code change |
+| infrastructure | CI infrastructure issue (node failures, resource exhaustion) |
+| configuration | Misconfiguration in CI pipeline, environment, or test setup |
+| dependency | External dependency failure (registry, network, third-party service) |
+| unknown | Root cause could not be determined |
+
 ## Repository Paths
 
 | Repository | Clone Path | Files Needed |