Skip to content

[Security] Security llm matrix automation#6960

Open
spong wants to merge 2 commits into
mainfrom
security-llm-matrix-automation
Open

[Security] Security llm matrix automation#6960
spong wants to merge 2 commits into
mainfrom
security-llm-matrix-automation

Conversation

@spong

@spong spong commented Jun 17, 2026

Copy link
Copy Markdown
Member

Note

Currently iterating on artifact generation with @dhru42 & @patrykkopycinski over on elastic/kibana#273827. Once we have consensus on the generated artifact, I'll sort out the bucket configuration and update this PR and then we can do a proper review. In the meantime, let me know if anything here doesn't follow current best practices, but I tried best I could to match current automations.

My only open question here is if this is too much noise once we scale up to multiple serverless releases per week (which will put out of sync with our weekly eval runs anyway, so that's probably a larger conversation anyway).

Summary

Replaces the hand-maintained Security LLM performance matrix tables with auto-generated CSVs embedded via :::{csv-include}, and adds a keyless-WIF GitHub Action that keeps them current from the Elastic Security LLM evaluation pipeline. Companion to the Kibana PR (elastic/kibana#273827) that generates the matrix (closes elastic/security-team#16394).

What changed

  • solutions/security/ai/large-language-model-performance-matrix.md — the two Markdown tables are replaced by :::{csv-include} directives (same pattern as eis-supported-models.md).
  • solutions/security/ai/llm-performance-matrix/{proprietary,open-source}-models.csv — generated from a real golden-cluster run (branch main, 2026-06-15); refreshed automatically going forward.
  • .github/workflows/sync-llm-matrix-keyless.yml — weekly schedule (serverless latest → PR to main) + manual workflow_dispatch (version input → PR to the <version> branch). Mirrors sync-sheets-keyless.yml.
  • .github/scripts/llm-matrix/sync_matrix.sh — pulls the CSVs from GCS.

Note: this also moves the page to the new Agent Builder column taxonomy (Alert Triage / Detection Engineering / Investigation / KB Retrieval / Workflow Execution / Overall), replacing the previous columns.

Required repo configuration

Reuses the existing sheet2docs keyless-WIF variables (GCP_WORKLOAD_IDENTITY_PROVIDER, GCP_SERVICE_ACCOUNT_EMAIL, GCP_PROJECT_ID). One new variable: LLM_MATRIX_GCS_BUCKET (bucket name, no gs://). The reader identity needs roles/storage.objectViewer on that bucket.

Updating a versioned (released) Stack matrix

The weekly job keeps serverless/latest (→ main) current automatically. To refresh a released version (e.g. a new model in 9.2):

  1. Ensure the model exists in that version's matrix config and run the eval suites against the <version> branch so results land on the golden cluster.
  2. Run the Kibana kibana-evals-security-matrix pipeline with MATRIX_BRANCH=<version> + MATRIX_VERSION=<version> (overwrites gs://<bucket>/security/<version>/).
  3. Run this workflow via workflow_dispatch with version=<version> → opens a PR against the <version> docs branch.

Reviewer notes (scores — cc @dhru42 / @patrykkopycinski )

Values are from a real run and the math is verified end-to-end (GPT OSS 120B's Alert Triage 7.31 reproduces exactly from raw per-evaluator means). Open scoring decisions for eval owners (numbers may move as suites are tuned):

  1. Alert Triage saturates at 10 for most proprietary models — honest, but the column blends AttachmentReadCompliance (tool compliance) with answer-quality criteria, weighted equally. Keep the blend, or exclude AttachmentReadCompliance to make it quality-only.
  2. criteria counts N/A as pass and each triage scenario has a single example → all-or-nothing per criterion.
  3. Only quality evaluators feed the matrix (observability metrics excluded). Confirm the desired set.
  4. Cluster metadata mislabeled (family: "Claude" / provider: "Elastic" for every model) — matrix unaffected, but flag to the eval-ingestion owner.

Test plan / how to validate

  • Confirm both :::{csv-include} tables render on the page preview.
  • Run the workflow with dry_run=true once the bucket exists; verify the CSV diff before enabling the schedule.
  • Sanity-check the model lineup + column taxonomy with eval owners.

Open items

Generative AI disclosure

  1. Did you use a generative AI (GenAI) tool to assist in creating this contribution?
  • Yes
  • No
  1. If you answered "Yes" to the previous question, please specify the tool(s) and model(s) used (e.g., Google Gemini, OpenAI ChatGPT-4, etc.).

Tool(s) and model(s) used: PR developed with Cursor + Claude Opus 4.8

@spong spong requested review from dhru42 and patrykkopycinski June 17, 2026 19:01
@spong spong requested review from a team as code owners June 17, 2026 19:01
@spong spong requested a review from theletterf June 17, 2026 19:01
@github-actions

Copy link
Copy Markdown
Contributor

Elastic Docs AI PR menu

Check the box to run an AI review for this pull request.

  • Review docs changes (docs-review). Status: not started.

Powered by GitHub Agentic Workflows and docs-actions. For more information, reach out to the docs team.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

🔍 Preview links for changed docs

@github-actions

Copy link
Copy Markdown
Contributor

✅ Elastic Docs Style Checker (Vale)

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant