[Security] Security llm matrix automation#6960
Open
spong wants to merge 2 commits into
Open
Conversation
Contributor
Elastic Docs AI PR menuCheck the box to run an AI review for this pull request.
Powered by GitHub Agentic Workflows and docs-actions. For more information, reach out to the docs team. |
Contributor
🔍 Preview links for changed docs |
Contributor
✅ Elastic Docs Style Checker (Vale)No issues found on modified lines! The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Currently iterating on artifact generation with @dhru42 & @patrykkopycinski over on elastic/kibana#273827. Once we have consensus on the generated artifact, I'll sort out the bucket configuration and update this PR and then we can do a proper review. In the meantime, let me know if anything here doesn't follow current best practices, but I tried best I could to match current automations.
My only open question here is if this is too much noise once we scale up to multiple serverless releases per week (which will put out of sync with our weekly eval runs anyway, so that's probably a larger conversation anyway).
Summary
Replaces the hand-maintained Security LLM performance matrix tables with auto-generated CSVs embedded via
:::{csv-include}, and adds a keyless-WIF GitHub Action that keeps them current from the Elastic Security LLM evaluation pipeline. Companion to the Kibana PR (elastic/kibana#273827) that generates the matrix (closes elastic/security-team#16394).What changed
solutions/security/ai/large-language-model-performance-matrix.md— the two Markdown tables are replaced by:::{csv-include}directives (same pattern aseis-supported-models.md).solutions/security/ai/llm-performance-matrix/{proprietary,open-source}-models.csv— generated from a real golden-cluster run (branchmain, 2026-06-15); refreshed automatically going forward..github/workflows/sync-llm-matrix-keyless.yml— weekly schedule (serverlesslatest→ PR tomain) + manualworkflow_dispatch(version input → PR to the<version>branch). Mirrorssync-sheets-keyless.yml..github/scripts/llm-matrix/sync_matrix.sh— pulls the CSVs from GCS.Note: this also moves the page to the new Agent Builder column taxonomy (Alert Triage / Detection Engineering / Investigation / KB Retrieval / Workflow Execution / Overall), replacing the previous columns.
Required repo configuration
Reuses the existing
sheet2docskeyless-WIF variables (GCP_WORKLOAD_IDENTITY_PROVIDER,GCP_SERVICE_ACCOUNT_EMAIL,GCP_PROJECT_ID). One new variable:LLM_MATRIX_GCS_BUCKET(bucket name, nogs://). The reader identity needsroles/storage.objectVieweron that bucket.Updating a versioned (released) Stack matrix
The weekly job keeps serverless/
latest(→main) current automatically. To refresh a released version (e.g. a new model in9.2):<version>branch so results land on the golden cluster.kibana-evals-security-matrixpipeline withMATRIX_BRANCH=<version>+MATRIX_VERSION=<version>(overwritesgs://<bucket>/security/<version>/).workflow_dispatchwithversion=<version>→ opens a PR against the<version>docs branch.Reviewer notes (scores — cc @dhru42 / @patrykkopycinski )
Values are from a real run and the math is verified end-to-end (GPT OSS 120B's Alert Triage
7.31reproduces exactly from raw per-evaluator means). Open scoring decisions for eval owners (numbers may move as suites are tuned):AttachmentReadCompliance(tool compliance) with answer-qualitycriteria, weighted equally. Keep the blend, or excludeAttachmentReadComplianceto make it quality-only.criteriacountsN/Aas pass and each triage scenario has a single example → all-or-nothing per criterion.family: "Claude"/provider: "Elastic"for every model) — matrix unaffected, but flag to the eval-ingestion owner.Test plan / how to validate
:::{csv-include}tables render on the page preview.dry_run=trueonce the bucket exists; verify the CSV diff before enabling the schedule.Open items
LLM_MATRIX_GCS_BUCKET.Generative AI disclosure
Tool(s) and model(s) used: PR developed with Cursor + Claude Opus 4.8