Skip to content

Consume Docker Hub upstream SBOMs and merge with Syft#225

Open
vpetersson wants to merge 2 commits into
masterfrom
docker-hub
Open

Consume Docker Hub upstream SBOMs and merge with Syft#225
vpetersson wants to merge 2 commits into
masterfrom
docker-hub

Conversation

@vpetersson
Copy link
Copy Markdown
Contributor

Summary

  • Detect Docker Official Images (library/*) and Docker Hardened Images (dhi.io/*) — either directly via DOCKER_IMAGE or via BuildKit SLSA provenance on a user-built image — and fetch the publisher's SPDX SBOM.
  • Merge that upstream SBOM with a local Syft scan of the same image: upstream wins for base-layer packages, Syft fills gaps and overlays packages installed by the user's Dockerfile (apt, pip, COPY, …). Components are tagged with a sbomify:source property (docker-hub-upstream or syft-overlay) for auditability.
  • Factor the crane/cosign + in-toto walking code out of chainguard.py into a shared buildkit_provenance module so both detectors use the same primitives. Fixes a latent bug along the way: multi-arch indexes have one attestation-manifest sibling per platform, and the previous code sometimes pulled the wrong one (or nothing) when given a per-platform digest.
  • Surface registry failures clearly — Docker Hub rate limits (100 anonymous pulls / 6h), 401s, and 404s now log at WARNING with the concrete remediation (docker login, docker login dhi.io, etc.) instead of silently falling back.

Full pipeline verified end-to-end with --augment (local sbomify.json) and --enrich: the merge output passes through both stages unmodified in shape, source tags preserved.

Dedup keys on the PURL's core identity, ignoring qualifiers (Docker's os_distro=trixie&os_name=debian vs Syft's arch=amd64&distro=debian-13 both collapse to the same package). A second-pass loose match (type + name + version, namespace ignored) catches the cases where the two generators disagree on namespace — for example, Amazon Linux's upstream SBOM emits pkg:rpm/amazonlinux/bash while Syft emits pkg:rpm/amzn/bash. Safe because a merge always describes a single image, so the distro is fixed.

Test plan

Unit coverage — 79 new tests, full suite 2257 passing, ruff clean:

  • tests/test_buildkit_provenance.py — crane/cosign helpers, rate-limit/401/404 classifier
  • tests/test_dockerhub.py — direct + provenance detection for library/*, dhi.io/*, ref classification, multi-arch sibling matching
  • tests/test_sbom_merge.py — strict dedup, qualifier-ignore fallback, namespace-mismatch fallback, relationship/extracted-license carry-over
  • tests/test_chainguard.py — existing tests rewired to the shared module; no behavioral change

Runnable E2E (examples/docker-hub/run-e2e.sh — 7 scenarios):

  • python:3.11-slim direct, CycloneDX — 142 upstream + 2721 overlay
  • python:3.11-slim direct, SPDX — 157 packages, 72 extracted licenses, 0 validation errors
  • Distro sweep: alpine:3.20, ubuntu:24.04, rockylinux:9, amazonlinux:2, archlinux:latest, busybox:latest — all detected, merged, deduped across pkg:deb/pkg:apk/pkg:rpm/pkg:alpm/pkg:generic
  • Built FROM python:3.11-slim + apt/pip, CycloneDX — 142 upstream + 2852 overlay, requests/click/curl/jq all present as syft-overlay
  • Same, SPDX — 188 packages, 105 extracted licenses, 0 validation errors
  • Full merge + --augment (local sbomify.json) + --enrich — supplier, authors, lifecycle applied on top of merged SBOM
  • DHI end-to-end — skipped in this environment (dhi.io needs docker login dhi.io and account entitlement). Unit-tested cosign shape includes --key https://registry.scout.docker.com/keyring/dhi/latest.pub and --insecure-ignore-tlog=true; the keyring URL is verified publicly reachable.

🤖 Generated with Claude Code

Docker Official Images (library/*) and Docker Hardened Images (dhi.io/*)
ship publisher-signed SBOMs. Unlike Chainguard (where we bypass local
scanning), Docker Hub images are routinely extended by users, so this
fetches upstream's authoritative SBOM and *merges* it with a Syft scan
of the same image: upstream wins for base-layer packages, Syft fills
gaps and overlays anything the Dockerfile added on top.

Detection mirrors the Chainguard flow:
  - Direct:     DOCKER_IMAGE is itself an Official Image or DHI.
  - Provenance: DOCKER_IMAGE carries BuildKit SLSA provenance whose
                resolvedDependencies name a Docker Hub base.

Merge policy (upstream wins; Syft fills empty upstream fields):
  - Strict dedup:  (type, namespace, name, version), qualifiers ignored.
                   Handles different qualifier conventions (Docker's
                   os_distro= vs Syft's distro=).
  - Loose dedup:   (type, name, version) fallback. Handles namespace
                   disagreement such as pkg:rpm/amazonlinux vs
                   pkg:rpm/amzn on the same Amazon Linux image.
  - Component tagging: sbomify:source = docker-hub-upstream | syft-overlay.

Ecosystems verified end-to-end: pkg:deb (Debian, Ubuntu), pkg:apk
(Alpine), pkg:rpm (Rocky, AlmaLinux, Fedora, Amazon Linux), pkg:alpm
(Arch), pkg:generic (BusyBox). Both CycloneDX and SPDX output; both
direct and provenance detection paths. Full pipeline verified with
--augment (local sbomify.json) and --enrich.

Factored the crane/cosign attestation-walking plumbing out of
chainguard.py into a new buildkit_provenance module shared by both
detectors. Added a platform-aware variant that picks the right
attestation sibling on multi-arch indexes (the original code walked
the wrong manifest when given a per-platform digest and silently
missed Docker Hub's SBOMs).

Failure modes surface clearly: rate-limit (Docker Hub anonymous pulls
are capped at 100/6h), 401, and 404 from crane or cosign are classified
and logged at WARNING with concrete remediation. Any fetch/merge error
falls back to the existing plain Syft scan path.

New files:
  sbomify_action/_generation/buildkit_provenance.py
  sbomify_action/_generation/dockerhub.py
  sbomify_action/_generation/sbom_merge.py
  tests/test_buildkit_provenance.py
  tests/test_dockerhub.py
  tests/test_sbom_merge.py
  examples/docker-hub/ (runnable E2E: 7 scenarios across distros and formats)

README: new "Docker Hub Images" section, updated tools table, and
"Docker Hub SBOM reuse" entry in the top-level feature list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 10:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for consuming Docker Hub–published upstream SPDX SBOMs (Docker Official Images and DHI), then merging them with a local Syft scan so base-layer packages prefer publisher metadata while user-installed packages are overlaid.

Changes:

  • Introduces shared BuildKit attestation/provenance helpers (crane/cosign + in-toto walking) and rewires Chainguard detection to use them.
  • Implements Docker Hub image detection (direct + provenance) and upstream SBOM retrieval (crane for Official Images, cosign for DHI).
  • Adds CycloneDX + SPDX merge logic (upstream-wins with Syft gap-filling) and integrates it into the CLI pipeline, with extensive unit + E2E examples.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sbomify_action/cli/main.py Integrates Docker Hub detection/fetch + Syft overlay merge into the main generation pipeline.
sbomify_action/_generation/buildkit_provenance.py New shared primitives for resolving platform digests and fetching BuildKit/cosign attestations.
sbomify_action/_generation/dockerhub.py New Docker Hub Official/DHI detection and upstream SPDX SBOM fetch.
sbomify_action/_generation/sbom_merge.py New upstream-wins merge logic for CycloneDX and SPDX documents.
sbomify_action/_generation/chainguard.py Refactors Chainguard detector to use shared BuildKit provenance helpers.
tests/test_buildkit_provenance.py Unit tests for shared BuildKit/crane/cosign helpers and registry error classification.
tests/test_dockerhub.py Unit tests for Docker Hub detection paths and upstream SBOM retrieval behavior.
tests/test_sbom_merge.py Unit tests for merge semantics (dedup, fill-empty, collision rewrites, relationship/licensing carry-over).
tests/test_chainguard.py Updates mocks/patches to point at the extracted shared helper module.
examples/docker-hub/run-e2e.sh Adds an end-to-end runner script covering official/provenance/DHI scenarios.
examples/docker-hub/README.md Documents the E2E scenarios and how to run them locally.
examples/docker-hub/Dockerfile.official Example derivative image to exercise provenance-based detection + overlay behavior.
examples/docker-hub/Dockerfile.dhi Example derivative image for DHI scenarios (best-effort).
examples/docker-hub/hello.py Minimal app used by the Dockerfiles/E2E runner.
README.md Documents Docker Hub SBOM reuse/merge feature and updates required-tools notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sbomify_action/_generation/buildkit_provenance.py
Comment thread README.md Outdated
Comment thread sbomify_action/_generation/buildkit_provenance.py Outdated
Security
  DHI SBOM fetch now uses `cosign verify-attestation --type spdxjson`
  instead of `cosign download attestation`. The download-only command
  fetched envelopes without checking Docker's signature, so a tampered
  attestation would have been silently consumed. verify-attestation
  enforces the signature against the --key keyring and still short-
  circuits the Rekor check (DHI isn't Rekor-logged). Chainguard's
  pre-existing download-only behavior is unchanged — separate scope.

Correctness
  SPDX merge now deep-copies the upstream doc before mutation. The
  shallow `dict(upstream_spdx)` copy worked only because upstream was
  never reused; a future cache layer would have silently leaked
  nested-list mutations.
  External-references dedup keys on (type, url) instead of url alone,
  so CycloneDX refs with the same URL but different type (e.g., vcs vs
  website) no longer collapse.

Error surfacing (Copilot)
  `_classify_registry_error` is shared by Chainguard and Docker Hub
  paths, so the 429/401 hints are now registry-agnostic, pointing at
  `docker login <registry>` with the placeholder left for the user to
  fill in. A new branch recognises crane's "No matching credentials"
  message and steers users toward `docker login dhi.io` — the common
  cause of DHI failures after a plain `docker login`.

Docs (Copilot)
  README clarified: `sbomify:source` property is emitted on CycloneDX
  components only; SPDX output is merged but not per-package source-
  annotated.

Types / polish
  DockerHubBaseImage.tier and related helpers typed as
  Literal["official", "dhi"] instead of plain str.
  convert_spdx_to_cyclonedx docstring notes its generic SPDX 2.x → CDX
  contract and that it's reused by the Docker Hub merge path.

Tests
  +4 new tests covering the registry-agnostic classifier, the "No
  matching credentials" → dhi.io hint, externalReferences (type, url)
  dedup, and a regression guard that pins the DHI cosign call to
  `verify-attestation` (not `download`). Suite now 2261 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants