Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/adrs/00013-configurable-sbom-duplicate-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# 00011. Configurable SBOM Duplicate Handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): ADR number in the title does not match the filename and may be confusing.

The file is named 00013-... but the ADR title starts with 00011. Please update the heading number to match the filename, or add a brief note if the mismatch is intentional, to avoid confusion when referencing this ADR.

Suggested change
# 00011. Configurable SBOM Duplicate Handling
# 00013. Configurable SBOM Duplicate Handling


## Status

PROPOSED

## Context

### Problem Statement

Trustify currently uses hash-based deduplication (SHA256/384/512) to detect duplicate SBOMs. However, SBOM documents have stable identifiers (`documentNamespace` for SPDX, `serialNumber` for CycloneDX) that uniquely identify them regardless of minor content changes.

**Current Limitation**: When an SBOM is regenerated with the same identifier but different content (e.g., updated timestamps), it's ingested as a new document.

### Use Cases

Different scenarios require different duplicate handling behaviors:

1. **Audit/Compliance**: Keep all versions for historical tracking
2. **Latest-only**: Replace old versions to save storage and show current state
3. **Deduplication**: Ignore re-ingestion of documents with the same identifier

## Decision

Add configurable duplicate handling with three modes based on SBOM document identifiers:

### Duplicate Handling Modes

**`onDuplicate=ingest`** (default)
- Ingest as new document (current behavior)
- Hash-based deduplication still applies
- Backward compatible

**`onDuplicate=ignore`**
- Skip ingestion if SBOM with same document_id already exists
- Return existing SBOM information
- Useful for preventing re-ingestion of unchanged documents

**`onDuplicate=replace`**
- Delete existing SBOM with same document_id
- Ingest new version
- Maintains latest-only view

## Configuration

### 1. API Upload (Per-Request)

Add optional `onDuplicate` query parameter to SBOM upload endpoint:

```bash
Comment on lines +48 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): Minor grammar tweak: add "it" to read more naturally.

Suggest rephrasing this line to # Ignore duplicates - skip if it already exists for smoother readability.

Suggested change
Add optional `onDuplicate` query parameter to SBOM upload endpoint:
```bash
Add optional `onDuplicate` query parameter to SBOM upload endpoint:
```bash
# Ignore duplicates - skip if it already exists
cat sbom.json | http POST localhost:8080/api/v2/sbom onDuplicate=ignore

# Ignore duplicates - skip if already exists
cat sbom.json | http POST localhost:8080/api/v2/sbom onDuplicate=ignore

# Replace existing - delete old, ingest new
cat sbom-v2.json | http POST localhost:8080/api/v2/sbom onDuplicate=replace

# Ingest as new (default) - current behavior
cat sbom.json | http POST localhost:8080/api/v2/sbom
```

### 2. Importer Configuration (Per-Importer)

Add `onDuplicate` field to SBOM importer configuration:

```bash
# Ignore duplicates during scheduled imports
http POST localhost:8080/api/v2/importer/my-sbom-source \
sbom[source]=https://example.com/sboms/ \
sbom[onDuplicate]=ignore \
sbom[period]=1d

# Replace with latest version
http POST localhost:8080/api/v2/importer/internal-builds \
sbom[source]=https://builds.internal/sboms/ \
sbom[onDuplicate]=replace \
sbom[period]=1h
```

## How It Works

### Duplicate Detection

1. Extract document identifier from SBOM:
- **SPDX**: `documentNamespace` field
- **CycloneDX**: `serialNumber` field

2. Check database for existing SBOM with same identifier

3. Apply configured behavior:
- **`ingest`**: Continue normal ingestion (hash-based dedup still applies)
- **`ignore`**: Skip ingestion, return existing SBOM info
- **`replace`**: Delete old SBOM and storage, then ingest new version

### Implementation Scope

**Core Components**:
- IngestorService: Add `onDuplicate` parameter to `ingest()` method
- Graph layer: Add `get_sbom_by_document_id()` lookup function
- API endpoints: Add `onDuplicate` query parameter
- Importer config: Add `onDuplicate` field to SbomImporter

## Benefits

- ✓ Flexible handling for different use cases (audit, latest-only, deduplication)
- ✓ Backward compatible (defaults to current behavior)
- ✓ Configurable per-importer and per-upload
- ✓ Works for both SPDX and CycloneDX formats
- ✓ Prevents storage waste from duplicate documents

## Considerations

**Logging**: All duplicate handling actions are logged for audit trail

**Atomicity**: Replace operations should ensure atomicity

## Open Questions

1. Should `replace` mode preserve user-added labels from the old SBOM?