Skip to content

Feature: User-Controlled Collection IDs for DPS STAC #91

@hrodmn

Description

@hrodmn

Feature: User-Controlled STAC Collection IDs

Background

Currently, STAC collection IDs are auto-assigned by DpsStacItemGenerator using a
deterministic formula derived from the DPS job's .met.json metadata file:

{username}__{algorithm_name}__{algorithm_version}__{tag}

This value is slugified (special characters replaced) and then unconditionally written
into the collection field of every STAC item before publishing to the ingestor queue —
regardless of what collection the user's catalog.json specifies. Users have requested
the ability to control the collection ID so that outputs from related jobs and algorithm
runs can be organized into a single, meaningfully named collection.

This ticket proposes an initial implementation using admin-mediated collection creation,
designed to extend toward self-service and algorithm-level authorization.


How the current pipeline works

DpsStacItemGenerator (link) is triggered by S3 event notifications when a DPS job writes a
catalog.json to the output bucket. For each event:

  1. The DPS output prefix is extracted from the S3 key path using a timestamp pattern
  2. A .met.json file is loaded from that prefix — this is the authoritative source of
    job context, containing at minimum: username, algorithm_name,
    algorithm_version, and tag
  3. A deterministic collection ID is constructed from those fields and slugified
  4. The catalog.json is read via pystac; every item's collection field is
    overwritten with the deterministic ID before publishing to the ingestor SNS topic

Some users are already setting the collection field in their STAC items, but the
current code silently overwrites it. This feature stops that overwrite and makes the
item-provided collection ID the primary routing mechanism.


Proposed Approach

Phase 1: Manual registry of collection ID/prefix + user allowed combinations

To start we will manually manage the list of users who are allowed to contribute to a collection via specific collection ids or a prefix with a wildcard. This will be deployed in the maap-eoapi Cloudformation stack.


Phase 2: Self-Service Collection Creation (future)

The Phase 1 design is structured so that self-service collection creation slots in
without changing the DpsStacItemGenerator authorization logic. The Lambda already
handles all authorization outcomes. Future work will mostly be managed by JPL in the MAAP Console where collection/user assignments will be tracked. We will add token-gated transactions endpoints for collections in the MAAP DPS STAc.


Algorithm Authoring Convention

The collection ID should be treated as a runtime parameter, not a hardcoded value
inside algorithm code. DPS supports arbitrary named input parameters, and algorithms
should declare a collection_id input parameter that is passed through to the STAC
item outputs at job runtime:

# Recommended pattern in algorithm code
def run(collection_id: str = None, **kwargs):
    items = generate_stac_items(...)
    for item in items:
        if collection_id:
            item.collection_id = collection_id
    write_catalog(items)

When a user submits a DPS job, they can then pass their target collection ID as a job
input parameter without modifying the algorithm itself:

algorithm: my-flood-detector v1.2.0
inputs:
  collection_id: jsmith--flood-catalog-2025
  ...

This convention should be documented as a best practice in the MAAP algorithm
authoring guide. Its benefits are:

  • The same algorithm version can route to different collections (dev, staging,
    production; personal vs. shared)
  • Collection governance decisions are separated from algorithm logic — the algorithm
    doesn't need to know or care about catalog organization
  • Users who don't specify a collection_id parameter get the deterministic fallback
    behavior automatically, so the convention is opt-in and backward compatible

Algorithms that hardcode a collection ID in their output items will still work — the
authorization check applies regardless of how the collection ID got into the item —
but hardcoding is discouraged because it couples a specific catalog governance decision
to algorithm code that may be shared or reused by others.


Naming Rules

  • Lowercase alphanumeric characters, hyphens, and underscores only
  • 3–64 characters; no leading or trailing hyphens or underscores
  • Case-insensitive uniqueness (my-collection and My-Collection are the same)
  • Reserved names blocked: api, admin, system, search, conformance,
    queryables, and any existing system collection patterns

Collection IDs are immutable after creation. The current deterministic ID formula
uses __ (double underscore) as a delimiter — user-specified IDs should avoid this
pattern to remain visually distinguishable from auto-assigned IDs.


Error Surfacing

DpsStacItemGenerator currently has no feedback channel back to the user after a DPS
job completes. Collection governance introduces new async failure modes — collection not
found, user not authorized, algorithm version not approved — that users need visibility
into. At minimum, the Lambda will emit structured CloudWatch log events for every
governance decision. A user-facing feedback mechanism (DPS job callback or ingestion
status dashboard) is a dependency that should be resolved before this feature ships.


Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions