Skip to content

design: record SRN id format — UUID vs sequential vs short-random #134

@rorybyrne

Description

@rorybyrne

Discussion seed for the one remaining open identifier question after #129. #129 settles the slug-based identity for human-named declarative resources (Convention, Ontology, by extension hooks/ingesters). It also explicitly says Record and Deposition should "keep UUIDs" — and the reasoning given (domain identifiers belong in metadata, not in identity) is sound and not what this issue questions. The narrower open question is: granted that record SRNs should be opaque server-minted IDs, what should that ID actually look like?

What the code does today

Record SRNs are minted at server/osa/domain/record/service/record.py:84-88 and :122-126, inside RecordService.bulk_publish() and publish_record():

record_srn = RecordSRN(
    domain=self.node_domain,
    id=LocalId(str(uuid4())),
    version=RecordVersion(1),
)

That's UUID v4 (not v7, despite CLAUDE.md's "UUIDv7/ULID" claim). The LocalId grammar at server/osa/domain/shared/model/srn.py:47 is [a-z0-9-]{3,64}, so the SRN schema doesn't enforce UUID shape — that's a service-layer choice.

What the code rules out

Ingesters never see or mint record SRNs. IngesterOutput (server/osa/domain/shared/port/ingester_runner.py:32-38) and IngesterRecord (server/osa/domain/ingest/model/ingester_record.py:21-54) carry only source_id (the upstream identifier — e.g. an accession from the source system), metadata, and files. The record SRN's {id} plays no role in the ingester contract.

Deduplication is on (source.type, source.id) where source.id is the composite "{convention_srn}:{upstream_source}" (publish_batch.py:99). The record SRN's {id} plays no role in idempotency either.

→ The format choice is purely a publish-time, server-side concern. There's no client round-trip constraint forcing a particular shape.

Options

Option Example SRN Pros Cons
UUID v4 (status quo) urn:osa:localhost:rec:a1b2c3d4-… No coordination; works today Hard to communicate verbally; long; doesn't sort meaningfully
UUID v7 urn:osa:localhost:rec:01938b… Time-ordered (DB-friendly); still no coordination Same verbal/length pain as v4
Sequential integer (Postgres sequence) urn:osa:localhost:rec:12345@1 Citation-friendly; tiny; sortable; matches GenBank/Zenodo/Crossref conventions for public record IDs Requires single DB sequence per node; reveals record counts
Short random alphanumeric (nanoid/ulid) urn:osa:localhost:rec:k7n2pq8x Compact; no coordination; collision-resistant No real advantage over sequential in a domain-scoped namespace; harder to read aloud
Structured codes (YYMM.NNNNN) urn:osa:localhost:rec:2606.00042@1 Encodes time; familiar from arXiv Couples ID to publish time; meaningless for backfill / re-publish of historical data

Code-level constraints on each option

  • Sequential: needs a Postgres SEQUENCE (or bigserial). bulk_publish would reserve N at once with nextval() calls or nextval('seq', n) style. Compatible with the current save_many + ON CONFLICT DO NOTHING pattern (infrastructure/persistence/repository/record.py:27-49) — collisions are still caught at the source-composite key, not the SRN.
  • UUID v7: drop-in replacement for v4, no other changes.
  • ULID / nanoid: add a dep, single-line change at the mint sites.
  • YYMM-style: would need a per-month sequence and a fresh field; not a small change.

Domain scoping consideration

Per #129 and the SRN model: identifiers only need to be unique within a node's domain. So the "what if two archives both have record 42" concern is a non-issue — urn:osa:archive.uni.edu:rec:42 and urn:osa:another.org:rec:42 are different SRNs.

This actively unlocks short identifiers. It's also what every major scientific archive does: GenBank accessions, PDB IDs, DOIs are all short and domain-scoped (by source/registry).

Versioning is unaffected

RecordVersion at srn.py:108-121 enforces >= 1, integer-only. Sequential IDs + integer versions compose fine: rec:12345@1, rec:12345@2, etc. None of the options here change that.

Migration

Records are immutable and the SRN is the PK at records_table.srn (tables.py:64). Switching minting strategy:

  • For existing rows: leave them alone. Old records keep UUID-shaped SRNs.
  • For new rows: take the new format from a cutoff.
  • Anything that parses record SRN IDs structurally (citation rendering? URL routing?) needs to tolerate both.

A separate one-time backfill could rewrite old IDs if desired, but the cost-benefit is probably "not worth it" — same reasoning as #129's note on migration.

Open questions for discussion

  1. Does the team prefer sequential integers (citation-friendly, matches archive precedent) or UUID v7 (no coordination, time-orderable) or status-quo UUID v4 (zero change)?
  2. If sequential: is exposing record counts a real concern, or just a theoretical one for an open archive that publishes count stats anyway?
  3. For ULID/nanoid specifically: is there a use case for client-mintable record IDs in the future (offline deposition? federated mirroring?) that would make non-sequential the safer long-term call?
  4. Should Deposition follow the same answer as Record (both system-created, both opaque) or could they diverge?
  5. Should CLAUDE.md's "UUIDv7/ULID" line be updated to reflect whatever's decided here? (Currently it's drifted from the v4 reality.)

No recommendation pushed here — surfacing options for discussion. The previous turn's analysis happened in chat without #129's "keep UUIDs" framing in view, and I want this thread to engage with that decision rather than work around it.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    design-neededNeeds architectural discussion before implementationrefactorInternal restructuring, no behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions