Skip to content

refactor: add record_id/version columns to records; key /data/ reads and cursors on internal ids #143

@rorybyrne

Description

@rorybyrne

Summary

Give records real record_id + version columns (or a BIGSERIAL surrogate for the keyset, plus record_id for lookups), backfilled from the SRN. Today the bare internal id exists only as a substring of the srn PK, which forces three workarounds in the read path and keeps SRNs inside /data/ engine internals against the SRN-sidelining direction.

Motivation

  • get_record_by_id is a full-table scan. It matches srn LIKE 'urn:osa:%:rec:<id>@%' — the leading wildcard defeats any index — then filters versions in Python. With a real column it becomes an indexed equality + ORDER BY version DESC LIMIT 1. At the 1M-records scale SC-002 targets, this is the dominant cost of every single-record fetch.
  • Keyset tiebreaks and cursors lean on the SRN. sort=id aliases to the srn column, the route layer has to mirror that aliasing when encoding cursors (the bug fixed in 3f6b434), and cursors carry full SRNs. With real columns the aliasing disappears and cursors carry internal ids only.
  • sort=id may become meaningful — contingent on design: record SRN id format — UUID vs sequential vs short-random #134. Record ids are currently minted as UUID v4 (see design: record SRN id format — UUID vs sequential vs short-random #134; CLAUDE.md's "UUIDv7/ULID" line has drifted from the code), so id-order is random today. If design: record SRN id format — UUID vs sequential vs short-random #134 lands on sequential, v7, or ULID, id-order ≈ creation order; if v4 stays, this bullet evaporates but the other two motivations stand on their own.

Design decisions to make first

  1. design: record SRN id format — UUID vs sequential vs short-random #134 (record SRN id format) is effectively a prerequisite. The format choice shapes this issue's schema:
  2. Composite (record_id, version) unique pair vs a single BIGSERIAL surrogate for the keyset tiebreak (moot if design: record SRN id format — UUID vs sequential vs short-random #134 picks sequential). The composite is self-describing but makes the keyset 3 columns when the primary sort is something else; the surrogate keeps 2-column keysets and an int cursor component, but adds a column whose only job is ordering. (record_id is needed either way for the by-id lookup.)

Places to look during implementation

This is a non-exhaustive orientation map, not a plan. It reflects a sweep of today's tree (symbol-level references — line numbers drift). Step one of implementation is to redo the sweep (grep -rn 'RecordSRN.parse\|records_table\|:rec:') and diff against this list; trust the grep, not the list.

Schema / migration

  • server/osa/infrastructure/persistence/tables.pyrecords_table: srn TEXT PK, no id/version columns today.
  • server/migrations/ — needs a new Alembic revision with a backfill parsing existing SRNs (':rec:{id}@{version}').

Write path

  • server/osa/domain/record/service/record.pyRecordService.publish_record / bulk_publish (where Record identity is minted; also the mint sites design: record SRN id format — UUID vs sequential vs short-random #134 discusses).
  • server/osa/infrastructure/persistence/mappers/record.pyrecord_to_dict / row_to_record (identity currently round-trips exclusively through RecordSRN.parse).
  • server/osa/infrastructure/persistence/repository/record.pyPostgresRecordRepository.save / save_many.
  • server/osa/domain/ingest/handler/publish_batch.py — bulk-publish caller (likely unaffected if the mapper owns the columns; check).

Read path (unified /data/)

  • server/osa/infrastructure/data/postgres_data_read_store.pyget_record_by_id (the LIKE scan), _records_sort (sort=idt.c.srn aliasing, srn tiebreak), _records_row_to_mapping (derives id/version by parsing the srn).
  • server/osa/application/api/v1/routes/data/_streaming.py_next_cursor (the srn/tiebreak aliasing added in 3f6b434).
  • server/osa/domain/data/model/record_summary.pyRecordSummary.flatten (id derived from srn).

Tests that pin current srn-based behavior

  • server/tests/integration/conftest.pyseed_record (raw INSERT into records).
  • server/tests/integration/test_data_read_store_postgres.pyTestStreamPaginationOrder cursor tests encode srn values.
  • server/tests/unit/application/api/v1/routes/data/test_streaming.pytest_paginated_records_id_sort_encodes_srn.
  • server/tests/unit/infrastructure/data/test_cursor_validation.py — records cursor coercion pins.
  • server/tests/integration/test_bulk_publish_dual_write.py — bulk-publish write coverage.

Explicitly unchanged (by design)

  • records.srn remains the PK and the citation/federation identity; /events changefeed and FR-015 response bodies keep carrying SRNs.
  • record_srn FKs from metadata.* / features.* tables stay on srn (metadata_table.py, feature_table.py).
  • /data/records/{id}[@{version}] route contract is unchanged — only its execution plan improves.

Sequencing

After #139 merges, and after (or jointly with) the #134 format decision. Roughly one focused PR: migration + write path + read path + tests. If #142 (infrastructure reorganization) lands first, the file paths above shift with its renames (persistence/postgres/, infrastructure/data/postgres/query/) — another reason to re-grep rather than follow the list.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    infrastructureCI, Docker, deployment, migrationsrefactorInternal restructuring, no behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions