Reconcile native/parquet/arrow with respect to logical vs physical storage

> If there is a graph with 10 edges that's persisted in native tables, then exported as icedisk and loaded into arrow, I want count(*) to be consistent across native, parquet, arrow cases.


• Then external duplication (icebug-format cli) is not sufficient as a general solution.

  If a 10-edge native table exports to ice-disk/Arrow as 20 physical rows, Ladybug will currently treat those as 20 logical relationships. That leaks into:

  - MATCH ()-[r]->() RETURN count(*)
  - MATCH ()-[r]-() RETURN count(*)
  - relationship IDs / row offsets
  - path expansion
  - multigraph semantics
  - self-loops

  To keep native, ice-disk, and Arrow consistent, the exported/loaded representation needs to preserve 10 logical rel rows, even if it also has reverse-
  direction acceleration data.

  The clean model is:

  - Logical edge storage: one row per relationship.
  - FWD index: source CSR over logical edge IDs/rows.
  - BWD index: destination CSR over the same logical edge IDs/rows.
  - Properties: stored once, keyed by logical edge row/ID.
  - Scans in either direction return the same logical rel ID, not the physical row in a directional mirror.

  For Arrow/ice-disk specifically, that implies either:

  1. Do not duplicate edges.
     Use one canonical CSR/FLAT table. BWD scans can be slower, but counts stay correct.
  2. Add a reverse index, not reverse rows.
     For CSR, add BWD indices/indptr that points to the same logical edge IDs, rather than duplicating property rows as separate relationships.
  3. If duplicated input is unavoidable, require a logical edge ID and mirror marker.
     Then scans/counts must collapse mirrors consistently. This is harder and riskier, especially for paths and multigraphs.

  So for your consistency requirement, I would not load duplicated Parquet rows as ordinary Arrow rel rows. I’d treat reverse data as an internal index
  representation, matching native-table semantics.


› Yes - I want to treat the duplicate rows as internal index. How do I make this distinction clear in icedisk spec?


• Make the spec distinguish logical relationship rows from directional index rows. That is the core wording.

  Suggested model:

  Relationship Layouts

  - FLAT: one row per logical relationship.
  - CSR: one row per logical relationship in the forward CSR index.
  - DIRECTIONAL_CSR or BIDIRECTIONAL_CSR: directional CSR index files may contain mirror rows, but those rows are not additional logical relationships.

  For ice-disk, I’d write it like this:

  ## Relationship Identity

  An ice-disk relationship table represents a set of logical relationships. Query
  semantics such as `COUNT(*)`, relationship IDs, path uniqueness, and returned
  relationship values are defined over logical relationships, not over physical
  index rows.

  Some layouts may store additional directional index rows to accelerate traversal.
  These rows are implementation detail and must not change logical cardinality.

  Then define the files explicitly:

  ## CSR Relationship Tables

  A CSR relationship table may be stored with one or two directional indexes.

  Forward index:
  - `indices_<rel>.parquet`
  - `indptr_<rel>.parquet`

  Optional backward index:
  - `indices_bwd_<rel>.parquet`
  - `indptr_bwd_<rel>.parquet`

  Each index row must include a logical relationship identifier column, for example
  `rel_id`. The same logical relationship must use the same `rel_id` in every
  directional index where it appears.

  Then make the invariant strict:

  ## Directional Index Invariants

  For a table with `N` logical relationships:

  - `rel_id` identifies the logical relationship.
  - Directional index rows are not counted as independent relationships.
  - A logical relationship may appear in both FWD and BWD indexes.
  - If it appears in both, both rows must share the same `rel_id`.
  - Properties are defined by logical relationship, not directional index row.
  - If properties are repeated in multiple indexes, repeated values must be identical.
  - Self-loops must not create duplicate logical relationships.

  I’d avoid saying “duplicate rows” in the normative part. Use “mirror index rows” or “directional index rows”. “Duplicate” sounds like ordinary duplicate data
  and invites the wrong interpretation.

  The most important concrete requirement is adding a stable logical rel_id. Without it, Ladybug cannot distinguish “two logical edges between same endpoints”
  from “one logical edge stored in two directional indexes”.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile native/parquet/arrow with respect to logical vs physical storage #505

Relationship Identity

CSR Relationship Tables

Directional Index Invariants

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reconcile native/parquet/arrow with respect to logical vs physical storage #505

Description

Relationship Identity

CSR Relationship Tables

Directional Index Invariants

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions