If there is a graph with 10 edges that's persisted in native tables, then exported as icedisk and loaded into arrow, I want count(*) to be consistent across native, parquet, arrow cases.
• Then external duplication (icebug-format cli) is not sufficient as a general solution.
If a 10-edge native table exports to ice-disk/Arrow as 20 physical rows, Ladybug will currently treat those as 20 logical relationships. That leaks into:
- MATCH ()-[r]->() RETURN count(*)
- MATCH ()-[r]-() RETURN count(*)
- relationship IDs / row offsets
- path expansion
- multigraph semantics
- self-loops
To keep native, ice-disk, and Arrow consistent, the exported/loaded representation needs to preserve 10 logical rel rows, even if it also has reverse-
direction acceleration data.
The clean model is:
- Logical edge storage: one row per relationship.
- FWD index: source CSR over logical edge IDs/rows.
- BWD index: destination CSR over the same logical edge IDs/rows.
- Properties: stored once, keyed by logical edge row/ID.
- Scans in either direction return the same logical rel ID, not the physical row in a directional mirror.
For Arrow/ice-disk specifically, that implies either:
- Do not duplicate edges.
Use one canonical CSR/FLAT table. BWD scans can be slower, but counts stay correct.
- Add a reverse index, not reverse rows.
For CSR, add BWD indices/indptr that points to the same logical edge IDs, rather than duplicating property rows as separate relationships.
- If duplicated input is unavoidable, require a logical edge ID and mirror marker.
Then scans/counts must collapse mirrors consistently. This is harder and riskier, especially for paths and multigraphs.
So for your consistency requirement, I would not load duplicated Parquet rows as ordinary Arrow rel rows. I’d treat reverse data as an internal index
representation, matching native-table semantics.
› Yes - I want to treat the duplicate rows as internal index. How do I make this distinction clear in icedisk spec?
• Make the spec distinguish logical relationship rows from directional index rows. That is the core wording.
Suggested model:
Relationship Layouts
- FLAT: one row per logical relationship.
- CSR: one row per logical relationship in the forward CSR index.
- DIRECTIONAL_CSR or BIDIRECTIONAL_CSR: directional CSR index files may contain mirror rows, but those rows are not additional logical relationships.
For ice-disk, I’d write it like this:
Relationship Identity
An ice-disk relationship table represents a set of logical relationships. Query
semantics such as COUNT(*), relationship IDs, path uniqueness, and returned
relationship values are defined over logical relationships, not over physical
index rows.
Some layouts may store additional directional index rows to accelerate traversal.
These rows are implementation detail and must not change logical cardinality.
Then define the files explicitly:
CSR Relationship Tables
A CSR relationship table may be stored with one or two directional indexes.
Forward index:
indices_<rel>.parquet
indptr_<rel>.parquet
Optional backward index:
indices_bwd_<rel>.parquet
indptr_bwd_<rel>.parquet
Each index row must include a logical relationship identifier column, for example
rel_id. The same logical relationship must use the same rel_id in every
directional index where it appears.
Then make the invariant strict:
Directional Index Invariants
For a table with N logical relationships:
rel_id identifies the logical relationship.
- Directional index rows are not counted as independent relationships.
- A logical relationship may appear in both FWD and BWD indexes.
- If it appears in both, both rows must share the same
rel_id.
- Properties are defined by logical relationship, not directional index row.
- If properties are repeated in multiple indexes, repeated values must be identical.
- Self-loops must not create duplicate logical relationships.
I’d avoid saying “duplicate rows” in the normative part. Use “mirror index rows” or “directional index rows”. “Duplicate” sounds like ordinary duplicate data
and invites the wrong interpretation.
The most important concrete requirement is adding a stable logical rel_id. Without it, Ladybug cannot distinguish “two logical edges between same endpoints”
from “one logical edge stored in two directional indexes”.
• Then external duplication (icebug-format cli) is not sufficient as a general solution.
If a 10-edge native table exports to ice-disk/Arrow as 20 physical rows, Ladybug will currently treat those as 20 logical relationships. That leaks into:
To keep native, ice-disk, and Arrow consistent, the exported/loaded representation needs to preserve 10 logical rel rows, even if it also has reverse-
direction acceleration data.
The clean model is:
For Arrow/ice-disk specifically, that implies either:
Use one canonical CSR/FLAT table. BWD scans can be slower, but counts stay correct.
For CSR, add BWD indices/indptr that points to the same logical edge IDs, rather than duplicating property rows as separate relationships.
Then scans/counts must collapse mirrors consistently. This is harder and riskier, especially for paths and multigraphs.
So for your consistency requirement, I would not load duplicated Parquet rows as ordinary Arrow rel rows. I’d treat reverse data as an internal index
representation, matching native-table semantics.
› Yes - I want to treat the duplicate rows as internal index. How do I make this distinction clear in icedisk spec?
• Make the spec distinguish logical relationship rows from directional index rows. That is the core wording.
Suggested model:
Relationship Layouts
For ice-disk, I’d write it like this:
Relationship Identity
An ice-disk relationship table represents a set of logical relationships. Query
semantics such as
COUNT(*), relationship IDs, path uniqueness, and returnedrelationship values are defined over logical relationships, not over physical
index rows.
Some layouts may store additional directional index rows to accelerate traversal.
These rows are implementation detail and must not change logical cardinality.
Then define the files explicitly:
CSR Relationship Tables
A CSR relationship table may be stored with one or two directional indexes.
Forward index:
indices_<rel>.parquetindptr_<rel>.parquetOptional backward index:
indices_bwd_<rel>.parquetindptr_bwd_<rel>.parquetEach index row must include a logical relationship identifier column, for example
rel_id. The same logical relationship must use the samerel_idin everydirectional index where it appears.
Then make the invariant strict:
Directional Index Invariants
For a table with
Nlogical relationships:rel_ididentifies the logical relationship.rel_id.I’d avoid saying “duplicate rows” in the normative part. Use “mirror index rows” or “directional index rows”. “Duplicate” sounds like ordinary duplicate data
and invites the wrong interpretation.
The most important concrete requirement is adding a stable logical rel_id. Without it, Ladybug cannot distinguish “two logical edges between same endpoints”
from “one logical edge stored in two directional indexes”.