Ladybug version
v0.16.1
What operating system are you using?
Ubuntu 22.04
What happened?
Summary
Large bulk ingest is still very slow for our workload even when using bulk COPY-style loading through the Python API, rather than single-row transactional inserts.
I know there is already related context in:
But this report is specifically about large-scale bulk load performance, not per-row or non-batched insert overhead.
Environment
- LadybugDB version:
0.16.1
- Python version:
3.12.13
- Platform:
Linux x86_64
- Binding: Python
- Workload runner: custom benchmark harness using Ladybug as an on-disk graph store
What we are doing
This is not single-row insert mode.
We:
- Create schema up front.
- Bulk load node tables from CSV using
COPY ... FROM.
- Bulk load relationship tables from CSV using
COPY ... FROM.
- Use an on-disk database.
- Do not create extra secondary indexes for this benchmark.
In our harness, node tables are copied directly from CSV, and relationship CSVs are rewritten into chunks and then loaded via COPY.
Dataset shape
Synthetic property graph:
- 10 node labels
- 10 relationship types
- 10,000,000 nodes
- 77,790,000 edges
- skewed degree profile
- node properties:
- 8 extra text
- 18 extra numeric
- 8 extra boolean
- edge properties:
- 4 extra text
- 10 extra numeric
- 4 extra boolean
Observed behavior
Large ingest takes roughly 19 to 20 hours before analytical queries even begin.
Representative ingest times from repeated runs:
68,619,567 ms (~19.06 h)
69,380,471 ms (~19.27 h)
72,656,042 ms (~20.18 h)
So in practice, just preparing the large dataset is already prohibitively expensive for benchmarking and evaluation workflows.
Why this seems distinct from #302
Issue #302 discusses poor performance for non-batched / transactional single-row inserts.
Our case is different:
- we are not inserting row-by-row
- we are doing large CSV-based bulk load
- the database still takes around 19 to 20 hours to ingest this graph
That makes it feel like there may still be a separate bottleneck in the current bulk-load path at larger scales.
Minimal reproduction shape
The exact harness is custom, but the important part is:
- create 10 node tables and 10 relationship tables
- bulk import ~10M nodes and ~77.79M relationships from CSV
- use Python API with on-disk database
- large skewed graph with many properties per record
If useful, I can provide a more minimal standalone repro script that just focuses on schema creation + CSV COPY load without the rest of the benchmark machinery.
Expected behavior
Bulk ingest should be substantially faster for this scale, or at least there should be a documented fast-ingest path that makes this practical.
Actual behavior
Large bulk ingest completes, but takes around 19 to 20 hours, which is too slow for practical large-scale benchmark preparation.
Additional note
I am intentionally keeping this issue focused on ingest time only.
In separate large runs, we also saw later query-execution instability / process kills during heavy OLAP traversal queries, but I do not want to mix that into this report unless you think the two are likely connected.
Are there known steps to reproduce?
No response
Ladybug version
v0.16.1
What operating system are you using?
Ubuntu 22.04
What happened?
Summary
Large bulk ingest is still very slow for our workload even when using bulk
COPY-style loading through the Python API, rather than single-row transactional inserts.I know there is already related context in:
But this report is specifically about large-scale bulk load performance, not per-row or non-batched insert overhead.
Environment
0.16.13.12.13Linux x86_64What we are doing
This is not single-row insert mode.
We:
COPY ... FROM.COPY ... FROM.In our harness, node tables are copied directly from CSV, and relationship CSVs are rewritten into chunks and then loaded via
COPY.Dataset shape
Synthetic property graph:
Observed behavior
Large ingest takes roughly 19 to 20 hours before analytical queries even begin.
Representative ingest times from repeated runs:
68,619,567 ms(~19.06 h)69,380,471 ms(~19.27 h)72,656,042 ms(~20.18 h)So in practice, just preparing the large dataset is already prohibitively expensive for benchmarking and evaluation workflows.
Why this seems distinct from #302
Issue #302 discusses poor performance for non-batched / transactional single-row inserts.
Our case is different:
That makes it feel like there may still be a separate bottleneck in the current bulk-load path at larger scales.
Minimal reproduction shape
The exact harness is custom, but the important part is:
If useful, I can provide a more minimal standalone repro script that just focuses on schema creation + CSV
COPYload without the rest of the benchmark machinery.Expected behavior
Bulk ingest should be substantially faster for this scale, or at least there should be a documented fast-ingest path that makes this practical.
Actual behavior
Large bulk ingest completes, but takes around 19 to 20 hours, which is too slow for practical large-scale benchmark preparation.
Additional note
I am intentionally keeping this issue focused on ingest time only.
In separate large runs, we also saw later query-execution instability / process kills during heavy OLAP traversal queries, but I do not want to mix that into this report unless you think the two are likely connected.
Are there known steps to reproduce?
No response