MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735
MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735arcivanov wants to merge 1 commit intoMariaDB:10.11from
Conversation
|
@arcivanov , a truely impressive first PR. https://mariadb.com/docs/server/server-management/install-and-upgrade-mariadb/installing-mariadb/compiling-mariadb-from-source/compile-and-using-mariadb-with-sanitizers-asan-ubsan-tsan-msan#buildbots-msan-container for help resolving the MSAN errors. A Full ci results - https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4735%2Fhead Do reach out if you need help - https://mariadb.zulipchat.com |
ced75a6 to
57fc52e
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Thank you for your contribution! This is a preliminary review.
There's nothing formally wrong with the diff. One remark: this seems like a new feature to me. So, technically, it should be going to main according to policy.
But I'll leave that decision to the final reviewer.
LGTM. Please stay tuned for the final review.
gkodinov
left a comment
There was a problem hiding this comment.
Missed the tests failing. Please make sure buildbot doesn't produce any failures.
|
The tests are being fixed right now. |
08ec2cf to
7fbaf3e
Compare
gkodinov
left a comment
There was a problem hiding this comment.
Nothing more to add. Please stay tuned for the final review.
|
I have added a lot of comments about the design in https://jira.mariadb.org/browse/MDEV-38975 |
f1dc279 to
15023ad
Compare
|
The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue. This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature. Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind? |
Nevermind, I pushed a commit update, hopefully next run is more lucky |
Those tests do sound like usual suspects. Don't worry, review isn't going to stop because of a few tests and we'll retrigger if important. MSAN happy after your first update and still is. Thanks. |
|
This PR is finalized pending review. I don't expect any more architectural changes or refactorings unless requested. |
Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables by storing blob data in variable-length continuation record chains within the existing `HP_BLOCK` structure. **Continuation runs**: blob data is split across contiguous sequences of `recbuffer`-sized records. Each run stores a 10-byte header (`next_cont` pointer + `run_rec_count`) in the first record; inner records (rec 1..N-1) have no flags byte — full `recbuffer` payload. Runs are linked via `next_cont` pointers. Individual runs are capped at 65,535 records (`uint16` format limit); larger blobs are automatically split into multiple runs. **Zero-copy reads**: single-run blobs return pointers directly into `HP_BLOCK` records, avoiding `blob_buff` reassembly entirely: - Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE` - Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer` - Case C (multi-run): walk chain, reassemble into `blob_buff` `HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by `heap_update()` to refresh the caller's record buffer after freeing old chains, preventing dangling pointers. **Free list scavenging**: on insert, the free list is walked read-only (peek) tracking contiguous groups in descending address order (LIFO). Qualifying groups (>= `min_run_records`) are unlinked and used. The first non-qualifying group terminates the scan — remaining data is allocated from the block tail. The free list is never disturbed when no qualifying group is found. **Record counting**: new `HP_SHARE::total_records` tracks all physical records (primary + continuation). `HP_SHARE::records` remains logical (primary-only) to preserve linear hash bucket mapping correctness. **Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read `run_rec_count` from rec 0 and skip entire continuation runs at once. **Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`, `hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading actual blob data via pointer dereference or chain materialization. **SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables (replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`). `remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`. `ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for DISTINCT deduplication support. Fixed undefined behavior in `test_if_cheaper_ordering()` where `select_limit/fanout` could overflow to infinity — capped at `HA_POS_ERROR`. https://jira.mariadb.org/browse/MDEV-38975
|
Added two defensive memory continuity alloc assertions to future-proof, nothing else changed |
Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine
tables by storing blob data in variable-length continuation record chains
within the existing
HP_BLOCKstructure.Continuation runs: blob data is split across contiguous sequences
of
recbuffer-sized records. Each run stores a 10-byte header(
next_contpointer +run_rec_count) in the first record; innerrecords (rec 1..N-1) have no flags byte — full
recbufferpayload.Runs are linked via
next_contpointers. Individual runs are cappedat 65,535 records (
uint16format limit); larger blobs areautomatically split into multiple runs.
Zero-copy reads: single-run blobs return pointers directly into
HP_BLOCKrecords, avoidingblob_buffreassembly entirely:run_rec_count == 1): returnchain + HP_CONT_HEADER_SIZEHP_ROW_CONT_ZEROCOPYflag): returnchain + recbufferblob_buffHP_INFO::has_zerocopy_blobstracks zero-copy state; used byheap_update()to refresh the caller's record buffer after freeingold chains, preventing dangling pointers.
Free list scavenging: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>=
min_run_records) are unlinked and used. Thefirst non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.
Record counting: new
HP_SHARE::total_recordstracks all physicalrecords (primary + continuation).
HP_SHARE::recordsremains logical(primary-only) to preserve linear hash bucket mapping correctness.
Scan/check batch-skip:
heap_scan()andheap_check_heap()readrun_rec_countfrom rec 0 and skip entire continuation runs at once.Hash functions:
hp_rec_hashnr(),hp_rec_key_cmp(),hp_key_cmp(),hp_make_key()updated to handleHA_BLOB_PARTkey segments — readingactual blob data via pointer dereference or chain materialization.
SQL layer:
choose_engine()no longer rejects HEAP for blob tables(replaced
blob_fieldscheck withreclength > HA_MAX_REC_LENGTH).remove_duplicates()routes HEAP+blob toremove_dup_with_compare().ha_heap::remember_rnd_pos()/restart_rnd_next()implemented forDISTINCT deduplication support. Fixed undefined behavior in
test_if_cheaper_ordering()whereselect_limit/fanoutcould overflowto infinity — capped at
HA_POS_ERROR.https://jira.mariadb.org/browse/MDEV-38975