Skip to content
/ server Public

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735

Open
arcivanov wants to merge 1 commit intoMariaDB:10.11from
arcivanov:MDEV-38975
Open

MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support#4735
arcivanov wants to merge 1 commit intoMariaDB:10.11from
arcivanov:MDEV-38975

Conversation

@arcivanov
Copy link
Contributor

@arcivanov arcivanov commented Mar 5, 2026

Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine
tables by storing blob data in variable-length continuation record chains
within the existing HP_BLOCK structure.

Continuation runs: blob data is split across contiguous sequences
of recbuffer-sized records. Each run stores a 10-byte header
(next_cont pointer + run_rec_count) in the first record; inner
records (rec 1..N-1) have no flags byte — full recbuffer payload.
Runs are linked via next_cont pointers. Individual runs are capped
at 65,535 records (uint16 format limit); larger blobs are
automatically split into multiple runs.

Zero-copy reads: single-run blobs return pointers directly into
HP_BLOCK records, avoiding blob_buff reassembly entirely:

  • Case A (run_rec_count == 1): return chain + HP_CONT_HEADER_SIZE
  • Case B (HP_ROW_CONT_ZEROCOPY flag): return chain + recbuffer
  • Case C (multi-run): walk chain, reassemble into blob_buff
    HP_INFO::has_zerocopy_blobs tracks zero-copy state; used by
    heap_update() to refresh the caller's record buffer after freeing
    old chains, preventing dangling pointers.

Free list scavenging: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>= min_run_records) are unlinked and used. The
first non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.

Record counting: new HP_SHARE::total_records tracks all physical
records (primary + continuation). HP_SHARE::records remains logical
(primary-only) to preserve linear hash bucket mapping correctness.

Scan/check batch-skip: heap_scan() and heap_check_heap() read
run_rec_count from rec 0 and skip entire continuation runs at once.

Hash functions: hp_rec_hashnr(), hp_rec_key_cmp(), hp_key_cmp(),
hp_make_key() updated to handle HA_BLOB_PART key segments — reading
actual blob data via pointer dereference or chain materialization.

SQL layer: choose_engine() no longer rejects HEAP for blob tables
(replaced blob_fields check with reclength > HA_MAX_REC_LENGTH).
remove_duplicates() routes HEAP+blob to remove_dup_with_compare().
ha_heap::remember_rnd_pos() / restart_rnd_next() implemented for
DISTINCT deduplication support. Fixed undefined behavior in
test_if_cheaper_ordering() where select_limit/fanout could overflow
to infinity — capped at HA_POS_ERROR.

https://jira.mariadb.org/browse/MDEV-38975

@arcivanov
Copy link
Contributor Author

@gkodinov @dr-m

@grooverdan
Copy link
Member

grooverdan commented Mar 5, 2026

@arcivanov , a truely impressive first PR.

https://mariadb.com/docs/server/server-management/install-and-upgrade-mariadb/installing-mariadb/compiling-mariadb-from-source/compile-and-using-mariadb-with-sanitizers-asan-ubsan-tsan-msan#buildbots-msan-container for help resolving the MSAN errors. A dev_debian13-msan-clang-22 tagged MSAN container was built yesterday.

Full ci results - https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4735%2Fhead

Do reach out if you need help - https://mariadb.zulipchat.com

@arcivanov arcivanov force-pushed the MDEV-38975 branch 2 times, most recently from ced75a6 to 57fc52e Compare March 5, 2026 07:59
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Mar 5, 2026
Copy link
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

There's nothing formally wrong with the diff. One remark: this seems like a new feature to me. So, technically, it should be going to main according to policy.

But I'll leave that decision to the final reviewer.

LGTM. Please stay tuned for the final review.

Copy link
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the tests failing. Please make sure buildbot doesn't produce any failures.

@arcivanov
Copy link
Contributor Author

The tests are being fixed right now.

@arcivanov arcivanov force-pushed the MDEV-38975 branch 5 times, most recently from 08ec2cf to 7fbaf3e Compare March 5, 2026 11:32
Copy link
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more to add. Please stay tuned for the final review.

@gkodinov gkodinov requested a review from montywi March 5, 2026 12:07
@montywi
Copy link
Contributor

montywi commented Mar 5, 2026

I have added a lot of comments about the design in https://jira.mariadb.org/browse/MDEV-38975
(looks good)

@arcivanov arcivanov force-pushed the MDEV-38975 branch 6 times, most recently from f1dc279 to 15023ad Compare March 6, 2026 05:52
@arcivanov
Copy link
Contributor Author

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.

This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.

Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

@arcivanov arcivanov changed the title MDEV-38975: HEAP engine BLOB/TEXT column support MDEV-38975: HEAP engine BLOB/TEXT/JSON/GEOMETRY column support Mar 6, 2026
@arcivanov
Copy link
Contributor Author

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.

This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.

Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

Nevermind, I pushed a commit update, hopefully next run is more lucky

@grooverdan
Copy link
Member

The https://buildbot.mariadb.org/#builders/534/builds/35108 failing test is a flaky timeout issue.
This is rpl.rpl_semi_sync_shutdown_await_ack - a replication semi-sync test, nothing to do with HEAP/BLOB. It's a timeout waiting for a semi-sync ACK during shutdown, failing across all three binlog formats (stmt, row, mix). This is a flaky replication test, not caused by this feature.
Not being in MariaDB I can't press rebuild to clear it so if anybody would be so kind?

Nevermind, I pushed a commit update, hopefully next run is more lucky

Those tests do sound like usual suspects. Don't worry, review isn't going to stop because of a few tests and we'll retrigger if important.

MSAN happy after your first update and still is.

Thanks.

@arcivanov
Copy link
Contributor Author

This PR is finalized pending review. I don't expect any more architectural changes or refactorings unless requested.

Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables
by storing blob data in variable-length continuation record chains
within the existing `HP_BLOCK` structure.

**Continuation runs**: blob data is split across contiguous sequences
of `recbuffer`-sized records. Each run stores a 10-byte header
(`next_cont` pointer + `run_rec_count`) in the first record; inner
records (rec 1..N-1) have no flags byte — full `recbuffer` payload.
Runs are linked via `next_cont` pointers. Individual runs are capped
at 65,535 records (`uint16` format limit); larger blobs are
automatically split into multiple runs.

**Zero-copy reads**: single-run blobs return pointers directly into
`HP_BLOCK` records, avoiding `blob_buff` reassembly entirely:
- Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE`
- Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer`
- Case C (multi-run): walk chain, reassemble into `blob_buff`
`HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by
`heap_update()` to refresh the caller's record buffer after freeing
old chains, preventing dangling pointers.

**Free list scavenging**: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>= `min_run_records`) are unlinked and used. The
first non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.

**Record counting**: new `HP_SHARE::total_records` tracks all physical
records (primary + continuation). `HP_SHARE::records` remains logical
(primary-only) to preserve linear hash bucket mapping correctness.

**Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read
`run_rec_count` from rec 0 and skip entire continuation runs at once.

**Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`,
`hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading
actual blob data via pointer dereference or chain materialization.

**SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables
(replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`).
`remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`.
`ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for
DISTINCT deduplication support. Fixed undefined behavior in
`test_if_cheaper_ordering()` where `select_limit/fanout` could overflow
to infinity — capped at `HA_POS_ERROR`.

https://jira.mariadb.org/browse/MDEV-38975
@arcivanov
Copy link
Contributor Author

Added two defensive memory continuity alloc assertions to future-proof, nothing else changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

4 participants