|
| 1 | +# Plan: RFC-0201 Blob Implementation Missions |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +RFC-0201 (Binary BLOB Type for Deterministic Hash Storage) has been moved to Accepted status in the CipherOcto repository. The spec defines native BYTEA/BLOB support for cryptographic hash storage (SHA256, HMAC-SHA256). Implementation must happen in the **stoolap** codebase (external dependency at `github.com:CipherOcto/stoolap`, branch `feat/blockchain-sql`). |
| 6 | + |
| 7 | +Two separate missions are needed: |
| 8 | +- **Mission A**: Phase 2a/2b/2c/2e — Core Blob (parser, DataType, Value, serialization, comparison, projection) |
| 9 | +- **Mission B**: Phase 2f — DFP/BigInt wire format integration |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Mission A: RFC-0201 Phase 2a/2b/2c/2e — BYTEA Core Blob Type |
| 14 | + |
| 15 | +### 1. DataType Enum (`src/core/types.rs`) |
| 16 | + |
| 17 | +Add `Blob = 10` as the next free variant: |
| 18 | + |
| 19 | +```rust |
| 20 | +/// Binary large object for cryptographic hashes and binary data |
| 21 | +Blob = 10, |
| 22 | +``` |
| 23 | + |
| 24 | +Update `FromStr` to parse BYTEA/BINARY/VARBINARY: |
| 25 | + |
| 26 | +```rust |
| 27 | +"BYTEA" | "BLOB" | "BINARY" | "VARBINARY" => Ok(DataType::Blob), |
| 28 | +``` |
| 29 | + |
| 30 | +Update `is_numeric` → no change (Blob is not numeric). Update `is_orderable` → `!matches!(..., DataType::Blob | DataType::Json | DataType::Vector)` — Blob IS orderable via byte comparison. |
| 31 | + |
| 32 | +**Note**: `DataType::as_u8()` and `from_u8()` auto-handle new variants via `#[repr(u8)]`. |
| 33 | + |
| 34 | +### 2. SchemaColumn Extension (`src/core/schema.rs`) |
| 35 | + |
| 36 | +Add `blob_length: Option<u32>` to `SchemaColumn`: |
| 37 | + |
| 38 | +```rust |
| 39 | +/// Fixed length for BLOB columns (None = variable length) |
| 40 | +pub blob_length: Option<u32>, |
| 41 | +``` |
| 42 | + |
| 43 | +Initialize to `None` in all constructors. Add builder method: |
| 44 | + |
| 45 | +```rust |
| 46 | +pub fn with_blob_length(mut self, len: u32) -> Self { |
| 47 | + self.blob_length = Some(len); |
| 48 | + self |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +### 3. Value::Blob Variant (`src/core/value.rs`) |
| 53 | + |
| 54 | +Add first-class Blob variant (NOT Extension): |
| 55 | + |
| 56 | +```rust |
| 57 | +/// Binary large object — stored as CompactArc<[u8]> for zero-copy sharing. |
| 58 | +/// INVARIANT: The Arc is always heap-allocated; there is no inline/blob case. |
| 59 | +Blob(CompactArc<[u8]>), |
| 60 | +``` |
| 61 | + |
| 62 | +**Remove** the comment at line 68 mentioning "Blob" as a future Extension type. |
| 63 | + |
| 64 | +### 4. Blob Constructors in Value |
| 65 | + |
| 66 | +```rust |
| 67 | +impl Value { |
| 68 | + /// Create a Blob from a byte slice (copies into CompactArc) |
| 69 | + pub fn blob(data: &[u8]) -> Self { |
| 70 | + Value::Blob(CompactArc::from(data)) |
| 71 | + } |
| 72 | + |
| 73 | + /// Create a Blob from an owned Vec (no copy — takes ownership of Arc) |
| 74 | + pub fn blob_from_vec(data: Vec<u8>) -> Self { |
| 75 | + Value::Blob(CompactArc::from(data)) |
| 76 | + } |
| 77 | + |
| 78 | + /// Create a Blob from a CompactArc (zero-copy) |
| 79 | + pub fn blob_from_arc(data: CompactArc<[u8]>) -> Self { |
| 80 | + Value::Blob(data) |
| 81 | + } |
| 82 | + |
| 83 | + /// Extract blob data as byte slice |
| 84 | + pub fn as_blob(&self) -> Option<&[u8]> { |
| 85 | + match self { |
| 86 | + Value::Blob(data) => Some(data), |
| 87 | + _ => None, |
| 88 | + } |
| 89 | + } |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +### 5. compare_blob and BlobOrdering (`src/core/value.rs`) |
| 94 | + |
| 95 | +Per RFC-0201 Section on Comparison Semantics: |
| 96 | + |
| 97 | +```rust |
| 98 | +#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)] |
| 99 | +pub enum BlobOrdering { |
| 100 | + Less, |
| 101 | + Equal, |
| 102 | + Greater, |
| 103 | +} |
| 104 | + |
| 105 | +/// Compare two blobs byte-by-byte in deterministic order |
| 106 | +/// |
| 107 | +/// Algorithm: |
| 108 | +/// 1. Compare bytes in ascending index order until difference found |
| 109 | +/// 2. If all compared bytes are equal, compare lengths (shorter = less) |
| 110 | +/// |
| 111 | +/// Determinism: This ordering is canonical and reproducible. |
| 112 | +fn compare_blob(a: &[u8], b: &[u8]) -> BlobOrdering { |
| 113 | + let min_len = a.len().min(b.len()); |
| 114 | + for i in 0..min_len { |
| 115 | + match a[i].cmp(&b[i]) { |
| 116 | + Ordering::Less => return BlobOrdering::Less, |
| 117 | + Ordering::Greater => return BlobOrdering::Greater, |
| 118 | + Ordering::Equal => continue, |
| 119 | + } |
| 120 | + } |
| 121 | + match a.len().cmp(&b.len()) { |
| 122 | + Ordering::Less => BlobOrdering::Less, |
| 123 | + Ordering::Greater => BlobOrdering::Greater, |
| 124 | + Ordering::Equal => BlobOrdering::Equal, |
| 125 | + } |
| 126 | +} |
| 127 | +``` |
| 128 | + |
| 129 | +**Important**: `BlobOrdering` is NOT `Ordering` — the RFC intentionally uses a separate type. The `Ord` impl on `BlobOrdering` is for use in BTree contexts, but `compare_blob` returns `BlobOrdering`. |
| 130 | + |
| 131 | +### 6. Value::compare Integration (`src/core/value.rs`) |
| 132 | + |
| 133 | +In `compare_same_type`, add: |
| 134 | + |
| 135 | +```rust |
| 136 | +(Value::Blob(a), Value::Blob(b)) => { |
| 137 | + Ok(match compare_blob(a, b) { |
| 138 | + BlobOrdering::Less => Ordering::Less, |
| 139 | + BlobOrdering::Equal => Ordering::Equal, |
| 140 | + BlobOrdering::Greater => Ordering::Greater, |
| 141 | + }) |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +In `PartialEq` for Value: |
| 146 | + |
| 147 | +```rust |
| 148 | +(Value::Blob(a), Value::Blob(b)) => a == b, |
| 149 | +``` |
| 150 | + |
| 151 | +In `Ord` for Value: |
| 152 | + |
| 153 | +```rust |
| 154 | +(Value::Blob(a), Value::Blob(b)) => a.cmp(b), |
| 155 | +``` |
| 156 | + |
| 157 | +In `Hash` for Value: |
| 158 | + |
| 159 | +```rust |
| 160 | +Value::Blob(data) => { |
| 161 | + // Include discriminant (10) and blob data in hash |
| 162 | + let mut hasher = state; |
| 163 | + hasher.write_u8(10); |
| 164 | + hasher.write(data); |
| 165 | +} |
| 166 | +``` |
| 167 | + |
| 168 | +### 7. Display and as_string for Blob |
| 169 | + |
| 170 | +In `fmt::Display`: |
| 171 | + |
| 172 | +```rust |
| 173 | +Value::Blob(data) => { |
| 174 | + // Display as hex string (first 8 bytes + "..." if long) |
| 175 | + if data.len() <= 16 { |
| 176 | + write!(f, "{}", hex::encode(data)) |
| 177 | + } else { |
| 178 | + write!(f, "{}...", hex::encode(&data[..16])) |
| 179 | + } |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +In `as_string`: |
| 184 | + |
| 185 | +```rust |
| 186 | +Value::Blob(data) => Some(hex::encode(data)), |
| 187 | +``` |
| 188 | + |
| 189 | +In `as_str` → Blob does NOT implement `as_str` (binary data, not UTF-8). |
| 190 | + |
| 191 | +### 8. Type Coercion for Blob |
| 192 | + |
| 193 | +In `cast_to_type` → `DataType::Blob`: pass through if already Blob, error otherwise. |
| 194 | + |
| 195 | +In `cast_to_type` FROM Blob → Text: hex encoding. |
| 196 | + |
| 197 | +### 9. Serialization (`src/storage/mvcc/persistence.rs`) |
| 198 | + |
| 199 | +**Tag 12** is the next free tag for Blob: |
| 200 | + |
| 201 | +```rust |
| 202 | +Value::Blob(data) => { |
| 203 | + buf.push(12); |
| 204 | + buf.extend_from_slice(&(data.len() as u32).to_le_bytes()); |
| 205 | + buf.extend_from_slice(data); |
| 206 | +} |
| 207 | +``` |
| 208 | + |
| 209 | +**Deserialization** for tag 12: |
| 210 | + |
| 211 | +```rust |
| 212 | +12 => { |
| 213 | + // Blob |
| 214 | + if rest.len() < 4 { |
| 215 | + return Err(Error::internal("missing blob length")); |
| 216 | + } |
| 217 | + let len = u32::from_le_bytes(rest[..4].try_into().unwrap()) as usize; |
| 218 | + if rest.len() < 4 + len { |
| 219 | + return Err(Error::internal("missing blob data")); |
| 220 | + } |
| 221 | + let blob_data = CompactArc::from(&rest[4..4 + len]); |
| 222 | + Ok(Value::Blob(blob_data)) |
| 223 | +} |
| 224 | +``` |
| 225 | + |
| 226 | +### 10. DDL Parser (`src/executor/ddl.rs`) |
| 227 | + |
| 228 | +Currently at line ~1131: `BLOB | "BINARY" | "VARBINARY" => Ok(DataType::Text)`. Change to: |
| 229 | + |
| 230 | +```rust |
| 231 | +"BYTEA" | "BLOB" | "BINARY" | "VARBINARY" => Ok(DataType::Blob), |
| 232 | +``` |
| 233 | + |
| 234 | +Handle `BYTEA(N)` length constraint via regex in the DDL column parsing path, storing in `SchemaColumn.blob_length`. |
| 235 | + |
| 236 | +### 11. Projection/Selection (Phase 2c) |
| 237 | + |
| 238 | +`Value::Blob` must serialize correctly in result set encoding. The existing `Display` impl for `Value` handles this — Blob displays as hex. |
| 239 | + |
| 240 | +### 12. Equality in Expression Evaluation (Phase 2b) |
| 241 | + |
| 242 | +The `Value::compare` method already handles Blob via the new arm in `compare_same_type`. The expression VM calls `col_val.compare(val)` — no changes needed to the VM, only to Value's comparison logic. |
| 243 | + |
| 244 | +### 13. Phase 2a: Hash Index for Blob Columns |
| 245 | + |
| 246 | +The existing `HashIndex` uses ahash (not SipHash). Per RFC-0201: |
| 247 | + |
| 248 | +- **Acceptable for Phase 2a**: ahash is fine for non-consensus use. SipHash with persistent key is the production requirement for the hash index, but ahash is acceptable for correctness verification first. |
| 249 | +- **Implementation**: `HashIndex` already handles arbitrary `Value` types via `Value::hash`. The key insight is that `HashIndex` stores `Value::Blob` as a key — no structural changes needed. Only the hasher would differ (SipHash vs ahash), which is a Phase 2a follow-up. |
| 250 | + |
| 251 | +Acceptance for Phase 2a: `CREATE INDEX ... USING HASH ON blob_column` creates a functional hash index that correctly resolves `WHERE blob_column = $1` lookups. |
| 252 | + |
| 253 | +### 14. Null Handling |
| 254 | + |
| 255 | +Per RFC-0201: `ALTER TABLE ADD COLUMN BYTEA ... NOT NULL` and `ALTER TABLE ADD COLUMN BYTEA ... NULL` are both **rejected** until null bitmap integration is complete. The schema validation layer must reject any `CREATE TABLE` or `ALTER TABLE` that introduces a BYTEA column with a clear error: "BYTEA columns not supported: null bitmap integration required". |
| 256 | + |
| 257 | +### 15. Tests |
| 258 | + |
| 259 | +Per RFC-0201 test vectors, implement: |
| 260 | +- Blob round-trip: `Value::Blob(bytes)` → serialize → deserialize → `Value::Blob(same_bytes)` |
| 261 | +- `compare_blob` deterministic ordering (bytes-first, length as tiebreaker) |
| 262 | +- `BYTEST` in SQL parser |
| 263 | +- `CREATE TABLE t(key_hash BYTEA(32) NOT NULL)` rejected |
| 264 | +- Hash index creation and lookup for BYTEA column |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## Mission B: RFC-0201 Phase 2f — DFP and BigInt Dispatcher Integration |
| 269 | + |
| 270 | +Phase 2f implements `serialize_dfp`/`deserialize_dfp` and `serialize_bigint`/`deserialize_bigint` in the RFC-0201 dispatcher, replacing the `Err(DCS_INVALID_STRUCT)` stubs. Both RFC-0104 (DFP, 24-byte canonical format) and RFC-0110 (BigInt, little-endian limb array) are Accepted. |
| 271 | + |
| 272 | +### Prerequisites |
| 273 | + |
| 274 | +- `octo-determin` crate (already a dependency in stoolap — used for `Dfp`, `Dqa`) |
| 275 | +- RFC-0104 and RFC-0110 wire format specs must be available |
| 276 | + |
| 277 | +### DFP (RFC-0104) |
| 278 | + |
| 279 | +The `octo-determin::Dfp` type already exists in stoolap (used via `Value::dfp()` etc.). The missing piece is the **dispatcher integration**: |
| 280 | + |
| 281 | +In the RFC-0201 dispatcher pseudocode (implemented in stoolap's query/serialization layer): |
| 282 | + |
| 283 | +```rust |
| 284 | +(Value::Dfp(dfp_val), ColumnType::DeterministicFloat) => { |
| 285 | + let encoding = DfpEncoding::from_dfp(dfp_val).to_bytes(); |
| 286 | + Ok(serialize_dfp(&encoding)) |
| 287 | +} |
| 288 | +``` |
| 289 | + |
| 290 | +The wire format per RFC-0104 is **24 bytes**: sign(1) + exponent(2) + mantissa(21). `octo_determin::DfpEncoding` handles the conversion. |
| 291 | + |
| 292 | +### BigInt (RFC-0110) |
| 293 | + |
| 294 | +The `octo-determin::BigInt` type may not exist yet in stoolap's scope. Per RFC-0110, the wire format is: |
| 295 | +- 4-byte little-endian limb count N |
| 296 | +- N × 8-byte little-endian limbs, least-significant first |
| 297 | + |
| 298 | +```rust |
| 299 | +(Value::BigInt(bigint_val), ColumnType::BigInt) => { |
| 300 | + Ok(serialize_bigint(bigint_val)) |
| 301 | +} |
| 302 | +``` |
| 303 | + |
| 304 | +### Dispatcher Integration Points |
| 305 | + |
| 306 | +The "dispatcher" in RFC-0201 terminology maps to stoolap's query/serialization layer. Specifically: |
| 307 | + |
| 308 | +1. **`serialize_value`** (in `src/storage/mvcc/persistence.rs`) — currently has no DFP or BigInt arm. Add: |
| 309 | + ```rust |
| 310 | + Value::Dfp(dfp) => { buf.push(13); buf.extend_from_slice(&DfpEncoding::from_dfp(dfp).to_bytes()); } |
| 311 | + Value::BigInt(bigint) => { /* limb serialization */ } |
| 312 | + ``` |
| 313 | + |
| 314 | +2. **`deserialize_value`** — currently returns `Err` for unknown tags. Add deserialization arms for tags 13 (DFP) and 14 (BigInt). |
| 315 | + |
| 316 | +3. **`Value::from_typed`** and **`cast_to_type`** — add DFP and BigInt coercion paths. |
| 317 | + |
| 318 | +### NUMERIC_SPEC_VERSION |
| 319 | + |
| 320 | +Per RFC-0201 Phase 1 item and RFC-0110 governance, after implementing BigInt: bump `NUMERIC_SPEC_VERSION` to 2. This is a configuration constant in the serialization layer. |
| 321 | + |
| 322 | +--- |
| 323 | + |
| 324 | +## Dependencies |
| 325 | + |
| 326 | +- **Mission A**: No external RFC dependencies. RFC-0127 (DCS Blob Amendment) is already Accepted and provides the wire format foundation. |
| 327 | +- **Mission B**: RFC-0104 (DFP wire format) and RFC-0110 (BigInt wire format) are both Accepted. |
| 328 | + |
| 329 | +--- |
| 330 | + |
| 331 | +## Verification |
| 332 | + |
| 333 | +After Mission A: |
| 334 | +- `cargo test` passes with new Blob tests |
| 335 | +- `cargo clippy --all-targets --all-features -- -D warnings` passes |
| 336 | +- `CREATE TABLE t(key_hash BYTEA(32))` parses without error |
| 337 | +- `SELECT * FROM t WHERE key_hash = $1` uses hash index |
| 338 | + |
| 339 | +After Mission B: |
| 340 | +- DFP and BigInt round-trip through serialize/deserialize |
| 341 | +- `NUMERIC_SPEC_VERSION = 2` after BigInt implementation |
0 commit comments