Skip to content

fix(avro): append raw bytes#850

Merged
zeroshade merged 4 commits into
apache:mainfrom
matan129:avro-reader-bytes-append
Jun 10, 2026
Merged

fix(avro): append raw bytes#850
zeroshade merged 4 commits into
apache:mainfrom
matan129:avro-reader-bytes-append

Conversation

@matan129

Copy link
Copy Markdown
Contributor

Rationale for this change

appendBinaryData has no case for []byte, so a plain bytes value falls into thefmt.Append fallback, which results in a string-formated value.

schema := `{"type":"record","name":"r","fields":[{"name":"f","type":"bytes"}]}`
var buf bytes.Buffer
enc, _ := ocf.NewEncoder(schema, &buf)
enc.Encode(map[string]any{"f": []byte{0x00, 0x01, 0xfe, 0xff}})
enc.Close()

r, _ := avro.NewOCFReader(&buf, avro.WithChunk(-1))
r.Next()
fmt.Printf("%x\n", r.RecordBatch().Column(0).(*array.Binary).Value(0))
// got:  5b30203120323534203235355d   — the 13-byte text "[0 1 254 255]"
// want: 0001feff

What changes are included in this PR?

  • appendBinaryData: append []bytes as is & append string as its raw bytes
  • appendStringData: handle []byte.
  • testdata.ByteArray.MarshalJSON: Remove fmt.Sprint(<bytes>) to match the new behaviour

Are these changes tested?

Yes. Added regression test: TestOCFReaderBytesValues covering plain bytes and ["null","bytes"]

Are there any user-facing changes?

Yes. Avro bytes columns previously decoded to corrupted text now decode to the actual payload.

…atted text

The OCF reader's appendBinaryData only handled nil and map[string]any
(multi-branch union) inputs; a bare []byte — what hamba yields for a plain
"bytes" field or a ["null","bytes"] union — fell into the default branch,
which appends fmt-formatted text (e.g. [1 2 254]) instead of the payload,
silently corrupting every bytes column. appendStringData had the same
fmt.Sprint fallback for []byte.

Handle []byte (and string, mirroring appendFixedSizeBinaryData) explicitly,
and fix the testdata golden marshaling that base64-encoded the formatted
text rather than the raw bytes, which had been masking the bug in
TestReader/ShouldLoadExpectedRecords.
@matan129 matan129 requested a review from zeroshade as a code owner June 10, 2026 13:02

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a closer look and ran the avro package locally against this branch: the new TestOCFReaderBytesValues fails without the reader_types.go change (returns the fmt-formatted text) and passes with it, and TestReader/ShouldLoadExpectedRecords stays green — nice catch on the golden-marshaling that was masking it. Core fix looks correct. A few optional, non-blocking follow-ups inline.

Comment on lines +597 to +598
case []byte:
b.Append(dt)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These explicit []byte/string cases are the right fix. Forward-looking nit: the default: branch a few lines below in this function — b.Append(fmt.Append([]byte{}, data)) — is the exact silent fallback that produced the corrupted text this PR fixes. Now that nil/[]byte/string/map[string]any are all handled it's unreachable for hamba's real outputs, so this is a good opportunity to turn it (and the identical b.Append(fmt.Sprint(data)) default in appendStringData) into a hard error/panic on the unexpected %T. That way the next decoder/type mismatch fails loudly instead of silently corrupting a column. Non-blocking.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6a7b49b. Went with an error return rather than a panic since the plumbing already exists: appendFunc returns error and loadDatum propagates it to OCFReader.Err(), same as the decimal appenders. Both appendBinaryData and appendStringData now error on any type outside what hamba produces, with a test covering the error paths.

Comment thread arrow/avro/reader_types.go Outdated
Comment on lines +599 to +600
case string:
b.Append([]byte(dt))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: per hamba's decoder a bytes field (and a nullable ["null","bytes"] union) decodes into any as a bare []byte, never a Go string, so this case string is effectively dead on the normal decode path. It's harmless as defensive code, but note the appendFixedSizeBinaryData you're mirroring has no string case — for consistency you could drop it. Your call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped it in 6a7b49b. With the default now returning an error, a string reaching a binary column fails loudly instead of being silently coerced, which seems preferable to keeping the dead case.

case string:
b.Append([]byte(dt))
case map[string]any:
switch ct := dt["bytes"].(type) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing (not introduced by this PR) but right next to the change: the default of this dt["bytes"] switch does b.Append(ct.([]byte)), an unchecked type assertion that will panic if the union's bytes value isn't a []byte. appendFixedSizeBinaryData handles the same shape more defensively with a typed case []byte: b.Append(v) and lets anything else fall through. Might be worth mirroring that here while you're in the file. Optional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6a7b49b — the union branch now uses a typed case []byte like appendFixedSizeBinaryData, and anything else falls into the new error default rather than panicking.

Review follow-ups:
- appendBinaryData and appendStringData now return an error on types the
  hamba decoder never produces, instead of appending fmt-formatted text.
  The error reaches the caller through the existing appendFunc -> loadDatum
  -> OCFReader.Err() path.
- Drop the dead string case in appendBinaryData (hamba yields []byte for
  bytes values), matching appendFixedSizeBinaryData.
- Replace the unchecked ct.([]byte) assertion in the bytes-union branch
  with a typed case; non-[]byte union values now error instead of panic.

@zeroshade zeroshade left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround on 6a7b49b — all three points from the previous review are addressed, and switching to an error return (matching the decimal appenders and surfacing via OCFReader.Err()) rather than a panic is the right call.

Resolved

  • The default fmt fallbacks in appendBinaryData/appendStringData now return unexpected type %T errors instead of silently formatting the value.
  • The dead case string in appendBinaryData was dropped.
  • The bytes-union branch now uses a typed case []byte instead of the unchecked .([]byte) assertion.

Verified locally (at 6a7b49b): go test ./arrow/avro/... -race and go vet pass, including the new TestAppendBinaryAndStringDataUnexpectedTypes (which also asserts the builder length is unchanged after an errored append, so no row drift) and the existing TestReader / TestOCFReaderBytesValues.

Two non-blocking notes:

  1. Optional / consistency: appendStringData's inner map[string]any switch on dt["string"] has no default, so an unexpected union value silently appends nothing — whereas the matching appendBinaryData inner switch now returns an error. Mirroring default: return fmt.Errorf(...) there would keep the two symmetric. Trivial; your call.

  2. Pre-existing (not introduced by this PR): loadDatum ignores the appendFunc error in the list-item / map-key / map-value paths (e.g. lines 179, 208, 230, 241, 243), so these new errors — like the existing decimal-appender errors — only surface for top-level and struct fields, not for bytes/string nested as a list item or map key/value. It's defensive-only in practice (the hamba decoder doesn't emit those Go types for those schemas), so I don't think it should block this fix; might be worth a separate issue to make loadDatum propagate appendFunc errors uniformly.

Overall LGTM — no blockers from me.

matan129 added 2 commits June 10, 2026 22:20
Mirror appendBinaryData: the inner dt["string"] union switch in
appendStringData now returns an error on non-string values instead of
silently appending nothing.
loadDatum dropped appendFunc errors on the list-item, map-key and
map-value paths and from recursive loadDatum calls, so appender errors
only surfaced for top-level and struct fields.
All call sites now propagate; ErrNullStructData is filtered since it
signals a skippable null struct, not a failure.
@matan129

Copy link
Copy Markdown
Contributor Author

Thanks for the second pass.

  • Note 1 is done in b76328e the dt["string"] union switch now errors on unexpected types, matching appendBinaryData, with test coverage
  • For note 2, I went ahead and fixed it in this PR. Now loadDatum propagate errors on all paths.

@zeroshade zeroshade merged commit ef130b9 into apache:main Jun 10, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants