fix(avro): append raw bytes by matan129 · Pull Request #850 · apache/arrow-go

matan129 · 2026-06-10T13:02:53Z

Rationale for this change

appendBinaryData has no case for []byte, so a plain bytes value falls into thefmt.Append fallback, which results in a string-formated value.

schema := `{"type":"record","name":"r","fields":[{"name":"f","type":"bytes"}]}`
var buf bytes.Buffer
enc, _ := ocf.NewEncoder(schema, &buf)
enc.Encode(map[string]any{"f": []byte{0x00, 0x01, 0xfe, 0xff}})
enc.Close()

r, _ := avro.NewOCFReader(&buf, avro.WithChunk(-1))
r.Next()
fmt.Printf("%x\n", r.RecordBatch().Column(0).(*array.Binary).Value(0))
// got:  5b30203120323534203235355d   — the 13-byte text "[0 1 254 255]"
// want: 0001feff

What changes are included in this PR?

appendBinaryData: append []bytes as is & append string as its raw bytes
appendStringData: handle []byte.
testdata.ByteArray.MarshalJSON: Remove fmt.Sprint(<bytes>) to match the new behaviour

Are these changes tested?

Yes. Added regression test: TestOCFReaderBytesValues covering plain bytes and ["null","bytes"]

Are there any user-facing changes?

Yes. Avro bytes columns previously decoded to corrupted text now decode to the actual payload.

…atted text The OCF reader's appendBinaryData only handled nil and map[string]any (multi-branch union) inputs; a bare []byte — what hamba yields for a plain "bytes" field or a ["null","bytes"] union — fell into the default branch, which appends fmt-formatted text (e.g. [1 2 254]) instead of the payload, silently corrupting every bytes column. appendStringData had the same fmt.Sprint fallback for []byte. Handle []byte (and string, mirroring appendFixedSizeBinaryData) explicitly, and fix the testdata golden marshaling that base64-encoded the formatted text rather than the raw bytes, which had been masking the bug in TestReader/ShouldLoadExpectedRecords.

zeroshade

Took a closer look and ran the avro package locally against this branch: the new TestOCFReaderBytesValues fails without the reader_types.go change (returns the fmt-formatted text) and passes with it, and TestReader/ShouldLoadExpectedRecords stays green — nice catch on the golden-marshaling that was masking it. Core fix looks correct. A few optional, non-blocking follow-ups inline.

zeroshade · 2026-06-10T17:57:46Z

+	case []byte:
+		b.Append(dt)


These explicit []byte/string cases are the right fix. Forward-looking nit: the default: branch a few lines below in this function — b.Append(fmt.Append([]byte{}, data)) — is the exact silent fallback that produced the corrupted text this PR fixes. Now that nil/[]byte/string/map[string]any are all handled it's unreachable for hamba's real outputs, so this is a good opportunity to turn it (and the identical b.Append(fmt.Sprint(data)) default in appendStringData) into a hard error/panic on the unexpected %T. That way the next decoder/type mismatch fails loudly instead of silently corrupting a column. Non-blocking.

Done in 6a7b49b. Went with an error return rather than a panic since the plumbing already exists: appendFunc returns error and loadDatum propagates it to OCFReader.Err(), same as the decimal appenders. Both appendBinaryData and appendStringData now error on any type outside what hamba produces, with a test covering the error paths.

zeroshade · 2026-06-10T17:57:46Z

+	case string:
+		b.Append([]byte(dt))


Optional: per hamba's decoder a bytes field (and a nullable ["null","bytes"] union) decodes into any as a bare []byte, never a Go string, so this case string is effectively dead on the normal decode path. It's harmless as defensive code, but note the appendFixedSizeBinaryData you're mirroring has no string case — for consistency you could drop it. Your call.

Dropped it in 6a7b49b. With the default now returning an error, a string reaching a binary column fails loudly instead of being silently coerced, which seems preferable to keeping the dead case.

zeroshade · 2026-06-10T17:57:46Z

+	case string:
+		b.Append([]byte(dt))
 	case map[string]any:
 		switch ct := dt["bytes"].(type) {


Pre-existing (not introduced by this PR) but right next to the change: the default of this dt["bytes"] switch does b.Append(ct.([]byte)), an unchecked type assertion that will panic if the union's bytes value isn't a []byte. appendFixedSizeBinaryData handles the same shape more defensively with a typed case []byte: b.Append(v) and lets anything else fall through. Might be worth mirroring that here while you're in the file. Optional.

Done in 6a7b49b — the union branch now uses a typed case []byte like appendFixedSizeBinaryData, and anything else falls into the new error default rather than panicking.

Review follow-ups: - appendBinaryData and appendStringData now return an error on types the hamba decoder never produces, instead of appending fmt-formatted text. The error reaches the caller through the existing appendFunc -> loadDatum -> OCFReader.Err() path. - Drop the dead string case in appendBinaryData (hamba yields []byte for bytes values), matching appendFixedSizeBinaryData. - Replace the unchecked ct.([]byte) assertion in the bytes-union branch with a typed case; non-[]byte union values now error instead of panic.

zeroshade

Thanks for the quick turnaround on 6a7b49b — all three points from the previous review are addressed, and switching to an error return (matching the decimal appenders and surfacing via OCFReader.Err()) rather than a panic is the right call.

Resolved

The default fmt fallbacks in appendBinaryData/appendStringData now return unexpected type %T errors instead of silently formatting the value.
The dead case string in appendBinaryData was dropped.
The bytes-union branch now uses a typed case []byte instead of the unchecked .([]byte) assertion.

Verified locally (at 6a7b49b): go test ./arrow/avro/... -race and go vet pass, including the new TestAppendBinaryAndStringDataUnexpectedTypes (which also asserts the builder length is unchanged after an errored append, so no row drift) and the existing TestReader / TestOCFReaderBytesValues.

Two non-blocking notes:

Optional / consistency: appendStringData's inner map[string]any switch on dt["string"] has no default, so an unexpected union value silently appends nothing — whereas the matching appendBinaryData inner switch now returns an error. Mirroring default: return fmt.Errorf(...) there would keep the two symmetric. Trivial; your call.
Pre-existing (not introduced by this PR): loadDatum ignores the appendFunc error in the list-item / map-key / map-value paths (e.g. lines 179, 208, 230, 241, 243), so these new errors — like the existing decimal-appender errors — only surface for top-level and struct fields, not for bytes/string nested as a list item or map key/value. It's defensive-only in practice (the hamba decoder doesn't emit those Go types for those schemas), so I don't think it should block this fix; might be worth a separate issue to make loadDatum propagate appendFunc errors uniformly.

Overall LGTM — no blockers from me.

Mirror appendBinaryData: the inner dt["string"] union switch in appendStringData now returns an error on non-string values instead of silently appending nothing.

loadDatum dropped appendFunc errors on the list-item, map-key and map-value paths and from recursive loadDatum calls, so appender errors only surfaced for top-level and struct fields. All call sites now propagate; ErrNullStructData is filtered since it signals a skippable null struct, not a failure.

matan129 · 2026-06-10T19:31:58Z

Thanks for the second pass.

Note 1 is done in b76328e the dt["string"] union switch now errors on unexpected types, matching appendBinaryData, with test coverage
For note 2, I went ahead and fixed it in this PR. Now loadDatum propagate errors on all paths.

matan129 requested a review from zeroshade as a code owner June 10, 2026 13:02

zeroshade reviewed Jun 10, 2026

View reviewed changes

matan129 added 2 commits June 10, 2026 22:20

fix(avro): error on unexpected string-union values

b76328e

Mirror appendBinaryData: the inner dt["string"] union switch in appendStringData now returns an error on non-string values instead of silently appending nothing.

zeroshade approved these changes Jun 10, 2026

View reviewed changes

zeroshade merged commit ef130b9 into apache:main Jun 10, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(avro): append raw bytes#850

fix(avro): append raw bytes#850
zeroshade merged 4 commits into
apache:mainfrom
matan129:avro-reader-bytes-append

matan129 commented Jun 10, 2026

Uh oh!

zeroshade left a comment

Uh oh!

zeroshade Jun 10, 2026

Uh oh!

matan129 Jun 10, 2026

Uh oh!

zeroshade Jun 10, 2026

Uh oh!

matan129 Jun 10, 2026

Uh oh!

zeroshade Jun 10, 2026

Uh oh!

matan129 Jun 10, 2026

Uh oh!

zeroshade left a comment

Uh oh!

matan129 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matan129 commented Jun 10, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

zeroshade Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

matan129 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

matan129 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

matan129 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

matan129 commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants