fix(avro): append raw bytes#850
Conversation
…atted text The OCF reader's appendBinaryData only handled nil and map[string]any (multi-branch union) inputs; a bare []byte — what hamba yields for a plain "bytes" field or a ["null","bytes"] union — fell into the default branch, which appends fmt-formatted text (e.g. [1 2 254]) instead of the payload, silently corrupting every bytes column. appendStringData had the same fmt.Sprint fallback for []byte. Handle []byte (and string, mirroring appendFixedSizeBinaryData) explicitly, and fix the testdata golden marshaling that base64-encoded the formatted text rather than the raw bytes, which had been masking the bug in TestReader/ShouldLoadExpectedRecords.
zeroshade
left a comment
There was a problem hiding this comment.
Took a closer look and ran the avro package locally against this branch: the new TestOCFReaderBytesValues fails without the reader_types.go change (returns the fmt-formatted text) and passes with it, and TestReader/ShouldLoadExpectedRecords stays green — nice catch on the golden-marshaling that was masking it. Core fix looks correct. A few optional, non-blocking follow-ups inline.
| case []byte: | ||
| b.Append(dt) |
There was a problem hiding this comment.
These explicit []byte/string cases are the right fix. Forward-looking nit: the default: branch a few lines below in this function — b.Append(fmt.Append([]byte{}, data)) — is the exact silent fallback that produced the corrupted text this PR fixes. Now that nil/[]byte/string/map[string]any are all handled it's unreachable for hamba's real outputs, so this is a good opportunity to turn it (and the identical b.Append(fmt.Sprint(data)) default in appendStringData) into a hard error/panic on the unexpected %T. That way the next decoder/type mismatch fails loudly instead of silently corrupting a column. Non-blocking.
There was a problem hiding this comment.
Done in 6a7b49b. Went with an error return rather than a panic since the plumbing already exists: appendFunc returns error and loadDatum propagates it to OCFReader.Err(), same as the decimal appenders. Both appendBinaryData and appendStringData now error on any type outside what hamba produces, with a test covering the error paths.
| case string: | ||
| b.Append([]byte(dt)) |
There was a problem hiding this comment.
Optional: per hamba's decoder a bytes field (and a nullable ["null","bytes"] union) decodes into any as a bare []byte, never a Go string, so this case string is effectively dead on the normal decode path. It's harmless as defensive code, but note the appendFixedSizeBinaryData you're mirroring has no string case — for consistency you could drop it. Your call.
There was a problem hiding this comment.
Dropped it in 6a7b49b. With the default now returning an error, a string reaching a binary column fails loudly instead of being silently coerced, which seems preferable to keeping the dead case.
| case string: | ||
| b.Append([]byte(dt)) | ||
| case map[string]any: | ||
| switch ct := dt["bytes"].(type) { |
There was a problem hiding this comment.
Pre-existing (not introduced by this PR) but right next to the change: the default of this dt["bytes"] switch does b.Append(ct.([]byte)), an unchecked type assertion that will panic if the union's bytes value isn't a []byte. appendFixedSizeBinaryData handles the same shape more defensively with a typed case []byte: b.Append(v) and lets anything else fall through. Might be worth mirroring that here while you're in the file. Optional.
There was a problem hiding this comment.
Done in 6a7b49b — the union branch now uses a typed case []byte like appendFixedSizeBinaryData, and anything else falls into the new error default rather than panicking.
Review follow-ups: - appendBinaryData and appendStringData now return an error on types the hamba decoder never produces, instead of appending fmt-formatted text. The error reaches the caller through the existing appendFunc -> loadDatum -> OCFReader.Err() path. - Drop the dead string case in appendBinaryData (hamba yields []byte for bytes values), matching appendFixedSizeBinaryData. - Replace the unchecked ct.([]byte) assertion in the bytes-union branch with a typed case; non-[]byte union values now error instead of panic.
zeroshade
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround on 6a7b49b — all three points from the previous review are addressed, and switching to an error return (matching the decimal appenders and surfacing via OCFReader.Err()) rather than a panic is the right call.
Resolved
- The
defaultfmtfallbacks inappendBinaryData/appendStringDatanow returnunexpected type %Terrors instead of silently formatting the value. - The dead
case stringinappendBinaryDatawas dropped. - The bytes-union branch now uses a typed
case []byteinstead of the unchecked.([]byte)assertion.
Verified locally (at 6a7b49b): go test ./arrow/avro/... -race and go vet pass, including the new TestAppendBinaryAndStringDataUnexpectedTypes (which also asserts the builder length is unchanged after an errored append, so no row drift) and the existing TestReader / TestOCFReaderBytesValues.
Two non-blocking notes:
-
Optional / consistency:
appendStringData's innermap[string]anyswitch ondt["string"]has nodefault, so an unexpected union value silently appends nothing — whereas the matchingappendBinaryDatainner switch now returns an error. Mirroringdefault: return fmt.Errorf(...)there would keep the two symmetric. Trivial; your call. -
Pre-existing (not introduced by this PR):
loadDatumignores theappendFuncerror in the list-item / map-key / map-value paths (e.g. lines 179, 208, 230, 241, 243), so these new errors — like the existing decimal-appender errors — only surface for top-level and struct fields, not forbytes/stringnested as a list item or map key/value. It's defensive-only in practice (the hamba decoder doesn't emit those Go types for those schemas), so I don't think it should block this fix; might be worth a separate issue to makeloadDatumpropagateappendFuncerrors uniformly.
Overall LGTM — no blockers from me.
Mirror appendBinaryData: the inner dt["string"] union switch in appendStringData now returns an error on non-string values instead of silently appending nothing.
loadDatum dropped appendFunc errors on the list-item, map-key and map-value paths and from recursive loadDatum calls, so appender errors only surfaced for top-level and struct fields. All call sites now propagate; ErrNullStructData is filtered since it signals a skippable null struct, not a failure.
|
Thanks for the second pass.
|
Rationale for this change
appendBinaryDatahas no case for[]byte, so a plain bytes value falls into thefmt.Appendfallback, which results in a string-formated value.What changes are included in this PR?
appendBinaryData: append[]bytesas is & appendstringas its raw bytesappendStringData: handle[]byte.testdata.ByteArray.MarshalJSON: Removefmt.Sprint(<bytes>)to match the new behaviourAre these changes tested?
Yes. Added regression test:
TestOCFReaderBytesValuescovering plain bytes and["null","bytes"]Are there any user-facing changes?
Yes. Avro bytes columns previously decoded to corrupted text now decode to the actual payload.