Skip to content

Conversation

@serramatutu
Copy link

@serramatutu serramatutu commented Oct 30, 2025

Which issue does this PR close?

This PR implements the new arrow.timestamp_with_offset canonical extension type for arrow-go.

Rationale for this change

Be compatible with Arrow spec.

What changes are included in this PR?

This commit adds a new TimestampWithOffset extension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, this is a new canonical extension type.

@felipecrv felipecrv self-requested a review November 1, 2025 01:55
felipecrv added a commit to apache/arrow that referenced this pull request Dec 5, 2025
…48002)

### Rationale for this change

Closes #44248

Arrow has no built-in canonical way of representing the `TIMESTAMP WITH
TIME ZONE` SQL type, which is present across multiple different database
systems. Not having a native way to represent this forces users to
either convert to UTC and drop the time zone, which may have correctness
implications, or use bespoke workarounds. A new
`arrow.timestamp_with_offset` extension type would introduce a standard
canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

[DISCUSS] [thread in the mailing
list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk).

### What changes are included in this PR?

Proposal and documentation for `arrow.timestamp_with_offset` canonical
extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Yes, this is an extension to the arrow format.

* GitHub Issue: #44248

---------

Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch 3 times, most recently from 95230ad to b9b8bf2 Compare December 22, 2025 14:33
@serramatutu serramatutu changed the title [DRAFT] Add TimestampWithOffset extension type Add TimestampWithOffset extension type Dec 22, 2025
@serramatutu serramatutu marked this pull request as ready for review December 22, 2025 14:42
@serramatutu serramatutu changed the title Add TimestampWithOffset extension type Add TimestampWithOffset canonical extension type Dec 22, 2025
This commit adds a new `TimestampWithOffset` extension type that can be
used to represent timestamps with per-row timezone information. It
stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`,
where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which
represents the offset in minutes from the UTC timestamp.
This commit allows `TimestampWithOffset` to be dict-encoded.

 - I made `NewTimestampWithOffsetType` take in an input
  `offsetType arrow.DataType`. It returns an error if the data type is not
  valid.

- I added a new infallible `NewTimestampWithOffsetTypePrimitiveEncoded`
  to make the encoding explicit.

- I added `NewTimestampWithOffsetTypeDictionaryEncoded` which returns an
  error in case the given type is not a valid dictionary key type.

- I made all tests run in a for loop with all possible allowed encoding
  types, ensuring all encodings work.
Smartly iterate over offsets if they're run-end encoded instead of doing
a binary search at every iteration.

This makes the loops O(n) instead of O(n*logn).
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/go branch from b9b8bf2 to ccdd288 Compare December 22, 2025 15:17
Comment on lines +60 to +64
func isDataTypeCompatible(storageType arrow.DataType) (arrow.TimeUnit, arrow.DataType, bool) {
timeUnit := arrow.Second
offsetType := arrow.PrimitiveTypes.Int16
switch t := storageType.(type) {
case *arrow.StructType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a perfect case that would make sense to use named returns:

Suggested change
func isDataTypeCompatible(storageType arrow.DataType) (arrow.TimeUnit, arrow.DataType, bool) {
timeUnit := arrow.Second
offsetType := arrow.PrimitiveTypes.Int16
switch t := storageType.(type) {
case *arrow.StructType:
func isDataTypeCompatible(storageType arrow.DataType) (unit arrow.TimeUnit, offsetType arrow.DataType, ok bool) {
st, compat := storageType.(*arrow.StructType)
if !compat || st.NumFields() != 2 {
return
}

etc...

Comment on lines +69 to +80
maybeTimestamp := t.Field(0)
maybeOffset := t.Field(1)

timestampOk := false
switch timestampType := maybeTimestamp.Type.(type) {
case *arrow.TimestampType:
if timestampType.TimeZone == "UTC" {
timestampOk = true
timeUnit = timestampType.TimeUnit()
}
default:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this with something like:

    if ts, compat := t.Field(0).Type.(*arrow.TimestampType); compat && ts.TimeZone == "UTC" {        
        timeUnit = ts.TimeUnit()
    } else {
        return
    }

    maybeOffset := t.Field(1)
    offsetType = maybeOffset.Type
    ok = t.Field(0).Name == "timestamp" && !t.Field(0).Nullable &&
        maybeOffset.Name == "offset_minutes" && isOffsetTypeOk(offsetType) &&
        !maybeOffset.Nullable
    return

Comment on lines +128 to +130
// NewTimestampWithOffsetTypePrimitiveEncoded creates a new TimestampWithOffsetType with the underlying storage type set correctly to
// Struct(timestamp=Timestamp(T, "UTC"), offset_minutes=Int16), where T is any TimeUnit.
func NewTimestampWithOffsetTypePrimitiveEncoded(unit arrow.TimeUnit) *TimestampWithOffsetType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my personal preference for naming would be the inverse of this. The default NewTimestampWithOffsetType should be this one, while the version that takes a custom offsetType should be NewTimestampWithOffsetTypeCustomOffset or something to that effect. The simplest/most common usage should have the shorter name.

// valid Dictionary index type.
//
// The error will be populated if the index is not a valid dictionary-encoding index type.
func NewTimestampWithOffsetTypeDictionaryEncoded(unit arrow.TimeUnit, index arrow.DataType) (*TimestampWithOffsetType, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea that might be interesting would be to use generics to avoid the need for the error output.

For example:

func NewTimestampWithOffsetTypeDictEncoded[T arrow.IntType | arrow.UintType](unit arrow.TimeUnit) *TimestampWithOffsetType {
    offsetType := &arrow.DictionaryType{
        IndexType: arrow.GetDataType[T](),
        ValueType: arrow.PrimitiveTypes.Int16,
    }

    ...

This would be used like NewTimestampWithOffsetTypeDictEncoded[int8](arrow.Second)

or maybe:

type IntegralType interface {
    *arrow.Int8Type | *arrow.Int16Type | *arrow.Int32Type | *arrow.Int64Type |
    *arrow.Uint8Type | *arrow.Uint16Type | *arrow.Uint32Type | *arrow.Uint64Type
}

func NewTimestampWithOffsetTypeDictEncoded[T IntegralType](unit arrow.TimeUnit, index T) *TimestampWithOffsetType {
    offsetType := &arrow.DictionaryType{
        IndexType: index,
        ValueType: arrow.PrimitiveTypes.Int16,
    }

    ...

This could be used like NewTimestampWithOffsetTypeDictEncoded(arrow.Second, arrow.PrimitiveTypes.Int8)

Leveraging the generics might make the usage simpler potentially by compile time enforcement of the proper types. This is just an idea, curious what you think

// valid run-ends type.
//
// The error will be populated if runEnds is not a valid run-end encoding run-ends type.
func NewTimestampWithOffsetTypeRunEndEncoded(unit arrow.TimeUnit, runEnds arrow.DataType) (*TimestampWithOffsetType, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the previous comment about potentially leveraging generics for simplification

Comment on lines +365 to +369
values := make([]interface{}, a.Len())
a.iterValues(func(i int, timestamp *time.Time) {
values[i] = timestamp
})
return json.Marshal(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values := make([]interface{}, a.Len())
a.iterValues(func(i int, timestamp *time.Time) {
values[i] = timestamp
})
return json.Marshal(values)
return json.Marshal(a.Values())

Comment on lines +495 to +496
timestamps.UnsafeAppendBoolToBitmap(false)
offsets.UnsafeAppendBoolToBitmap(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are non-nullable I thought? We should be pushing default 0,0 in the case where !valids[i]

Comment on lines +506 to +507
timestamps.UnsafeAppendBoolToBitmap(false)
offsets.UnsafeAppendBoolToBitmap(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, these are non-nullable according to the spec.

Comment on lines +525 to +526
timestamps.UnsafeAppendBoolToBitmap(false)
offsets.UnsafeAppendBoolToBitmap(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. these are non-nullable according to the spec

Comment on lines +39 to +42
var testZone1 = time.FixedZone("UTC-08:00", -8*60*60)
var testDate2 = testDate1.In(testZone1)

var testZone2 = time.FixedZone("UTC+06:00", +6*60*60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure there's a test with a zone that isn't evenly divisible by 60, perhaps a zone with an offset of 5:30 or something to that effect.

Also, make sure we're testing a zone with an offset greater than 10 hours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants