Implement support for Google Spanner#271
Conversation
client-base/src/main/kotlin/app/cash/backfila/client/RealBackfillModule.kt
Outdated
Show resolved
Hide resolved
...-misk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfillModule.kt
Outdated
Show resolved
Hide resolved
client-misk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfill.kt
Show resolved
Hide resolved
...isk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackend.kt
Outdated
Show resolved
Hide resolved
...isk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackend.kt
Show resolved
Hide resolved
...er/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackfillOperator.kt
Show resolved
Hide resolved
...nt-misk-spanner/src/test/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfillTest.kt
Show resolved
Hide resolved
|
|
||
| val partitions = listOf( | ||
| PrepareBackfillResponse.Partition.Builder() | ||
| .backfill_range(request.range) |
There was a problem hiding this comment.
are you requiring range to be passed in? In other implementations we compute the ranges if you don't pass it in
There was a problem hiding this comment.
The range is actually completely ignored. Spanner is unlike many other DBs, where for optimal performance primary keys really can't be in anything like a monotonic increasing range. I don't know how to compute a range without doing a full table scan, which seems... suboptimal.
There was a problem hiding this comment.
you can't ask for min/max primary key value?
There was a problem hiding this comment.
Primary keys are often random values like UUIDs and unordered for optimal performance. Min/max aren't valid concepts, as far as I can tell. Source: https://cloud.google.com/spanner/docs/schema-design#primary-key-prevent-hotspots
There was a problem hiding this comment.
Backfila requires ordered key values to operate. I'm curious how you would use it if that's not the case. I haven't used spanner but my understanding was its ordered, you just want to avoid sequential writes
There was a problem hiding this comment.
And to answer the original question - we don’t require a range to be passed in. That’s optional.
There was a problem hiding this comment.
Yeah I'm well aware how primary key design works in spanner, and you can have items added within the range. That's true even in auto increment, technically. It doesn't matter since the expectation is you are inserting new items that don't need backfilling.
It sounds like you are able to just ask spanner for records and it will give in some order, that should be fine I guess
There was a problem hiding this comment.
Wouldn't this work like dynamo backfills? Dynamo is somewhat different, but has a scan mechanism we use, and I believe we don't do ranges on it either? You could check that.
There was a problem hiding this comment.
Does this mean that you will essentially run your backfill single threaded?
There was a problem hiding this comment.
So there must be some distributed way to process the whole data set in bulk? In Dynamo it is this idea of segments.
87ad6fd to
aaaebc0
Compare
aaaebc0 to
1a6ad72
Compare
...er/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackfillOperator.kt
Outdated
Show resolved
Hide resolved
| override fun getNextBatchRange(request: GetNextBatchRangeRequest): GetNextBatchRangeResponse { | ||
| // Establish a range to scane - either we want to start at the first key, | ||
| // or start from (and exclude) the last key that was scanned. | ||
| val range = if (request.previous_end_key == null) { |
There was a problem hiding this comment.
I guess we're not using the backfill_range at all, that's what would be passed in by the user (or I missed it somewhere)
There was a problem hiding this comment.
Yes. If I'm not mistaken, the DynamoDB backend also ignores it.
There was a problem hiding this comment.
DynamoDb is pretty limited because of dynamo itself, the hibernate one is pretty good to copy from. Obviously, build whatever features you want, I won't be using it :P
There was a problem hiding this comment.
You need some guarantees around the end key otherwise you may be missing items, no? This was tricky with DynamoDb as well. We figured out some optimizations but since they weren't really documented we didn't add those to the client. In Dynamo we split up by segment but then don't complete the "batch" until the range is completed. Maybe Google has some better guarantees?
escardin
left a comment
There was a problem hiding this comment.
Since this isn't urgent, I'll just comment.
I think this is a good start, but we'd want to really make sure as much of backfila works as expected as possible. Having a single partition is okay, but could be somewhat challenging to scale. Upper and lower bounds for ranges are a reasonable tradeoff.
|
|
||
| val partitions = listOf( | ||
| PrepareBackfillResponse.Partition.Builder() | ||
| .backfill_range(request.range) |
There was a problem hiding this comment.
Wouldn't this work like dynamo backfills? Dynamo is somewhat different, but has a scan mechanism we use, and I believe we don't do ranges on it either? You could check that.
|
|
||
| val partitions = listOf( | ||
| PrepareBackfillResponse.Partition.Builder() | ||
| .backfill_range(request.range) |
There was a problem hiding this comment.
Does this mean that you will essentially run your backfill single threaded?
mpawliszyn
left a comment
There was a problem hiding this comment.
Overall looking very good. Let's avoid misk except in test.
I wonder if you can use this to be more parallel?
https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.Spanner.PartitionRead
Can you share a session among different machines? Your backfill might die if the session dies though.
| // We do not want to leak client-base implementation details to customers. | ||
| implementation(project(":client-base")) | ||
|
|
||
| implementation(Dependencies.misk) |
There was a problem hiding this comment.
Can we limit our use of misk at least in non-test? Do we really need it?
There was a problem hiding this comment.
Looking through your code I think these only need to be testImplementation dependencies. Let's move those dependencies to test, rename the module, and add a comment so they don't leak to the main implementation.
| val partitions = listOf( | ||
| PrepareBackfillResponse.Partition.Builder() | ||
| .backfill_range(request.range) | ||
| .partition_name("partition") |
There was a problem hiding this comment.
I'd prefer something like single or only. This is exposed to the customer.
| override fun getNextBatchRange(request: GetNextBatchRangeRequest): GetNextBatchRangeResponse { | ||
| // Establish a range to scane - either we want to start at the first key, | ||
| // or start from (and exclude) the last key that was scanned. | ||
| val range = if (request.previous_end_key == null) { |
There was a problem hiding this comment.
You need some guarantees around the end key otherwise you may be missing items, no? This was tricky with DynamoDb as well. We figured out some optimizations but since they weren't really documented we didn't add those to the client. In Dynamo we split up by segment but then don't complete the "batch" until the range is completed. Maybe Google has some better guarantees?
|
|
||
| val partitions = listOf( | ||
| PrepareBackfillResponse.Partition.Builder() | ||
| .backfill_range(request.range) |
There was a problem hiding this comment.
So there must be some distributed way to process the whole data set in bulk? In Dynamo it is this idea of segments.
These changes add a new backend for backfilling Spanner databases integrated into Misk services.
I'm still adding unit tests to show that it all works, but I figured I would put it up for some early review and to discover CI issues.