Issue 6485: Partition data when generating sketches before bulk import #6501

rtjd6554 · 2026-01-28T10:52:30Z

Make sure you have checked all steps below.

Issue

My PR fully resolves the following issues. I've referenced an issue in the PR title, for example "Issue 1234 - My
Feature". Note that before an issue is finished, you can still make a pull request by raising a separate issue
for your progress.
- Resolves Partition data when generating sketches before bulk import #6485

Tests

My PR adds the following tests based on our test strategy OR does not need testing for this extremely good reason:
- New tests within file: SleeperBuilderTest

Documentation

In case of new functionality, my PR adds documentation that describes how to use it, or I have linked to a
separate issue for that below.
If I have added new Java code, I have added Javadoc that explains it following our conventions and style.
If I have added or removed any dependencies from the project, I have updated the NOTICES file.

java/common/sketches/src/test/java/sleeper/sketches/SketchesUnionBuilderTest.java

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

...t/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchingIterator.java

...ort/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchesBuilder.java

...import/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/common/SparkSketchRow.java

...rt/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/GenerateSketches.java

...ulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchWritingterator.java

java/partitions/splitter/src/main/java/sleeper/splitter/core/sketches/SketchesForSplitting.java

patchwork01 · 2026-01-30T13:52:45Z

...import/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/common/SparkSketchRow.java


 /**
- * A reference to a sketch file written during a Spark job. Used to calculate split points when pre-splitting
+ * A reference to a sketch created during a Spark job. Used to calculate split points when pre-splitting


It's not a reference, it's a Spark row containing a sketch.

patchwork01 · 2026-01-30T13:54:56Z

...rt/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/GenerateSketches.java

 /**
- * Generates a sketch of all input data, and outputs a single row per partition referencing a file that contains that
- * sketch.
+ * Generates a sketch of all input data, and outputs a single row per partition that contains that a sketch.


There's a typo at "that a sketch". I think it should just be "that sketch".

patchwork01 · 2026-01-30T14:00:29Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+    }
+
+    /**
+     * Updates existing unions with any new sketches provide that match on the key field.


I'm not sure it's clear what it means to "match on the key field". How about this:

"Adds sketches into the union. The sketches must include an item sketch for each row key in the Sleeper table schema."

patchwork01 · 2026-01-30T14:01:04Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+     * Updates existing unions with any new sketches provide that match on the key field.
+     *
+     * @param sketches sketches to add
+     *


There's an extra blank line here.

patchwork01 · 2026-01-30T14:02:40Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+    }
+
+    /**
+     * Creates sketches object from the mapped unions for use a single reference point.


I don't know what is meant by "mapped", or by "for use a single reference point" here. How about this:

"Gathers the results of the union into a sketches object."

patchwork01 · 2026-01-30T14:02:49Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+    /**
+     * Updates existing unions with any new sketches provide that match on the key field.
+     *
+     * @param sketches sketches to add


Is this missing a "the"?

patchwork01 · 2026-01-30T14:02:53Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+    /**
+     * Creates sketches object from the mapped unions for use a single reference point.
+     *
+     * @return sketches


Is this missing a "the"?

patchwork01 · 2026-01-30T14:05:28Z

java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java

+import static java.util.stream.Collectors.toMap;
+
+/**
+ * Creates union of sketches.


This seems to suggest that this might build a SketchesUnion object, which does not exist. It creates sketches, but it does that by first creating a union of a number of other sketches. It builds the sketches from the results of the union. How about this:

"Creates sketches from a union of a number of other sketches."

patchwork01 · 2026-01-30T14:09:59Z

java/common/sketches/src/test/java/sleeper/sketches/SketchesUnionBuilderTest.java

+    InstanceProperties instanceProperties = createTestInstanceProperties();
+
+    @Test
+    void shouldUnionTwoSketchFilesTogether() {


This test isn't doing what it says. It only adds one Sketches object to the union, and it doesn't deal with files.

Maybe we should have a test that only adds one Sketches object, and a test that adds two?

patchwork01 · 2026-01-30T14:14:01Z

java/common/sketches/src/test/java/sleeper/sketches/SketchesUnionBuilderTest.java

+    }
+
+    @Test
+    void shouldUnionSketchesWithDifferentKeys() {


This test isn't doing what it says, since it only creates one Sketches object. It also shouldn't be possible to have a Sleeper row that doesn't have values for all the row keys. In practice this would be refused during ingest, because it doesn't match the schema.

The test seems a bit confusing and unnecessary, maybe we should remove it?

rtjd6554 and others added 18 commits January 28, 2026 10:50

6485: Initial commit with new reference class

6a0ae50

6485: Update name

85ad63b

6485 Rename SparkSketchBytesRow

170637b

6485 Test harness for SketchWritingIterator

b9cdd24

6485: Correction for object interaction within BytesRow

85d2d0d

6485: New ByteIteratorTestClass and functionality

3ffb577

6485 Write sketches as byte array

c749554

6485: partial new test

b939a68

6485 Handle multiple partitions

13c3166

6485 Adjust log message

64090ed

6485: Refactor splitOnField method declaration

2def9a9

6485 Undo moving splitOnField to interface

03735cf

6485 Use new iterator in GenerateSketchesDriver

28e0c5c

6485: Remove un-needed files

9f8eaca

Merge branch 'develop' into 6485-Sketches-Before-Bulk

24e7162

6485: Checkstyle update

0f69388

6485: Update with new tests

09298cf

6485: Rename classes to simplfy

9e2a89b

rtjd6554 marked this pull request as ready for review January 30, 2026 10:15

rtjd6554 added the needs-reviewer Pull requests that need a reviewer to be assigned label Jan 30, 2026

patchwork01 reviewed Jan 30, 2026

View reviewed changes

patchwork01 removed the needs-reviewer Pull requests that need a reviewer to be assigned label Jan 30, 2026

patchwork01 assigned rtjd6554 Jan 30, 2026

rtjd6554 added 7 commits January 30, 2026 11:09

6485: Undo functional interface removal

a552cb8

6485: Rename class

f646c13

6485: Remove usages of file, replaced wtih sketch

0cd3617

6485: Update javadoc

6179912

6485: Relocate and rename sketchesBuilder

a04a5e1

6485: Update reference

d65fbd8

6485: Update tests and javadoc

cbcff63

rtjd6554 and others added 2 commits January 30, 2026 12:24

6485: Re-add missing javadoc

8d9d856

Merge branch 'develop' into 6485-Sketches-Before-Bulk

5b48793

patchwork01 reviewed Jan 30, 2026

View reviewed changes

Issue 6485: Partition data when generating sketches before bulk import #6501

Are you sure you want to change the base?

Issue 6485: Partition data when generating sketches before bulk import #6501

Uh oh!

Conversation

rtjd6554 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Tests

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rtjd6554 commented Jan 28, 2026 •

edited

Loading