-
Notifications
You must be signed in to change notification settings - Fork 19
Issue 6485: Partition data when generating sketches before bulk import #6501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
java/common/sketches/src/test/java/sleeper/sketches/SketchesUnionBuilderTest.java
Outdated
Show resolved
Hide resolved
java/common/sketches/src/main/java/sleeper/sketches/SketchesUnionBuilder.java
Outdated
Show resolved
Hide resolved
...t/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchingIterator.java
Outdated
Show resolved
Hide resolved
...ort/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchesBuilder.java
Outdated
Show resolved
Hide resolved
...ort/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchesBuilder.java
Outdated
Show resolved
Hide resolved
...import/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/common/SparkSketchRow.java
Outdated
Show resolved
Hide resolved
...import/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/common/SparkSketchRow.java
Outdated
Show resolved
Hide resolved
...rt/bulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/GenerateSketches.java
Show resolved
Hide resolved
...ulk-import-runner/src/main/java/sleeper/bulkimport/runner/sketches/SketchWritingterator.java
Outdated
Show resolved
Hide resolved
java/partitions/splitter/src/main/java/sleeper/splitter/core/sketches/SketchesForSplitting.java
Show resolved
Hide resolved
|
|
||
| /** | ||
| * A reference to a sketch file written during a Spark job. Used to calculate split points when pre-splitting | ||
| * A reference to a sketch created during a Spark job. Used to calculate split points when pre-splitting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a reference, it's a Spark row containing a sketch.
| /** | ||
| * Generates a sketch of all input data, and outputs a single row per partition referencing a file that contains that | ||
| * sketch. | ||
| * Generates a sketch of all input data, and outputs a single row per partition that contains that a sketch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo at "that a sketch". I think it should just be "that sketch".
| } | ||
|
|
||
| /** | ||
| * Updates existing unions with any new sketches provide that match on the key field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's clear what it means to "match on the key field". How about this:
"Adds sketches into the union. The sketches must include an item sketch for each row key in the Sleeper table schema."
| * Updates existing unions with any new sketches provide that match on the key field. | ||
| * | ||
| * @param sketches sketches to add | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra blank line here.
| } | ||
|
|
||
| /** | ||
| * Creates sketches object from the mapped unions for use a single reference point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what is meant by "mapped", or by "for use a single reference point" here. How about this:
"Gathers the results of the union into a sketches object."
| /** | ||
| * Updates existing unions with any new sketches provide that match on the key field. | ||
| * | ||
| * @param sketches sketches to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing a "the"?
| /** | ||
| * Creates sketches object from the mapped unions for use a single reference point. | ||
| * | ||
| * @return sketches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing a "the"?
| import static java.util.stream.Collectors.toMap; | ||
|
|
||
| /** | ||
| * Creates union of sketches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to suggest that this might build a SketchesUnion object, which does not exist. It creates sketches, but it does that by first creating a union of a number of other sketches. It builds the sketches from the results of the union. How about this:
"Creates sketches from a union of a number of other sketches."
| InstanceProperties instanceProperties = createTestInstanceProperties(); | ||
|
|
||
| @Test | ||
| void shouldUnionTwoSketchFilesTogether() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't doing what it says. It only adds one Sketches object to the union, and it doesn't deal with files.
Maybe we should have a test that only adds one Sketches object, and a test that adds two?
| } | ||
|
|
||
| @Test | ||
| void shouldUnionSketchesWithDifferentKeys() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't doing what it says, since it only creates one Sketches object. It also shouldn't be possible to have a Sleeper row that doesn't have values for all the row keys. In practice this would be refused during ingest, because it doesn't match the schema.
The test seems a bit confusing and unnecessary, maybe we should remove it?
Make sure you have checked all steps below.
Issue
Feature". Note that before an issue is finished, you can still make a pull request by raising a separate issue
for your progress.
Tests
Documentation
separate issue for that below.