Skip to content

Conversation

@rtjd6554
Copy link
Collaborator

@rtjd6554 rtjd6554 commented Jan 28, 2026

Make sure you have checked all steps below.

Issue

Tests

  • My PR adds the following tests based on our test strategy OR does not need testing for this extremely good reason:
    • New tests within file: SleeperBuilderTest

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it, or I have linked to a
    separate issue for that below.
  • If I have added new Java code, I have added Javadoc that explains it following our conventions and style.
  • If I have added or removed any dependencies from the project, I have updated the NOTICES file.

@rtjd6554 rtjd6554 marked this pull request as ready for review January 30, 2026 10:15
@rtjd6554 rtjd6554 added the needs-reviewer Pull requests that need a reviewer to be assigned label Jan 30, 2026
@patchwork01 patchwork01 removed the needs-reviewer Pull requests that need a reviewer to be assigned label Jan 30, 2026

/**
* A reference to a sketch file written during a Spark job. Used to calculate split points when pre-splitting
* A reference to a sketch created during a Spark job. Used to calculate split points when pre-splitting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a reference, it's a Spark row containing a sketch.

/**
* Generates a sketch of all input data, and outputs a single row per partition referencing a file that contains that
* sketch.
* Generates a sketch of all input data, and outputs a single row per partition that contains that a sketch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo at "that a sketch". I think it should just be "that sketch".

}

/**
* Updates existing unions with any new sketches provide that match on the key field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's clear what it means to "match on the key field". How about this:

"Adds sketches into the union. The sketches must include an item sketch for each row key in the Sleeper table schema."

* Updates existing unions with any new sketches provide that match on the key field.
*
* @param sketches sketches to add
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra blank line here.

}

/**
* Creates sketches object from the mapped unions for use a single reference point.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is meant by "mapped", or by "for use a single reference point" here. How about this:

"Gathers the results of the union into a sketches object."

/**
* Updates existing unions with any new sketches provide that match on the key field.
*
* @param sketches sketches to add
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing a "the"?

/**
* Creates sketches object from the mapped unions for use a single reference point.
*
* @return sketches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing a "the"?

import static java.util.stream.Collectors.toMap;

/**
* Creates union of sketches.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to suggest that this might build a SketchesUnion object, which does not exist. It creates sketches, but it does that by first creating a union of a number of other sketches. It builds the sketches from the results of the union. How about this:

"Creates sketches from a union of a number of other sketches."

InstanceProperties instanceProperties = createTestInstanceProperties();

@Test
void shouldUnionTwoSketchFilesTogether() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't doing what it says. It only adds one Sketches object to the union, and it doesn't deal with files.

Maybe we should have a test that only adds one Sketches object, and a test that adds two?

}

@Test
void shouldUnionSketchesWithDifferentKeys() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't doing what it says, since it only creates one Sketches object. It also shouldn't be possible to have a Sleeper row that doesn't have values for all the row keys. In practice this would be refused during ingest, because it doesn't match the schema.

The test seems a bit confusing and unnecessary, maybe we should remove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partition data when generating sketches before bulk import

3 participants