Skip to content

Pass bulk import job to Spark with table ID instead of name #6502

@patchwork01

Description

@patchwork01

User Story

As a user of Sleeper, once a job is submitted I want the system to remember which table it job is for by its ID, so that when I rename the table the job still runs against that table.

Description / Background

Under epic:

When a bulk import job is received in the starter lambda, the lambda looks up the Sleeper table by the table name. When it submits the job to be run in Spark, it passes the job to Spark as it was when it received it, including the table name instead of the ID.

We'd like the job to be passed to Spark by the table ID instead of the name, so that if the table is renamed between the starter lambda and the Spark driver, the job will still run.

Technical Notes / Implementation Details

The job is written to S3 with BulkImportExecutor.WriteJobToBucket. The job is read in Spark with BulkImportJobLoaderFromS3, in BulkImportJobDriver.start.

The Sleeper table is then looked up by its name in BulkImportJobDriver.run. Because it looks up by the name, that needs to happen outside of the try/catch/finally where failure is submitted to the job tracker. When we load the table properties by its ID instead, we'll be able to track any failure in the job tracker, and put it inside the try/catch/finally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions