Skip to content

Use async queries API instead #18

@nfultz

Description

@nfultz

So I've been doing some googling about PyAthenaJDBC while trying to triage #16, came across this by @davoscollective:

https://medium.com/@davedecahedron/ive-tested-both-and-pyathenajdbc-is-a-lot-slower-i-suppose-partly-because-it-is-using-the-athena-fdf56a9b715

I have tested with various clients (Tableau, DBeaver, and basic Java app) and retrieving data is a lot slower than it should be. When you run a query with the AWS Athena console, a results csv is written very quickly to S3. I did a test with a single table of 15m rows. The Athena query and csv file to S3 completes in less than 2 minutes. I can then download it to my local machine in less than a minute. To do the same operation via the JDBC driver takes over 2 hours. I am now thinking a better strategy is to query the metadata then start the Athena query asynchronously, poll it until completion and then download the csv file directly from the s3 staging directory and combine with metadata for correct types.

this is just one anecdote, and he is writing about the 1.x version of the driver, but it may be worth exploring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions