-
Notifications
You must be signed in to change notification settings - Fork 713
lake, media: add three benchmark docs with six diagrams #23113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lilin90
wants to merge
11
commits into
pingcap:feature/preview-cloud-lake
Choose a base branch
from
lilin90:add-benchmark
base: feature/preview-cloud-lake
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+427
−0
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
44df531
lake: add three benchmark docs
lilin90 583aa91
Apply suggestions from code review
lilin90 bebaec7
Keep consistent style
lilin90 64ca742
media: add six lake benchmark images
lilin90 212ed79
Apply suggestions from code review
lilin90 cae91c9
Apply suggestions from code review
lilin90 4327066
Apply suggestions from code review
lilin90 e3c810f
Apply suggestions from code review
lilin90 fcd8de6
Apply suggestions from code review
lilin90 bcf21e6
Use lakesql-bin links in benchmark guides
lilin90 de89f42
Update format
lilin90 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,182 @@ | ||
| --- | ||
| title: "{{{ .lake }}} vs. Snowflake: Data Ingestion Benchmark" | ||
| summary: This page presents a benchmark comparison of data ingestion performance and cost between {{{ .lake }}} and Snowflake, focusing on TPC-H SF100 dataset loading, ClickBench Hits dataset loading, and freshness benchmarks. | ||
| --- | ||
|
|
||
| # {{{ .lake }}} vs. Snowflake: Data Ingestion Benchmark | ||
|
|
||
| ## Overview | ||
|
|
||
| We conducted four specific benchmarks to evaluate {{{ .lake }}} versus Snowflake: | ||
|
|
||
| 1. **TPC-H SF100 Dataset Loading**: Focuses on loading performance and cost for a large-scale dataset (100GB, ~600 million rows). | ||
| 2. **ClickBench Hits Dataset Loading**: Tests efficiency in loading a wide-table dataset (76GB, ~100 million rows, 105 columns), emphasizing challenges associated with high column counts. | ||
| 3. **1-Second Freshness**: Measures the platforms' ability to ingest data within a strict 1-second freshness requirement. | ||
| 4. **5-Second Freshness**: Compares the platforms' data ingestion capabilities under a 5-second freshness constraint. | ||
|
|
||
| ## Platforms | ||
|
|
||
| - **[Snowflake](https://snowflake.com)**: A well-known cloud data platform emphasizing scalable compute, data sharing. | ||
| - **[{{{ .lake }}}](https://tidbcloud.com)**: A cloud-native data warehouse built on the open-source {{{ .lake }}} project, focusing on scalability and cost-efficiency. | ||
|
|
||
| ## Benchmark Conditions | ||
|
|
||
| Conducted on a `Small-Size` warehouse (16vCPU, AWS us-east-2) using data from the same S3 bucket. | ||
|
|
||
| ## Performance and Cost Comparison | ||
|
|
||
| - **TPC-H SF100 Data**: {{{ .lake }}} offers a **48% cost reduction** over Snowflake. | ||
| - **ClickBench Hits Data**: {{{ .lake }}} achieves a **84% cost reduction**. | ||
| - **1-Second Freshness**: {{{ .lake }}} loads **400 times** more data than Snowflake. | ||
|
lilin90 marked this conversation as resolved.
|
||
| - **5-Second Freshness**: {{{ .lake }}} loads over **27 times** more data. | ||
|
lilin90 marked this conversation as resolved.
|
||
|
|
||
| ## Data Ingestion Benchmarks | ||
|
|
||
|  | ||
|
|
||
| ### TPC-H SF100 Dataset | ||
|
|
||
| | Metric | Snowflake | {{{ .lake }}} | Description | | ||
| | -------------- | --------- | -------------- | ------------------------- | | ||
| | **Total Time** | 695s | 446s | Time to load the dataset. | | ||
| | **Total Cost** | $0.77 | $0.40 | Cost of data loading. | | ||
|
|
||
| - Data Volume: 100GB | ||
|
lilin90 marked this conversation as resolved.
|
||
| - Rows: Approx. 600 million | ||
|
|
||
| ### ClickBench Hits Dataset | ||
|
|
||
| | Metric | Snowflake | {{{ .lake }}} | Description | | ||
| | -------------- | --------- | -------------- | ------------------------- | | ||
| | **Total Time** | 51m 17s | 9m 58s | Time to load the dataset. | | ||
| | **Total Cost** | $3.42 | $0.53 | Cost of data loading. | | ||
|
|
||
| - Data Volume: 76GB | ||
|
lilin90 marked this conversation as resolved.
|
||
| - Rows: Approx. 100 million | ||
| - Table Width: 105 columns | ||
|
|
||
| ## Freshness Benchmarks | ||
|
|
||
|  | ||
|
|
||
| ### 1-Second Freshness Benchmark | ||
|
|
||
| Evaluates the volume of data ingested within a 1-second freshness requirement. | ||
|
|
||
| | Metric | Snowflake | {{{ .lake }}} | Description | | ||
| | -------------- | --------- | -------------- | ----------------------------------------------- | | ||
| | **Total Time** | 1s | 1s | Loading time frame. | | ||
| | **Total Rows** | 100 Rows | 40,000 Rows | Volume of data successfully ingested within 1s. | | ||
|
|
||
| ### 5-Second Freshness Benchmark | ||
|
|
||
| Assesses the volume of data that can be ingested within a 5-second freshness requirement. | ||
|
|
||
| | Metric | Snowflake | {{{ .lake }}} | Description | | ||
| | -------------- | ----------- | -------------- | ----------------------------------------------- | | ||
| | **Total Time** | 5s | 5s | Loading time frame. | | ||
| | **Total Rows** | 90,000 Rows | 2,500,000 Rows | Volume of data successfully ingested within 5s. | | ||
|
|
||
| ## Reproduce the Benchmark | ||
|
|
||
| You can reproduce the benchmark by following the steps below. | ||
|
|
||
| ### Benchmark Environment | ||
|
|
||
| The benchmark tests both Snowflake and {{{ .lake }}} under similar conditions: | ||
|
|
||
| | Parameter | Snowflake | {{{ .lake }}} | | ||
| | -------------- | -------------------------------------------------------- | ----------------------------------------- | | ||
| | Warehouse Size | Small | Small | | ||
| | vCPU | 16 | 16 | | ||
| | Price | [$4/hour](https://www.snowflake.com/en/pricing-options/) | [$3.2/hour](https://www.pingcap.com/pricing/) | | ||
| | AWS Region | us-east-2 | us-east-2 | | ||
| | Storage | AWS S3 | AWS S3 | | ||
|
lilin90 marked this conversation as resolved.
|
||
|
|
||
| - The TPC-H SF100 dataset, sourced from [Amazon Redshift](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH). | ||
| - The ClickBench dataset, sourced from [ClickBench](https://github.com/ClickHouse/ClickBench). | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| - Have a [Snowflake account](https://signup.snowflake.com) | ||
| - Create a [{{{ .lake }}} account](https://tidbcloud.com/) | ||
|
|
||
| ### Data Ingestion Benchmark | ||
|
|
||
| The data ingestion benchmark can be reproduced using the following steps: | ||
|
|
||
| <details> | ||
| <summary>TPC-H Data Loading</summary> | ||
|
|
||
| 1. **Snowflake Data Load**: | ||
|
|
||
| - Log into your [Snowflake account](https://app.snowflake.com/). | ||
| - Create tables corresponding to the TPC-H schema. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/tpch-100/snowflake/setup.sql). | ||
| - Use the `COPY INTO` command to load the data from AWS S3. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/tpch-100/snowflake/setup.sql). | ||
|
|
||
| 2. **{{{ .lake }}} Data Load**: | ||
|
|
||
| - Sign in to your [{{{ .lake }}} account](https://tidbcloud.com). | ||
| - Create the necessary tables as per the TPC-H schema. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/tpch-100/lake/setup.sql). | ||
| - Use a method similar to Snowflake for loading data from AWS S3. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/tpch-100/lake/setup.sql). | ||
|
|
||
| </details> | ||
|
|
||
| <details> | ||
| <summary>ClickBench Hits Data Loading</summary> | ||
|
|
||
| 1. **Snowflake Data Load**: | ||
|
|
||
| - Log into your [Snowflake account](https://app.snowflake.com/). | ||
| - Create tables corresponding to the `hits` schema. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/hits/snowflake/schema.sql). | ||
| - Use the `COPY INTO` command to load the data from AWS S3. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/hits/snowflake/copy.sql). | ||
|
|
||
| 2. **{{{ .lake }}} Data Load**: | ||
|
|
||
| - Sign in to your [{{{ .lake }}} account](https://tidbcloud.com). | ||
| - Create the necessary tables as per the `hits` schema. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/hits/lake/schema.sql). | ||
| - Use a method similar to Snowflake for loading data from AWS S3. [SQL Script](https://lakesql-bin.tidbcloud.com/datasets/tpch/hits/lake/copy.sql). | ||
|
|
||
| </details> | ||
|
|
||
| ### Freshness Benchmark | ||
|
|
||
| Data generation and ingestion for the freshness benchmark can be reproduced using the following steps: | ||
|
|
||
| 1. Create an [external stage](/tidb-cloud-lake/sql/create-stage.md#example-2-create-external-stage-with-connection) in {{{ .lake }}}. | ||
|
|
||
| ```sql | ||
| CREATE STAGE hits_unload_stage | ||
| URL = 's3://unload/files/' | ||
| CONNECTION = ( | ||
| ACCESS_KEY_ID = '<your-access-key-id>', | ||
| SECRET_ACCESS_KEY = '<your-secret-access-key>' | ||
| ); | ||
| ``` | ||
|
|
||
| 2. Unload data to the external stage. | ||
|
|
||
| ```sql | ||
| CREATE or REPLACE FILE FORMAT tsv_unload_format_gzip | ||
| TYPE = TSV, | ||
| COMPRESSION = gzip; | ||
|
|
||
| COPY INTO @hits_unload_stage | ||
| FROM ( | ||
| SELECT * | ||
| FROM hits limit <the-rows-you-want> | ||
| ) | ||
| FILE_FORMAT = (FORMAT_NAME = 'tsv_unload_format_gzip') | ||
| DETAILED_OUTPUT = true; | ||
| ``` | ||
|
|
||
| 3. Load data from the external stage to the `hits` table. | ||
|
|
||
| ```sql | ||
| COPY INTO hits | ||
| FROM @hits_unload_stage | ||
| PATTERN = '.*[.]tsv.gz' | ||
| FILE_FORMAT = (TYPE = TSV, COMPRESSION=auto); | ||
| ``` | ||
|
|
||
| 4. Measure results from the dashboard. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.