From 502415ac4e412d1c852cb51d3f04436c63efeee1 Mon Sep 17 00:00:00 2001 From: Kevin Liu Date: Sat, 28 Mar 2026 15:07:59 -0700 Subject: [PATCH] s/img-responsive/img-fluid/g --- content/blog/2024-01-19-datafusion-34.0.0.md | 4 +-- content/blog/2024-03-06-comet-donation.md | 2 +- .../blog/2024-07-20-datafusion-comet-0.1.0.md | 4 +-- .../2024-08-20-python-datafusion-40.0.0.md | 4 +-- .../blog/2024-08-28-datafusion-comet-0.2.0.md | 4 +-- ...string-view-german-style-strings-part-1.md | 12 ++++----- ...string-view-german-style-strings-part-2.md | 4 +-- ...-fastest-single-node-parquet-clickbench.md | 14 +++++----- .../blog/2025-01-17-datafusion-comet-0.5.0.md | 4 +-- .../2025-02-02-datafusion-ballista-43.0.0.md | 8 +++--- content/blog/2025-02-20-datafusion-45.0.0.md | 2 +- content/blog/2025-03-11-ordering-analysis.md | 2 +- .../blog/2025-03-20-datafusion-comet-0.7.0.md | 2 +- content/blog/2025-03-20-parquet-pruning.md | 4 +-- content/blog/2025-03-21-parquet-pushdown.md | 10 +++---- content/blog/2025-03-24-datafusion-46.0.0.md | 2 +- .../2025-03-30-datafusion-python-46.0.0.md | 2 +- .../blog/2025-04-10-fastest-tpch-generator.md | 6 ++--- ...6-15-optimizing-sql-dataframes-part-one.md | 10 +++---- ...6-15-optimizing-sql-dataframes-part-two.md | 16 ++++++------ content/blog/2025-06-30-cancellation.md | 2 +- .../blog/2025-07-01-datafusion-comet-0.9.0.md | 2 +- content/blog/2025-07-11-datafusion-47.0.0.md | 2 +- ...2025-07-14-user-defined-parquet-indexes.md | 4 +-- content/blog/2025-07-28-datafusion-49.0.0.md | 4 +-- .../2025-08-15-external-parquet-indexes.md | 16 ++++++------ content/blog/2025-09-10-dynamic-filters.md | 10 +++---- .../2025-09-21-custom-types-using-metadata.md | 2 +- content/blog/2025-09-29-datafusion-50.0.0.md | 2 +- .../2025-10-21-datafusion-comet-0.11.0.md | 4 +-- content/blog/2025-11-25-datafusion-51.0.0.md | 4 +-- ...25-12-15-avoid-consecutive-repartitions.md | 26 +++++++++---------- content/blog/2026-01-12-extending-sql.md | 2 +- content/blog/2026-02-02-datafusion_case.md | 16 ++++++------ 34 files changed, 106 insertions(+), 106 deletions(-) diff --git a/content/blog/2024-01-19-datafusion-34.0.0.md b/content/blog/2024-01-19-datafusion-34.0.0.md index 2f95ccce..9b49d940 100644 --- a/content/blog/2024-01-19-datafusion-34.0.0.md +++ b/content/blog/2024-01-19-datafusion-34.0.0.md @@ -113,7 +113,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below [ClickBench]: https://benchmark.clickhouse.com/
- Fig 1: Adaptive Arrow schema architecture overview. + Fig 1: Adaptive Arrow schema architecture overview.
Figure 1: Performance improvement between 25.0.0 and 34.0.0 on ClickBench. Note that DataFusion 25.0.0, could not run several queries due to @@ -122,7 +122,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below
- Fig 1: Adaptive Arrow schema architecture overview. + Fig 1: Adaptive Arrow schema architecture overview.
Figure 2: Total query runtime for DataFusion 34.0.0 and DataFusion 25.0.0.
diff --git a/content/blog/2024-03-06-comet-donation.md b/content/blog/2024-03-06-comet-donation.md index 660a0642..44728f23 100644 --- a/content/blog/2024-03-06-comet-donation.md +++ b/content/blog/2024-03-06-comet-donation.md @@ -39,7 +39,7 @@ performance improvements for some workloads as shown below. Fig 1: Adaptive Arrow schema architecture overview.
diff --git a/content/blog/2024-07-20-datafusion-comet-0.1.0.md b/content/blog/2024-07-20-datafusion-comet-0.1.0.md index 8d41d8f2..6167225b 100644 --- a/content/blog/2024-07-20-datafusion-comet-0.1.0.md +++ b/content/blog/2024-07-20-datafusion-comet-0.1.0.md @@ -88,7 +88,7 @@ for details of the environment used for these benchmarks. Chart showing TPC-H benchmark results for Comet 0.1.0 @@ -105,7 +105,7 @@ The following chart shows how much Comet currently accelerates each query from t Chart showing TPC-H benchmark results for Comet 0.1.0 diff --git a/content/blog/2024-08-20-python-datafusion-40.0.0.md b/content/blog/2024-08-20-python-datafusion-40.0.0.md index dd3b4e66..be484ffd 100644 --- a/content/blog/2024-08-20-python-datafusion-40.0.0.md +++ b/content/blog/2024-08-20-python-datafusion-40.0.0.md @@ -72,7 +72,7 @@ release, users can fully use these tools in their workflow. Fig 1: Enhanced tooltips in an IDE.
@@ -88,7 +88,7 @@ used a function's arguments as shown in Figure 2. Fig 2: Error checking in static analysis
diff --git a/content/blog/2024-08-28-datafusion-comet-0.2.0.md b/content/blog/2024-08-28-datafusion-comet-0.2.0.md index ff17da6a..bc46485a 100644 --- a/content/blog/2024-08-28-datafusion-comet-0.2.0.md +++ b/content/blog/2024-08-28-datafusion-comet-0.2.0.md @@ -86,7 +86,7 @@ Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better th Chart showing TPC-H benchmark results for Comet 0.2.0 @@ -98,7 +98,7 @@ Comet 0.1.0, which did not provide any speedup for this benchmark. Chart showing TPC-DS benchmark results for Comet 0.2.0 diff --git a/content/blog/2024-09-13-string-view-german-style-strings-part-1.md b/content/blog/2024-09-13-string-view-german-style-strings-part-1.md index 6f8770b9..f9f5571b 100644 --- a/content/blog/2024-09-13-string-view-german-style-strings-part-1.md +++ b/content/blog/2024-09-13-string-view-german-style-strings-part-1.md @@ -47,7 +47,7 @@ StringView support was released as part of [arrow-rs v52.2.0](https://crates.io/ End to end performance improvements for ClickBench queries @@ -61,7 +61,7 @@ Figure 1: StringView improves string-intensive ClickBench query performance by 2 Diagram of using StringArray and StringViewArray to represent the same string content @@ -121,7 +121,7 @@ On the other hand, reading Parquet data as a StringViewArray can re-use the same Diagram showing how StringViewArray can avoid copying by reusing decoded Parquet pages. @@ -147,7 +147,7 @@ Strings are stored as byte sequences. When reading data from (potentially untrus Figure showing time to load strings from Parquet and the effect of optimized UTF-8 validation. @@ -162,7 +162,7 @@ UTF-8 validation in Rust is highly optimized and favors longer strings (as shown Figure showing UTF-8 validation throughput vs string length. @@ -212,7 +212,7 @@ With StringViewArray we saw a 24% end-to-end performance improvement, as shown i Figure showing StringView improves end to end performance by 24 percent. diff --git a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md index 7fb64f56..34114b11 100644 --- a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md +++ b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md @@ -66,7 +66,7 @@ Figure 1 illustrates the difference between the output of both string representa Diagram showing Zero-copy `take`/`filter` for StringViewArray @@ -121,7 +121,7 @@ To eliminate the impact of the faster Parquet reading using StringViewArray (see Figure showing StringViewArray reduces the filter time by 32% on ClickBench query 22. diff --git a/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md b/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md index a71b8a03..82762e36 100644 --- a/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md +++ b/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md @@ -45,14 +45,14 @@ been held by traditional C/C++-based engines. Apache DataFusion Logo ClickBench performance for DataFusion 43.0.0 @@ -97,7 +97,7 @@ Figure 2. ClickBench performance results over time for DataFusion @@ -134,7 +134,7 @@ resulted in measurable performance improvements. Illustration of how take works with StringView @@ -216,7 +216,7 @@ bypass the first phase when it is not working efficiently, shown in Figure 4. Two phase aggregation diagram from DataFusion API docs annotated to show first phase not helping @@ -253,7 +253,7 @@ length strings and binary data]. Row based storage for multiple group columns @@ -276,7 +276,7 @@ at the [one shipped in DataFusion `43.0.0`], shown in Figure 6. Column based storage for multiple group columns diff --git a/content/blog/2025-01-17-datafusion-comet-0.5.0.md b/content/blog/2025-01-17-datafusion-comet-0.5.0.md index bf3a9560..dc9e439e 100644 --- a/content/blog/2025-01-17-datafusion-comet-0.5.0.md +++ b/content/blog/2025-01-17-datafusion-comet-0.5.0.md @@ -52,14 +52,14 @@ Comet 0.5.0 achieves a 1.9x speedup for single-node TPC-H @ 100 GB, an improveme Chart showing TPC-H benchmark results for Comet 0.5.0 Chart showing TPC-H benchmark results for Comet 0.5.0 diff --git a/content/blog/2025-02-02-datafusion-ballista-43.0.0.md b/content/blog/2025-02-02-datafusion-ballista-43.0.0.md index 7fac0323..1a2d7cbf 100644 --- a/content/blog/2025-02-02-datafusion-ballista-43.0.0.md +++ b/content/blog/2025-02-02-datafusion-ballista-43.0.0.md @@ -91,7 +91,7 @@ Per query comparison: Per query comparison @@ -100,7 +100,7 @@ Relative speedup: Relative speedup graph @@ -109,7 +109,7 @@ The overall speedup is 2.9x Overall speedup @@ -120,7 +120,7 @@ Ballista now has a new logo, which is visually similar to other DataFusion proje New logo diff --git a/content/blog/2025-02-20-datafusion-45.0.0.md b/content/blog/2025-02-20-datafusion-45.0.0.md index 9471f5bb..d887374f 100644 --- a/content/blog/2025-02-20-datafusion-45.0.0.md +++ b/content/blog/2025-02-20-datafusion-45.0.0.md @@ -152,7 +152,7 @@ more improvements]. ClickBench performance results over time for DataFusion diff --git a/content/blog/2025-03-11-ordering-analysis.md b/content/blog/2025-03-11-ordering-analysis.md index 121adf41..72a71bba 100644 --- a/content/blog/2025-03-11-ordering-analysis.md +++ b/content/blog/2025-03-11-ordering-analysis.md @@ -332,7 +332,7 @@ using the orderings of the query intermediates.
Window Query Datafusion Optimization
Figure 1: DataFusion analyzes orderings of the sources and query intermediates to generate efficient plans
diff --git a/content/blog/2025-03-20-datafusion-comet-0.7.0.md b/content/blog/2025-03-20-datafusion-comet-0.7.0.md index 45cc12ce..1e6b2286 100644 --- a/content/blog/2025-03-20-datafusion-comet-0.7.0.md +++ b/content/blog/2025-03-20-datafusion-comet-0.7.0.md @@ -56,7 +56,7 @@ CPU and RAM. Even with **half the resources**, Comet still provides a measurable Chart showing TPC-H benchmark results for Comet 0.7.0 diff --git a/content/blog/2025-03-20-parquet-pruning.md b/content/blog/2025-03-20-parquet-pruning.md index 5c268a97..0b78ea4a 100644 --- a/content/blog/2025-03-20-parquet-pruning.md +++ b/content/blog/2025-03-20-parquet-pruning.md @@ -49,7 +49,7 @@ The diagram below illustrates the [Parquet reading pipeline] in DataFusion, high [Parquet reading pipeline]: https://docs.rs/datafusion/46.0.0/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html -Parquet pruning pipeline in DataFusion +Parquet pruning pipeline in DataFusion #### Background: Parquet file structure @@ -106,7 +106,7 @@ So far we have discussed techniques that prune the Parquet file using only the m Filter pushdown, also known as predicate pushdown or late materialization, is a technique that prunes data during scanning, with filters being generated and applied in the Parquet reader. -Filter pushdown in DataFusion +Filter pushdown in DataFusion Unlike metadata-based pruning which works at the row group or page level, filter pushdown operates at the row level, allowing DataFusion to filter out individual rows that don't match the query predicates during the decoding process. diff --git a/content/blog/2025-03-21-parquet-pushdown.md b/content/blog/2025-03-21-parquet-pushdown.md index 395d59d8..1da5f827 100644 --- a/content/blog/2025-03-21-parquet-pushdown.md +++ b/content/blog/2025-03-21-parquet-pushdown.md @@ -77,7 +77,7 @@ WHERE date_time > '2025-03-11' AND location = 'office'; ```
- Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows. + Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.
Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.
@@ -102,7 +102,7 @@ At a high level, the Parquet reader first builds a filter mask -- essentially a Let's dig into details of [how filter pushdown is implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712) in the current Rust Parquet reader implementation, illustrated in the following figure.
- Implementation of filter pushdown in Rust Parquet readers + Implementation of filter pushdown in Rust Parquet readers
Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns
@@ -170,7 +170,7 @@ This section describes my [<700 LOC PR (with lots of comments and tests)](https:
- New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time + New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time
New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time
@@ -213,7 +213,7 @@ Parquet by default encodes data using [dictionary encoding](https://parquet.apac You can see this in action using [parquet-viewer](https://parquet-viewer.xiangpeng.systems):
- Parquet viewer shows the page layout of a column chunk + Parquet viewer shows the page layout of a column chunk
Parquet viewer shows the page layout of a column chunk
@@ -225,7 +225,7 @@ This is why it caches 2 pages per column: one dictionary page and one data page. The data page slot will move forward as it reads the data; but the dictionary page slot always references the first page.
- Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data) + Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)
Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)
diff --git a/content/blog/2025-03-24-datafusion-46.0.0.md b/content/blog/2025-03-24-datafusion-46.0.0.md index 71ef758d..8cae1126 100644 --- a/content/blog/2025-03-24-datafusion-46.0.0.md +++ b/content/blog/2025-03-24-datafusion-46.0.0.md @@ -59,7 +59,7 @@ DataFusion 46.0.0 introduces a new [**SQL Diagnostics framework**](https://gith For example, if you reference an unknown table or miss a column in `GROUP BY` the error message will include the query snippet causing the error. These diagnostics are meant for end-users of applications built on DataFusion, providing clearer messages instead of generic errors. Here’s an example: -diagnostic-example +diagnostic-example Currently, diagnostics cover unresolved table/column references, missing `GROUP BY` columns, ambiguous references, wrong number of UNION columns, type mismatches, and a few others. Future releases will extend this to more error types. This feature should greatly ease debugging of complex SQL by pinpointing errors directly in the query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his contributions in this project. diff --git a/content/blog/2025-03-30-datafusion-python-46.0.0.md b/content/blog/2025-03-30-datafusion-python-46.0.0.md index 357aa8a6..f854621a 100644 --- a/content/blog/2025-03-30-datafusion-python-46.0.0.md +++ b/content/blog/2025-03-30-datafusion-python-46.0.0.md @@ -181,7 +181,7 @@ expandable text and scroll bars. Fig 1: Example html rendering in a jupyter notebook.
diff --git a/content/blog/2025-04-10-fastest-tpch-generator.md b/content/blog/2025-04-10-fastest-tpch-generator.md index 639d4342..17182214 100644 --- a/content/blog/2025-04-10-fastest-tpch-generator.md +++ b/content/blog/2025-04-10-fastest-tpch-generator.md @@ -48,7 +48,7 @@ which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes l It is finally convenient and efficient to run TPC-H queries locally when testing analytical engines such as DataFusion. -Time to create TPC-H parquet dataset for Scale Factor  1, 10, 100 and 1000 +Time to create TPC-H parquet dataset for Scale Factor  1, 10, 100 and 1000 **Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10, 100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP @@ -206,7 +206,7 @@ load the data, using `dbgen`, which is not ideal for several reasons: [here is how to do so]: https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124 -Time to generate TPC-H data in TBL format +Time to generate TPC-H data in TBL format **Figure 3**: Time to generate TPC-H data in TBL format. `tpchgen` is shown in blue. `tpchgen` restricted to a single core is shown in red. Unmodified @@ -266,7 +266,7 @@ strings. [unsafe]: https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&type=code [skip]: https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72 -Lamb Theory on Evolution of Systems Languages +Lamb Theory on Evolution of Systems Languages **Figure 4**: Lamb Theory of System Language Evolution from [Boston University MiDAS Fall 2024 (Data Systems Seminar)] [slides(pdf)], [recording]. Special diff --git a/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md b/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md index a389fe93..7ac4aa8f 100644 --- a/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md +++ b/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md @@ -79,7 +79,7 @@ language—it describes what answers are desired rather than an *imperative* language such as Python, where you describe how to do the computation as shown in Figure 1. -Fig 1: Query Execution. +Fig 1: Query Execution. **Figure 1**: Query Execution: Users describe the answer they want using either SQL or a DataFrame. For SQL, a Query Planner translates the parsed query @@ -112,7 +112,7 @@ modern APIs such as [Polars' lazy API], [Apache Spark's DataFrame]. and This section motivates the value of a Query Optimizer with an example. Let’s say you have some observations of animal behavior, as illustrated in Table 1. -Table 1: Observational Data. +Table 1: Observational Data. **Table 1**: Example observational data. @@ -148,7 +148,7 @@ Figure 2. [LogicalPlan]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html [this DataFusion overview video]: https://youtu.be/EzZTLiSJnhY -Fig 2: Initial Logical Plan. +Fig 2: Initial Logical Plan. **Figure 2**: Example initial `LogicalPlan` for SQL and DataFrame query. The plan is read from bottom to top, computing the results in each step. @@ -157,7 +157,7 @@ The optimizer's job is to take this query plan and rewrite it into an alternate plan that computes the same results but faster, such as the one shown in Figure 3. -Fig 3: Optimized Logical Plan. +Fig 3: Optimized Logical Plan. **Figure 3**: An example optimized plan that computes the same result as the plan in Figure 2 more efficiently. The diagram highlights where the optimizer @@ -184,7 +184,7 @@ A multi-pass design is standard because it helps: 1. Understand, implement, and test each pass in isolation 2. Easily extend the optimizer by adding new passes -Fig 4: Query Optimizer Passes. +Fig 4: Query Optimizer Passes. **Figure 4**: Query Optimizers are implemented as a series of rules that each rewrite the query plan. Each rule’s algorithm is expressed as a transformation diff --git a/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md b/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md index 195d115b..3dd27b2f 100644 --- a/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md +++ b/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md @@ -95,7 +95,7 @@ Optimizers will evaluate the filter before the aggregation. [evaluated after]: https://www.datacamp.com/tutorial/sql-order-of-execution -Fig 1: Filter Pushdown. +Fig 1: Filter Pushdown. **Figure 1**: Filter Pushdown. In (**A**) without filter pushdown, the operator processes more rows, reducing efficiency. In (**B**) with filter pushdown, the @@ -129,7 +129,7 @@ each column in each row must be parsed even if it is not used in the plan. [Apache Parquet]: https://parquet.apache.org/ [especially powerful in combination with filter pushdown]: https://blog.xiangpeng.systems/posts/parquet-pushdown/ -Fig 2: Projection Pushdown. +Fig 2: Projection Pushdown. **Figure 2:** In (**A**) without projection pushdown, the operator receives more columns, reducing efficiency. In (**B**) with projection pushdown, the operator @@ -156,7 +156,7 @@ opening additional files once the limit has been hit. [TopK]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html -Fig 3: Limit Pushdown. +Fig 3: Limit Pushdown. **Figure 3**: In (**A**), without limit pushdown all data is sorted and everything except the first few rows are discarded. In (**B**), with limit @@ -217,7 +217,7 @@ customer, but fills in the fields with `null`. All such rows will be filtered out by `customer.last_name = 'Lamb'`, and thus an INNER JOIN produces the same answer. This is illustrated in Figure 4. -Fig 4: Join Rewrite. +Fig 4: Join Rewrite. **Figure 4**: Rewriting `OUTER JOIN` to `INNER JOIN`. In (A) the original query contains an `OUTER JOIN` but also a filter on `customer.last_name`, which @@ -326,7 +326,7 @@ ORDER BY time_chunk ``` -Fig 5: Common Subquery Elimination. +Fig 5: Common Subquery Elimination. **Figure 5:** Adding a Projection to evaluate common complex sub expression decreases complexity for later stages. @@ -349,7 +349,7 @@ group keys or a `MergeJoin` [source]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html -Fig 6: Specialized Grouping. +Fig 6: Specialized Grouping. **Figure 6: **An example of specialized operation for grouping. In (**A**), input data has no specified ordering and DataFusion uses a hashing-based grouping operator ([source](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/row_hash.rs)) to determine distinct groups. In (**B**), when the input data is ordered by the group keys, DataFusion uses a specialized grouping operator ([source](https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/order)) to find boundaries that separate groups. @@ -371,7 +371,7 @@ and statistics are commonly stored in analytic file formats. For example, the [Metadata]: https://docs.rs/parquet/latest/parquet/file/metadata/index.html -Fig 7: Using Statistics. +Fig 7: Using Statistics. **Figure 7: **When the aggregation result is already stored in the statistics, the query can be evaluated using the values from statistics without looking at @@ -392,7 +392,7 @@ potentially (very) different performance. The major options in this category are [Materialized View]: https://en.wikipedia.org/wiki/Materialized_view -Fig 8: Access Path and Join Order. +Fig 8: Access Path and Join Order. **Figure 8:** Access Path and Join Order Selection in Query Optimizers. Optimizers use heuristics to enumerate some subset of potential join orders (shape) and access paths (color). The plan with the smallest estimated cost according to some cost model is chosen. In this case, Plan 2 with a cost of 180,000 is chosen for execution as it has the lowest estimated cost. diff --git a/content/blog/2025-06-30-cancellation.md b/content/blog/2025-06-30-cancellation.md index 244019b1..76ee19fe 100644 --- a/content/blog/2025-06-30-cancellation.md +++ b/content/blog/2025-06-30-cancellation.md @@ -359,7 +359,7 @@ To illustrate what this process looks like, let's have a look at the execution o If we assume a task budget of 1 unit, each time Tokio schedules the task would result in the following sequence of function calls.
-Sequence diagram showing how the tokio task budget is used and reset.
Tokio task budget system, assuming the task budget is set to 1, for the plan above.
diff --git a/content/blog/2025-07-01-datafusion-comet-0.9.0.md b/content/blog/2025-07-01-datafusion-comet-0.9.0.md index cd3f24b8..4c73ebd8 100644 --- a/content/blog/2025-07-01-datafusion-comet-0.9.0.md +++ b/content/blog/2025-07-01-datafusion-comet-0.9.0.md @@ -144,7 +144,7 @@ Comet now provides a tracing feature for analyzing performance and off-heap vers Comet Tracing diff --git a/content/blog/2025-07-11-datafusion-47.0.0.md b/content/blog/2025-07-11-datafusion-47.0.0.md index 27c32216..2cf56ca1 100644 --- a/content/blog/2025-07-11-datafusion-47.0.0.md +++ b/content/blog/2025-07-11-datafusion-47.0.0.md @@ -198,7 +198,7 @@ use pre-integrated community crates such as the [datafusion-tracing] crate. DataFusion telemetry project logo diff --git a/content/blog/2025-07-14-user-defined-parquet-indexes.md b/content/blog/2025-07-14-user-defined-parquet-indexes.md index e2f1452d..7f4fd08c 100644 --- a/content/blog/2025-07-14-user-defined-parquet-indexes.md +++ b/content/blog/2025-07-14-user-defined-parquet-indexes.md @@ -88,7 +88,7 @@ The Parquet format includes three main types[2](#footnote2) of option -Parquet File layout with standard index structures. +Parquet File layout with standard index structures. **Figure 1**: Parquet file layout with standard index structures (as written by arrow-rs). @@ -116,7 +116,7 @@ Figure 2 shows the resulting file layout. -Parquet File layout with custom index structures. +Parquet File layout with custom index structures. **Figure 2**: Parquet file layout with user-defined indexes. diff --git a/content/blog/2025-07-28-datafusion-49.0.0.md b/content/blog/2025-07-28-datafusion-49.0.0.md index 96322304..a2148f70 100644 --- a/content/blog/2025-07-28-datafusion-49.0.0.md +++ b/content/blog/2025-07-28-datafusion-49.0.0.md @@ -46,7 +46,7 @@ DataFusion continues to focus on enhancing performance, as shown in the ClickBen ClickBench performance results over time for DataFusion @@ -61,7 +61,7 @@ NOTE: Andrew is working on gathering these numbers Planning benchmark performance results over time for DataFusion diff --git a/content/blog/2025-08-15-external-parquet-indexes.md b/content/blog/2025-08-15-external-parquet-indexes.md index 53002cce..58566a31 100644 --- a/content/blog/2025-08-15-external-parquet-indexes.md +++ b/content/blog/2025-08-15-external-parquet-indexes.md @@ -80,7 +80,7 @@ needs[1](#footnote1). Using External Indexes to Accelerate Queries @@ -211,7 +211,7 @@ The standard approach is shown in Figure 2: Standard Pruning Layers. @@ -245,7 +245,7 @@ shown below. Logical Parquet File layout: Row Groups and Column Chunks. @@ -262,7 +262,7 @@ stored at the end of the file (in the footer), as shown below. Physical Parquet File layout: Metadata and Footer. @@ -289,7 +289,7 @@ The high level mechanics of Parquet predicate pushdown is shown below: Parquet Filter Pushdown: use filter predicate to skip pages. @@ -326,7 +326,7 @@ most recent 7 days. Data Skipping: Pruning Files. @@ -471,7 +471,7 @@ indexes for filtering *WITHIN* Parquet files as shown below. Data Skipping: Pruning Row Groups and DataPages @@ -724,7 +724,7 @@ Come Join Us! 🎣 https://datafusion.apache.org/ diff --git a/content/blog/2025-09-10-dynamic-filters.md b/content/blog/2025-09-10-dynamic-filters.md index 84293a9e..ddf77893 100644 --- a/content/blog/2025-09-10-dynamic-filters.md +++ b/content/blog/2025-09-10-dynamic-filters.md @@ -70,7 +70,7 @@ SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10; Q23 Performance Improvement with Dynamic Filters and Late Materialization @@ -105,7 +105,7 @@ A straightforward, though slow, plan to answer this query is shown in Figure 2. Naive Query Plan @@ -132,7 +132,7 @@ DuckDB]. The plan for Q23 using this specialized operator is shown in Figure 3. TopK Query Plan @@ -161,7 +161,7 @@ of files. The plan for Q23 with dynamic filters is shown in Figure 4. TopK Query Plan with Dynamic Filters @@ -372,7 +372,7 @@ other optimizations as shown in Figure 7. Join Performance Improvements with Dynamic Filters diff --git a/content/blog/2025-09-21-custom-types-using-metadata.md b/content/blog/2025-09-21-custom-types-using-metadata.md index b835a5be..65142b02 100644 --- a/content/blog/2025-09-21-custom-types-using-metadata.md +++ b/content/blog/2025-09-21-custom-types-using-metadata.md @@ -86,7 +86,7 @@ implementation, during processing of all user defined functions we pass the inpu field information.
- Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns. + Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.
Figure 1: Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.
diff --git a/content/blog/2025-09-29-datafusion-50.0.0.md b/content/blog/2025-09-29-datafusion-50.0.0.md index e8548b75..5582f3ce 100644 --- a/content/blog/2025-09-29-datafusion-50.0.0.md +++ b/content/blog/2025-09-29-datafusion-50.0.0.md @@ -48,7 +48,7 @@ DataFusion continues to focus on enhancing performance, as shown in ClickBench and other benchmark results. ClickBench performance results over time for DataFusion + width="100%" class="img-fluid" alt="ClickBench performance results over time for DataFusion" /> **Figure 1**: Average and median normalized query execution times for ClickBench queries for each git revision. Query times are normalized using the ClickBench definition. See the diff --git a/content/blog/2025-10-21-datafusion-comet-0.11.0.md b/content/blog/2025-10-21-datafusion-comet-0.11.0.md index dd22a085..1991ee2a 100644 --- a/content/blog/2025-10-21-datafusion-comet-0.11.0.md +++ b/content/blog/2025-10-21-datafusion-comet-0.11.0.md @@ -109,7 +109,7 @@ Comet 0.11.0 continues to deliver significant performance improvements over Spar TPC-H Overall Performance @@ -118,7 +118,7 @@ The performance gains are consistent across individual queries, with most querie TPC-H Query-by-Query Comparison diff --git a/content/blog/2025-11-25-datafusion-51.0.0.md b/content/blog/2025-11-25-datafusion-51.0.0.md index 58a23aa6..2cb8bcbf 100644 --- a/content/blog/2025-11-25-datafusion-51.0.0.md +++ b/content/blog/2025-11-25-datafusion-51.0.0.md @@ -46,7 +46,7 @@ the core engine and in the Parquet reader. Performance over time @@ -91,7 +91,7 @@ where startup time or low latency is important. You can read more about the upst Metadata Parsing Performance Improvements in Arrow/Parquet 57 diff --git a/content/blog/2025-12-15-avoid-consecutive-repartitions.md b/content/blog/2025-12-15-avoid-consecutive-repartitions.md index c3d3c4d4..7b1d8ad2 100644 --- a/content/blog/2025-12-15-avoid-consecutive-repartitions.md +++ b/content/blog/2025-12-15-avoid-consecutive-repartitions.md @@ -37,7 +37,7 @@ Starting a journey learning about database internals can be daunting. With so ma Database System Components @@ -81,7 +81,7 @@ This will give you familiarity with the codebase and using your tools, like your Noot Noot Database Meme @@ -109,7 +109,7 @@ DataFusion implements a vectorized @@ -134,7 +134,7 @@ Round-robin repartitioning is useful when the data grouping isn't known or when Round-Robin Repartitioning @@ -154,7 +154,7 @@ Hash repartitioning is useful when working with grouped data. Imagine you have a Hash Repartitioning @@ -166,7 +166,7 @@ Note, the benefit of hash opposed to round-robin partitioning in this scenario. Hash Repartitioning Example @@ -187,7 +187,7 @@ SELECT a, SUM(b) FROM data.parquet GROUP BY a; Consecutive Repartition Query Plan @@ -204,7 +204,7 @@ Why is this such a big deal? Well, repartitions do not process the data; their p Consecutive Repartition Query Plan With Data @@ -219,7 +219,7 @@ Optimally the plan should do one of two things: Optimal Query Plans @@ -294,7 +294,7 @@ This logic takes place in the main loop of this rule. I find it helpful to draw Incorrect Logic Tree @@ -321,7 +321,7 @@ The new logic tree looks like this: Correct Logic Tree @@ -388,7 +388,7 @@ For the benchmarking standard, TPCH, speedups were small but consistent: TPCH Benchmark Results @@ -400,7 +400,7 @@ For the benchmarking standard, TPCH, speedups were small but consistent: TPCH10 Benchmark Results diff --git a/content/blog/2026-01-12-extending-sql.md b/content/blog/2026-01-12-extending-sql.md index d4d9e7a7..374842e2 100644 --- a/content/blog/2026-01-12-extending-sql.md +++ b/content/blog/2026-01-12-extending-sql.md @@ -76,7 +76,7 @@ DataFusion turns SQL into executable work in stages: Each stage has extension points.
- DataFusion SQL processing pipeline: SQL String flows through Parser to AST, then SqlToRel (with Extension Planners) to LogicalPlan, then PhysicalPlanner to ExecutionPlan + DataFusion SQL processing pipeline: SQL String flows through Parser to AST, then SqlToRel (with Extension Planners) to LogicalPlan, then PhysicalPlanner to ExecutionPlan
Figure 1: SQL flows through three stages: parsing, logical planning (via SqlToRel, where the Extension Planners hook in), and physical planning. Each stage has extension points: wrap the parser, implement planner traits, or add physical operators.
diff --git a/content/blog/2026-02-02-datafusion_case.md b/content/blog/2026-02-02-datafusion_case.md index 2f133bd6..f8dee65a 100644 --- a/content/blog/2026-02-02-datafusion_case.md +++ b/content/blog/2026-02-02-datafusion_case.md @@ -164,7 +164,7 @@ END Schematically, it will look as follows:
-Schematic representation of data flow in the original CASE implementation +Schematic representation of data flow in the original CASE implementation
One iteration of the `CASE` evaluation loop
@@ -192,7 +192,7 @@ pub trait PhysicalExpr { Going back to the same example as before, the data flow in `evaluate_selection` looks like this:
-Schematic representation of `evaluate_selection` evaluation +Schematic representation of `evaluate_selection` evaluation
evaluate_selection data flow
@@ -279,7 +279,7 @@ The second optimization fundamentally restructures how the results of each loop The diagram below illustrates the optimized data flow when evaluating the `CASE WHEN col = 'b' THEN 100 ELSE 200 END` from before:
-Schematic representation of optimized evaluation loop +Schematic representation of optimized evaluation loop
optimized evaluation loop
@@ -299,7 +299,7 @@ The diagram below illustrates how `merge_n` works for an example where three `WH The first branch produced the result `A` for row 2, the second produced `B` for row 1, and the third produced `C` and `D` for rows 4 and 5.
-Schematic illustration of the merge_n algorithm +Schematic illustration of the merge_n algorithm
merge_n example
@@ -329,7 +329,7 @@ FROM mailing_address You can see that the `CASE` expression only references the columns `country` and `state`, but because all columns are being queried, projection pushdown cannot reduce the number of columns being fed in to the projection operator.
-Schematic illustration of CASE evaluation without projection +Schematic illustration of CASE evaluation without projection
CASE evaluation without projection
@@ -339,7 +339,7 @@ As the diagram above shows, this filtering creates a reduced copy of all columns This unnecessary copying can be avoided by first narrowing the batch to only include the columns that are actually needed.
-Schematic illustration of CASE evaluation with projection +Schematic illustration of CASE evaluation with projection
CASE evaluation with projection
@@ -378,7 +378,7 @@ In contrast to `zip`, `merge` does not require both of its value inputs to have Instead it requires that the sum of the length of the value inputs matches the length of the mask array.
-Schematic illustration of the merge algorithm +Schematic illustration of the merge algorithm
merge example
@@ -435,7 +435,7 @@ The green series shows the time measurement for the `SELECT * FROM orders` to gi All measurements were made with a target partition count of `1`.
-Performance measurements chart +Performance measurements chart
Performance measurements