diff --git a/content/blog/2024-01-19-datafusion-34.0.0.md b/content/blog/2024-01-19-datafusion-34.0.0.md
index 2f95ccce..9b49d940 100644
--- a/content/blog/2024-01-19-datafusion-34.0.0.md
+++ b/content/blog/2024-01-19-datafusion-34.0.0.md
@@ -113,7 +113,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below
[ClickBench]: https://benchmark.clickhouse.com/
-
+ Figure 1: Performance improvement between 25.0.0 and 34.0.0 on ClickBench.
Note that DataFusion 25.0.0, could not run several queries due to
@@ -122,7 +122,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below
-
+ Figure 2: Total query runtime for DataFusion 34.0.0 and DataFusion 25.0.0.
diff --git a/content/blog/2024-03-06-comet-donation.md b/content/blog/2024-03-06-comet-donation.md
index 660a0642..44728f23 100644
--- a/content/blog/2024-03-06-comet-donation.md
+++ b/content/blog/2024-03-06-comet-donation.md
@@ -39,7 +39,7 @@ performance improvements for some workloads as shown below.
diff --git a/content/blog/2024-07-20-datafusion-comet-0.1.0.md b/content/blog/2024-07-20-datafusion-comet-0.1.0.md
index 8d41d8f2..6167225b 100644
--- a/content/blog/2024-07-20-datafusion-comet-0.1.0.md
+++ b/content/blog/2024-07-20-datafusion-comet-0.1.0.md
@@ -88,7 +88,7 @@ for details of the environment used for these benchmarks.
@@ -105,7 +105,7 @@ The following chart shows how much Comet currently accelerates each query from t
diff --git a/content/blog/2024-08-20-python-datafusion-40.0.0.md b/content/blog/2024-08-20-python-datafusion-40.0.0.md
index dd3b4e66..be484ffd 100644
--- a/content/blog/2024-08-20-python-datafusion-40.0.0.md
+++ b/content/blog/2024-08-20-python-datafusion-40.0.0.md
@@ -72,7 +72,7 @@ release, users can fully use these tools in their workflow.
@@ -88,7 +88,7 @@ used a function's arguments as shown in Figure 2.
diff --git a/content/blog/2024-08-28-datafusion-comet-0.2.0.md b/content/blog/2024-08-28-datafusion-comet-0.2.0.md
index ff17da6a..bc46485a 100644
--- a/content/blog/2024-08-28-datafusion-comet-0.2.0.md
+++ b/content/blog/2024-08-28-datafusion-comet-0.2.0.md
@@ -86,7 +86,7 @@ Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better th
@@ -98,7 +98,7 @@ Comet 0.1.0, which did not provide any speedup for this benchmark.
diff --git a/content/blog/2024-09-13-string-view-german-style-strings-part-1.md b/content/blog/2024-09-13-string-view-german-style-strings-part-1.md
index 6f8770b9..f9f5571b 100644
--- a/content/blog/2024-09-13-string-view-german-style-strings-part-1.md
+++ b/content/blog/2024-09-13-string-view-german-style-strings-part-1.md
@@ -47,7 +47,7 @@ StringView support was released as part of [arrow-rs v52.2.0](https://crates.io/
@@ -61,7 +61,7 @@ Figure 1: StringView improves string-intensive ClickBench query performance by 2
@@ -121,7 +121,7 @@ On the other hand, reading Parquet data as a StringViewArray can re-use the same
@@ -147,7 +147,7 @@ Strings are stored as byte sequences. When reading data from (potentially untrus
@@ -162,7 +162,7 @@ UTF-8 validation in Rust is highly optimized and favors longer strings (as shown
@@ -212,7 +212,7 @@ With StringViewArray we saw a 24% end-to-end performance improvement, as shown i
diff --git a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
index 7fb64f56..34114b11 100644
--- a/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
+++ b/content/blog/2024-09-13-string-view-german-style-strings-part-2.md
@@ -66,7 +66,7 @@ Figure 1 illustrates the difference between the output of both string representa
@@ -121,7 +121,7 @@ To eliminate the impact of the faster Parquet reading using StringViewArray (see
diff --git a/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md b/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md
index a71b8a03..82762e36 100644
--- a/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md
+++ b/content/blog/2024-11-18-datafusion-fastest-single-node-parquet-clickbench.md
@@ -45,14 +45,14 @@ been held by traditional C/C++-based engines.
@@ -97,7 +97,7 @@ Figure 2.
@@ -134,7 +134,7 @@ resulted in measurable performance improvements.
@@ -216,7 +216,7 @@ bypass the first phase when it is not working efficiently, shown in Figure 4.
@@ -253,7 +253,7 @@ length strings and binary data].
@@ -276,7 +276,7 @@ at the [one shipped in DataFusion `43.0.0`], shown in Figure 6.
diff --git a/content/blog/2025-01-17-datafusion-comet-0.5.0.md b/content/blog/2025-01-17-datafusion-comet-0.5.0.md
index bf3a9560..dc9e439e 100644
--- a/content/blog/2025-01-17-datafusion-comet-0.5.0.md
+++ b/content/blog/2025-01-17-datafusion-comet-0.5.0.md
@@ -52,14 +52,14 @@ Comet 0.5.0 achieves a 1.9x speedup for single-node TPC-H @ 100 GB, an improveme
diff --git a/content/blog/2025-02-02-datafusion-ballista-43.0.0.md b/content/blog/2025-02-02-datafusion-ballista-43.0.0.md
index 7fac0323..1a2d7cbf 100644
--- a/content/blog/2025-02-02-datafusion-ballista-43.0.0.md
+++ b/content/blog/2025-02-02-datafusion-ballista-43.0.0.md
@@ -91,7 +91,7 @@ Per query comparison:
@@ -100,7 +100,7 @@ Relative speedup:
@@ -109,7 +109,7 @@ The overall speedup is 2.9x
@@ -120,7 +120,7 @@ Ballista now has a new logo, which is visually similar to other DataFusion proje
diff --git a/content/blog/2025-02-20-datafusion-45.0.0.md b/content/blog/2025-02-20-datafusion-45.0.0.md
index 9471f5bb..d887374f 100644
--- a/content/blog/2025-02-20-datafusion-45.0.0.md
+++ b/content/blog/2025-02-20-datafusion-45.0.0.md
@@ -152,7 +152,7 @@ more improvements].
diff --git a/content/blog/2025-03-11-ordering-analysis.md b/content/blog/2025-03-11-ordering-analysis.md
index 121adf41..72a71bba 100644
--- a/content/blog/2025-03-11-ordering-analysis.md
+++ b/content/blog/2025-03-11-ordering-analysis.md
@@ -332,7 +332,7 @@ using the orderings of the query intermediates. Figure 1: DataFusion analyzes orderings of the sources and query intermediates to generate efficient plans
diff --git a/content/blog/2025-03-20-datafusion-comet-0.7.0.md b/content/blog/2025-03-20-datafusion-comet-0.7.0.md
index 45cc12ce..1e6b2286 100644
--- a/content/blog/2025-03-20-datafusion-comet-0.7.0.md
+++ b/content/blog/2025-03-20-datafusion-comet-0.7.0.md
@@ -56,7 +56,7 @@ CPU and RAM. Even with **half the resources**, Comet still provides a measurable
diff --git a/content/blog/2025-03-20-parquet-pruning.md b/content/blog/2025-03-20-parquet-pruning.md
index 5c268a97..0b78ea4a 100644
--- a/content/blog/2025-03-20-parquet-pruning.md
+++ b/content/blog/2025-03-20-parquet-pruning.md
@@ -49,7 +49,7 @@ The diagram below illustrates the [Parquet reading pipeline] in DataFusion, high
[Parquet reading pipeline]: https://docs.rs/datafusion/46.0.0/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html
-
+
#### Background: Parquet file structure
@@ -106,7 +106,7 @@ So far we have discussed techniques that prune the Parquet file using only the m
Filter pushdown, also known as predicate pushdown or late materialization, is a technique that prunes data during scanning, with filters being generated and applied in the Parquet reader.
-
+
Unlike metadata-based pruning which works at the row group or page level, filter pushdown operates at the row level, allowing DataFusion to filter out individual rows that don't match the query predicates during the decoding process.
diff --git a/content/blog/2025-03-21-parquet-pushdown.md b/content/blog/2025-03-21-parquet-pushdown.md
index 395d59d8..1da5f827 100644
--- a/content/blog/2025-03-21-parquet-pushdown.md
+++ b/content/blog/2025-03-21-parquet-pushdown.md
@@ -77,7 +77,7 @@ WHERE date_time > '2025-03-11' AND location = 'office';
```
-
+
Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.
@@ -102,7 +102,7 @@ At a high level, the Parquet reader first builds a filter mask -- essentially a
Let's dig into details of [how filter pushdown is implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712) in the current Rust Parquet reader implementation, illustrated in the following figure.
-
+
Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns
@@ -170,7 +170,7 @@ This section describes my [<700 LOC PR (with lots of comments and tests)](https:
-
+
New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time
@@ -213,7 +213,7 @@ Parquet by default encodes data using [dictionary encoding](https://parquet.apac
You can see this in action using [parquet-viewer](https://parquet-viewer.xiangpeng.systems):
-
+
Parquet viewer shows the page layout of a column chunk
@@ -225,7 +225,7 @@ This is why it caches 2 pages per column: one dictionary page and one data page.
The data page slot will move forward as it reads the data; but the dictionary page slot always references the first page.
-
+
Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)
diff --git a/content/blog/2025-03-24-datafusion-46.0.0.md b/content/blog/2025-03-24-datafusion-46.0.0.md
index 71ef758d..8cae1126 100644
--- a/content/blog/2025-03-24-datafusion-46.0.0.md
+++ b/content/blog/2025-03-24-datafusion-46.0.0.md
@@ -59,7 +59,7 @@ DataFusion 46.0.0 introduces a new [**SQL Diagnostics framework**](https://gith
For example, if you reference an unknown table or miss a column in `GROUP BY` the error message will include the query snippet causing the error. These diagnostics are meant for end-users of applications built on DataFusion, providing clearer messages instead of generic errors. Here’s an example:
-
+
Currently, diagnostics cover unresolved table/column references, missing `GROUP BY` columns, ambiguous references, wrong number of UNION columns, type mismatches, and a few others. Future releases will extend this to more error types. This feature should greatly ease debugging of complex SQL by pinpointing errors directly in the query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his contributions in this project.
diff --git a/content/blog/2025-03-30-datafusion-python-46.0.0.md b/content/blog/2025-03-30-datafusion-python-46.0.0.md
index 357aa8a6..f854621a 100644
--- a/content/blog/2025-03-30-datafusion-python-46.0.0.md
+++ b/content/blog/2025-03-30-datafusion-python-46.0.0.md
@@ -181,7 +181,7 @@ expandable text and scroll bars.
diff --git a/content/blog/2025-04-10-fastest-tpch-generator.md b/content/blog/2025-04-10-fastest-tpch-generator.md
index 639d4342..17182214 100644
--- a/content/blog/2025-04-10-fastest-tpch-generator.md
+++ b/content/blog/2025-04-10-fastest-tpch-generator.md
@@ -48,7 +48,7 @@ which takes 30 minutes1 (0.05GB/sec). On the same machine, it takes l
It is finally convenient and efficient to run TPC-H queries locally when testing
analytical engines such as DataFusion.
-
+
**Figure 1**: Time to create TPC-H dataset for Scale Factor (see below) 1, 10,
100 and 1000 as 8 individual SNAPPY compressed parquet files using a 22 core GCP
@@ -206,7 +206,7 @@ load the data, using `dbgen`, which is not ideal for several reasons:
[here is how to do so]: https://github.com/apache/datafusion/blob/507f6b6773deac69dd9d90dbe60831f5ea5abed1/datafusion/sqllogictest/test_files/tpch/create_tables.slt.part#L24-L124
-
+
**Figure 3**: Time to generate TPC-H data in TBL format. `tpchgen` is
shown in blue. `tpchgen` restricted to a single core is shown in red. Unmodified
@@ -266,7 +266,7 @@ strings.
[unsafe]: https://github.com/search?q=repo%3Aclflushopt%2Ftpchgen-rs%20unsafe&type=code
[skip]: https://github.com/clflushopt/tpchgen-rs/blob/c651da1fc309f9cb3872cbdf71e4796904dc62c6/tpchgen/src/text.rs#L72
-
+
**Figure 4**: Lamb Theory of System Language Evolution from [Boston University
MiDAS Fall 2024 (Data Systems Seminar)] [slides(pdf)], [recording]. Special
diff --git a/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md b/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md
index a389fe93..7ac4aa8f 100644
--- a/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md
+++ b/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md
@@ -79,7 +79,7 @@ language—it describes what answers are desired rather than an *imperative*
language such as Python, where you describe how to do the computation as shown
in Figure 1.
-
+
**Figure 1**: Query Execution: Users describe the answer they want using either
SQL or a DataFrame. For SQL, a Query Planner translates the parsed query
@@ -112,7 +112,7 @@ modern APIs such as [Polars' lazy API], [Apache Spark's DataFrame]. and
This section motivates the value of a Query Optimizer with an example. Let’s say
you have some observations of animal behavior, as illustrated in Table 1.
-
+
**Table 1**: Example observational data.
@@ -148,7 +148,7 @@ Figure 2.
[LogicalPlan]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
[this DataFusion overview video]: https://youtu.be/EzZTLiSJnhY
-
+
**Figure 2**: Example initial `LogicalPlan` for SQL and DataFrame query. The
plan is read from bottom to top, computing the results in each step.
@@ -157,7 +157,7 @@ The optimizer's job is to take this query plan and rewrite it into an alternate
plan that computes the same results but faster, such as the one shown in Figure
3.
-
+
**Figure 3**: An example optimized plan that computes the same result as the
plan in Figure 2 more efficiently. The diagram highlights where the optimizer
@@ -184,7 +184,7 @@ A multi-pass design is standard because it helps:
1. Understand, implement, and test each pass in isolation
2. Easily extend the optimizer by adding new passes
-
+
**Figure 4**: Query Optimizers are implemented as a series of rules that each
rewrite the query plan. Each rule’s algorithm is expressed as a transformation
diff --git a/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md b/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md
index 195d115b..3dd27b2f 100644
--- a/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md
+++ b/content/blog/2025-06-15-optimizing-sql-dataframes-part-two.md
@@ -95,7 +95,7 @@ Optimizers will evaluate the filter before the aggregation.
[evaluated after]: https://www.datacamp.com/tutorial/sql-order-of-execution
-
+
**Figure 1**: Filter Pushdown. In (**A**) without filter pushdown, the operator
processes more rows, reducing efficiency. In (**B**) with filter pushdown, the
@@ -129,7 +129,7 @@ each column in each row must be parsed even if it is not used in the plan.
[Apache Parquet]: https://parquet.apache.org/
[especially powerful in combination with filter pushdown]: https://blog.xiangpeng.systems/posts/parquet-pushdown/
-
+
**Figure 2:** In (**A**) without projection pushdown, the operator receives more
columns, reducing efficiency. In (**B**) with projection pushdown, the operator
@@ -156,7 +156,7 @@ opening additional files once the limit has been hit.
[TopK]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html
-
+
**Figure 3**: In (**A**), without limit pushdown all data is sorted and
everything except the first few rows are discarded. In (**B**), with limit
@@ -217,7 +217,7 @@ customer, but fills in the fields with `null`. All such rows will be filtered
out by `customer.last_name = 'Lamb'`, and thus an INNER JOIN produces the same
answer. This is illustrated in Figure 4.
-
+
**Figure 4**: Rewriting `OUTER JOIN` to `INNER JOIN`. In (A) the original query
contains an `OUTER JOIN` but also a filter on `customer.last_name`, which
@@ -326,7 +326,7 @@ ORDER BY time_chunk
```
-
+
**Figure 5:** Adding a Projection to evaluate common complex sub expression
decreases complexity for later stages.
@@ -349,7 +349,7 @@ group keys or a `MergeJoin`
[source]: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html
-
+
**Figure 6: **An example of specialized operation for grouping. In (**A**), input data has no specified ordering and DataFusion uses a hashing-based grouping operator ([source](https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/row_hash.rs)) to determine distinct groups. In (**B**), when the input data is ordered by the group keys, DataFusion uses a specialized grouping operator ([source](https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/order)) to find boundaries that separate groups.
@@ -371,7 +371,7 @@ and statistics are commonly stored in analytic file formats. For example, the
[Metadata]: https://docs.rs/parquet/latest/parquet/file/metadata/index.html
-
+
**Figure 7: **When the aggregation result is already stored in the statistics,
the query can be evaluated using the values from statistics without looking at
@@ -392,7 +392,7 @@ potentially (very) different performance. The major options in this category are
[Materialized View]: https://en.wikipedia.org/wiki/Materialized_view
-
+
**Figure 8:** Access Path and Join Order Selection in Query Optimizers. Optimizers use heuristics to enumerate some subset of potential join orders (shape) and access paths (color). The plan with the smallest estimated cost according to some cost model is chosen. In this case, Plan 2 with a cost of 180,000 is chosen for execution as it has the lowest estimated cost.
diff --git a/content/blog/2025-06-30-cancellation.md b/content/blog/2025-06-30-cancellation.md
index 244019b1..76ee19fe 100644
--- a/content/blog/2025-06-30-cancellation.md
+++ b/content/blog/2025-06-30-cancellation.md
@@ -359,7 +359,7 @@ To illustrate what this process looks like, let's have a look at the execution o
If we assume a task budget of 1 unit, each time Tokio schedules the task would result in the following sequence of function calls.
-Tokio task budget system, assuming the task budget is set to 1, for the plan above.
diff --git a/content/blog/2025-07-01-datafusion-comet-0.9.0.md b/content/blog/2025-07-01-datafusion-comet-0.9.0.md
index cd3f24b8..4c73ebd8 100644
--- a/content/blog/2025-07-01-datafusion-comet-0.9.0.md
+++ b/content/blog/2025-07-01-datafusion-comet-0.9.0.md
@@ -144,7 +144,7 @@ Comet now provides a tracing feature for analyzing performance and off-heap vers
diff --git a/content/blog/2025-07-11-datafusion-47.0.0.md b/content/blog/2025-07-11-datafusion-47.0.0.md
index 27c32216..2cf56ca1 100644
--- a/content/blog/2025-07-11-datafusion-47.0.0.md
+++ b/content/blog/2025-07-11-datafusion-47.0.0.md
@@ -198,7 +198,7 @@ use pre-integrated community crates such as the [datafusion-tracing] crate.
diff --git a/content/blog/2025-07-14-user-defined-parquet-indexes.md b/content/blog/2025-07-14-user-defined-parquet-indexes.md
index e2f1452d..7f4fd08c 100644
--- a/content/blog/2025-07-14-user-defined-parquet-indexes.md
+++ b/content/blog/2025-07-14-user-defined-parquet-indexes.md
@@ -88,7 +88,7 @@ The Parquet format includes three main types[2](#footnote2) of option
-
+
**Figure 1**: Parquet file layout with standard index structures (as written by arrow-rs).
@@ -116,7 +116,7 @@ Figure 2 shows the resulting file layout.
-
+
**Figure 2**: Parquet file layout with user-defined indexes.
diff --git a/content/blog/2025-07-28-datafusion-49.0.0.md b/content/blog/2025-07-28-datafusion-49.0.0.md
index 96322304..a2148f70 100644
--- a/content/blog/2025-07-28-datafusion-49.0.0.md
+++ b/content/blog/2025-07-28-datafusion-49.0.0.md
@@ -46,7 +46,7 @@ DataFusion continues to focus on enhancing performance, as shown in the ClickBen
@@ -61,7 +61,7 @@ NOTE: Andrew is working on gathering these numbers
diff --git a/content/blog/2025-08-15-external-parquet-indexes.md b/content/blog/2025-08-15-external-parquet-indexes.md
index 53002cce..58566a31 100644
--- a/content/blog/2025-08-15-external-parquet-indexes.md
+++ b/content/blog/2025-08-15-external-parquet-indexes.md
@@ -80,7 +80,7 @@ needs[1](#footnote1).
@@ -211,7 +211,7 @@ The standard approach is shown in Figure 2:
@@ -245,7 +245,7 @@ shown below.
@@ -262,7 +262,7 @@ stored at the end of the file (in the footer), as shown below.
@@ -289,7 +289,7 @@ The high level mechanics of Parquet predicate pushdown is shown below:
@@ -326,7 +326,7 @@ most recent 7 days.
@@ -471,7 +471,7 @@ indexes for filtering *WITHIN* Parquet files as shown below.
@@ -724,7 +724,7 @@ Come Join Us! 🎣
diff --git a/content/blog/2025-09-10-dynamic-filters.md b/content/blog/2025-09-10-dynamic-filters.md
index 84293a9e..ddf77893 100644
--- a/content/blog/2025-09-10-dynamic-filters.md
+++ b/content/blog/2025-09-10-dynamic-filters.md
@@ -70,7 +70,7 @@ SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
@@ -105,7 +105,7 @@ A straightforward, though slow, plan to answer this query is shown in Figure 2.
@@ -132,7 +132,7 @@ DuckDB]. The plan for Q23 using this specialized operator is shown in Figure 3.
@@ -161,7 +161,7 @@ of files. The plan for Q23 with dynamic filters is shown in Figure 4.
@@ -372,7 +372,7 @@ other optimizations as shown in Figure 7.
diff --git a/content/blog/2025-09-21-custom-types-using-metadata.md b/content/blog/2025-09-21-custom-types-using-metadata.md
index b835a5be..65142b02 100644
--- a/content/blog/2025-09-21-custom-types-using-metadata.md
+++ b/content/blog/2025-09-21-custom-types-using-metadata.md
@@ -86,7 +86,7 @@ implementation, during processing of all user defined functions we pass the inpu
field information.
-
+ Figure 1: Relationship between a Record Batch, it's schema, and the underlying arrays. There is a one to one relationship between each Field in the Schema and Array entry in the Columns.
diff --git a/content/blog/2025-09-29-datafusion-50.0.0.md b/content/blog/2025-09-29-datafusion-50.0.0.md
index e8548b75..5582f3ce 100644
--- a/content/blog/2025-09-29-datafusion-50.0.0.md
+++ b/content/blog/2025-09-29-datafusion-50.0.0.md
@@ -48,7 +48,7 @@ DataFusion continues to focus on enhancing performance, as shown in ClickBench
and other benchmark results.
+ width="100%" class="img-fluid" alt="ClickBench performance results over time for DataFusion" />
**Figure 1**: Average and median normalized query execution times for ClickBench queries for each git revision.
Query times are normalized using the ClickBench definition. See the
diff --git a/content/blog/2025-10-21-datafusion-comet-0.11.0.md b/content/blog/2025-10-21-datafusion-comet-0.11.0.md
index dd22a085..1991ee2a 100644
--- a/content/blog/2025-10-21-datafusion-comet-0.11.0.md
+++ b/content/blog/2025-10-21-datafusion-comet-0.11.0.md
@@ -109,7 +109,7 @@ Comet 0.11.0 continues to deliver significant performance improvements over Spar
@@ -118,7 +118,7 @@ The performance gains are consistent across individual queries, with most querie
diff --git a/content/blog/2025-11-25-datafusion-51.0.0.md b/content/blog/2025-11-25-datafusion-51.0.0.md
index 58a23aa6..2cb8bcbf 100644
--- a/content/blog/2025-11-25-datafusion-51.0.0.md
+++ b/content/blog/2025-11-25-datafusion-51.0.0.md
@@ -46,7 +46,7 @@ the core engine and in the Parquet reader.
@@ -91,7 +91,7 @@ where startup time or low latency is important. You can read more about the upst
diff --git a/content/blog/2025-12-15-avoid-consecutive-repartitions.md b/content/blog/2025-12-15-avoid-consecutive-repartitions.md
index c3d3c4d4..7b1d8ad2 100644
--- a/content/blog/2025-12-15-avoid-consecutive-repartitions.md
+++ b/content/blog/2025-12-15-avoid-consecutive-repartitions.md
@@ -37,7 +37,7 @@ Starting a journey learning about database internals can be daunting. With so ma
@@ -81,7 +81,7 @@ This will give you familiarity with the codebase and using your tools, like your
@@ -109,7 +109,7 @@ DataFusion implements a vectorized
@@ -134,7 +134,7 @@ Round-robin repartitioning is useful when the data grouping isn't known or when
@@ -154,7 +154,7 @@ Hash repartitioning is useful when working with grouped data. Imagine you have a
@@ -166,7 +166,7 @@ Note, the benefit of hash opposed to round-robin partitioning in this scenario.
@@ -187,7 +187,7 @@ SELECT a, SUM(b) FROM data.parquet GROUP BY a;
@@ -204,7 +204,7 @@ Why is this such a big deal? Well, repartitions do not process the data; their p
@@ -219,7 +219,7 @@ Optimally the plan should do one of two things:
@@ -294,7 +294,7 @@ This logic takes place in the main loop of this rule. I find it helpful to draw
@@ -321,7 +321,7 @@ The new logic tree looks like this:
@@ -388,7 +388,7 @@ For the benchmarking standard, TPCH, speedups were small but consistent:
@@ -400,7 +400,7 @@ For the benchmarking standard, TPCH, speedups were small but consistent:
diff --git a/content/blog/2026-01-12-extending-sql.md b/content/blog/2026-01-12-extending-sql.md
index d4d9e7a7..374842e2 100644
--- a/content/blog/2026-01-12-extending-sql.md
+++ b/content/blog/2026-01-12-extending-sql.md
@@ -76,7 +76,7 @@ DataFusion turns SQL into executable work in stages:
Each stage has extension points.
-
+ Figure 1: SQL flows through three stages: parsing, logical planning (via SqlToRel, where the Extension Planners hook in), and physical planning. Each stage has extension points: wrap the parser, implement planner traits, or add physical operators.
diff --git a/content/blog/2026-02-02-datafusion_case.md b/content/blog/2026-02-02-datafusion_case.md
index 2f133bd6..f8dee65a 100644
--- a/content/blog/2026-02-02-datafusion_case.md
+++ b/content/blog/2026-02-02-datafusion_case.md
@@ -164,7 +164,7 @@ END
Schematically, it will look as follows:
-
+One iteration of the `CASE` evaluation loop
@@ -192,7 +192,7 @@ pub trait PhysicalExpr {
Going back to the same example as before, the data flow in `evaluate_selection` looks like this:
-
+evaluate_selection data flow
@@ -279,7 +279,7 @@ The second optimization fundamentally restructures how the results of each loop
The diagram below illustrates the optimized data flow when evaluating the `CASE WHEN col = 'b' THEN 100 ELSE 200 END` from before:
-
+optimized evaluation loop
@@ -299,7 +299,7 @@ The diagram below illustrates how `merge_n` works for an example where three `WH
The first branch produced the result `A` for row 2, the second produced `B` for row 1, and the third produced `C` and `D` for rows 4 and 5.
-
+merge_n example
@@ -329,7 +329,7 @@ FROM mailing_address
You can see that the `CASE` expression only references the columns `country` and `state`, but because all columns are being queried, projection pushdown cannot reduce the number of columns being fed in to the projection operator.
-
+CASE evaluation without projection
@@ -339,7 +339,7 @@ As the diagram above shows, this filtering creates a reduced copy of all columns
This unnecessary copying can be avoided by first narrowing the batch to only include the columns that are actually needed.
-
+CASE evaluation with projection
@@ -378,7 +378,7 @@ In contrast to `zip`, `merge` does not require both of its value inputs to have
Instead it requires that the sum of the length of the value inputs matches the length of the mask array.
-
+merge example
@@ -435,7 +435,7 @@ The green series shows the time measurement for the `SELECT * FROM orders` to gi
All measurements were made with a target partition count of `1`.
-
+Performance measurements