Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions content/blog/2024-01-19-datafusion-34.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below
[ClickBench]: https://benchmark.clickhouse.com/

<figure style="text-align: center;">
<img src="/blog/images/datafusion-34.0.0/compare-new.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<img src="/blog/images/datafusion-34.0.0/compare-new.png" width="100%" class="img-fluid" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<figcaption>
<b>Figure 1</b>: Performance improvement between <code>25.0.0</code> and <code>34.0.0</code> on ClickBench.
Note that DataFusion <code>25.0.0</code>, could not run several queries due to
Expand All @@ -122,7 +122,7 @@ more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below
</figure>

<figure style="text-align: center;">
<img src="/blog/images/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<img src="/blog/images/datafusion-34.0.0/compare.png" width="100%" class="img-fluid" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<figcaption>
<b>Figure 2</b>: Total query runtime for DataFusion <code>34.0.0</code> and DataFusion <code>25.0.0</code>.
</figcaption>
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2024-03-06-comet-donation.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ performance improvements for some workloads as shown below.
<img
src="/blog/images/datafusion-comet/comet-architecture.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Fig 1: Adaptive Arrow schema architecture overview."
>
<figcaption>
Expand Down
4 changes: 2 additions & 2 deletions content/blog/2024-07-20-datafusion-comet-0.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ for details of the environment used for these benchmarks.
<img
src="/blog/images/comet-0.1.0/tpch_allqueries.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.1.0"
/>

Expand All @@ -105,7 +105,7 @@ The following chart shows how much Comet currently accelerates each query from t
<img
src="/blog/images/comet-0.1.0/tpch_queries_speedup.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.1.0"
/>

Expand Down
4 changes: 2 additions & 2 deletions content/blog/2024-08-20-python-datafusion-40.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ release, users can fully use these tools in their workflow.
<img
src="/blog/images/python-datafusion-40.0.0/vscode_hover_tooltip.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Fig 1: Enhanced tooltips in an IDE."
>
<figcaption>
Expand All @@ -88,7 +88,7 @@ used a function's arguments as shown in Figure 2.
<img
src="/blog/images/python-datafusion-40.0.0/pylance_error_checking.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Fig 2: Error checking in static analysis"
>
<figcaption>
Expand Down
4 changes: 2 additions & 2 deletions content/blog/2024-08-28-datafusion-comet-0.2.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ Comet 0.2.0 provides a 62% speedup compared to Spark. This is slightly better th
<img
src="/blog/images/comet-0.2.0/tpch_allqueries.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.2.0"
/>

Expand All @@ -98,7 +98,7 @@ Comet 0.1.0, which did not provide any speedup for this benchmark.
<img
src="/blog/images/comet-0.2.0/tpcds_allqueries.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-DS benchmark results for Comet 0.2.0"
/>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ StringView support was released as part of [arrow-rs v52.2.0](https://crates.io/
<img
src="/blog/images/string-view-1/figure1-performance.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="End to end performance improvements for ClickBench queries"
/>

Expand All @@ -61,7 +61,7 @@ Figure 1: StringView improves string-intensive ClickBench query performance by 2
<img
src="/blog/images/string-view-1/figure2-string-view.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Diagram of using StringArray and StringViewArray to represent the same string content"
/>

Expand Down Expand Up @@ -121,7 +121,7 @@ On the other hand, reading Parquet data as a StringViewArray can re-use the same
<img
src="/blog/images/string-view-1/figure4-copying.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Diagram showing how StringViewArray can avoid copying by reusing decoded Parquet pages."
/>

Expand All @@ -147,7 +147,7 @@ Strings are stored as byte sequences. When reading data from (potentially untrus
<img
src="/blog/images/string-view-1/figure5-loading-strings.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Figure showing time to load strings from Parquet and the effect of optimized UTF-8 validation."
/>

Expand All @@ -162,7 +162,7 @@ UTF-8 validation in Rust is highly optimized and favors longer strings (as shown
<img
src="/blog/images/string-view-1/figure6-utf8-validation.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Figure showing UTF-8 validation throughput vs string length."
/>

Expand Down Expand Up @@ -212,7 +212,7 @@ With StringViewArray we saw a 24% end-to-end performance improvement, as shown i
<img
src="/blog/images/string-view-1/figure7-end-to-end.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Figure showing StringView improves end to end performance by 24 percent."
/>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Figure 1 illustrates the difference between the output of both string representa
<img
src="/blog/images/string-view-2/figure1-zero-copy-take.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Diagram showing Zero-copy `take`/`filter` for StringViewArray"
/>

Expand Down Expand Up @@ -121,7 +121,7 @@ To eliminate the impact of the faster Parquet reading using StringViewArray (see
<img
src="/blog/images/string-view-2/figure2-filter-time.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Figure showing StringViewArray reduces the filter time by 32% on ClickBench query 22."
/>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,14 @@ been held by traditional C/C++-based engines.
<img
src="/blog/images/2x_bgwhite_original.png"
width="80%"
class="img-responsive"
class="img-fluid"
alt="Apache DataFusion Logo"
/>

<img
src="/blog/images/clickbench-datafusion-43/perf.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="ClickBench performance for DataFusion 43.0.0"
/>

Expand Down Expand Up @@ -97,7 +97,7 @@ Figure 2.
<img
src="/blog/images/clickbench-datafusion-43/perf-over-time.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="ClickBench performance results over time for DataFusion"
/>

Expand Down Expand Up @@ -134,7 +134,7 @@ resulted in measurable performance improvements.
<img
src="/blog/images/clickbench-datafusion-43/string-view-take.png"
width="80%"
class="img-responsive"
class="img-fluid"
alt="Illustration of how take works with StringView"
/>

Expand Down Expand Up @@ -216,7 +216,7 @@ bypass the first phase when it is not working efficiently, shown in Figure 4.
<img
src="/blog/images/clickbench-datafusion-43/skipping-partial-aggregation.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Two phase aggregation diagram from DataFusion API docs annotated to show first phase not helping"
/>

Expand Down Expand Up @@ -253,7 +253,7 @@ length strings and binary data].
<img
src="/blog/images/clickbench-datafusion-43/row-based-storage.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Row based storage for multiple group columns"
/>

Expand All @@ -276,7 +276,7 @@ at the [one shipped in DataFusion `43.0.0`], shown in Figure 6.
<img
src="/blog/images/clickbench-datafusion-43/column-based-storage.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Column based storage for multiple group columns"
/>

Expand Down
4 changes: 2 additions & 2 deletions content/blog/2025-01-17-datafusion-comet-0.5.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,14 @@ Comet 0.5.0 achieves a 1.9x speedup for single-node TPC-H @ 100 GB, an improveme
<img
src="/blog/images/comet-0.5.0/tpch_allqueries.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.5.0"
/>

<img
src="/blog/images/comet-0.5.0/tpch_queries_compare.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.5.0"
/>

Expand Down
8 changes: 4 additions & 4 deletions content/blog/2025-02-02-datafusion-ballista-43.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Per query comparison:
<img
src="/blog/images/datafusion-ballista-43.0.0/tpch_queries_compare.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Per query comparison"
/>

Expand All @@ -100,7 +100,7 @@ Relative speedup:
<img
src="/blog/images/datafusion-ballista-43.0.0/tpch_queries_speedup_rel.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Relative speedup graph"
/>

Expand All @@ -109,7 +109,7 @@ The overall speedup is 2.9x
<img
src="/blog/images/datafusion-ballista-43.0.0/tpch_allqueries.png"
width="50%"
class="img-responsive"
class="img-fluid"
alt="Overall speedup"
/>

Expand All @@ -120,7 +120,7 @@ Ballista now has a new logo, which is visually similar to other DataFusion proje
<img
src="/blog/images/datafusion-ballista-43.0.0/ballista-logo.png"
width="50%"
class="img-responsive"
class="img-fluid"
alt="New logo"
/>

Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-02-20-datafusion-45.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ more improvements].
<img
src="/blog/images/datafusion-45.0.0/performance_over_time.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="ClickBench performance results over time for DataFusion"
/>

Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-03-11-ordering-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ using the orderings of the query intermediates.<br>
<img
src="/blog/images/ordering_analysis/query_window_plan.png"
width="80%"
class="img-responsive"
class="img-fluid"
alt="Window Query Datafusion Optimization"
/>
<figcaption><strong>Figure 1:</strong> DataFusion analyzes orderings of the sources and query intermediates to generate efficient plans</figcaption>
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-03-20-datafusion-comet-0.7.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ CPU and RAM. Even with **half the resources**, Comet still provides a measurable
<img
src="/blog/images/comet-0.7.0/performance.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Chart showing TPC-H benchmark results for Comet 0.7.0"
/>

Expand Down
4 changes: 2 additions & 2 deletions content/blog/2025-03-20-parquet-pruning.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The diagram below illustrates the [Parquet reading pipeline] in DataFusion, high

[Parquet reading pipeline]: https://docs.rs/datafusion/46.0.0/datafusion/datasource/physical_plan/parquet/source/struct.ParquetSource.html

<img src="/blog/images/parquet-pruning/read-parquet.jpg" alt="Parquet pruning pipeline in DataFusion" width="100%" class="img-responsive">
<img src="/blog/images/parquet-pruning/read-parquet.jpg" alt="Parquet pruning pipeline in DataFusion" width="100%" class="img-fluid">


#### Background: Parquet file structure
Expand Down Expand Up @@ -106,7 +106,7 @@ So far we have discussed techniques that prune the Parquet file using only the m

Filter pushdown, also known as predicate pushdown or late materialization, is a technique that prunes data during scanning, with filters being generated and applied in the Parquet reader.

<img src="/blog/images/parquet-pruning/filter-pushdown.jpg" alt="Filter pushdown in DataFusion" width="100%" class="img-responsive">
<img src="/blog/images/parquet-pruning/filter-pushdown.jpg" alt="Filter pushdown in DataFusion" width="100%" class="img-fluid">

Unlike metadata-based pruning which works at the row group or page level, filter pushdown operates at the row level, allowing DataFusion to filter out individual rows that don't match the query predicates during the decoding process.

Expand Down
10 changes: 5 additions & 5 deletions content/blog/2025-03-21-parquet-pushdown.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ WHERE date_time > '2025-03-11' AND location = 'office';
```

<figure>
<img src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" alt="Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows." width="80%" class="img-responsive">
<img src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" alt="Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows." width="80%" class="img-fluid">
<figcaption>
Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows.
</figcaption>
Expand All @@ -102,7 +102,7 @@ At a high level, the Parquet reader first builds a filter mask -- essentially a
Let's dig into details of [how filter pushdown is implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712) in the current Rust Parquet reader implementation, illustrated in the following figure.

<figure>
<img src="/blog/images/parquet-pushdown/baseline-impl.jpg" alt="Implementation of filter pushdown in Rust Parquet readers" class="img-responsive" with="70%">
<img src="/blog/images/parquet-pushdown/baseline-impl.jpg" alt="Implementation of filter pushdown in Rust Parquet readers" class="img-fluid" with="70%">
<figcaption>
Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns
</figcaption>
Expand Down Expand Up @@ -170,7 +170,7 @@ This section describes my [<700 LOC PR (with lots of comments and tests)](https:


<figure>
<img src="/blog/images/parquet-pushdown/new-pipeline.jpg" alt="New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time" width="80%" class="img-responsive">
<img src="/blog/images/parquet-pushdown/new-pipeline.jpg" alt="New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time" width="80%" class="img-fluid">
<figcaption>
New decoding pipeline, building filter mask and output columns are interleaved in a single pass, allowing us to cache minimal pages for minimal amount of time
</figcaption>
Expand Down Expand Up @@ -213,7 +213,7 @@ Parquet by default encodes data using [dictionary encoding](https://parquet.apac
You can see this in action using [parquet-viewer](https://parquet-viewer.xiangpeng.systems):

<figure>
<img src="/blog/images/parquet-pushdown/parquet-viewer.jpg" alt="Parquet viewer shows the page layout of a column chunk" width="80%" class="img-responsive">
<img src="/blog/images/parquet-pushdown/parquet-viewer.jpg" alt="Parquet viewer shows the page layout of a column chunk" width="80%" class="img-fluid">
<figcaption>
Parquet viewer shows the page layout of a column chunk
</figcaption>
Expand All @@ -225,7 +225,7 @@ This is why it caches 2 pages per column: one dictionary page and one data page.
The data page slot will move forward as it reads the data; but the dictionary page slot always references the first page.

<figure>
<img src="/blog/images/parquet-pushdown/cached-pages.jpg" alt="Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)" width="80%" class="img-responsive">
<img src="/blog/images/parquet-pushdown/cached-pages.jpg" alt="Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)" width="80%" class="img-fluid">
<figcaption>
Cached two pages, one for dictionary (pinned), one for data (moves as it reads the data)
</figcaption>
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-03-24-datafusion-46.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ DataFusion 46.0.0 introduces a new [**SQL Diagnostics framework**](https://gith

For example, if you reference an unknown table or miss a column in `GROUP BY` the error message will include the query snippet causing the error. These diagnostics are meant for end-users of applications built on DataFusion, providing clearer messages instead of generic errors. Here’s an example:

<img src="/blog/images/datafusion-46.0.0/diagnostic-example.png" alt="diagnostic-example" width="80%" class="img-responsive">
<img src="/blog/images/datafusion-46.0.0/diagnostic-example.png" alt="diagnostic-example" width="80%" class="img-fluid">

Currently, diagnostics cover unresolved table/column references, missing `GROUP BY` columns, ambiguous references, wrong number of UNION columns, type mismatches, and a few others. Future releases will extend this to more error types. This feature should greatly ease debugging of complex SQL by pinpointing errors directly in the query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his contributions in this project.

Expand Down
2 changes: 1 addition & 1 deletion content/blog/2025-03-30-datafusion-python-46.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ expandable text and scroll bars.
<img
src="/blog/images/python-datafusion-46.0.0/html_rendering.png"
width="100%"
class="img-responsive"
class="img-fluid"
alt="Fig 1: Example html rendering in a jupyter notebook."
>
<figcaption>
Expand Down
Loading