Fabric Performance Benchmark

A benchmarking framework for comparing data processing engines on Microsoft Fabric. This project evaluates Pandas, PySpark, Polars, and DuckDB across various compute configurations to provide concrete, Fabric-specific evidence for choosing the right engine for your workloads.

Key Findings

Our benchmarking reveals that for medium-scale datasets (up to ~100GB):

DuckDB and Polars on Python Notebooks consistently outperform PySpark, often by 2x or more
Cost efficiency strongly favours Python Notebooks — the cheapest Spark configuration costs 4-5x more than equivalent DuckDB runs
The default Python Notebook (2 vCores) is 8x cheaper per second than the default Spark Notebook
There's a "sweet spot" for resource allocation — more infrastructure doesn't always mean faster execution

For detailed analysis, see our blog post: Fabric Performance Benchmarking.

Prerequisites

Before you begin, ensure you have:

GitHub Account — to fork the repository and enable Git integration with Fabric
Microsoft Fabric Capacity — an active Fabric capacity (F32 or above). The largest default benchmark configuration (Python Notebook with 64 vCores) consumes 32 CUs per second. See Capacity Requirements for details. You can run on smaller capacities by reducing the benchmark configurations — see Scaling Back the Benchmarks
Workspace with Admin permissions — a blank Fabric workspace assigned to your capacity, where you have Admin permissions. You can create this yourself or have someone create it for you

Repository Structure

fabric-performance-benchmark/
├── fabric/
│   ├── benchmark_1_spark_versus_python/
│   │   ├── analysis_of_results/
│   │   │   ├── analysis_of_results.Notebook    # Process raw benchmark data
│   │   │   ├── benchmark_analytics.Report      # Power BI report
│   │   │   ├── refresh_semantic_model.Notebook # Sync SQL endpoint & refresh semantic model
│   │   │   ├── run_analysis.DataPipeline       # Orchestrate analysis & refresh
│   │   │   └── *.SemanticModel                 # Power BI semantic model
│   │   ├── benchmark_notebooks/
│   │   │   ├── benchmark_1_variables.VariableLibrary  # Pipeline & notebook configuration
│   │   │   ├── duckdb_benchmark.Notebook       # DuckDB implementation
│   │   │   ├── helper_methods.Notebook         # Shared helpers for Python benchmarks
│   │   │   ├── helper_methods_pyspark.Notebook # Shared helpers for Spark benchmarks
│   │   │   ├── pandas_benchmark.Notebook       # Pandas implementation
│   │   │   ├── polars_benchmark.Notebook       # Polars implementation
│   │   │   └── pyspark_benchmark.Notebook      # PySpark implementation
│   │   ├── orchestrate_python_benchmark/
│   │   │   ├── child_pipelines/                # Sub-pipelines for iterations & vCores
│   │   │   └── run_benchmarks.DataPipeline     # Orchestrate Python benchmarks
│   │   ├── orchestrate_spark_benchmark/
│   │   │   ├── child_pipelines/                # Sub-pipelines for iterations & configs
│   │   │   └── run_spark_benchmarks.DataPipeline # Orchestrate Spark benchmarks
│   │   ├── sandpit/                            # Experimental notebooks
│   │   └── set_up/
│   │       ├── configure_workspace.Notebook    # Update variable library with notebook & connection IDs
│   │       └── download_data.Notebook          # Download source data
│   └── fabric_performance_benchmark_lakehouse.Lakehouse
├── notebooks/
│   ├── configure_workspace.ipynb               # Local version of workspace setup
│   ├── fabric-benchmarking-part-1.ipynb        # Local analysis notebook
│   └── fabric-benchmarking-part-1.md           # Blog content
└── src/
    ├── fabric_admin/                           # Fabric REST API client library
    └── onelake_tools/                          # OneLake helper utilities

Getting Started

1. Fork the Repository

Navigate to github.com/endjin/fabric-performance-benchmark
Click Fork in the top-right corner
Select your GitHub organisation or personal account as the destination
Wait for the fork to complete

2. Create a Fabric Workspace

If you already have a blank workspace with Admin permissions assigned to your Fabric capacity, skip to Step 3.

Sign in to Microsoft Fabric
Click Workspaces in the left navigation pane
Click + New workspace
Enter the workspace name: fabric_performance_benchmark_workspace

Note: You can choose a different workspace name. The configure_workspace notebook will update the workspace_name variable in the variable library.
Expand Advanced and select your Fabric capacity
Click Apply

3. Connect Workspace to GitHub

In your new workspace, click Workspace settings (gear icon)
Navigate to Git integration
Click Connect
Select GitHub as the Git provider
Authenticate with GitHub if prompted
Select your forked repository
Choose the main branch
Set the Git folder to fabric
Click Connect and sync

Fabric will import all items from the repository into your workspace. This may take a few minutes.

See Get started with Git integration for more details about this process.

4. Create a Cloud Connection

The benchmark pipelines use a shared cloud connection to trigger notebooks. You must create this connection manually.

Go to Power BI Gateway Management
Click + New to create a new connection
Set the connection type to Fabric Data Pipelines
Name it something descriptive (e.g. "Fabric Data Pipelines - Benchmark")
Choose the OAuth 2.0 autentication method
Click the Edit credentials and complete the authentication wizard
Once created, copy the Connection ID (GUID) from the connection details — you will need this in the next step

See Data source management for more information about Cloud Connections.

5. Configure the Workspace

This step writes the cloud connection GUID and notebook IDs into the variable library so the benchmark pipelines can reference them.

Navigate to set_up folder in your workspace
Open configure_workspace notebook
Paste the Connection ID (GUID) from the previous step into the CONNECTION_ID variable
Run all cells
Verify the output confirms:
- Variable library updated with the connection GUID and notebook IDs
- Semantic model repointed to the lakehouse in this workspace

Note: This notebook must be run before the benchmark pipelines. The pipelines depend on the connection GUID and notebook IDs stored in the benchmark_1_variables variable library. If you re-sync from Git, you may need to re-run this notebook as Fabric assigns new item IDs.

6. Download Source Data

The benchmark uses UK Land Registry house price data (1995-present, ~30 million rows, ~5GB).

Navigate to set_up folder in your workspace
Open download_data notebook
The notebook defaults to downloading 3 years of data (~100MB per year). To run the full benchmark as described in our blog post, update number_of_years in the LandRegistryImporter constructor to 30 for the complete dataset (~5GB, ~30 CSV files)
Run all cells to download the CSV files from the Land Registry
Verify the data appears in the lakehouse Files area under land_registry/

Note: Download times depend on the number_of_years setting and network speed. The full 30-year dataset may take 10-15 minutes.

7. Run the Benchmarks

The benchmarks are orchestrated via Data Factory pipelines that run each engine across multiple configurations. Each pipeline iterates through its configurations sequentially (not in parallel), running 3 iterations per configuration to collect statistically meaningful results. Each pipeline takes several hours to complete.

Important: Run the two pipelines in series (one after the other), not at the same time, to avoid exceeding your Fabric capacity limits.

Run Spark Benchmarks

Navigate to orchestrate_spark_benchmark folder
Open run_spark_benchmarks pipeline
Review the default parameters:
- configurations_to_run — array of 6 Spark configurations (see Spark Notebook Configurations)
- iterations — number of runs per configuration (default: 3)
Click Run and wait for the pipeline to complete before starting the Python benchmarks
Monitor progress in the pipeline run view or the Monitoring hub

The Spark pipeline tests PySpark across various executor configurations (1-4 executors, with 4/4 and 8/8 vCore/memory combinations). The largest configuration (8/8 vCores, 4 executors) consumes 20 CUs per second.

Run Python Benchmarks

Navigate to orchestrate_python_benchmark folder
Open run_benchmarks pipeline
Review the default parameters:
- vcores_to_run — array of vCore sizes (default: [2, 4, 8, 16, 32, 64])
- iterations — number of runs per configuration (default: 3)
Click Run
Monitor progress in the pipeline run view or the Monitoring hub

The Python pipeline tests Polars, DuckDB, and Pandas across each vCore configuration. The largest configuration (64 vCores) consumes 32 CUs per second.

Note: Each pipeline takes several hours to run due to sequential execution. Spark benchmarks include cluster spin-up time (~3 minutes per run). Python Notebook benchmarks are faster to provision (~30 seconds for the default 2 vCore configuration).

9. Analyse Results

After the benchmarks complete, process the raw data:

Navigate to analysis_of_results folder
Open run_analysis pipeline
Click Run to execute the pipeline, which:
- Runs the analysis_of_results notebook to aggregate benchmark timing data, calculate median execution times, compute CU costs, and generate visualisations
- Runs the refresh_semantic_model notebook to sync the SQL analytics endpoint metadata and refresh the Direct Lake semantic model
At this stage you can commit all of the changes to artfacts back to your Git fork so that a re-sync wouldn't break things.

The pipeline writes processed data to the benchmark_repository/benchmark_analytics Delta table and ensures the Power BI report reflects the latest results.

10. View the Report

Navigate to analysis_of_results folder
Click the Benchmark Analytics semantic model (the name may appear as a GUID in the file explorer — look for the item with type Semantic Model)
Open benchmark_analytics report
Explore the interactive visualisations:
- Execution time comparisons
- Cost analysis
- Stage-level breakdowns
- Configuration comparisons

Local Development

You can run the analysis notebook locally on your machine, which provides access to modern IDEs, faster iteration, and the ability to customise the analysis. The local notebook notebooks/fabric-benchmarking-part-1.ipynb connects directly to Fabric and generates detailed analytics with Markdown commentary — this notebook formed the basis of our blog series on this topic.

Prerequisites

VS Code with the Dev Containers extension
Docker Desktop (or compatible container runtime)
Access to a Fabric workspace with benchmark data

Getting Started

Clone the repository:

git clone https://github.com/endjin/fabric-performance-benchmark.git

Open the folder in VS Code:
```
code fabric-performance-benchmark
```
When prompted, click Reopen in Container (or use the Command Palette: Dev Containers: Reopen in Container)
Wait for the container to build — this automatically installs Python 3.12 and all dependencies via uv

Run the Local Analysis Notebook

Navigate to notebooks/fabric-benchmarking-part-1.ipynb

Update the workspace and lakehouse names if different from defaults:

WORKSPACE_NAME = "fabric_performance_benchmark_workspace"
LAKEHOUSE_NAME = "fabric_performance_benchmark_lakehouse"

Run all cells to:
- Authenticate with OneLake via browser (uses InteractiveBrowserCredential)
- Load benchmark data from the Fabric Lakehouse
- Generate interactive Plotly visualisations
- View detailed Markdown commentary explaining each analysis

Note: The local notebook reads data directly from OneLake, so you must have run the benchmarks in Fabric first to have data available for analysis.

Benchmark Configurations

The benchmark pipelines test each engine across multiple compute configurations. These are defined as pipeline parameters and can be customised before each run.

Python Notebook Configurations

Configured via the vcores_to_run parameter on the run_benchmarks pipeline. Each Python Notebook runs on a single compute node. Fabric allocates memory proportionally to the vCore count.

CUs Per Second	vCores	RAM
1	2	16 GB
2	4	32 GB
4	8	64 GB
8	16	128 GB
16	32	256 GB
32	64	512 GB

Spark Notebook Configurations

Configured via the configurations_to_run parameter on the run_spark_benchmarks pipeline. Each Spark session has a dedicated driver node plus one or more executor nodes. CU consumption is calculated as: total vCores ÷ 2, where total vCores = driver_cores + (executor_cores × executor_number).

CUs Per Second	Executors	vCores (Driver/Executor)	RAM (Driver/Executor)
4	1	4/4	28G/28G
6	2	4/4	28G/28G
8	1	8/8	56G/56G
10	4	4/4	28G/28G
12	2	8/8	56G/56G
20	4	8/8	56G/56G

Capacity Requirements

In Fabric, each Capacity Unit (CU) provides 2 Spark vCores. A running notebook consumes CUs for the duration of its session.

The largest default Python configuration (64 vCores) consumes 32 CUs per second, making F32 the minimum Fabric SKU required to run all benchmarks as configured. The largest Spark configuration (8/8 vCores, 4 executors = 40 total vCores) consumes 20 CUs per second, which fits within F32.

SKU	Capacity Units	Max Spark vCores	Can run all defaults?
F2	2	4	No
F4	4	8	No
F8	8	16	No
F16	16	32	No
F32	32	64	Yes
F64	64	128	Yes

Note: Fabric applies a 3x burst multiplier for Spark workloads, so the sustained CU consumption matters more than peak vCores. The critical constraint is that the CU consumption of a single benchmark run must not exceed your capacity's CU count.

Scaling Back the Benchmarks

If your Fabric capacity is smaller than F32, you can reduce the benchmark configurations to fit:

Pipeline: run_benchmarks — edit the vcores_to_run parameter to remove larger vCore sizes. For example, [2, 4, 8, 16] requires only F8 (0.5 CUs per vCore)
Pipeline: run_spark_benchmarks — edit the configurations_to_run parameter to remove higher-CU configurations. For example, removing the 4-executor configurations reduces the maximum to 12 CUs (F16)

You can also reduce the iterations parameter (default: 3) to shorten overall run time, though fewer iterations may reduce the statistical reliability of the results.

Data Attribution

This project uses open data from the UK Land Registry Price Paid Data, made available under the Open Government Licence v3.0.

Related Resources

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

Problem	Solution
Pipelines show "item not found" errors	After syncing from Git, Fabric assigns new IDs. Notebook references are resolved automatically via the Variable Library's Item Reference variables. For child pipeline references, open each pipeline and re-select them from the dropdowns.
Notebooks can't find the lakehouse	Open each notebook and re-attach the default lakehouse (`fabric_performance_benchmark_lakehouse`) via the lakehouse explorer panel.
Variable library errors	Ensure the `benchmark_1_variables` variable library exists in the workspace and contains `workspace_name`, `lakehouse_name`, `raw_data_relative_path`, and the notebook Item Reference variables (`polars_benchmark`, `duckdb_benchmark`, `pandas_benchmark`, `pyspark_benchmark`). After syncing from Git, verify the Item Reference variables point to the correct notebook IDs in your workspace.
Spark benchmarks fail with capacity errors	The full set of default configurations requires F32 capacity. Start with smaller configurations or reduce the `configurations_to_run` and `vcores_to_run` parameters. See Scaling Back the Benchmarks.
Download notebook fails	The Land Registry S3 endpoint may be temporarily unavailable. Retry after a few minutes.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About endjin

endjin is a technology consultancy specialising in software engineering, data analytics, AI and Azure platform.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.demo		.demo
.devcontainer		.devcontainer
fabric		fabric
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Fabric Performance Benchmark

Key Findings

Prerequisites

Repository Structure

Getting Started

1. Fork the Repository

2. Create a Fabric Workspace

3. Connect Workspace to GitHub

4. Create a Cloud Connection

5. Configure the Workspace

6. Download Source Data

7. Run the Benchmarks

Run Spark Benchmarks

Run Python Benchmarks

9. Analyse Results

10. View the Report

Local Development

Prerequisites

Getting Started

Run the Local Analysis Notebook

Benchmark Configurations

Python Notebook Configurations

Spark Notebook Configurations

Capacity Requirements

Scaling Back the Benchmarks

Data Attribution

Related Resources

Contributing

Troubleshooting

License

About endjin

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages