A benchmarking framework for comparing data processing engines on Microsoft Fabric. This project evaluates Pandas, PySpark, Polars, and DuckDB across various compute configurations to provide concrete, Fabric-specific evidence for choosing the right engine for your workloads.
Our benchmarking reveals that for medium-scale datasets (up to ~100GB):
- DuckDB and Polars on Python Notebooks consistently outperform PySpark, often by 2x or more
- Cost efficiency strongly favours Python Notebooks — the cheapest Spark configuration costs 4-5x more than equivalent DuckDB runs
- The default Python Notebook (2 vCores) is 8x cheaper per second than the default Spark Notebook
- There's a "sweet spot" for resource allocation — more infrastructure doesn't always mean faster execution
For detailed analysis, see our blog post: Fabric Performance Benchmarking.
Before you begin, ensure you have:
- GitHub Account — to fork the repository and enable Git integration with Fabric
- Microsoft Fabric Capacity — an active Fabric capacity (F32 or above). The largest default benchmark configuration (Python Notebook with 64 vCores) consumes 32 CUs per second. See Capacity Requirements for details. You can run on smaller capacities by reducing the benchmark configurations — see Scaling Back the Benchmarks
- Workspace with Admin permissions — a blank Fabric workspace assigned to your capacity, where you have Admin permissions. You can create this yourself or have someone create it for you
fabric-performance-benchmark/
├── fabric/
│ ├── benchmark_1_spark_versus_python/
│ │ ├── analysis_of_results/
│ │ │ ├── analysis_of_results.Notebook # Process raw benchmark data
│ │ │ ├── benchmark_analytics.Report # Power BI report
│ │ │ ├── refresh_semantic_model.Notebook # Sync SQL endpoint & refresh semantic model
│ │ │ ├── run_analysis.DataPipeline # Orchestrate analysis & refresh
│ │ │ └── *.SemanticModel # Power BI semantic model
│ │ ├── benchmark_notebooks/
│ │ │ ├── benchmark_1_variables.VariableLibrary # Pipeline & notebook configuration
│ │ │ ├── duckdb_benchmark.Notebook # DuckDB implementation
│ │ │ ├── helper_methods.Notebook # Shared helpers for Python benchmarks
│ │ │ ├── helper_methods_pyspark.Notebook # Shared helpers for Spark benchmarks
│ │ │ ├── pandas_benchmark.Notebook # Pandas implementation
│ │ │ ├── polars_benchmark.Notebook # Polars implementation
│ │ │ └── pyspark_benchmark.Notebook # PySpark implementation
│ │ ├── orchestrate_python_benchmark/
│ │ │ ├── child_pipelines/ # Sub-pipelines for iterations & vCores
│ │ │ └── run_benchmarks.DataPipeline # Orchestrate Python benchmarks
│ │ ├── orchestrate_spark_benchmark/
│ │ │ ├── child_pipelines/ # Sub-pipelines for iterations & configs
│ │ │ └── run_spark_benchmarks.DataPipeline # Orchestrate Spark benchmarks
│ │ ├── sandpit/ # Experimental notebooks
│ │ └── set_up/
│ │ ├── configure_workspace.Notebook # Update variable library with notebook & connection IDs
│ │ └── download_data.Notebook # Download source data
│ └── fabric_performance_benchmark_lakehouse.Lakehouse
├── notebooks/
│ ├── configure_workspace.ipynb # Local version of workspace setup
│ ├── fabric-benchmarking-part-1.ipynb # Local analysis notebook
│ └── fabric-benchmarking-part-1.md # Blog content
└── src/
├── fabric_admin/ # Fabric REST API client library
└── onelake_tools/ # OneLake helper utilities
- Navigate to github.com/endjin/fabric-performance-benchmark
- Click Fork in the top-right corner
- Select your GitHub organisation or personal account as the destination
- Wait for the fork to complete
If you already have a blank workspace with Admin permissions assigned to your Fabric capacity, skip to Step 3.
- Sign in to Microsoft Fabric
- Click Workspaces in the left navigation pane
- Click + New workspace
- Enter the workspace name:
fabric_performance_benchmark_workspaceNote: You can choose a different workspace name. The
configure_workspacenotebook will update theworkspace_namevariable in the variable library. - Expand Advanced and select your Fabric capacity
- Click Apply
- In your new workspace, click Workspace settings (gear icon)
- Navigate to Git integration
- Click Connect
- Select GitHub as the Git provider
- Authenticate with GitHub if prompted
- Select your forked repository
- Choose the main branch
- Set the Git folder to
fabric - Click Connect and sync
Fabric will import all items from the repository into your workspace. This may take a few minutes.
See Get started with Git integration for more details about this process.
The benchmark pipelines use a shared cloud connection to trigger notebooks. You must create this connection manually.
- Go to Power BI Gateway Management
- Click + New to create a new connection
- Set the connection type to Fabric Data Pipelines
- Name it something descriptive (e.g. "Fabric Data Pipelines - Benchmark")
- Choose the OAuth 2.0 autentication method
- Click the Edit credentials and complete the authentication wizard
- Once created, copy the Connection ID (GUID) from the connection details — you will need this in the next step
See Data source management for more information about Cloud Connections.
This step writes the cloud connection GUID and notebook IDs into the variable library so the benchmark pipelines can reference them.
- Navigate to set_up folder in your workspace
- Open configure_workspace notebook
- Paste the Connection ID (GUID) from the previous step into the
CONNECTION_IDvariable - Run all cells
- Verify the output confirms:
- Variable library updated with the connection GUID and notebook IDs
- Semantic model repointed to the lakehouse in this workspace
Note: This notebook must be run before the benchmark pipelines. The pipelines depend on the connection GUID and notebook IDs stored in the
benchmark_1_variablesvariable library. If you re-sync from Git, you may need to re-run this notebook as Fabric assigns new item IDs.
The benchmark uses UK Land Registry house price data (1995-present, ~30 million rows, ~5GB).
- Navigate to set_up folder in your workspace
- Open download_data notebook
- The notebook defaults to downloading 3 years of data (~100MB per year). To run the full benchmark as described in our blog post, update
number_of_yearsin theLandRegistryImporterconstructor to30for the complete dataset (~5GB, ~30 CSV files) - Run all cells to download the CSV files from the Land Registry
- Verify the data appears in the lakehouse Files area under
land_registry/
Note: Download times depend on the
number_of_yearssetting and network speed. The full 30-year dataset may take 10-15 minutes.
The benchmarks are orchestrated via Data Factory pipelines that run each engine across multiple configurations. Each pipeline iterates through its configurations sequentially (not in parallel), running 3 iterations per configuration to collect statistically meaningful results. Each pipeline takes several hours to complete.
Important: Run the two pipelines in series (one after the other), not at the same time, to avoid exceeding your Fabric capacity limits.
- Navigate to orchestrate_spark_benchmark folder
- Open run_spark_benchmarks pipeline
- Review the default parameters:
configurations_to_run— array of 6 Spark configurations (see Spark Notebook Configurations)iterations— number of runs per configuration (default:3)
- Click Run and wait for the pipeline to complete before starting the Python benchmarks
- Monitor progress in the pipeline run view or the Monitoring hub
The Spark pipeline tests PySpark across various executor configurations (1-4 executors, with 4/4 and 8/8 vCore/memory combinations). The largest configuration (8/8 vCores, 4 executors) consumes 20 CUs per second.
- Navigate to orchestrate_python_benchmark folder
- Open run_benchmarks pipeline
- Review the default parameters:
vcores_to_run— array of vCore sizes (default:[2, 4, 8, 16, 32, 64])iterations— number of runs per configuration (default:3)
- Click Run
- Monitor progress in the pipeline run view or the Monitoring hub
The Python pipeline tests Polars, DuckDB, and Pandas across each vCore configuration. The largest configuration (64 vCores) consumes 32 CUs per second.
Note: Each pipeline takes several hours to run due to sequential execution. Spark benchmarks include cluster spin-up time (~3 minutes per run). Python Notebook benchmarks are faster to provision (~30 seconds for the default 2 vCore configuration).
After the benchmarks complete, process the raw data:
- Navigate to analysis_of_results folder
- Open run_analysis pipeline
- Click Run to execute the pipeline, which:
- Runs the analysis_of_results notebook to aggregate benchmark timing data, calculate median execution times, compute CU costs, and generate visualisations
- Runs the refresh_semantic_model notebook to sync the SQL analytics endpoint metadata and refresh the Direct Lake semantic model
- At this stage you can commit all of the changes to artfacts back to your Git fork so that a re-sync wouldn't break things.
The pipeline writes processed data to the benchmark_repository/benchmark_analytics Delta table and ensures the Power BI report reflects the latest results.
- Navigate to analysis_of_results folder
- Click the Benchmark Analytics semantic model (the name may appear as a GUID in the file explorer — look for the item with type Semantic Model)
- Open benchmark_analytics report
- Explore the interactive visualisations:
- Execution time comparisons
- Cost analysis
- Stage-level breakdowns
- Configuration comparisons
You can run the analysis notebook locally on your machine, which provides access to modern IDEs, faster iteration, and the ability to customise the analysis. The local notebook notebooks/fabric-benchmarking-part-1.ipynb connects directly to Fabric and generates detailed analytics with Markdown commentary — this notebook formed the basis of our blog series on this topic.
- VS Code with the Dev Containers extension
- Docker Desktop (or compatible container runtime)
- Access to a Fabric workspace with benchmark data
-
Clone the repository:
git clone https://github.com/endjin/fabric-performance-benchmark.git
-
Open the folder in VS Code:
code fabric-performance-benchmark
-
When prompted, click Reopen in Container (or use the Command Palette:
Dev Containers: Reopen in Container) -
Wait for the container to build — this automatically installs Python 3.12 and all dependencies via
uv
-
Navigate to
notebooks/fabric-benchmarking-part-1.ipynb -
Update the workspace and lakehouse names if different from defaults:
WORKSPACE_NAME = "fabric_performance_benchmark_workspace" LAKEHOUSE_NAME = "fabric_performance_benchmark_lakehouse"
-
Run all cells to:
- Authenticate with OneLake via browser (uses
InteractiveBrowserCredential) - Load benchmark data from the Fabric Lakehouse
- Generate interactive Plotly visualisations
- View detailed Markdown commentary explaining each analysis
- Authenticate with OneLake via browser (uses
Note: The local notebook reads data directly from OneLake, so you must have run the benchmarks in Fabric first to have data available for analysis.
The benchmark pipelines test each engine across multiple compute configurations. These are defined as pipeline parameters and can be customised before each run.
Configured via the vcores_to_run parameter on the run_benchmarks pipeline. Each Python Notebook runs on a single compute node. Fabric allocates memory proportionally to the vCore count.
| CUs Per Second | vCores | RAM |
|---|---|---|
| 1 | 2 | 16 GB |
| 2 | 4 | 32 GB |
| 4 | 8 | 64 GB |
| 8 | 16 | 128 GB |
| 16 | 32 | 256 GB |
| 32 | 64 | 512 GB |
Configured via the configurations_to_run parameter on the run_spark_benchmarks pipeline. Each Spark session has a dedicated driver node plus one or more executor nodes. CU consumption is calculated as: total vCores ÷ 2, where total vCores = driver_cores + (executor_cores × executor_number).
| CUs Per Second | Executors | vCores (Driver/Executor) | RAM (Driver/Executor) |
|---|---|---|---|
| 4 | 1 | 4/4 | 28G/28G |
| 6 | 2 | 4/4 | 28G/28G |
| 8 | 1 | 8/8 | 56G/56G |
| 10 | 4 | 4/4 | 28G/28G |
| 12 | 2 | 8/8 | 56G/56G |
| 20 | 4 | 8/8 | 56G/56G |
In Fabric, each Capacity Unit (CU) provides 2 Spark vCores. A running notebook consumes CUs for the duration of its session.
The largest default Python configuration (64 vCores) consumes 32 CUs per second, making F32 the minimum Fabric SKU required to run all benchmarks as configured. The largest Spark configuration (8/8 vCores, 4 executors = 40 total vCores) consumes 20 CUs per second, which fits within F32.
| SKU | Capacity Units | Max Spark vCores | Can run all defaults? |
|---|---|---|---|
| F2 | 2 | 4 | No |
| F4 | 4 | 8 | No |
| F8 | 8 | 16 | No |
| F16 | 16 | 32 | No |
| F32 | 32 | 64 | Yes |
| F64 | 64 | 128 | Yes |
Note: Fabric applies a 3x burst multiplier for Spark workloads, so the sustained CU consumption matters more than peak vCores. The critical constraint is that the CU consumption of a single benchmark run must not exceed your capacity's CU count.
If your Fabric capacity is smaller than F32, you can reduce the benchmark configurations to fit:
- Pipeline:
run_benchmarks— edit thevcores_to_runparameter to remove larger vCore sizes. For example,[2, 4, 8, 16]requires only F8 (0.5 CUs per vCore) - Pipeline:
run_spark_benchmarks— edit theconfigurations_to_runparameter to remove higher-CU configurations. For example, removing the 4-executor configurations reduces the maximum to 12 CUs (F16)
You can also reduce the iterations parameter (default: 3) to shorten overall run time, though fewer iterations may reduce the statistical reliability of the results.
This project uses open data from the UK Land Registry Price Paid Data, made available under the Open Government Licence v3.0.
- DuckDB: The Rise of In-Process Analytics
- Why Polars Matters for Decision Makers
- DuckDB Workloads on Microsoft Fabric
- Polars Workloads on Microsoft Fabric
- Microsoft Fabric Python Notebooks Documentation
Contributions are welcome! Please feel free to submit a Pull Request.
| Problem | Solution |
|---|---|
| Pipelines show "item not found" errors | After syncing from Git, Fabric assigns new IDs. Notebook references are resolved automatically via the Variable Library's Item Reference variables. For child pipeline references, open each pipeline and re-select them from the dropdowns. |
| Notebooks can't find the lakehouse | Open each notebook and re-attach the default lakehouse (fabric_performance_benchmark_lakehouse) via the lakehouse explorer panel. |
| Variable library errors | Ensure the benchmark_1_variables variable library exists in the workspace and contains workspace_name, lakehouse_name, raw_data_relative_path, and the notebook Item Reference variables (polars_benchmark, duckdb_benchmark, pandas_benchmark, pyspark_benchmark). After syncing from Git, verify the Item Reference variables point to the correct notebook IDs in your workspace. |
| Spark benchmarks fail with capacity errors | The full set of default configurations requires F32 capacity. Start with smaller configurations or reduce the configurations_to_run and vcores_to_run parameters. See Scaling Back the Benchmarks. |
| Download notebook fails | The Land Registry S3 endpoint may be temporarily unavailable. Retry after a few minutes. |
This project is licensed under the MIT License - see the LICENSE file for details.
endjin is a technology consultancy specialising in software engineering, data analytics, AI and Azure platform.