Skip to content

Improve Benchmarking Report Generation Speed - Optimise Polars Pipeline Performance #507

@lakshmi-kovvuri1

Description

@lakshmi-kovvuri1

Overview
Optimize the digital-land-python Polars pipeline to exceed the current 2.7× performance baseline.
Create a single Airflow DAG that runs the transform pipeline for any dataset using a dataset name parameter (agreed with Owen on 16‑Feb‑2026).

Current State
Pipeline runs with legacy phase structure and materializes data between phases.
Current speed: 2.7× faster than old approach.

Desired State
Polars pipeline fully lazy, no intermediate collects, parallelized where possible.
Performance target: >3.2× improvement.
This is report generated in local.

Acceptance Criteria
Pipeline

  • Bottlenecks identified and optimized.
  • Apply Polars lazy optimisations across all refactored phases.
  • All tests pass with performance gain.

Technical Considerations
Maximize Polars lazy mode, avoid premature materialization.
Explore parallel execution where safe.
Maintain backward compatibility.
Reuse existing ECS/Airflow patterns and document supported dataset names.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

Projects

Status

In Review / QA 🔎

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions