Skip to content

Generate and Store Benchmarking Reports via ECS/Airflow Pipeline to AWS #508

@lakshmi-kovvuri1

Description

@lakshmi-kovvuri1

Overview
Implement an automated benchmarking workflow using Airflow on AWS ECS to generate, store, and track performance reports for the digital-land-python phase pipeline. Reports will be centrally stored in AWS for monitoring and historical analysis.

Current State

Benchmarking is run manually or locally.
No centralized storage or historical record of performance.
No automation or scheduled execution.
Current performance baseline: 2.7× improvement post‑Polars refactor.

Desired State

Automated daily/scheduled benchmarking via Airflow DAG.
ECS‑based container execution for scalability and isolation.
Reports stored in S3 with versioning + lifecycle policies.
Metrics and reports accessible via dashboards/APIs.
Ability to track historical performance trends.

Acceptance Criteria

Airflow DAG created for end‑to‑end benchmarking orchestration.
Benchmarking script containerized (Dockerfile for ECS).
IAM roles/policies configured for ECS tasks + S3 access.
S3 bucket configured with:

Versioning
Lifecycle policies
Structured folder layout: benchmarks/{date}/{report_name}

Collect performance metrics including:

Phase execution times
Memory usage
Data throughput (rows/sec)
Total pipeline runtime

Generate reports in JSON, CSV, and HTML formats.
Airflow task dependencies, retries, and error handling in place.
CloudWatch monitoring enabled for ECS tasks.
Deployment steps fully documented.
Alerts set up for performance degradation (>10% drop).
All integration tests pass in the containerized environment.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

Projects

Status

Refinement

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions