Skip to content

Utility Classes for Converting Between Dictionary Objects and Polars DataFrames #496

@mattsan-dev

Description

@mattsan-dev

Overview
The data pipeline is being incrementally refactored to use Polars for improved performance and efficiency. Parts of the system still expect stream objects, which creates friction during the migration. Introducing two conversion utilities will allow both formats to coexist smoothly, ensuring a controlled and low risk transition.

Assumptions
Data validation is managed in previous phases.

Tech Approach

  • Create a utility class that takes a Python stream object and returns a Polars DataFrame using standard Polars constructors.
  • Create a second utility class that converts a Polars DataFrame back to a Python stream, preserving types and nested structures where possible.
  • Ensure both utilities include simple validation and logging so that unexpected field structures can be identified early.
  • Provide internal documentation explaining the expected input and output shapes for each utility.
  • Relevant links for guidance:

Polars DataFrame documentation: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html
Polars conversion functions overview: https://pola-rs.github.io/polars/py-polars/html/reference/api/index.html

Acceptance Criteria / Tests

  • A stream object can be successfully converted into a Polars DataFrame.
  • The converted stream object is validated with expected columns and row counts.
  • A Polars DataFrame can be converted back into a stream that matches the original structure where feasible.
  • Unit tests are created for new classes

Resourcing and Dependencies

  • No prerequisite tickets are required, although parallel work on pipeline refactoring may influence timelines.
  • Any engineer familiar with the data pipeline and Polars can complete this ticket.
  • No dependencies on external teams, although the Data Engineering team should be informed once the utilities are ready for adoption in the migration work.

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

In Review / QA 🔎

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions