Skip to content

Cleaning Pipeline

Claude edited this page Mar 15, 2026 · 1 revision

Data Cleaning Pipeline

The cleaning/data_cleaning_pipeline.py module provides a CleaningPipeline class for reproducible survey data cleaning with automatic logging.


Overview

The pipeline handles the most common cleaning tasks in development sector survey data:

  • Duplicate removal (by ID or full row)
  • Column name standardisation (lowercase, snake_case)
  • Missing value handling (drop, fill, median, mode)
  • Type conversion (datetime, numeric, categorical)
  • Outlier flagging (IQR method)
  • Automatic cleaning log generation

All operations are chainable and logged.


Usage

from cleaning.data_cleaning_pipeline import CleaningPipeline

pipeline = CleaningPipeline(raw_df, id_col="respondent_id")

cleaned = (
    pipeline
    .standardize_columns()
    .remove_duplicates()
    .handle_missing(strategy="median")
    .convert_types({"age": "int", "date": "datetime"})
    .flag_outliers(["age", "income"])
    .get_result()
)

# View what happened
print(pipeline.get_log())

# Save audit trail
pipeline.save_log("cleaning_log.csv")

Missing Value Strategies

Strategy Behaviour
"drop" Drop columns exceeding the missing threshold (default 50%)
"fill" Fill all missing with a specified value
"median" Fill numeric columns with their median
"mode" Fill all columns with their mode

Cleaning Log

Every step is automatically logged with timestamp, description, and row count. The log can be exported as a CSV for audit trails — useful for reproducibility requirements in evaluation reports.

Field Description
timestamp ISO format timestamp
step Name of the cleaning operation
detail What changed (e.g., "Removed 5 duplicates")
rows Row count after the step

Clone this wiki locally