Cleaning Pipeline

Data Cleaning Pipeline

The cleaning/data_cleaning_pipeline.py module provides a CleaningPipeline class for reproducible survey data cleaning with automatic logging.

Overview

The pipeline handles the most common cleaning tasks in development sector survey data:

Duplicate removal (by ID or full row)
Column name standardisation (lowercase, snake_case)
Missing value handling (drop, fill, median, mode)
Type conversion (datetime, numeric, categorical)
Outlier flagging (IQR method)
Automatic cleaning log generation

All operations are chainable and logged.

Usage

from cleaning.data_cleaning_pipeline import CleaningPipeline

pipeline = CleaningPipeline(raw_df, id_col="respondent_id")

cleaned = (
    pipeline
    .standardize_columns()
    .remove_duplicates()
    .handle_missing(strategy="median")
    .convert_types({"age": "int", "date": "datetime"})
    .flag_outliers(["age", "income"])
    .get_result()
)

# View what happened
print(pipeline.get_log())

# Save audit trail
pipeline.save_log("cleaning_log.csv")

Missing Value Strategies

Strategy	Behaviour
`"drop"`	Drop columns exceeding the missing threshold (default 50%)
`"fill"`	Fill all missing with a specified value
`"median"`	Fill numeric columns with their median
`"mode"`	Fill all columns with their mode

Cleaning Log

Every step is automatically logged with timestamp, description, and row count. The log can be exported as a CSV for audit trails — useful for reproducibility requirements in evaluation reports.

Field	Description
`timestamp`	ISO format timestamp
`step`	Name of the cleaning operation
`detail`	What changed (e.g., "Removed 5 duplicates")
`rows`	Row count after the step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaning Pipeline

Data Cleaning Pipeline

Overview

Usage

Missing Value Strategies

Cleaning Log

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally