-
-
Notifications
You must be signed in to change notification settings - Fork 0
Cleaning Pipeline
Claude edited this page Mar 15, 2026
·
1 revision
The cleaning/data_cleaning_pipeline.py module provides a CleaningPipeline class for reproducible survey data cleaning with automatic logging.
The pipeline handles the most common cleaning tasks in development sector survey data:
- Duplicate removal (by ID or full row)
- Column name standardisation (lowercase, snake_case)
- Missing value handling (drop, fill, median, mode)
- Type conversion (datetime, numeric, categorical)
- Outlier flagging (IQR method)
- Automatic cleaning log generation
All operations are chainable and logged.
from cleaning.data_cleaning_pipeline import CleaningPipeline
pipeline = CleaningPipeline(raw_df, id_col="respondent_id")
cleaned = (
pipeline
.standardize_columns()
.remove_duplicates()
.handle_missing(strategy="median")
.convert_types({"age": "int", "date": "datetime"})
.flag_outliers(["age", "income"])
.get_result()
)
# View what happened
print(pipeline.get_log())
# Save audit trail
pipeline.save_log("cleaning_log.csv")| Strategy | Behaviour |
|---|---|
"drop" |
Drop columns exceeding the missing threshold (default 50%) |
"fill" |
Fill all missing with a specified value |
"median" |
Fill numeric columns with their median |
"mode" |
Fill all columns with their mode |
Every step is automatically logged with timestamp, description, and row count. The log can be exported as a CSV for audit trails — useful for reproducibility requirements in evaluation reports.
| Field | Description |
|---|---|
timestamp |
ISO format timestamp |
step |
Name of the cleaning operation |
detail |
What changed (e.g., "Removed 5 duplicates") |
rows |
Row count after the step |