DataExcept is a production-ready Python library that provides structured, hierarchical exception classes specifically designed for data science, machine learning, and data engineering workflows. Stop debugging generic ValueErrors and RuntimeErrors -- get meaningful, actionable error messages that help you understand exactly what went wrong in your data pipeline.
| β Without DataExcept | β With DataExcept |
|---|---|
ValueError: Invalid value |
DataValidationError: Invalid value for 'age': -1 |
RuntimeError: Training failed |
ConvergenceError: Model 'RandomForest' failed to converge after 100 iterations |
Exception: Prediction error |
ModelInferenceError: Inference failed for model 'CNN': CUDA out of memory |
KeyError: column not found |
MissingColumnError: Missing required column 'customer_id' in DataFrame 'sales_data' |
- ποΈ Hierarchical Structure: Catch specific errors or broad categories
- π Data Science Focused: 40+ exceptions covering ML pipelines, feature engineering, model training
- π§ Production Ready: Comprehensive logging helpers and error context
- π Academic Quality: Proper documentation, type hints, and citation support
- π Python 3.10+: Modern Python with full type safety
- π§ͺ Well Tested: 84% test coverage with comprehensive edge case handling
pip install DataExceptFor development:
git clone https://github.com/DiogoRibeiro7/DataExcept.git
cd DataExcept
poetry installfrom dataexcept import ValidationError, ModelTrainingError
from dataexcept.datascience_exceptions import DataLoadingError
import pandas as pd
# Data validation with context
def validate_dataframe(df: pd.DataFrame) -> None:
if 'customer_id' not in df.columns:
raise ValidationError(
field='customer_id',
value=list(df.columns),
message="Customer ID column is required for processing"
)
# Model training with specific error types
def train_model(model_type: str, epochs: int) -> None:
try:
# Your training code here
if epochs > 1000:
raise ModelTrainingError(
model_type=model_type,
epoch=epochs,
message=f"Training {model_type} exceeded reasonable epoch limit"
)
except Exception as e:
# Wrap unknown errors with context
raise ModelTrainingError(model_type, message=f"Unexpected error: {e}")
# File operations with detailed context
def load_dataset(file_path: str) -> pd.DataFrame:
try:
return pd.read_csv(file_path)
except FileNotFoundError as e:
raise DataLoadingError(source=file_path, original=e)from dataexcept import JobError
from dataexcept.datascience_exceptions import ModelTrainingError, ConvergenceError
try:
# Your ML pipeline
train_complex_model()
except ConvergenceError:
# Handle specific convergence issues
logger.warning("Model didn't converge, trying with different parameters")
train_with_fallback_params()
except ModelTrainingError:
# Handle any training-related error
logger.error("Training failed, falling back to simpler model")
train_simple_model()
except JobError:
# Handle any job-related error
logger.error("Job failed, notifying administrators")
send_alert()from dataexcept.datascience_exceptions import *
# Data ingestion and validation
DataLoadingError("data.csv", FileNotFoundError())
DataValidationError("age", -5, "Age cannot be negative")
MissingDataError("income", "Required for credit scoring")
# Feature engineering and preprocessing
FeatureEngineeringError("log_transform", "Cannot take log of negative values")
DataNormalizationError("StandardScaler", "Division by zero in variance calculation")
DataImbalanceError(ratio=0.05, threshold=0.1)
# Model training and evaluation
ModelTrainingError("RandomForest", epoch=45)
ConvergenceError("GradientBoosting", iterations=1000)
OverfittingError(train_metric=0.98, val_metric=0.65)
BiasDetectionError("gender", bias_score=0.15, threshold=0.1)
# Model deployment and inference
ModelInferenceError("CNN", RuntimeError("CUDA out of memory"))
ModelCompatibilityError("2.1.0", "1.8.0")from dataexcept.dataengineering_exceptions import *
ETLJobError("daily_customer_pipeline")
SchemaEvolutionError("v2.1", reason="Incompatible column type change")
DataTransformationError("currency_conversion", "Invalid exchange rate")
BatchProcessingError("batch_2023_11_13", original=TimeoutError())from dataexcept.pandas_exceptions import *
MissingColumnError("customer_id", dataframe="sales_df")
DtypeMismatchError("revenue", expected=["float64", "int64"], found="object")
MergeKeyError(["customer_id"], ["cust_id"])from dataexcept.network_exceptions import *
from dataexcept.database_exceptions import *
HostUnreachableError("api.example.com")
DatabaseConnectionError("postgresql://prod-db:5432/analytics")
QueryExecutionError("SELECT * FROM large_table", original=TimeoutError())from dataexcept.logging_helpers import log_and_raise, log_exception
import logging
logger = logging.getLogger(__name__)
# Context manager for automatic logging
with log_and_raise(logger=logger, context={"job_id": "ETL_001", "batch": "2023-11-13"}):
process_daily_batch()
# Manual exception logging with context
try:
risky_operation()
except Exception as exc:
log_exception(
exc,
logger=logger,
context={"user_id": "12345", "operation": "feature_extraction"}
)
raise# List all available exception classes
$ dataexcept list
JobError
ValidationError
DataScienceError
ModelTrainingError
... (40+ more)
# Check version
$ dataexcept --version
dataexcept 0.1.0- Model Training: Distinguish between convergence issues, data problems, and infrastructure failures
- Feature Engineering: Track which transformation steps fail and why
- Model Serving: Provide actionable error messages for inference failures
- Data Drift: Alert when model assumptions are violated
- ETL Pipelines: Clear error categorization for debugging complex data flows
- Data Quality: Structured validation errors with field-level context
- Schema Evolution: Track migration failures and compatibility issues
- Batch Processing: Identify whether failures are data-related or system-related
- Reproducible Experiments: Consistent error handling across research codebases
- Citation Support: Proper academic attribution with CITATION.cff
- Documentation: Auto-generated API docs with comprehensive examples
"""
Complete ML pipeline with DataExcept error handling
"""
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from dataexcept import ValidationError
from dataexcept.datascience_exceptions import *
from dataexcept.pandas_exceptions import *
from dataexcept.logging_helpers import log_and_raise
import logging
def ml_pipeline(data_path: str, target_col: str):
logger = logging.getLogger(__name__)
with log_and_raise(logger=logger, context={"pipeline": "customer_churn"}):
# 1\. Data Loading
try:
df = pd.read_csv(data_path)
except FileNotFoundError as e:
raise DataLoadingError(source=data_path, original=e)
# 2\. Data Validation
if target_col not in df.columns:
raise MissingColumnError(target_col, dataframe="training_data")
if df[target_col].dtype not in ['int64', 'bool']:
raise DtypeMismatchError(
target_col,
expected=['int64', 'bool'],
found=str(df[target_col].dtype)
)
# 3\. Data Quality Checks
missing_ratio = df.isnull().sum().sum() / (df.shape[0] * df.shape[1])
if missing_ratio > 0.3:
raise DataValidationError(
field="missing_data_ratio",
value=missing_ratio,
message=f"Dataset has {missing_ratio:.1%} missing values, exceeds 30% threshold"
)
# 4\. Class Imbalance Check
class_ratio = df[target_col].value_counts().min() / df[target_col].value_counts().max()
if class_ratio < 0.1:
raise DataImbalanceError(ratio=class_ratio, threshold=0.1)
# 5\. Feature Engineering
try:
df['log_revenue'] = np.log(df['revenue'] + 1)
except Exception as e:
raise FeatureEngineeringError("log_transform", cause=str(e))
# 6\. Model Training
try:
model = RandomForestClassifier(n_estimators=100)
X = df.drop(columns=[target_col])
y = df[target_col]
model.fit(X, y)
except Exception as e:
raise ModelTrainingError("RandomForest", message=f"Training failed: {e}")
# 7\. Model Validation
train_score = model.score(X, y)
if train_score < 0.6:
raise UnderfittingError(train_metric=train_score, threshold=0.6)
return model
# Usage
if __name__ == "__main__":
try:
model = ml_pipeline("customer_data.csv", "churned")
print("β
Pipeline completed successfully!")
except DataLoadingError as e:
print(f"β Data loading failed: {e}")
except MissingColumnError as e:
print(f"β Schema validation failed: {e}")
except DataImbalanceError as e:
print(f"β οΈ Data quality issue: {e}")
except ModelTrainingError as e:
print(f"β Model training failed: {e}")
except Exception as e:
print(f"π₯ Unexpected error: {e}")We welcome contributions! See our Contributing Guide for details.
# Development setup
git clone https://github.com/DiogoRibeiro7/DataExcept.git
cd DataExcept
poetry install
poetry run pytest
poetry run flake8 dataexcept tests- Full Documentation: dataexcept.readthedocs.io
- API Reference: API Docs
- Advanced Usage: Advanced Guide
- CLI Reference: CLI Guide
If you use DataExcept in your research, please cite it:
@software{ribeiro_dataexcept_2025,
author = {Ribeiro, Diogo},
title = {DataExcept: Structured Exception Handling for Data Science},
url = {https://github.com/DiogoRibeiro7/DataExcept},
version = {0.1.0},
year = {2025},
publisher = {GitHub}
}This project is licensed under the MIT License - see the file for details.
Diogo Ribeiro is a Lead Data Scientist at Mysense.ai and researcher/instructor at ESMAD (Instituto PolitΓ©cnico do Porto). With expertise in machine learning, statistical analysis, and production ML systems, he created DataExcept to solve real-world error handling challenges in data science workflows.
- π ORCID: 0009-0001-2022-7072
- π Website: diogoribeiro7.github.io
- π’ Affiliation: ESMAD - Instituto PolitΓ©cnico do Porto
β Star this repo if DataExcept helps you build better data pipelines!