End-to-end ML competition project for Kaggle Titanic survival prediction from tabular passenger data.
Given train.csv and test.csv, predict Survived for unseen passengers while maintaining transparent preprocessing and a reproducible submission workflow.
- Python (notebook and script workflows)
- Jupyter Notebook
- XGBoost / classical ML preprocessing
- GitHub Actions (validation checks)
data/: competition train/test datasetstitanic_survival_NN.ipynb: main notebook (EDA, preprocessing, modeling)xgboost.py: script-based model experimentationsolutions/: generated submission filestests/: checks for generated output format/content
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun notebook:
jupyter notebook titanic_survival_NN.ipynbOr run script experiment:
python xgboost.pyGenerate a deterministic baseline submission and CV report without opening notebooks:
python scripts/reproducible_baseline.pyOutputs:
solutions/cli_baseline_submission.csvartifacts/cv_report.json
Local check:
python scripts/reproducible_baseline.py
python -m unittest discover -s tests -p "test_*.py"CI (.github/workflows/ci.yml) validates Python syntax for xgboost.py and solution-file tests.
- Best score in this repository: 0.78229 (Kaggle public leaderboard).
- Includes notebook-first and script-based experimentation paths.
- Includes automated checks for generated submission files.
- Workflow is still notebook-centered for main reproducibility path.
- Hyperparameter search and CV reporting are limited.
- No single CLI command yet to reproduce final submission end-to-end.
- Add reproducible CLI pipeline for submission generation.
- Add cross-validation report and feature-importance artifacts.
- Add pinned environment lockfile for stronger reproducibility.
See CONTRIBUTING.md.