This project trains a RandomForestClassifier on the Titanic dataset and creates a submission.csv file for test predictions.
src/preprocess.py: data loading, cleaning, feature engineering, and preprocessing pipelinesrc/train.py: model training and validation accuracy reportsrc/predict.py: test-set prediction andsubmission.csvcreationnotebooks/titanic_analysis.ipynb: beginner walkthrough notebookrequirements.txt: required Python packages
- Python 3.9+
- Titanic files:
data/train.csvdata/test.csv
Install dependencies:
pip install -r requirements.txt- Reads
train.csvandtest.csvfromdata/ - Performs data cleaning:
- fills missing
AgeandFarewith median - fills missing
Embarkedwith most frequent value
- fills missing
- One-hot encodes categorical columns (
Sex,Embarked) - Creates new features:
FamilySize = SibSp + Parch + 1IsAlone(1 if family size is 1, else 0)
- Trains a
RandomForestClassifier - Evaluates accuracy on a validation split
- Generates
submission.csvfor test predictions
From project root:
python src/train.py
python src/predict.pyAfter running, you should see:
models/model.joblibsubmission.csv
flowchart TD
A["train main"] --> B["load data"]
B --> C["read train csv and test csv"]
C --> D["prepare train features"]
D --> E["add family features"]
E --> F["prepare X and y"]
F --> G["train test split"]
G --> H["build preprocessor"]
H --> I["fit transform train"]
I --> J["transform validation"]
J --> K["fit random forest"]
K --> L["predict validation"]
L --> M["compute accuracy"]
M --> N["save model with joblib"]
flowchart TD
A["predict main"] --> B{"model joblib exists"}
B -- "no" --> C["raise file not found"]
B -- "yes" --> D["load data"]
D --> E["select test dataframe"]
E --> F["load model and preprocessor"]
F --> G["prepare test features"]
G --> H["add family features"]
H --> I["transform test features"]
I --> J["predict survived"]
J --> K["build submission dataframe"]
K --> L["write submission csv"]
- Start with
notebooks/titanic_analysis.ipynbif you want to understand each step interactively. - The scripts in
src/are the same logic in reusable Python modules.