This project predicts individual medical insurance costs using demographic and health data. To improve model accuracy and handle the right-skewed distribution of insurance charges, a Log Transformation was applied to the target variable (charges).
- Analyze the impact of features like
age,bmi, andsmokeron total charges. - Train a Linear Regression model using
Scikit-Learn. - Evaluate the model using standard regression metrics.
Based on the final evaluation, the model achieved the following results:
- R² Score: 0.894
- Mean Absolute Error (MAE): 0.21
- Mean Squared Error (MSE): 0.098
- Data Cleaning: Handled duplicates and verified no missing values existed.
- Feature Engineering: * Encoded categorical variables (
sex,smoker,region).- Applied
np.log()to thechargescolumn to normalize the distribution.
- Applied
- Training: Split the data into training and testing sets.
- Prediction: Generated predictions on the log-scale and visualized them against actual values.
- Language: Python
- Libraries:
Pandas,NumPy,Matplotlib,Seaborn,Scikit-Learn
dataset/insurance.csv: Input dataset.images/linear_trend.png: Visualization of results.medical_insurance_cost_prediction.ipynb: Complete Python code and analysis.
