This repository documents my daily progress in mastering Data Science and Machine Learning.
For Day 1, I focused on understanding the fundamentals of Classification in Machine Learning. I explored the scikit-learn library to build a simple "Gender Predictor" based on physical attributes.
My goal was to compare how different classification algorithms handle the same small dataset. I manually created a dataset and trained 5 different models to see if they would agree on the prediction for a new user.
- Features:
[Height (cm), Weight (kg), Shoe Size] - Target:
Gender('male' or 'female')
- Language: Python
- Library: Scikit-Learn (sklearn)
- Decision Tree Classifier: Uses a flowchart-like structure to make decisions.
- K-Nearest Neighbors (KNN): Classifies based on the closest "k" data points.
- Support Vector Machine (SVM): Finds the optimal boundary to separate the classes.
- Random Forest Classifier: Uses multiple Decision Trees for better accuracy (Ensemble).
- Gaussian Naive Bayes: Uses probability to predict.
- Learned how to import specific models from sklearn.
- Understood the
.fit()method for training and.predict()for prediction. - Observed that different models might give different results depending on data complexity.
For Day 2, I moved from Classification to Regression. While classification categorizes data (A vs B), Regression predicts continuous values (like prices, temperature, or scores).
I built a simple logic that predicts a student's test score based on the number of hours they studied.
The model tries to fit a straight line through the data points using the equation:
- y: The Score (Target)
- x: The Hours Studied (Feature)
- m: The Slope (How much the score goes up per hour)
- c: The Intercept (The score if you study 0 hours)
The code generates a graph showing the actual student scores (Blue Dots) and the machine's learned pattern (Red Line).
For Day 3, I learned Logistic Regression. Despite the name "Regression," this is actually a Classification algorithm used to predict binary outcomes (Yes/No, True/False, 0/1).
I built a model that predicts whether a customer will buy insurance based on their age.
Unlike Linear Regression which fits a straight line, Logistic Regression applies the Sigmoid Function to squash the output between 0 and 1:
- If probability > 50%, it predicts 1 (Yes).
- If probability < 50%, it predicts 0 (No).
The red line in the graph represents the Probability.
- Bottom of S: Low probability (Younger people).
- Top of S: High probability (Older people).
For Day 4, I explored KNN. Unlike other models that "learn" a complex math formula, KNN is a "Lazy Learner"βit simply memorizes the entire dataset. When we ask for a prediction, it looks for the 'K' most similar data points (neighbors) and takes a vote.
I built a CLI (Command Line Interface) tool that predicts whether a user needs a Small or Large T-shirt.
- The program asks the user for their Height and Weight.
- It compares the user to a dataset of 18 existing customers.
- It predicts the size and plots the user as a Star on the graph.
KNN calculates the straight-line distance between the new point and every other point to find the closest matches.
- Small K (e.g., 1): Highly sensitive to noise.
- Large K (e.g., 5): More stable, but might miss local details.
- I used K=3 for this project.
- π΅ Blue Dots: Small Size
- π΄ Red Dots: Large Size
- β Yellow Star: Entered value (You)
For Day 5, I learned Random Forest, an "Ensemble" method. Instead of relying on a single Decision Tree (which can be biased), Random Forest creates hundreds of trees and makes them vote on the final answer.
I built an AI that mimics a Hiring Manager.
- Input: Experience, Test Score, Interview Score.
- Output: Hired vs. Rejected.
One of the best features of Random Forest is that it is "interpretable." It can calculate exactly which input feature had the biggest impact on the decision.
- Example: The model revealed that Interview Score might be more important than Experience.
For Day 6, I tackled a high-stakes classification problem: Breast Cancer Detection.
The system looks at patient details like age, tumor size, and medical history to predict whether a breast tumor is Malignant (Dangerous) or Benign (Not Dangerous).
- Early detection saves lives.
- Reduces human error in diagnosis.
- Source: Kaggle Breast Cancer Prediction Dataset
- Process: I cleaned the data, converted text labels to numbers, and split the data for training vs testing.
- Generated a Confusion Matrix to visualize False Positives vs. False Negatives.
- Prioritized reducing False Negatives (missing a cancer diagnosis) over raw accuracy.
For Day 7, I shifted from Supervised Learning to Unsupervised Learning. I used the K-Means Clustering algorithm to find hidden patterns in raw data without any labels or guidance.
Imagine a party where strangers naturally separate into groups (Dancers, Business people, Foodies).
- Supervised Learning: The host tells everyone where to stand.
- Unsupervised Learning: The guests group themselves based on similarity.
- Dataset: Mall Customer Segmentation Data
- Goal: Identify specific "Customer Personas" (e.g., High Income + High Spenders) for marketing.
- The Elbow Method: Calculated WCSS to scientifically find that K=5 was the optimal number of groups.
- Model Training: Trained K-Means with 5 clusters.
- Visualization: Plotted the 5 distinct market segments and their Centroids.
- Prediction: Classifies new customers into these established personas.
For Day 8, I mastered XGBoost (Extreme Gradient Boosting), the algorithm that dominates Kaggle competitions. I moved from "Bagging" (Random Forest) to "Boosting" (Sequential Learning).
I built an AI system for a Bank to identify customers who are at high risk of leaving ("Churning").
- Goal: Predict
Exited(1 = Left, 0 = Stayed). - Business Value: Identifying at-risk customers allows the bank to offer incentives before they leave.
XGBoost works like a student taking practice exams:
- Model 1 takes the test and fails hard questions.
- Model 2 studies only those hard questions (mistakes).
- Model 3 fixes the mistakes of Model 2. This Gradient Boosting approach achieves higher accuracy than standard trees.
- Accuracy: 86.55% (Outperforming Random Forest).
- Confusion Matrix: Successfully identified 193 high-risk customers who were about to leave.
- Feature Importance: Discovered that Age and Number of Products are the biggest drivers of churn.
- Library:
xgboost
For Day 9, I mastered LightGBM (Light Gradient Boosting Machine), developed by Microsoft. While XGBoost is powerful (the "Ferrari"), LightGBM is designed for the Big Data era. It is 10x faster, consumes less memory, and handles categorical data automatically.
I built an AI model to analyze passenger feedback and predict customer satisfaction.
- Dataset: 100,000+ real passenger records (Kaggle).
- Goal: Predict
Satisfaction(1 = Satisfied, 0 = Dissatisfied/Neutral). - Challenge: Processing massive data with mixed text and numbers instantly.
Unlike XGBoost, which grows trees level-by-level (horizontally), LightGBM grows Leaf-Wise (vertically).
- It finds the most promising branch and digs deep immediately.
- Result: Faster training and higher accuracy on large datasets.
- Native Handling: LightGBM can read text categories (like "Business Class") directly without needing One-Hot Encoding.
- Library:
lightgbm(Native API). - Data Structure:
lgb.Dataset(Binary optimized format). - Configuration: Used a
paramsdictionary for granular control.
Instead of default settings, I tuned the model manually:
params = {
'objective': 'binary', # Predicting Yes/No
'metric': 'binary_logloss', # Confidence scoring
'boosting_type': 'gbdt', # Standard Gradient Boosting
'num_leaves': 31, # Complexity control
'learning_rate': 0.05, # Slow & Steady learning
'feature_fraction': 0.9 # Prevent overfitting
}An End-to-End Machine Learning project that predicts the Air Quality Index (AQI) based on pollutant levels and city data. This project moves beyond simple analysis by deploying the model as a live FastAPI backend, ready for integration into web or mobile apps.
- Machine Learning: Uses Random Forest Regressor to predict continuous AQI values.
- Optimization: Implements GridSearchCV to find the perfect hyperparameters automatically.
- City-Aware: Uses Label Encoding to adjust predictions based on the specific city (e.g., Delhi vs. Bangalore).
- Visualization: Automatically generates charts for Feature Importance and Prediction Error.
- Deployment: A professional FastAPI server that classifies health risk (Good π’ to Severe β οΈ).
- Language: Python 3.x
- ML Library: Scikit-Learn (Random Forest, GridSearch)
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- API Framework: FastAPI, Uvicorn
- Serialization: Joblib (for saving/loading models)
βββ city_day.csv # The Dataset (Kaggle)
βββ train_model.py # Script to Train, Tune, and Save the Model
βββ main.py # FastAPI Server script
βββ aqi_model.pkl # Saved Model + Encoder (Generated after training)
βββ requirements.txt # Dependencies
βββ README.md # Documentation
βοΈ Installation Clone the repository (or download files). Install dependencies: Bash pip install pandas numpy scikit-learn matplotlib seaborn joblib fastapi uvicorn Get the Data: Ensure you have the city_day.csv file in the folder. Source: Kaggle: Air Quality Data in India
π§ Phase 1: Training the Model Run the training script to process data, tune the model, and generate visualizations. Bash python train_model.py What happens? Cleans missing values using Median Imputation. Encodes City names into numbers. Runs GridSearchCV to find the best n_estimators and max_depth. Saves three plots: pollutant_effect.png, Actual_vs_Predicted.png, Truth_Vs_Predict.png. Saves the final artifact: aqi_model.pkl.
π Phase 2: Running the API Once aqi_model.pkl is created, launch the server:
Bash python -m uvicorn main:app --reload Server URL: http://127.0.0.1:8000 Interactive Docs: http://127.0.0.1:8000/docs π§ͺ How to Test (API usage) Go to http://127.0.0.1:8000/docs Click POST /predict_aqi β Try it out. Paste this JSON:
JSON
{
"pm25": 180.0,
"pm10": 250.0,
"no2": 60.0,
"co": 1.5,
"so2": 10.0,
"o3": 45.0,
"city": "Delhi"
}
Response:
JSON
{
"City": "Delhi",
"Predicted_AQI": 312,
"Status": "Very Poor π΄"
}
π Model Performance Baseline Accuracy: ~88% (R2 Score) Tuned Accuracy: ~91% (after GridSearchCV) Key Insight: PM2.5 and PM10 were identified as the most critical drivers of AQI.
A Machine Learning project that uses Unsupervised Learning (K-Means Clustering) to identify distinct customer groups based on their purchasing behavior.
To analyze a dataset of mall customers and group them into specific "Tribes" (Clusters) based on two key factors:
- Annual Income (k$)
- Spending Score (1-100)
This helps marketing teams understand who their customers are (e.g., "VIPs" vs. "Savers") to create targeted strategies.
Since we do not have labels (no "correct answer"), we use Unsupervised Learning:
- Elbow Method: To scientifically determine the optimal number of groups (Clusters).
- K-Means Clustering: To group the data points.
- Centroid Analysis: To interpret what each group represents.
- Language: Python 3.x
- Algorithm: K-Means Clustering (Scikit-Learn)
- Data Processing: Pandas, NumPy
- Preprocessing: StandardScaler (Feature Scaling)
- Visualization: Matplotlib, Seaborn
βββ Mall_Customers.csv # The Dataset (Required)
βββ sample.py # Main script (Elbow Method + Training + Plotting)
βββ README.md # Documentation