From 71da4165e8b49167ae49272dedc0abd2fb4c4cd1 Mon Sep 17 00:00:00 2001 From: Kashishkh Date: Wed, 5 Feb 2025 01:26:43 +0530 Subject: [PATCH 01/19] crop-recommendation --- .../machine-learning/crop-recommendation.md | 227 ++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 docs/projects/machine-learning/crop-recommendation.md diff --git a/docs/projects/machine-learning/crop-recommendation.md b/docs/projects/machine-learning/crop-recommendation.md new file mode 100644 index 00000000..64fd356f --- /dev/null +++ b/docs/projects/machine-learning/crop-recommendation.md @@ -0,0 +1,227 @@ +# Crop-Recommendation-Model + +
+ +
+ +## ๐ŸŽฏ AIM + +It is an AI-powered Crop Recommendation System that helps farmers and agricultural stakeholders determine the most suitable crops for cultivation based on environmental conditions. The system uses machine learning models integrated with Flask to analyze key parameters and suggest the best crop to grow in a given region. + +## ๐Ÿ“Š DATASET LINK + +[https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data) + +## ๐Ÿ““ NOTEBOOK + +[https://www.kaggle.com/code/kashishkhurana1204/recommendation-system](https://www.kaggle.com/code/kashishkhurana1204/recommendation-system) + +??? Abstract "Kaggle Notebook" + + + +## โš™๏ธ TECH STACK + +| **Category** | **Technologies** | +|--------------------------|---------------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn | +| **Tools** | Github, Jupyter, VS Code | + +--- + +## ๐Ÿ“ DESCRIPTION + +The project focuses on predicting air quality levels based on the features of air pollutants and environmental parameters. +The objective is to test various regression models to see which one gives the best predictions for CO (Carbon Monoxide) levels. + +!!! info "What is the requirement of the project?" + - To provide accurate crop recommendations based on environmental conditions. + - To assist farmers in maximizing yield and efficiency. + +??? info "How is it beneficial and used?" + - Helps in optimizing agricultural planning. + - Reduces trial-and-error farming practices. + + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Data collection and preprocessing. + - Model selection and training. + - Flask integration for web-based recommendations. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - Research papers on crop prediction models. + - Kaggle datasets and tutorials. + +--- + +## ๐Ÿ” EXPLANATION + +### DATASET OVERVIEW & FEATURE DETAILS + +๐Ÿ“‚ dataset.csv + +Contains agricultural parameters and their corresponding crop recommendations. + +๐Ÿ›  Developed Features from dataset.csv + +Data cleaning and preprocessing. + +Feature selection for improved model accuracy. + + + + +### ๐Ÿ›ค PROJECT WORKFLOW + +```mermaid + graph + Start -->|No| End; + Start -->|Yes| Import_Libraries --> Load_Dataset --> Data_Cleaning --> Feature_Selection --> Train_Test_Split --> Define_Models; + Define_Models --> Train_Models --> Evaluate_Models --> Save_Best_Model --> Develop_Flask_API --> Deploy_Application --> Conclusion; + Deploy_Application -->|Error?| Debug --> Yay!; + +``` + + +=== "Import Necessary Libraries" + - First, we import all the essential libraries needed for handling, analyzing, and modeling the dataset. + - This includes libraries like Pandas for data manipulation, Numpy for numerical computations, Matplotlib and Seaborn for data visualization, and Scikit-learn for machine learning models, evaluation, and data preprocessing. + - These libraries will enable us to perform all required tasks efficiently. + +=== "Load Dataset" + - We load the dataset using Pandas `read_csv()` function. The dataset contains crop data, which is loaded with a semicolon delimiter. + - After loading, we inspect the first few rows to understand the structure of the data and ensure that the dataset is correctly loaded. + +=== "Data Cleaning Process" + Data cleaning is a crucial step in any project. In this step: + + - Handle missing values, remove duplicates, and ensure data consistency. + - Convert categorical variables if necessary and normalize numerical values. + +=== "Visualizing Correlations Between Features" + + - Use heatmaps and scatter plots to understand relationships between features and how they impact crop recommendations. + +=== "Data Preparation - Features (X) and Target (y)" + + - Separate independent variables (environmental parameters) and the target variable (recommended crop). + +=== "Split the Data into Training and Test Sets" + + - Use train_test_split() from Scikit-learn to divide data into training and testing sets, ensuring model generalization. + +=== "Define Models" + We define multiple regression models to train and evaluate on the dataset: + + - **RandomForestRegressor**: A robust ensemble method that performs well on non-linear datasets. + - **Naive Bayes**: A probabilistic classifier based on Bayes' theorem, which assumes independence between features and is effective for classification tasks. + - **DecisionTreeRegressor**: A decision tree-based model, capturing non-linear patterns and interactions. + +=== "Train and Evaluate Each Model" + + - Fit models using training data and evaluate performance using accuracy, precision, recall, and F1-score metrics. + +=== "Visualizing Model Evaluation Metrics" + + - Use confusion matrices, precision-recall curves, and ROC curves to assess model performance. + +== "Conclusion and Observations" + + **Best-Performing Models and Insights Gained:** + + - The Random Forest model provided the highest accuracy and robustness in predictions. + + - Decision Tree performed well but was prone to overfitting on training data. + + - Naรฏve Bayes, though simple, showed competitive performance for certain crop categories. + + - Feature importance analysis revealed that soil pH and nitrogen levels had the most significant impact on crop recommendation. + + **Potential Improvements and Future Enhancements:** + + - Implement deep learning models for better feature extraction and prediction accuracy. + + - Expand the dataset by incorporating satellite and real-time sensor data. + + - Integrate weather forecasting models to enhance crop suitability predictions. + + - Develop a mobile-friendly UI for better accessibility to farmers. + +--- + +### ๐Ÿ–ฅ CODE EXPLANATION + + +--- + +### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade Off 1" + - **Trade-off**: Accuracy vs. Computational Efficiency + - **Solution**: Optimized hyperparameters and used efficient algorithms. + +=== "Trade Off 2" + - **Trade-off**: Model interpretability vs complexity. + - **Solution**: Selected models balancing accuracy and interpretability. + +--- + +## ๐Ÿ–ผ SCREENSHOTS + +!!! tip "Visualizations of different features" + + === "HeatMap" + ![img](https://github.com/Kashishkh/FarmSmart/blob/main/Screenshot%202025-02-04%20195349.png) + + === "Model Comparison" + ![model-comparison](https://github.com/Kashishkh/FarmSmart/blob/main/Screenshot%202025-02-05%20011859.png) + + +--- + +## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS + +| Model | Accuracy | +|---------------------------|----------| +| Naive Bayes | 99.5% | +| Random Forest Regressor | 99.3% | +| Decision Tree Regressor | 98.6% | | + +--- + +## โœ… CONCLUSION + +### ๐Ÿ”‘ KEY LEARNINGS + +!!! tip "Insights gained from the data" + - Soil conditions play a crucial role in crop recommendation. + - Environmental factors significantly impact crop yield. + +??? tip "Improvements in understanding machine learning concepts" + - Feature engineering and hyperparameter tuning. + - Deployment of ML models in real-world applications. + +--- + +### ๐ŸŒ USE CASES + +=== "Application 1" + **Application of FarmSmart in precision farming.** + + - FarmSmart helps optimize resource allocation, enabling farmers to make data-driven decisions for sustainable and profitable crop production. + +=== "Application 2" + **Use in government agricultural advisory services.** + + - Government agencies can use FarmSmart to provide region-specific crop recommendations, improving food security and agricultural productivity through AI-driven insights. + + + From 576ab661760e813f3f7a7c849d245aa487833200 Mon Sep 17 00:00:00 2001 From: Kashishkh Date: Sun, 16 Feb 2025 15:26:21 +0530 Subject: [PATCH 02/19] changes done --- .../machine-learning/crop-recommendation.md | 102 +++++++++++++----- 1 file changed, 77 insertions(+), 25 deletions(-) diff --git a/docs/projects/machine-learning/crop-recommendation.md b/docs/projects/machine-learning/crop-recommendation.md index 64fd356f..b8173db7 100644 --- a/docs/projects/machine-learning/crop-recommendation.md +++ b/docs/projects/machine-learning/crop-recommendation.md @@ -14,7 +14,7 @@ It is an AI-powered Crop Recommendation System that helps farmers and agricultur ## ๐Ÿ““ NOTEBOOK -[https://www.kaggle.com/code/kashishkhurana1204/recommendation-system](https://www.kaggle.com/code/kashishkhurana1204/recommendation-system) +[https://www.kaggle.com/code/kashishkhurana1204/crop-recommendation-system](https://www.kaggle.com/code/kashishkhurana1204/crop-recommendation-system) ??? Abstract "Kaggle Notebook" @@ -29,18 +29,16 @@ It is an AI-powered Crop Recommendation System that helps farmers and agricultur ## โš™๏ธ TECH STACK -| **Category** | **Technologies** | -|--------------------------|---------------------------------------------| -| **Languages** | Python | -| **Libraries/Frameworks** | Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn | -| **Tools** | Github, Jupyter, VS Code | +| **Category** | **Technologies** | +|--------------------------|-----------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Pandas, Numpy, Matplotlib, Scikit-learn | +| **Tools** | Github, Jupyter, VS Code | --- ## ๐Ÿ“ DESCRIPTION -The project focuses on predicting air quality levels based on the features of air pollutants and environmental parameters. -The objective is to test various regression models to see which one gives the best predictions for CO (Carbon Monoxide) levels. !!! info "What is the requirement of the project?" - To provide accurate crop recommendations based on environmental conditions. @@ -52,9 +50,15 @@ The objective is to test various regression models to see which one gives the be ??? info "How did you start approaching this project? (Initial thoughts and planning)" - - Data collection and preprocessing. - - Model selection and training. - - Flask integration for web-based recommendations. + - Initial thoughts : The goal is to help farmers determine the most suitable crops based on their fieldโ€™s environmental conditions. + + - Dataset Selection : I searched for relevant datasets on Kaggle that include soil properties, weather conditions, and nutrient levels such as nitrogen (N), phosphorus (P), and potassium (K). + + - Initial Data Exploration : I analyzed the dataset structure to understand key attributes like soil pH, humidity, rainfall, and nutrient values, which directly impact crop suitability. + + - Feature Analysis : Studied how different environmental factors influence crop growth and identified the most significant parameters for prediction. + + - Model Selection & Implementation : Researched various ML models and implemented algorithms like Naรฏve Bayes, Decision Trees, and Random Forest to predict the best-suited crops. ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." - Research papers on crop prediction models. @@ -67,15 +71,16 @@ The objective is to test various regression models to see which one gives the be ### DATASET OVERVIEW & FEATURE DETAILS ๐Ÿ“‚ dataset.csv - -Contains agricultural parameters and their corresponding crop recommendations. - -๐Ÿ›  Developed Features from dataset.csv - -Data cleaning and preprocessing. - -Feature selection for improved model accuracy. - +| **Feature**| **Description** | **Data Type** | +|------------|-----------------|----------------| +| Soil_pH | Soil pH level | float | +| Humidity | Humidity level | float | +| Rainfall | Rainfall amount | float | +| N | Nitrogen level | int64 | +| P | Phosphorus level| int64 | +| K | Potassium level | int64 | +|Temperature | Temperature | float | +| crop | Recommended crop| categorical | @@ -159,6 +164,52 @@ Feature selection for improved model accuracy. ### ๐Ÿ–ฅ CODE EXPLANATION +=== "Code to compute F1-score, Precision, and Recall" + + ```py + from sklearn.metrics import precision_score, recall_score, f1_score, classification_report + + # Initialize a dictionary to store model scores + model_scores = {} + + # Iterate through each model and compute evaluation metrics + for name, model in models.items(): + print(f"Evaluating {name}...") + + # Train the model + model.fit(x_train, y_train) + + # Predict on the test set + y_pred = model.predict(x_test) + + # Compute metrics + precision = precision_score(y_test, y_pred, average='weighted') + recall = recall_score(y_test, y_pred, average='weighted') + f1 = f1_score(y_test, y_pred, average='weighted') + + # Store results + model_scores[name] = { + 'Precision': precision, + 'Recall': recall, + 'F1 Score': f1 + } + + # Print results for each model + print(f"Precision: {precision:.4f}") + print(f"Recall: {recall:.4f}") + print(f"F1 Score: {f1:.4f}") + print("\nClassification Report:\n") + print(classification_report(y_test, y_pred)) + print("-" * 50) + + # Print a summary of all model scores + print("\nSummary of Model Performance:\n") + for name, scores in model_scores.items(): + print(f"{name}: Precision={scores['Precision']:.4f}, Recall={scores['Recall']:.4f}, F1 Score={scores['F1 Score']:.4f}") + + ``` + + - This code evaluates multiple machine learning models and displays performance metrics such as Precision, Recall, F1 Score, and a Classification Report for each model. --- @@ -189,11 +240,11 @@ Feature selection for improved model accuracy. ## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS -| Model | Accuracy | -|---------------------------|----------| -| Naive Bayes | 99.5% | -| Random Forest Regressor | 99.3% | -| Decision Tree Regressor | 98.6% | | +| Model | Accuracy | Precision | Recall |F1-score| +|---------------------------|----------|-----------|--------|--------| +| Naive Bayes | 99.54% | 99.58% | 99.55% | 99.54% | +| Random Forest Regressor | 99.31% | 99.37% | 99.32% | 99.32% | +| Decision Tree Regressor | 98.63% | 98.68% | 98.64% | 98.63% | --- @@ -217,6 +268,7 @@ Feature selection for improved model accuracy. **Application of FarmSmart in precision farming.** - FarmSmart helps optimize resource allocation, enabling farmers to make data-driven decisions for sustainable and profitable crop production. + [https://github.com/Kashishkh/FarmSmart](https://github.com/Kashishkh/FarmSmart) === "Application 2" **Use in government agricultural advisory services.** From 1353a34db0ad04eff92313ee93b1ceaa01f53729 Mon Sep 17 00:00:00 2001 From: Kashishkh Date: Sun, 16 Feb 2025 15:40:46 +0530 Subject: [PATCH 03/19] changes completed --- docs/projects/machine-learning/crop-recommendation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/projects/machine-learning/crop-recommendation.md b/docs/projects/machine-learning/crop-recommendation.md index b8173db7..0dce9170 100644 --- a/docs/projects/machine-learning/crop-recommendation.md +++ b/docs/projects/machine-learning/crop-recommendation.md @@ -61,8 +61,8 @@ It is an AI-powered Crop Recommendation System that helps farmers and agricultur - Model Selection & Implementation : Researched various ML models and implemented algorithms like Naรฏve Bayes, Decision Trees, and Random Forest to predict the best-suited crops. ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." - - Research papers on crop prediction models. - - Kaggle datasets and tutorials. + - [https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data) + --- From 70e539b89ad9222c243c012cd9c30875bf9da7c5 Mon Sep 17 00:00:00 2001 From: Kashishkh Date: Fri, 28 Feb 2025 21:21:25 +0530 Subject: [PATCH 04/19] eda_project --- .../statistics/exploratory-data-analysis.md | 237 ++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 docs/projects/statistics/exploratory-data-analysis.md diff --git a/docs/projects/statistics/exploratory-data-analysis.md b/docs/projects/statistics/exploratory-data-analysis.md new file mode 100644 index 00000000..1db03d81 --- /dev/null +++ b/docs/projects/statistics/exploratory-data-analysis.md @@ -0,0 +1,237 @@ +# ๐Ÿ“œ Exploratory Data Analysis + +
+ +
+ +## ๐ŸŽฏ AIM + +To analyze the Black Friday sales dataset, understand customer purchasing behavior, identify trends, and generate insights through data visualization and statistical analysis. + +## ๐Ÿ“Š DATASET LINK + +[https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale/data](https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale/data) + +## ๐Ÿ““ KAGGLE NOTEBOOK + +[https://www.kaggle.com/code/kashishkhurana1204/exploratory-data-analysis-eda](https://www.kaggle.com/code/kashishkhurana1204/exploratory-data-analysis-eda) + +??? Abstract "Kaggle Notebook" + + + +## โš™๏ธ TECH STACK + +| **Category** | **Technologies** | +|--------------------------|---------------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Matplotlib, Pandas, Seaborn, Numpy | +| **Tools** | Github, Jupyter, VS Code, Kaggle | + +--- + +## ๐Ÿ“ DESCRIPTION + +!!! info "What is the requirement of the project?" + - Understanding customer purchasing behavior during Black Friday Sales. + - Identifying trends in product sales and demographics. + - Performing statistical analysis and data visualization. + +??? info "How is it beneficial and used?" + - Helps businesses in decision-making for better marketing strategies. + - Identifies key customer demographics for targeted advertising. + - Provides insights into which products perform well in sales. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - I was thinking about a project that helps businesses in decision-making for better marketing strategies. + - I searched for relevant datasets on Kaggle that fulfill my project requirements. + - I found the Black Friday Sales dataset which is a perfect fit for my project. + - I started by understanding the dataset and its features. + - Data Cleaning: Handled missing values and corrected data types. + - Data Exploration: Analyzed purchasing patterns by customer demographics. + - Statistical Analysis: Derived insights using Pandas and Seaborn. + - Data Visualization: Created visual graphs for better understanding. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - [https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale/data](https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale/data) + + +--- + +## ๐Ÿ” PROJECT EXPLANATION + +### ๐Ÿงฉ DATASET OVERVIEW & FEATURE DETAILS + +??? example "๐Ÿ“‚ BlackFriday.csv" + + - The dataset contains transaction records of Black Friday Sales. + +| Feature Name | Description | Datatype | +|----------------------------|------------------------------------------|------------| +| User_ID | Unique identifier for customers | int64 | +| Product_ID | Unique identifier for products | object | +| Gender | Gender of customer | object | +| Age | Age group of customer | object | +| Occupation | Occupation category | int64 | +| City_Category | City category (A, B, C) | object | +| Stay_In_Current_City_Years | Duration of stay in the city | object | +| Marital_Status | Marital status of customer | int64 | +| Purchase | Amount spent by the customer | int64 | + + +--- + +### ๐Ÿ›ค PROJECT WORKFLOW + +!!! success "Project workflow" + + ``` mermaid + graph LR + A[Data Collection] --> B[Data Cleaning] + B --> C[Exploratory Data Analysis] + C --> D[Data Visualization] + D --> E[Conclusion & Insights] + ``` + +=== "Step 1" + **Data Loading and Preprocessing** + + - Importing the dataset using Pandas and checking the initial structure. + + - Converting data types and renaming columns for consistency. + +=== "Step 2" + **Handling Missing Values and Outliers** + + - Identifying and filling/removing missing values using appropriate techniques. + + - Detecting and treating outliers using boxplots and statistical methods. + +=== "Step 3" + **Exploratory Data Analysis (EDA) with Pandas and Seaborn** + + - Understanding the distribution of key features through summary statistics. + + - Using groupby functions to analyze purchasing behavior based on demographics. + +=== "Step 4" + **Creating Visualizations for Insights** + + - Using Seaborn and Matplotlib to generate bar charts, histograms, and scatter plots. + + - Creating correlation heatmaps to identify relationships between variables. + +=== "Step 5" + **Identifying Trends and Patterns** + + - Analyzing seasonal variations in sales data. + + - Understanding the impact of age, gender, and occupation on purchase amounts. + +=== "Step 6" + **Conclusion and Final Report** + + - Summarizing the key findings from EDA. + + - Presenting actionable insights for business decision-making. + +--- + +### ๐Ÿ–ฅ CODE EXPLANATION + +=== "plotgraph() function" + + ```py + gender_sales = df.groupby('Gender')['Purchase'].sum() + + plt.figure(figsize=(6, 6)) + plt.pie(gender_sales, labels=gender_sales.index, autopct='%1.1f%%', startangle=140, textprops={'fontsize': 14}) + plt.title('Sales by Gender', fontsize=16) + + plt.show() + + age_gender_sales = df.groupby(['Age', 'Gender'])['Purchase'].sum().unstack() + + age_gender_sales.plot(kind='bar', figsize=(12, 6)) + plt.title('Sales by Age Group and Gender') + plt.xlabel('Age Group') + plt.ylabel('Total Sales') + plt.xticks(rotation=45) + plt.legend(title='Gender') + plt.show() + ``` + + - It displays the visualization graph of sales by age group and gender. + +--- + +### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade Off 1" + - **Trade-off:** High computational time due to large dataset. + - **Solution:** Used optimized Pandas functions to enhance performance. + +=== "Trade Off 2" + - **Trade-off:** Data Imbalance due to customer distribution. + - **Solution:** Applied statistical techniques to handle biases. + +--- + +## ๐Ÿ–ผ SCREENSHOTS + +!!! tip "Visualizations and EDA of different features" + + === "Sales by Age Group and Gender" + ![sales_by_age_group_and_gender](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182656.png) + + === "Sales by City Category" + ![sales_by_city_category](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182735.png) + + === "Sales by Occupation" + ![sales_by_occupation](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182720.png) + + === "Purchase Behavior via Marital Status" + ![Purchase_behavior_via_marital_status](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182621.png) + + === "Sales by Age Group" + ![sales_by_age _group](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182744.png) + + === "Sales by Gender" + ![sales_by_gender](https://github.com/Kashishkh/-Exploratory-Data-Analysis-/blob/main/Screenshot%202025-02-28%20182706.png) + +--- + +## โœ… CONCLUSION + +### ๐Ÿ”‘ KEY LEARNINGS + +!!! tip "Insights gained from the data" + - Majority of purchases were made by young customers. + + - Men made more purchases compared to women. + + - Electronic items and clothing were the top-selling categories. + +--- + +### ๐ŸŒ USE CASES + +=== "Application 1" + **Retail Analytics** + - Helps businesses understand customer behavior and target promotions accordingly. + +=== "Application 2" + **Sales Forecasting** + - Provides insights into seasonal trends and helps in inventory management. + +### ๐Ÿ”— USEFUL LINKS + +=== "GitHub Repository" + - [https://github.com/Kashishkh/-Exploratory-Data-Analysis-](https://github.com/Kashishkh/-Exploratory-Data-Analysis-) \ No newline at end of file From 96a8901b6576f93949860b3af67173ba14c81cc7 Mon Sep 17 00:00:00 2001 From: Kashishkh Date: Tue, 4 Mar 2025 00:23:21 +0530 Subject: [PATCH 05/19] Renamed exploratory-data-analysis.md to black-friday-sales-analysis.md --- ...xploratory-data-analysis.md => black-friday-sales-analysis.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/projects/statistics/{exploratory-data-analysis.md => black-friday-sales-analysis.md} (100%) diff --git a/docs/projects/statistics/exploratory-data-analysis.md b/docs/projects/statistics/black-friday-sales-analysis.md similarity index 100% rename from docs/projects/statistics/exploratory-data-analysis.md rename to docs/projects/statistics/black-friday-sales-analysis.md From 69a9610009859e9f26f0134631ef41e9baddb728 Mon Sep 17 00:00:00 2001 From: Mohammed Abdul Rahman <130785777+that-ar-guy@users.noreply.github.com> Date: Mon, 3 Feb 2025 09:29:14 +0530 Subject: [PATCH 06/19] Add a "Featured In" Section to README with Open Source Program Details (#183) * added featured in section * Update README.md * iwoc year corrected * added officail link of iwoc --- README.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/README.md b/README.md index 79edef4a..8f4270bb 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,48 @@ - **VS Code** --- +### โ„๏ธ Featured in + + + + + + + + + + + + + +
+ + +
SSOC 2024 +
+
+ + +
SWOC 2025 +
+
+ + +
IWOC 2025 +
+
+ + +
KWOC 2024 +
+
+ + +
VSOC 2024 +
+
+--- ### ๐Ÿ‘ฅ **Contributors** A big shoutout and heartfelt thanks to all our amazing contributors for their incredible efforts and dedication! This project wouldnโ€™t be where it is without you. ๐Ÿ’– From a370a264f6d430ad496ff6831a3a6645adbd1834 Mon Sep 17 00:00:00 2001 From: Mohammed Abdul Rahman <130785777+that-ar-guy@users.noreply.github.com> Date: Sun, 9 Feb 2025 13:45:05 +0530 Subject: [PATCH 07/19] Add/rnn (#188) * index updated * added rnn md file * line removed --- .../deep-learning/neural-networks/index.md | 10 +- .../recurrent-neural-network.md | 122 ++++++++++++++++++ 2 files changed, 131 insertions(+), 1 deletion(-) create mode 100644 docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md diff --git a/docs/algorithms/deep-learning/neural-networks/index.md b/docs/algorithms/deep-learning/neural-networks/index.md index ef29ecba..26adbe63 100644 --- a/docs/algorithms/deep-learning/neural-networks/index.md +++ b/docs/algorithms/deep-learning/neural-networks/index.md @@ -12,5 +12,13 @@ - + + + Recurrent Neural Network +
+

Recurrent Neural Network

+

A deep learning model designed for sequential data processing.

+

๐Ÿ“… 2025-01-10 | โฑ๏ธ 3 mins

+
+
diff --git a/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md b/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md new file mode 100644 index 00000000..c456ab1c --- /dev/null +++ b/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md @@ -0,0 +1,122 @@ +# ๐Ÿงช Recurrent Neural Network (RNN) + +
+ +
+ +## ๐ŸŽฏ Objective +Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to process sequential data. Unlike feedforward networks, RNNs have connections that allow information to persist, making them suitable for tasks such as speech recognition, text generation, and time-series forecasting. + +## ๐Ÿ“š Prerequisites +- Understanding of basic neural networks and deep learning +- Knowledge of activation functions and backpropagation +- Familiarity with sequence-based data processing +- Libraries: NumPy, TensorFlow, PyTorch + +--- + +## ๐Ÿงฌ Inputs +- A sequence of data points such as text, speech signals, or time-series data. +- Example: A sentence represented as a sequence of word embeddings for NLP tasks. + +## ๐ŸŽŽ Outputs +- Predicted sequence values or classifications. +- Example: Next word prediction in a sentence or stock price forecasting. + +--- + +## ๐Ÿฉ RNN Architecture +- RNNs maintain a **hidden state** that updates with each time step. +- At each step, the hidden state is computed as: + $$ h_t = f(W_h h_{t-1} + W_x x_t + b) $$ +- Variants of RNNs include **LSTMs (Long Short-Term Memory)** and **GRUs (Gated Recurrent Units)**, which help mitigate the vanishing gradient problem. + +## ๐Ÿ… Training Process +- The model is trained using **Backpropagation Through Time (BPTT)**. +- Uses optimizers like **Adam** or **SGD**. +- Typical hyperparameters: + - Learning rate: 0.001 + - Batch size: 64 + - Epochs: 30 + - Loss function: Cross-entropy for classification tasks, MSE for regression tasks. + +## ๐Ÿ“Š Evaluation Metrics +- Accuracy (for classification) +- Perplexity (for language models) +- Mean Squared Error (MSE) (for regression tasks) +- BLEU Score (for sequence-to-sequence models) + +--- + +## ๐Ÿ’ป Code Implementation +```python +import numpy as np +import torch +import torch.nn as nn +import torch.optim as optim + +# Define RNN Model +class RNN(nn.Module): + def __init__(self, input_size, hidden_size, output_size): + super(RNN, self).__init__() + self.hidden_size = hidden_size + self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) + self.fc = nn.Linear(hidden_size, output_size) + + def forward(self, x, hidden): + out, hidden = self.rnn(x, hidden) + out = self.fc(out[:, -1, :]) + return out, hidden + +# Model Training +input_size = 10 # Number of input features +hidden_size = 20 # Number of hidden neurons +output_size = 1 # Output dimension + +model = RNN(input_size, hidden_size, output_size) +criterion = nn.MSELoss() +optimizer = optim.Adam(model.parameters(), lr=0.001) + +# Sample Training Loop +for epoch in range(10): + optimizer.zero_grad() + inputs = torch.randn(32, 5, input_size) # (batch_size, seq_length, input_size) + hidden = torch.zeros(1, 32, hidden_size) # Initial hidden state + outputs, hidden = model(inputs, hidden) + loss = criterion(outputs, torch.randn(32, output_size)) + loss.backward() + optimizer.step() + print(f"Epoch {epoch+1}, Loss: {loss.item()}") +``` + +## ๐Ÿ” Understanding the Code +- **Model Definition:** + - The `RNN` class defines a simple recurrent neural network with an input layer, a recurrent layer, and a fully connected output layer. +- **Forward Pass:** + - Takes an input sequence, processes it through the RNN layer, and generates an output. +- **Training Loop:** + - Uses randomly generated data for demonstration. + - Optimizes weights using the Adam optimizer and mean squared error loss. + +--- + +## ๐ŸŒŸ Advantages +- Effective for sequential data modeling. +- Capable of handling variable-length inputs. +- Works well for applications like text generation and speech recognition. + +## โš ๏ธ Limitations +- Struggles with long-range dependencies due to vanishing gradients. +- Training can be slow due to sequential computations. +- Alternatives like **LSTMs and GRUs** are preferred for longer sequences. + +## ๐Ÿš€ Applications +### Natural Language Processing (NLP) +- Text prediction +- Sentiment analysis +- Machine translation + +### Time-Series Forecasting +- Stock price prediction +- Weather forecasting +- Healthcare monitoring (e.g., ECG signals) \ No newline at end of file From 4fd3e0e328404831c8f0f5af952b1ba1f9cef9f2 Mon Sep 17 00:00:00 2001 From: Avdhesh-Varshney <114330097+Avdhesh-Varshney@users.noreply.github.com> Date: Sun, 9 Feb 2025 17:00:48 +0530 Subject: [PATCH 08/19] update: guidelines --- CONTRIBUTING.md | 23 ------ README.md | 212 +++++++++++++++++++++++++++++++++--------------- 2 files changed, 145 insertions(+), 90 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index b8a06cc4..b70d4eb0 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -36,29 +36,6 @@ git push -u origin --- -### Important Points to remember while submitting your work ๐Ÿ“ - -We want your work to be readable by others; therefore, we encourage you to note the following: - -1. File names should be in `kebab-case` letters (e.g., `music-genre-classification-model`, `insurance-cross-sell-prediction`). -2. Follow the [***PROJECT README TEMPLATE***](./docs/project-readme-template.md) and [***ALGORITHM README TEMPLATE***](./docs/algorithm-readme-template.md) for refrence. -3. Do not upload images or video files directly. Use a GitHub raw URL in the documentation. -4. Upload your notebook to Kaggle, make it public, and share the Kaggle embed link only. Other links are not accepted. -5. Limit commits to 3-4 unless given permission by project Admins or Mentors. -6. Keep commit messages clear and relevant; avoid unnecessary details. - -### Pull Requests Review Criteria ๐Ÿงฒ - -1. It must required to follow mentioned [do/don't](https://github.com/Avdhesh-Varshney/AI-Code/issues/9) guidelines. -2. Please fill the ***PR Template*** properly while making a Pull Request. -3. Do not commit directly to the `main` branch, or your PR will be instantly rejected. -4. Ensure all work is original and not copied from other sources. -5. Add comments to your code wherever necessary for clarity. -6. Include a working video and show integration with `AI-Code MkDocs Documentation` website as part of your PR. -7. For frontend updates, share screenshots and work samples before submitting a PR. - ---- - ### Communication and Support ๐Ÿ’ฌ - Join the project's communication channels to interact with other contributors and seek assistance. - If you have any questions or need help, don't hesitate to ask in the project's communication channels or comment on the relevant issue. diff --git a/README.md b/README.md index 8f4270bb..1f95a4d2 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,71 @@ -# AI-Code +

Hey <๐šŒ๐š˜๐š๐šŽ๐š›๐šœ/>! ๐Ÿ‘‹

-![AI](https://img.shields.io/badge/AI-ff5733?style=flat-square) -![DL](https://img.shields.io/badge/DL-007bff?style=flat-square) -![ML](https://img.shields.io/badge/ML-ffc300?style=flat-square) -![GAN](https://img.shields.io/badge/GAN-6a1b9a?style=flat-square) -![NLP](https://img.shields.io/badge/NLP-28a745?style=flat-square) -![OpenCV](https://img.shields.io/badge/OpenCV-34495e?style=flat-square) -![Pre-processing](https://img.shields.io/badge/Pre--processing-e67e22?style=flat-square) +[![Typing SVG](https://readme-typing-svg.demolab.com?font=Monoton&size=85&pause=12&speed=12&color=00FF00¢er=true&vCenter=true&width=2000&height=200&lines=Hello+World!;Welcome+to+AI-Code!;Learn,+Build,+Contribute!;Master+AI+with+Hands-on+Projects!;Machine+Learning+to+LLMs!;Scratch+Code+for+Every+Algorithm!;Collaborate.+Innovate.+Inspire!;Your+AI+Journey+Starts+Here!)](https://git.io/typing-svg) ---- +
+

+ + + + + +

+ + + + + + + +
+ + + + +

+ +

+ + ![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54) + ![Markdown](https://img.shields.io/badge/markdown-%23000000.svg?style=for-the-badge&logo=markdown&logoColor=white) + ![Git](https://img.shields.io/badge/git-%23F05033.svg?style=for-the-badge&logo=git&logoColor=white) + ![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white) + ![Visual Studio Code](https://img.shields.io/badge/Visual%20Studio%20Code-0078d7.svg?style=for-the-badge&logo=visual-studio-code&logoColor=white) +

-### ๐ŸŒŸ **Overview** -**AI-Code** simplifies learning AI technologies with **easy-to-follow** code and **real-world project** guides for ML, DL, GAN, NLP, OpenCV, and more. +

+ + ![Statstics](https://img.shields.io/badge/Statistics-e67e22?style=for-the-badge) + ![ML](https://img.shields.io/badge/ML-%23FF7F50.svg?style=for-the-badge) + ![DL](https://img.shields.io/badge/DL-%23FF6347.svg?style=for-the-badge) + ![NLP](https://img.shields.io/badge/NLP-%23706FD3.svg?style=for-the-badge) + ![OpenCV](https://img.shields.io/badge/OpenCV-34495e?style=for-the-badge) + ![GAN](https://img.shields.io/badge/GAN-%23FF69B4.svg?style=for-the-badge) + ![LLM](https://img.shields.io/badge/LLM-%238E44AD.svg?style=for-the-badge) + ![AI](https://img.shields.io/badge/AI-%234A90E2.svg?style=for-the-badge) +

+ +
--- -### ๐Ÿ”‘ **Core Features** +#### :zap: About AI Code ๐ŸŒŸ + +**AI Code** is an open-source initiative designed to make learning **Artificial Intelligence (AI)** more accessible, structured, and hands-on. Whether you're a beginner or an experienced developer, AI-Code provides **scratch implementations** of various **AI algorithms** alongside **real-world project guides**, helping you bridge the gap between theory and practice. + +
+

:zap: Core Features ๐Ÿ”‘

+ - Scratch-level implementations of **AI algorithms** ๐Ÿง  - **Guides**, datasets, research papers, and **step-by-step tutorials** ๐Ÿ“˜ - Clear directories with focused **README** files ๐Ÿ“‚ - Fast learning with minimal complexity ๐Ÿš€ ---- +
+ +
+

:zap: Setup the Project ๐Ÿฑ

-### โšก **Setup the Project** 1. Go through the [Contributing Guidelines](./CONTRIBUTING.md) to fork and clone the project. 2. After forking and cloning the project in your local system: - Create a virtual environment: @@ -49,66 +91,102 @@ ``` 4. Open the local server URL (usually `http://127.0.0.1:8000`) in your browser. You are now ready to work on the project. ---- +
-### ๐Ÿ› ๏ธ **Tech Stack** -- **Python 3.8+** -- **Markdown** -- **Git/GitHub** -- **VS Code** +
+

:zap: Important Points to remember while submitting your work ๐Ÿ“

+ +> We want your work to be readable by others; therefore, we encourage you to note the following: + +1. File names should be in `kebab-case` letters (e.g., `music-genre-classification-model`, `insurance-cross-sell-prediction`). +2. Follow the [***PROJECT README TEMPLATE***](./docs/project-readme-template.md) and [***ALGORITHM README TEMPLATE***](./docs/algorithm-readme-template.md) for refrence. +3. Do not upload images or video files directly. Use a GitHub raw URL in the documentation. +4. Upload your notebook to Kaggle, make it public, and share the Kaggle embed link only. Other links are not accepted. +5. Limit commits to 3-4 unless given permission by project Admins or Mentors. +6. Keep commit messages clear and relevant; avoid unnecessary details. + +
+ +
+

:zap: Pull Requests Review Criteria ๐Ÿงฒ

+ +1. It must required to follow mentioned [do/don't](https://github.com/Avdhesh-Varshney/AI-Code/issues/9) guidelines. +2. Please fill the ***PR Template*** properly while making a Pull Request. +3. Do not commit directly to the `main` branch, or your PR will be instantly rejected. +4. Ensure all work is original and not copied from other sources. +5. Add comments to your code wherever necessary for clarity. +6. Include a working video and show integration with `AI-Code MkDocs Documentation` website as part of your PR. +7. For frontend updates, share screenshots and work samples before submitting a PR. + +
+ +--- + +
+ +### โ„๏ธ Open Source Programs ---- -### โ„๏ธ Featured in - - - - - - - - - - - - + + + + + + + + + +
- - -
SSOC 2024 -
-
- - -
SWOC 2025 -
-
- - -
IWOC 2025 -
-
- - -
KWOC 2024 -
-
- - -
VSOC 2024 -
-
+
+ +

SSOC

+ 2024 +
+
+
+ +

VSOC

+ 2024 +
+
+
+ +

KWOC

+ 2024 +
+
+
+ +

IWOC

+ 2025 +
+
+
+ +

SWOC

+ 2025 +
+
+
+ +

DWOC

+ 2025 +
+
---- -### ๐Ÿ‘ฅ **Contributors** +### โœจ Our Valuable Contributors - A big shoutout and heartfelt thanks to all our amazing contributors for their incredible efforts and dedication! This project wouldnโ€™t be where it is without you. ๐Ÿ’– - - Contributors - --- + + + + +![Line](https://github.com/Avdhesh-Varshney/WebMasterLog/assets/114330097/4b78510f-a941-45f8-a9d5-80ed0705e847) + +# Tip from us ๐Ÿ˜‡ +##### It always takes time to understand and learn. So, don't worry at all. We know you have got this! ๐Ÿ’ช +### Show some  โค๏ธ  by  ๐ŸŒŸ  this repository! -
-

๐Ÿ’™ Like the project?  ๐ŸŒŸ Star it!

From ff862cad58dcdd75bdab05549aa012259ff9730d Mon Sep 17 00:00:00 2001 From: Avdhesh-Varshney <114330097+Avdhesh-Varshney@users.noreply.github.com> Date: Sun, 9 Feb 2025 17:01:18 +0530 Subject: [PATCH 09/19] feat-add: libraries-exploration --- docs/libraries/index.md | 25 +++++++++++++++++++++++++ docs/libraries/numpy.md | 2 ++ docs/libraries/pandas.md | 2 ++ mkdocs.yml | 1 + 4 files changed, 30 insertions(+) create mode 100644 docs/libraries/index.md create mode 100644 docs/libraries/numpy.md create mode 100644 docs/libraries/pandas.md diff --git a/docs/libraries/index.md b/docs/libraries/index.md new file mode 100644 index 00000000..f71f26ed --- /dev/null +++ b/docs/libraries/index.md @@ -0,0 +1,25 @@ +# Libraries & Packages ๐Ÿ“š + + diff --git a/docs/libraries/numpy.md b/docs/libraries/numpy.md new file mode 100644 index 00000000..8e9eb3f4 --- /dev/null +++ b/docs/libraries/numpy.md @@ -0,0 +1,2 @@ +# Numpy + diff --git a/docs/libraries/pandas.md b/docs/libraries/pandas.md new file mode 100644 index 00000000..da5514e8 --- /dev/null +++ b/docs/libraries/pandas.md @@ -0,0 +1,2 @@ +# Pandas + diff --git a/mkdocs.yml b/mkdocs.yml index dfd4a594..655a159a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -5,6 +5,7 @@ nav: - ๐Ÿ  Home: index.md - ๐Ÿ”ท Algorithms: algorithms/index.md - ๐ŸŽ‰ Projects: projects/index.md + - ๐Ÿ“š Libraries/Packages: libraries/index.md - ๐Ÿ“ Contribute: contribute.md - ๐Ÿงฎ Algorithm Template: algorithm-readme-template.md - ๐Ÿ“œ Project Template: project-readme-template.md From f60b1b72ec19de121a6570702ba900436be42e79 Mon Sep 17 00:00:00 2001 From: that-ar-guy Date: Wed, 12 Feb 2025 23:15:06 +0530 Subject: [PATCH 10/19] index updated --- docs/projects/deep-learning/anamoly-detection.md | 0 docs/projects/deep-learning/index.md | 9 +++++++++ 2 files changed, 9 insertions(+) create mode 100644 docs/projects/deep-learning/anamoly-detection.md diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/projects/deep-learning/anamoly-detection.md new file mode 100644 index 00000000..e69de29b diff --git a/docs/projects/deep-learning/index.md b/docs/projects/deep-learning/index.md index 7d210a0f..068507f4 100644 --- a/docs/projects/deep-learning/index.md +++ b/docs/projects/deep-learning/index.md @@ -11,6 +11,15 @@

๐Ÿ“… 2025-01-10 | โฑ๏ธ 10 mins

+ + + +
+

LSTM Autoencoder for Time Series Anomaly Detection

+

A deep learning approach to detect anomalies in time series data.

+

๐Ÿ“… 2025-02-12 | โฑ๏ธ 10 mins

+
+
From cfeee240e5b72ae2a1a9157a7f5cf0010813aa13 Mon Sep 17 00:00:00 2001 From: that-ar-guy Date: Wed, 12 Feb 2025 23:20:03 +0530 Subject: [PATCH 11/19] page created --- .../deep-learning/anamoly-detection.md | 147 ++++++++++++++++++ 1 file changed, 147 insertions(+) diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/projects/deep-learning/anamoly-detection.md index e69de29b..5ac9d936 100644 --- a/docs/projects/deep-learning/anamoly-detection.md +++ b/docs/projects/deep-learning/anamoly-detection.md @@ -0,0 +1,147 @@ +# Time-Series Anomaly Detection + +### AIM + +To detect anomalies in time-series data using Long Short-Term Memory (LSTM) networks. + +### DATASET + +Synthetic time-series data generated using sine wave with added noise. + +### KAGGLE NOTEBOOK +[https://www.kaggle.com/code/thatarguy/lstm-anamoly-detection/notebook](https://www.kaggle.com/code/thatarguy/lstm-anamoly-detection/notebook) + +### LIBRARIES NEEDED + + - numpy + - pandas + - yfinance + - matplotlib + - tensorflow + - scikit-learn + +--- + +### DESCRIPTION + +!!! info "What is the requirement of the project?" + - The project focuses on identifying anomalies in time-series data using an LSTM autoencoder. The model learns normal patterns and detects deviations indicating anomalies. + +??? info "Why is it necessary?" + - Anomaly detection is crucial in various domains such as finance, healthcare, and cybersecurity, where detecting unexpected behavior can prevent failures, fraud, or security breaches. + +??? info "How is it beneficial and used?" + - Businesses can use it to detect irregularities in stock market trends. + - It can help monitor industrial equipment to identify faults before failures occur. + - It can be applied in fraud detection for financial transactions. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Understanding time-series anomaly detection methodologies. + - Generating synthetic data to simulate real-world scenarios. + - Implementing an LSTM autoencoder to learn normal patterns and detect anomalies. + - Evaluating model performance using Mean Squared Error (MSE). + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - Research paper: "Deep Learning for Time-Series Anomaly Detection" + - Public notebook: LSTM Autoencoder for Anomaly Detection + +--- + +### Model Architecture + - The LSTM autoencoder learns normal time-series behavior and reconstructs it. Any deviation is considered an anomaly. + - Encoder: Extracts patterns using LSTM layers. + - Bottleneck: Compresses the data representation. + - Decoder: Reconstructs the original sequence. + - The reconstruction error determines anomalies. + +### Model Structure + - Input: Time-series sequence (50 time steps) + - LSTM Layers for encoding + - Repeat Vector to retain sequence information + - LSTM Layers for decoding + - TimeDistributed Dense Layer for reconstruction + - Loss Function: Mean Squared Error (MSE) + +--- + +#### WHAT I HAVE DONE + +=== "Step 1" + + Exploratory Data Analysis + + - Generate synthetic data (sine wave with noise) + - Normalize data using MinMaxScaler + - Split data into training and validation sets + +=== "Step 2" + + Data Cleaning and Preprocessing + + - Create sequential data using a rolling window approach + - Reshape data for LSTM compatibility + +=== "Step 3" + + Feature Engineering and Selection + + - Use LSTM layers for sequence modeling + - Implement autoencoder-based reconstruction + +=== "Step 4" + + Modeling + + - Train an LSTM autoencoder + - Optimize loss function using Adam optimizer + - Monitor validation loss for overfitting prevention + +=== "Step 5" + + Result Analysis + + - Compute reconstruction error for anomaly detection + - Identify threshold for anomalies using percentile-based method + - Visualize detected anomalies using Matplotlib + +--- + +#### PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade Off 1" + + **Reconstruction Error Threshold Selection:** + Setting a high threshold may miss subtle anomalies, while a low threshold might increase false positives. + + - **Solution**: Use the 95th percentile of reconstruction errors as the threshold to balance false positives and false negatives. + +--- + +### CONCLUSION + +#### WHAT YOU HAVE LEARNED + +!!! tip "Insights gained from the data" + - Time-series anomalies often appear as sudden deviations from normal patterns. + +??? tip "Improvements in understanding machine learning concepts" + - Learned about LSTM autoencoders and their ability to reconstruct normal sequences. + +??? tip "Challenges faced and how they were overcome" + - Handling high reconstruction errors by tuning model hyperparameters. + - Selecting an appropriate anomaly threshold using statistical methods. + +--- + +#### USE CASES OF THIS MODEL + +=== "Application 1" + + - Financial fraud detection through irregular transaction patterns. + +=== "Application 2" + + - Predictive maintenance in industrial settings by identifying equipment failures. + +--- + From 7fd4bd9e108f03c19b3ef613c56827e755dbd8ab Mon Sep 17 00:00:00 2001 From: that-ar-guy Date: Tue, 18 Feb 2025 23:09:30 +0530 Subject: [PATCH 12/19] ss to be added --- .../deep-learning/anamoly-detection.md | 145 ++++++++++-------- 1 file changed, 84 insertions(+), 61 deletions(-) diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/projects/deep-learning/anamoly-detection.md index 5ac9d936..f0fe3e8c 100644 --- a/docs/projects/deep-learning/anamoly-detection.md +++ b/docs/projects/deep-learning/anamoly-detection.md @@ -1,28 +1,34 @@ -# Time-Series Anomaly Detection +# ๐Ÿ“œ Time-Series Anomaly Detection -### AIM +
+ +
+## ๐ŸŽฏ AIM To detect anomalies in time-series data using Long Short-Term Memory (LSTM) networks. -### DATASET +## ๐Ÿ“Š DATASET LINK +[NOT USED] -Synthetic time-series data generated using sine wave with added noise. - -### KAGGLE NOTEBOOK +## ๐Ÿ““ KAGGLE NOTEBOOK [https://www.kaggle.com/code/thatarguy/lstm-anamoly-detection/notebook](https://www.kaggle.com/code/thatarguy/lstm-anamoly-detection/notebook) -### LIBRARIES NEEDED +??? Abstract "Kaggle Notebook" + + + - - numpy - - pandas - - yfinance - - matplotlib - - tensorflow - - scikit-learn +## โš™๏ธ TECH STACK + +| **Category** | **Technologies** | +|--------------------------|---------------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | TensorFlow, Keras, scikit-learn, numpy, pandas, matplotlib | +| **Tools** | Jupyter Notebook, VS Code | --- -### DESCRIPTION +## ๐Ÿ“ DESCRIPTION !!! info "What is the requirement of the project?" - The project focuses on identifying anomalies in time-series data using an LSTM autoencoder. The model learns normal patterns and detects deviations indicating anomalies. @@ -47,79 +53,98 @@ Synthetic time-series data generated using sine wave with added noise. --- -### Model Architecture - - The LSTM autoencoder learns normal time-series behavior and reconstructs it. Any deviation is considered an anomaly. - - Encoder: Extracts patterns using LSTM layers. - - Bottleneck: Compresses the data representation. - - Decoder: Reconstructs the original sequence. - - The reconstruction error determines anomalies. - -### Model Structure - - Input: Time-series sequence (50 time steps) - - LSTM Layers for encoding - - Repeat Vector to retain sequence information - - LSTM Layers for decoding - - TimeDistributed Dense Layer for reconstruction - - Loss Function: Mean Squared Error (MSE) +## ๐Ÿ” PROJECT EXPLANATION + +### ๐Ÿงฉ DATASET OVERVIEW & FEATURE DETAILS + +??? example "๐Ÿ“‚ Synthetic dataset" + + - The dataset consists of a sine wave with added noise. + + | Feature Name | Description | Datatype | + |--------------|-------------|:------------:| + | time | Timestamp | int64 | + | value | Sine wave value with noise | float64 | --- -#### WHAT I HAVE DONE +### ๐Ÿ›ค PROJECT WORKFLOW -=== "Step 1" +!!! success "Project workflow" - Exploratory Data Analysis + ``` mermaid + graph LR + A[Start] --> B{Generate Data}; + B --> C[Normalize Data]; + C --> D[Create Sequences]; + D --> E[Train LSTM Autoencoder]; + E --> F[Compute Reconstruction Error]; + F --> G[Identify Anomalies]; + ``` +=== "Step 1" - Generate synthetic data (sine wave with noise) - Normalize data using MinMaxScaler - Split data into training and validation sets === "Step 2" - - Data Cleaning and Preprocessing - - Create sequential data using a rolling window approach - Reshape data for LSTM compatibility === "Step 3" + - Implement LSTM autoencoder for anomaly detection + - Optimize model using Adam optimizer - Feature Engineering and Selection +=== "Step 4" + - Compute reconstruction error for anomaly detection + - Identify threshold for anomalies using percentile-based method - - Use LSTM layers for sequence modeling - - Implement autoencoder-based reconstruction +=== "Step 5" + - Visualize detected anomalies using Matplotlib -=== "Step 4" +--- - Modeling +### ๐Ÿ–ฅ CODE EXPLANATION - - Train an LSTM autoencoder - - Optimize loss function using Adam optimizer - - Monitor validation loss for overfitting prevention +=== "LSTM Autoencoder" + - The model consists of an encoder, bottleneck, and decoder. + - It learns normal time-series behavior and reconstructs it. + - Deviations from normal patterns are considered anomalies. -=== "Step 5" +--- - Result Analysis +### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS - - Compute reconstruction error for anomaly detection - - Identify threshold for anomalies using percentile-based method - - Visualize detected anomalies using Matplotlib +=== "Reconstruction Error Threshold Selection" + - Setting a high threshold may miss subtle anomalies, while a low threshold might increase false positives. + - **Solution**: Use the 95th percentile of reconstruction errors as the threshold to balance false positives and false negatives. --- -#### PROJECT TRADE-OFFS AND SOLUTIONS +## ๐Ÿ–ผ SCREENSHOTS -=== "Trade Off 1" +!!! tip "Visualizations and EDA of different features" - **Reconstruction Error Threshold Selection:** - Setting a high threshold may miss subtle anomalies, while a low threshold might increase false positives. + === "Synthetic Data Plot" + - - **Solution**: Use the 95th percentile of reconstruction errors as the threshold to balance false positives and false negatives. +??? example "Model performance graphs" + + === "Reconstruction Error Plot" --- -### CONCLUSION +## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS -#### WHAT YOU HAVE LEARNED +| Model | Reconstruction Error (MSE) | +|------------------|---------------------------| +| LSTM Autoencoder | 0.015 | + +--- + +## โœ… CONCLUSION + +### ๐Ÿ”‘ KEY LEARNINGS !!! tip "Insights gained from the data" - Time-series anomalies often appear as sudden deviations from normal patterns. @@ -133,15 +158,13 @@ Synthetic time-series data generated using sine wave with added noise. --- -#### USE CASES OF THIS MODEL - -=== "Application 1" +### ๐ŸŒ USE CASES - - Financial fraud detection through irregular transaction patterns. +=== "Financial Fraud Detection" + - Detect irregular transaction patterns using anomaly detection. -=== "Application 2" +=== "Predictive Maintenance" + - Identify equipment failures in industrial settings before they occur. - - Predictive maintenance in industrial settings by identifying equipment failures. ---- From d5908557fdace12e7f2637f7d79206ec2c19cccb Mon Sep 17 00:00:00 2001 From: Mohammed Abdul Rahman <130785777+that-ar-guy@users.noreply.github.com> Date: Tue, 18 Feb 2025 23:13:29 +0530 Subject: [PATCH 13/19] added images need to check locally --- docs/projects/deep-learning/anamoly-detection.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/projects/deep-learning/anamoly-detection.md index f0fe3e8c..2cbf0954 100644 --- a/docs/projects/deep-learning/anamoly-detection.md +++ b/docs/projects/deep-learning/anamoly-detection.md @@ -126,12 +126,13 @@ To detect anomalies in time-series data using Long Short-Term Memory (LSTM) netw !!! tip "Visualizations and EDA of different features" === "Synthetic Data Plot" - + ![img](https://github.com/user-attachments/assets/4ff144a9-756a-43e3-aba2-609d92cbacd2) + ??? example "Model performance graphs" === "Reconstruction Error Plot" - + ![img](https://github.com/user-attachments/assets/e33a0537-9e23-4e21-b0e5-153a78ac4000) --- ## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS From 4197e282c18cbdcdc9a82df08f37f586bc5070dd Mon Sep 17 00:00:00 2001 From: that-ar-guy Date: Tue, 18 Feb 2025 23:15:26 +0530 Subject: [PATCH 14/19] images are correct --- docs/projects/deep-learning/anamoly-detection.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/projects/deep-learning/anamoly-detection.md index 2cbf0954..5615d8b4 100644 --- a/docs/projects/deep-learning/anamoly-detection.md +++ b/docs/projects/deep-learning/anamoly-detection.md @@ -126,13 +126,13 @@ To detect anomalies in time-series data using Long Short-Term Memory (LSTM) netw !!! tip "Visualizations and EDA of different features" === "Synthetic Data Plot" - ![img](https://github.com/user-attachments/assets/4ff144a9-756a-43e3-aba2-609d92cbacd2) + ![img](https://github.com/user-attachments/assets/e33a0537-9e23-4e21-b0e5-153a78ac4000) ??? example "Model performance graphs" === "Reconstruction Error Plot" - ![img](https://github.com/user-attachments/assets/e33a0537-9e23-4e21-b0e5-153a78ac4000) + ![img](https://github.com/user-attachments/assets/4ff144a9-756a-43e3-aba2-609d92cbacd2) --- ## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS From b33ec90525c6ff0b7b18bbfa5b893d198a227d93 Mon Sep 17 00:00:00 2001 From: Avdhesh-Varshney <114330097+Avdhesh-Varshney@users.noreply.github.com> Date: Mon, 24 Feb 2025 11:29:43 +0530 Subject: [PATCH 15/19] delete: algo & lib --- docs/algorithm-readme-template.md | 262 -------------- .../evolutionary-algorithms/index.md | 11 - .../expert-systems/index.md | 11 - .../artificial-intelligence/index.md | 60 ---- .../knowledge-based-systems/index.md | 11 - .../reinforcement-learning/index.md | 11 - .../search-and-optimization/index.md | 11 - .../image-augmentation/index.md | 11 - .../computer-vision/image-processing/index.md | 11 - docs/algorithms/computer-vision/index.md | 49 --- .../computer-vision/object-detection/index.md | 11 - .../semantic-segmentation/index.md | 11 - .../deep-learning/architectures/index.md | 11 - docs/algorithms/deep-learning/index.md | 49 --- .../convolutional-neural-network.md | 115 ------- .../deep-learning/neural-networks/index.md | 24 -- .../recurrent-neural-network.md | 122 ------- .../optimization-algorithms/index.md | 11 - .../deep-learning/pre-trained-models/index.md | 11 - .../generative-adversarial-networks/ac-gan.md | 170 ---------- .../basic-gan.md | 222 ------------ .../generative-adversarial-networks/c-gan.md | 251 -------------- .../generative-adversarial-networks/eb-gan.md | 225 ------------ .../generative-adversarial-networks/index.md | 56 --- .../info-gan.md | 255 -------------- docs/algorithms/index.md | 93 ----- .../large-language-models/bert/index.md | 11 - .../large-language-models/bloom/index.md | 11 - .../large-language-models/gpt-series/index.md | 11 - .../algorithms/large-language-models/index.md | 49 --- .../large-language-models/t5/index.md | 11 - .../machine-learning/boosting/index.md | 15 - .../machine-learning/boosting/light-gbm.md | 128 ------- .../data-preprocessing/encoding/index.md | 16 - .../encoding/ordinal-encoder.md | 115 ------- .../data-preprocessing/imputation/index.md | 11 - .../data-preprocessing/index.md | 40 --- .../scaling-and-normalization/index.md | 27 -- .../min-max-scaler.md | 133 -------- .../standard-scaler.md | 140 -------- docs/algorithms/machine-learning/index.md | 49 --- .../supervised/classifications/index.md | 10 - .../machine-learning/supervised/index.md | 27 -- .../supervised/regressions/adaboost.md | 144 -------- .../supervised/regressions/bayesian.md | 94 ----- .../supervised/regressions/decision-tree.md | 205 ----------- .../supervised/regressions/elastic-net.md | 92 ----- .../regressions/gradient-boosting.md | 218 ------------ .../supervised/regressions/huber.md | 98 ------ .../supervised/regressions/index.md | 166 --------- .../regressions/k-nearest-neighbors.md | 94 ----- .../supervised/regressions/lasso.md | 100 ------ .../supervised/regressions/linear.md | 115 ------- .../supervised/regressions/logistic.md | 174 ---------- .../supervised/regressions/neural-network.md | 128 ------- .../supervised/regressions/polynomial.md | 114 ------- .../supervised/regressions/random-forest.md | 244 ------------- .../supervised/regressions/ridge.md | 101 ------ .../supervised/regressions/support-vector.md | 140 -------- .../supervised/regressions/xg-boost.md | 127 ------- .../unsupervised/clustering/index.md | 16 - .../clustering/kmeans-clustering.md | 185 ---------- .../dimensionality-reduction/index.md | 11 - .../machine-learning/unsupervised/index.md | 27 -- .../Bag_Of_Words.md | 51 --- .../natural-language-processing/Fast_Text.md | 228 ------------- .../natural-language-processing/GloVe.md | 223 ------------ .../natural-language-processing/NLTK_Setup.md | 241 ------------- .../N_L_P_Introduction.md | 67 ---- .../Text_PreProcessing_Techniques.md | 320 ------------------ .../natural-language-processing/Tf_Idf.md | 192 ----------- .../Transformers.md | 85 ----- .../natural-language-processing/Word_2_Vec.md | 222 ------------ .../Word_Embeddings.md | 128 ------- .../natural-language-processing/index.md | 105 ------ .../statistics/descriptive/index.md | 11 - docs/algorithms/statistics/index.md | 50 --- .../statistics/inferential/index.md | 11 - .../errors/Mean_Absolute_Error.md | 18 - .../errors/Mean_Squared_Error.md | 18 - .../errors/R2_Squared_Error.md | 21 -- .../errors/Root_Mean_Squared_Error.md | 19 -- .../statistics/metrics-and-losses/index.md | 2 - .../loss-functions/Cross_Entropy_Loss.md | 170 ---------- .../loss-functions/Hinge_Loss.md | 39 --- .../Kullback_Leibler_Divergence_Loss.md | 54 --- .../loss-functions/Ranking_Losses.md | 64 ---- .../statistics/probability/index.md | 11 - docs/customs/extra.css | 49 --- docs/libraries/index.md | 25 -- docs/libraries/numpy.md | 2 - docs/libraries/pandas.md | 2 - 92 files changed, 7910 deletions(-) delete mode 100644 docs/algorithm-readme-template.md delete mode 100644 docs/algorithms/artificial-intelligence/evolutionary-algorithms/index.md delete mode 100644 docs/algorithms/artificial-intelligence/expert-systems/index.md delete mode 100644 docs/algorithms/artificial-intelligence/index.md delete mode 100644 docs/algorithms/artificial-intelligence/knowledge-based-systems/index.md delete mode 100644 docs/algorithms/artificial-intelligence/reinforcement-learning/index.md delete mode 100644 docs/algorithms/artificial-intelligence/search-and-optimization/index.md delete mode 100644 docs/algorithms/computer-vision/image-augmentation/index.md delete mode 100644 docs/algorithms/computer-vision/image-processing/index.md delete mode 100644 docs/algorithms/computer-vision/index.md delete mode 100644 docs/algorithms/computer-vision/object-detection/index.md delete mode 100644 docs/algorithms/computer-vision/semantic-segmentation/index.md delete mode 100644 docs/algorithms/deep-learning/architectures/index.md delete mode 100644 docs/algorithms/deep-learning/index.md delete mode 100644 docs/algorithms/deep-learning/neural-networks/convolutional-neural-network.md delete mode 100644 docs/algorithms/deep-learning/neural-networks/index.md delete mode 100644 docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md delete mode 100644 docs/algorithms/deep-learning/optimization-algorithms/index.md delete mode 100644 docs/algorithms/deep-learning/pre-trained-models/index.md delete mode 100644 docs/algorithms/generative-adversarial-networks/ac-gan.md delete mode 100644 docs/algorithms/generative-adversarial-networks/basic-gan.md delete mode 100644 docs/algorithms/generative-adversarial-networks/c-gan.md delete mode 100644 docs/algorithms/generative-adversarial-networks/eb-gan.md delete mode 100644 docs/algorithms/generative-adversarial-networks/index.md delete mode 100644 docs/algorithms/generative-adversarial-networks/info-gan.md delete mode 100644 docs/algorithms/index.md delete mode 100644 docs/algorithms/large-language-models/bert/index.md delete mode 100644 docs/algorithms/large-language-models/bloom/index.md delete mode 100644 docs/algorithms/large-language-models/gpt-series/index.md delete mode 100644 docs/algorithms/large-language-models/index.md delete mode 100644 docs/algorithms/large-language-models/t5/index.md delete mode 100644 docs/algorithms/machine-learning/boosting/index.md delete mode 100644 docs/algorithms/machine-learning/boosting/light-gbm.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/encoding/index.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/encoding/ordinal-encoder.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/imputation/index.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/index.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/index.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/min-max-scaler.md delete mode 100644 docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/standard-scaler.md delete mode 100644 docs/algorithms/machine-learning/index.md delete mode 100644 docs/algorithms/machine-learning/supervised/classifications/index.md delete mode 100644 docs/algorithms/machine-learning/supervised/index.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/adaboost.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/bayesian.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/decision-tree.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/elastic-net.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/gradient-boosting.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/huber.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/index.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/k-nearest-neighbors.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/lasso.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/linear.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/logistic.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/neural-network.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/polynomial.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/random-forest.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/ridge.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/support-vector.md delete mode 100644 docs/algorithms/machine-learning/supervised/regressions/xg-boost.md delete mode 100644 docs/algorithms/machine-learning/unsupervised/clustering/index.md delete mode 100644 docs/algorithms/machine-learning/unsupervised/clustering/kmeans-clustering.md delete mode 100644 docs/algorithms/machine-learning/unsupervised/dimensionality-reduction/index.md delete mode 100644 docs/algorithms/machine-learning/unsupervised/index.md delete mode 100644 docs/algorithms/natural-language-processing/Bag_Of_Words.md delete mode 100644 docs/algorithms/natural-language-processing/Fast_Text.md delete mode 100644 docs/algorithms/natural-language-processing/GloVe.md delete mode 100644 docs/algorithms/natural-language-processing/NLTK_Setup.md delete mode 100644 docs/algorithms/natural-language-processing/N_L_P_Introduction.md delete mode 100644 docs/algorithms/natural-language-processing/Text_PreProcessing_Techniques.md delete mode 100644 docs/algorithms/natural-language-processing/Tf_Idf.md delete mode 100644 docs/algorithms/natural-language-processing/Transformers.md delete mode 100644 docs/algorithms/natural-language-processing/Word_2_Vec.md delete mode 100644 docs/algorithms/natural-language-processing/Word_Embeddings.md delete mode 100644 docs/algorithms/natural-language-processing/index.md delete mode 100644 docs/algorithms/statistics/descriptive/index.md delete mode 100644 docs/algorithms/statistics/index.md delete mode 100644 docs/algorithms/statistics/inferential/index.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/errors/Mean_Absolute_Error.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/errors/Mean_Squared_Error.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/errors/R2_Squared_Error.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/errors/Root_Mean_Squared_Error.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/index.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/loss-functions/Cross_Entropy_Loss.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/loss-functions/Hinge_Loss.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/loss-functions/Kullback_Leibler_Divergence_Loss.md delete mode 100644 docs/algorithms/statistics/metrics-and-losses/loss-functions/Ranking_Losses.md delete mode 100644 docs/algorithms/statistics/probability/index.md delete mode 100644 docs/customs/extra.css delete mode 100644 docs/libraries/index.md delete mode 100644 docs/libraries/numpy.md delete mode 100644 docs/libraries/pandas.md diff --git a/docs/algorithm-readme-template.md b/docs/algorithm-readme-template.md deleted file mode 100644 index f4617e83..00000000 --- a/docs/algorithm-readme-template.md +++ /dev/null @@ -1,262 +0,0 @@ - - - - -# ๐Ÿงฎ Algorithm Title - - -
- -
- -## ๐ŸŽฏ Objective - -- Example: "This is a K-Nearest Neighbors (KNN) classifier algorithm used for classifying data points based on their proximity to other points in the dataset." - -## ๐Ÿ“š Prerequisites - - -- Linear Algebra Basics -- Probability and Statistics -- Libraries: NumPy, TensorFlow, PyTorch (as applicable) - ---- - -## ๐Ÿงฉ Inputs - -- Example: The input dataset should be in CSV format with features and labels for supervised learning algorithms. - - -## ๐Ÿ“ค Outputs - -- Example: The algorithm returns a predicted class label or a regression value for each input sample. - ---- - -## ๐Ÿ›๏ธ Algorithm Architecture - -- Example: "The neural network consists of 3 layers: an input layer, one hidden layer with 128 units, and an output layer with 10 units for classification." - - -## ๐Ÿ‹๏ธโ€โ™‚๏ธ Training Process - -- Example: - - The model is trained using the **gradient descent** optimizer. - - Learning rate: 0.01 - - Batch size: 32 - - Number of epochs: 50 - - Validation set: 20% of the training data - - -## ๐Ÿ“Š Evaluation Metrics - -- Example: "Accuracy and F1-Score are used to evaluate the classification performance of the model. Cross-validation is used to reduce overfitting." - ---- - -## ๐Ÿ’ป Code Implementation - -```python -# Example: Bayesian Regression implementation - -import numpy as np -from sklearn.linear_model import BayesianRidge -import matplotlib.pyplot as plt - -# Generate Synthetic Data -np.random.seed(42) -X = np.random.rand(20, 1) * 10 -y = 3 * X.squeeze() + np.random.randn(20) * 2 - -# Initialize and Train Bayesian Ridge Regression -model = BayesianRidge(alpha_1=1e-6, lambda_1=1e-6, compute_score=True) -model.fit(X, y) - -# Make Predictions -X_test = np.linspace(0, 10, 100).reshape(-1, 1) -y_pred, y_std = model.predict(X_test, return_std=True) - -# Display Results -print("Coefficients:", model.coef_) -print("Intercept:", model.intercept_) - -# Visualization -plt.figure(figsize=(8, 5)) -plt.scatter(X, y, color="blue", label="Training Data") -plt.plot(X_test, y_pred, color="red", label="Mean Prediction") -plt.fill_between( - X_test.squeeze(), - y_pred - y_std, - y_pred + y_std, - color="orange", - alpha=0.3, - label="Predictive Uncertainty", -) -plt.title("Bayesian Regression with Predictive Uncertainty") -plt.xlabel("X") -plt.ylabel("y") -plt.legend() -plt.show() -``` - -## ๐Ÿ” Scratch Code Explanation - - -Bayesian Regression is a probabilistic approach to linear regression that incorporates prior beliefs and updates these beliefs based on observed data to form posterior distributions of the model parameters. Below is a breakdown of the implementation, structured for clarity and understanding. - ---- - -#### 1. Class Constructor: Initialization - -```python -class BayesianRegression: - def __init__(self, alpha=1, beta=1): - """ - Constructor for the BayesianRegression class. - - Parameters: - - alpha: Prior precision (controls the weight of the prior belief). - - beta: Noise precision (inverse of noise variance in the data). - """ - self.alpha = alpha - self.beta = beta - self.w_mean = None - self.w_precision = None -``` - -- Key Idea - - The `alpha` (Prior precision, representing our belief in the model parameters' variability) and `beta` (Precision of the noise in the data) hyperparameters are crucial to controlling the Bayesian framework. A higher `alpha` means stronger prior belief in smaller weights, while `beta` controls the confidence in the noise level of the observations. - - `w_mean` - Posterior mean of weights (initialized as None) - - `w_precision` - Posterior precision matrix (initialized as None) - ---- - -#### 2. Fitting the Model: Bayesian Learning - -```python -def fit(self, X, y): - """ - Fit the Bayesian Regression model to the input data. - - Parameters: - - X: Input features (numpy array of shape [n_samples, n_features]). - - y: Target values (numpy array of shape [n_samples]). - """ - # Add a bias term to X for intercept handling. - X = np.c_[np.ones(X.shape[0]), X] - - # Compute the posterior precision matrix. - self.w_precision = ( - self.alpha * np.eye(X.shape[1]) # Prior contribution. - + self.beta * X.T @ X # Data contribution. - ) - - # Compute the posterior mean of the weights. - self.w_mean = np.linalg.solve(self.w_precision, self.beta * X.T @ y) -``` - -Key Steps in the Fitting Process - -1. Add Bias Term: The bias term (column of ones) is added to `X` to account for the intercept in the linear model. -2. Posterior Precision Matrix: - $$ - \mathbf{S}_w^{-1} = \alpha \mathbf{I} + \beta \mathbf{X}^\top \mathbf{X} - $$ - - - The prior contributes \(\alpha \mathbf{I}\), which regularizes the weights. - - The likelihood contributes \(\beta \mathbf{X}^\top \mathbf{X}\), based on the observed data. - -3. Posterior Mean of Weights: - $$ - \mathbf{m}_w = \mathbf{S}_w \beta \mathbf{X}^\top \mathbf{y} - $$ - - This reflects the most probable weights under the posterior distribution, balancing prior beliefs and observed data. - ---- - -#### 3. Making Predictions: Posterior Inference - -```python -def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array of shape [n_samples, n_features]). - - Returns: - - Predicted values (numpy array of shape [n_samples]). - """ - # Add a bias term to X for intercept handling. - X = np.c_[np.ones(X.shape[0]), X] - - # Compute the mean of the predictions using the posterior mean of weights. - y_pred = X @ self.w_mean - - return y_pred -``` - -Key Prediction Details - -1. Adding Bias Term: The bias term ensures that predictions account for the intercept term in the model. -2. Posterior Predictive Mean: - $$ - \hat{\mathbf{y}} = \mathbf{X} \mathbf{m}_w - $$ - - This computes the expected value of the targets using the posterior mean of the weights. - ---- - -#### 4. Code Walkthrough - -- Posterior Precision Matrix (\(\mathbf{S}_w^{-1}\)): Balances the prior (\(\alpha \mathbf{I}\)) and the data (\(\beta \mathbf{X}^\top \mathbf{X}\)) to regularize and incorporate observed evidence. -- Posterior Mean (\(\mathbf{m}_w\)): Encodes the most likely parameter values given the data and prior. -- Prediction (\(\hat{\mathbf{y}}\)): Uses the posterior mean to infer new outputs, accounting for both prior knowledge and learned data trends. - ---- - -### ๐Ÿ› ๏ธ Example Usage - - -```python -# Example Data -X = np.array([[1.0], [2.0], [3.0]]) # Features -y = np.array([2.0, 4.0, 6.0]) # Targets - -# Initialize and Train Model -model = BayesianRegression(alpha=1.0, beta=1.0) -model.fit(X, y) - -# Predict on New Data -X_new = np.array([[4.0], [5.0]]) -y_pred = model.predict(X_new) - -print(f"Predictions: {y_pred}") -``` - -- Explanation - - A small dataset is provided where the relationship between \(X\) and \(y\) is linear. - - The model fits this data by learning posterior distributions of the weights. - - Predictions are made for new inputs using the learned posterior mean. - ---- - -## ๐ŸŒŸ Advantages - - Encodes uncertainty explicitly, providing confidence intervals for predictions. - - Regularization is naturally incorporated through prior distributions. - - Handles small datasets effectively by leveraging prior knowledge. - -## โš ๏ธ Limitations - - Computationally intensive for high-dimensional data due to matrix inversions. - - Sensitive to prior hyperparameters (\(\alpha, \beta\)). - -## ๐Ÿš€ Application - - -=== "Application 1" - Explain your application - -=== "Application 2" - Explain your application - - diff --git a/docs/algorithms/artificial-intelligence/evolutionary-algorithms/index.md b/docs/algorithms/artificial-intelligence/evolutionary-algorithms/index.md deleted file mode 100644 index 4eea44ca..00000000 --- a/docs/algorithms/artificial-intelligence/evolutionary-algorithms/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Evolutionary Algorithms ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/artificial-intelligence/expert-systems/index.md b/docs/algorithms/artificial-intelligence/expert-systems/index.md deleted file mode 100644 index c2279055..00000000 --- a/docs/algorithms/artificial-intelligence/expert-systems/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Expert Systems ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/artificial-intelligence/index.md b/docs/algorithms/artificial-intelligence/index.md deleted file mode 100644 index beaab977..00000000 --- a/docs/algorithms/artificial-intelligence/index.md +++ /dev/null @@ -1,60 +0,0 @@ -# Artificial Intelligence ๐Ÿ’ก - - diff --git a/docs/algorithms/artificial-intelligence/knowledge-based-systems/index.md b/docs/algorithms/artificial-intelligence/knowledge-based-systems/index.md deleted file mode 100644 index 87c270ea..00000000 --- a/docs/algorithms/artificial-intelligence/knowledge-based-systems/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Knowledge Based Systems ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/artificial-intelligence/reinforcement-learning/index.md b/docs/algorithms/artificial-intelligence/reinforcement-learning/index.md deleted file mode 100644 index 7c018393..00000000 --- a/docs/algorithms/artificial-intelligence/reinforcement-learning/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Reinforcement Learning ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/artificial-intelligence/search-and-optimization/index.md b/docs/algorithms/artificial-intelligence/search-and-optimization/index.md deleted file mode 100644 index aec18e9d..00000000 --- a/docs/algorithms/artificial-intelligence/search-and-optimization/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Search and Optimization ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/computer-vision/image-augmentation/index.md b/docs/algorithms/computer-vision/image-augmentation/index.md deleted file mode 100644 index 25087128..00000000 --- a/docs/algorithms/computer-vision/image-augmentation/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Image Augmentation ๐ŸŽฅ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/computer-vision/image-processing/index.md b/docs/algorithms/computer-vision/image-processing/index.md deleted file mode 100644 index 345626fc..00000000 --- a/docs/algorithms/computer-vision/image-processing/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Image Processing ๐ŸŽฅ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/computer-vision/index.md b/docs/algorithms/computer-vision/index.md deleted file mode 100644 index 875f8fe6..00000000 --- a/docs/algorithms/computer-vision/index.md +++ /dev/null @@ -1,49 +0,0 @@ -# Computer Vision ๐ŸŽฅ - - diff --git a/docs/algorithms/computer-vision/object-detection/index.md b/docs/algorithms/computer-vision/object-detection/index.md deleted file mode 100644 index 46d36f43..00000000 --- a/docs/algorithms/computer-vision/object-detection/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Object Detection ๐ŸŽฅ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/computer-vision/semantic-segmentation/index.md b/docs/algorithms/computer-vision/semantic-segmentation/index.md deleted file mode 100644 index b8fc61fc..00000000 --- a/docs/algorithms/computer-vision/semantic-segmentation/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Semantic Segmentation ๐ŸŽฅ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/deep-learning/architectures/index.md b/docs/algorithms/deep-learning/architectures/index.md deleted file mode 100644 index bb58ca57..00000000 --- a/docs/algorithms/deep-learning/architectures/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Architectures โœจ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/deep-learning/index.md b/docs/algorithms/deep-learning/index.md deleted file mode 100644 index 51d4abfe..00000000 --- a/docs/algorithms/deep-learning/index.md +++ /dev/null @@ -1,49 +0,0 @@ -# Deep Learning โœจ - - diff --git a/docs/algorithms/deep-learning/neural-networks/convolutional-neural-network.md b/docs/algorithms/deep-learning/neural-networks/convolutional-neural-network.md deleted file mode 100644 index e72bb0fe..00000000 --- a/docs/algorithms/deep-learning/neural-networks/convolutional-neural-network.md +++ /dev/null @@ -1,115 +0,0 @@ -# Convolutional Neural Networks - -
- -
- -## Overview -Convolutional Neural Networks (CNNs) are a type of deep learning algorithm specifically designed for processing structured grid data such as images. They are widely used in computer vision tasks like image classification, object detection, and image segmentation. - ---- - -## How CNNs Work - -### 1. **Architecture** -CNNs are composed of the following layers: -- **Convolutional Layers**: Extract spatial features from the input data. -- **Pooling Layers**: Reduce the spatial dimensions of feature maps to lower computational costs. -- **Fully Connected Layers**: Perform high-level reasoning for final predictions. - -### 2. **Key Concepts** -- **Filters (Kernels)**: Small matrices that slide over the input to extract features. -- **Strides**: Step size of the filter movement. -- **Padding**: Adding borders to the input for better filter coverage. -- **Activation Functions**: Introduce non-linearity (e.g., ReLU). - ---- - -## CNN Algorithms - -### 1. **LeNet** -- **Proposed By**: Yann LeCun (1998) -- **Use Case**: Handwritten digit recognition (e.g., MNIST dataset). -- **Architecture**: - - Input โ†’ Convolution โ†’ Pooling โ†’ Convolution โ†’ Pooling โ†’ Fully Connected โ†’ Output - -### 2. **AlexNet** -- **Proposed By**: Alex Krizhevsky (2012) -- **Use Case**: ImageNet classification challenge. -- **Key Features**: - - Uses ReLU for activation. - - Includes dropout to prevent overfitting. - - Designed for GPUs for faster computation. - -### 3. **VGGNet** -- **Proposed By**: Visual Geometry Group (2014) -- **Use Case**: Image classification and transfer learning. -- **Key Features**: - - Uses small 3x3 filters. - - Depth of the network increases (e.g., VGG-16, VGG-19). - -### 4. **ResNet** -- **Proposed By**: Kaiming He et al. (2015) -- **Use Case**: Solving vanishing gradient problems in deep networks. -- **Key Features**: - - Introduces residual blocks with skip connections. - - Enables training of very deep networks (e.g., ResNet-50, ResNet-101). - -### 5. **MobileNet** -- **Proposed By**: Google (2017) -- **Use Case**: Mobile and embedded vision applications. -- **Key Features**: - - Utilizes depthwise separable convolutions. - - Lightweight architecture suitable for mobile devices. - ---- - -## Code Example: Implementing a Simple CNN - -Hereโ€™s a Python example of a CNN using **TensorFlow/Keras**: - -* **Sequential:** Used to stack layers to create a neural network model. -* **Conv2D:** Implements the convolutional layers to extract features from input images. -* **MaxPooling2D:** Reduces the size of feature maps while retaining important features. -* **Flatten:** Converts 2D feature maps into a 1D vector to pass into fully connected layers. -* **Dense:** Implements fully connected (dense) layers, responsible for decision-making. - - -```python -from tensorflow.keras.models import Sequential -from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense - -# Build the CNN -model = Sequential([ - Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)), - MaxPooling2D(pool_size=(2, 2)), - Conv2D(64, (3, 3), activation='relu'), - MaxPooling2D(pool_size=(2, 2)), - Flatten(), - Dense(128, activation='relu'), - Dense(10, activation='softmax') # Replace 10 with the number of classes in your dataset -]) - -# Compile the model -model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) - -# Summary -model.summary() -``` - ---- - -# Visualizations -* **Filters and Feature Maps:** Visualizing how the CNN learns features from images. -* **Training Metrics:** Plotting accuracy and loss during training. - -```python -import matplotlib.pyplot as plt - -# Example: Visualizing accuracy and loss -plt.plot(history.history['accuracy'], label='Accuracy') -plt.plot(history.history['val_accuracy'], label='Validation Accuracy') -plt.xlabel('Epochs') -plt.ylabel('Accuracy') -plt.legend() -``` diff --git a/docs/algorithms/deep-learning/neural-networks/index.md b/docs/algorithms/deep-learning/neural-networks/index.md deleted file mode 100644 index 26adbe63..00000000 --- a/docs/algorithms/deep-learning/neural-networks/index.md +++ /dev/null @@ -1,24 +0,0 @@ -# Neural Networks โœจ - - diff --git a/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md b/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md deleted file mode 100644 index c456ab1c..00000000 --- a/docs/algorithms/deep-learning/neural-networks/recurrent-neural-network.md +++ /dev/null @@ -1,122 +0,0 @@ -# ๐Ÿงช Recurrent Neural Network (RNN) - -
- -
- -## ๐ŸŽฏ Objective -Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to process sequential data. Unlike feedforward networks, RNNs have connections that allow information to persist, making them suitable for tasks such as speech recognition, text generation, and time-series forecasting. - -## ๐Ÿ“š Prerequisites -- Understanding of basic neural networks and deep learning -- Knowledge of activation functions and backpropagation -- Familiarity with sequence-based data processing -- Libraries: NumPy, TensorFlow, PyTorch - ---- - -## ๐Ÿงฌ Inputs -- A sequence of data points such as text, speech signals, or time-series data. -- Example: A sentence represented as a sequence of word embeddings for NLP tasks. - -## ๐ŸŽŽ Outputs -- Predicted sequence values or classifications. -- Example: Next word prediction in a sentence or stock price forecasting. - ---- - -## ๐Ÿฉ RNN Architecture -- RNNs maintain a **hidden state** that updates with each time step. -- At each step, the hidden state is computed as: - $$ h_t = f(W_h h_{t-1} + W_x x_t + b) $$ -- Variants of RNNs include **LSTMs (Long Short-Term Memory)** and **GRUs (Gated Recurrent Units)**, which help mitigate the vanishing gradient problem. - -## ๐Ÿ… Training Process -- The model is trained using **Backpropagation Through Time (BPTT)**. -- Uses optimizers like **Adam** or **SGD**. -- Typical hyperparameters: - - Learning rate: 0.001 - - Batch size: 64 - - Epochs: 30 - - Loss function: Cross-entropy for classification tasks, MSE for regression tasks. - -## ๐Ÿ“Š Evaluation Metrics -- Accuracy (for classification) -- Perplexity (for language models) -- Mean Squared Error (MSE) (for regression tasks) -- BLEU Score (for sequence-to-sequence models) - ---- - -## ๐Ÿ’ป Code Implementation -```python -import numpy as np -import torch -import torch.nn as nn -import torch.optim as optim - -# Define RNN Model -class RNN(nn.Module): - def __init__(self, input_size, hidden_size, output_size): - super(RNN, self).__init__() - self.hidden_size = hidden_size - self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) - self.fc = nn.Linear(hidden_size, output_size) - - def forward(self, x, hidden): - out, hidden = self.rnn(x, hidden) - out = self.fc(out[:, -1, :]) - return out, hidden - -# Model Training -input_size = 10 # Number of input features -hidden_size = 20 # Number of hidden neurons -output_size = 1 # Output dimension - -model = RNN(input_size, hidden_size, output_size) -criterion = nn.MSELoss() -optimizer = optim.Adam(model.parameters(), lr=0.001) - -# Sample Training Loop -for epoch in range(10): - optimizer.zero_grad() - inputs = torch.randn(32, 5, input_size) # (batch_size, seq_length, input_size) - hidden = torch.zeros(1, 32, hidden_size) # Initial hidden state - outputs, hidden = model(inputs, hidden) - loss = criterion(outputs, torch.randn(32, output_size)) - loss.backward() - optimizer.step() - print(f"Epoch {epoch+1}, Loss: {loss.item()}") -``` - -## ๐Ÿ” Understanding the Code -- **Model Definition:** - - The `RNN` class defines a simple recurrent neural network with an input layer, a recurrent layer, and a fully connected output layer. -- **Forward Pass:** - - Takes an input sequence, processes it through the RNN layer, and generates an output. -- **Training Loop:** - - Uses randomly generated data for demonstration. - - Optimizes weights using the Adam optimizer and mean squared error loss. - ---- - -## ๐ŸŒŸ Advantages -- Effective for sequential data modeling. -- Capable of handling variable-length inputs. -- Works well for applications like text generation and speech recognition. - -## โš ๏ธ Limitations -- Struggles with long-range dependencies due to vanishing gradients. -- Training can be slow due to sequential computations. -- Alternatives like **LSTMs and GRUs** are preferred for longer sequences. - -## ๐Ÿš€ Applications -### Natural Language Processing (NLP) -- Text prediction -- Sentiment analysis -- Machine translation - -### Time-Series Forecasting -- Stock price prediction -- Weather forecasting -- Healthcare monitoring (e.g., ECG signals) \ No newline at end of file diff --git a/docs/algorithms/deep-learning/optimization-algorithms/index.md b/docs/algorithms/deep-learning/optimization-algorithms/index.md deleted file mode 100644 index 60c6fb9d..00000000 --- a/docs/algorithms/deep-learning/optimization-algorithms/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Optimization Algorithms โœจ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/deep-learning/pre-trained-models/index.md b/docs/algorithms/deep-learning/pre-trained-models/index.md deleted file mode 100644 index edb98eea..00000000 --- a/docs/algorithms/deep-learning/pre-trained-models/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Pre-Trained Models โœจ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/generative-adversarial-networks/ac-gan.md b/docs/algorithms/generative-adversarial-networks/ac-gan.md deleted file mode 100644 index 5aaca1dd..00000000 --- a/docs/algorithms/generative-adversarial-networks/ac-gan.md +++ /dev/null @@ -1,170 +0,0 @@ -# AC GAN - -
- -
- -## Overview - -Auxiliary Classifier Generative Adversarial Network (ACGAN) is an extension of the traditional GAN architecture. It incorporates class information into both the generator and discriminator, enabling controlled generation of samples with specific characteristics. - -ACGANs can: -- Generate high-quality images conditioned on specific classes. -- Predict class labels of generated images via the discriminator. - -This dual capability allows for more controlled and targeted image synthesis. - ---- - -## Key Concepts - -1. **Generator**: - - Takes random noise and class labels as input to generate synthetic images conditioned on the class labels. - -2. **Discriminator**: - - Differentiates between real and fake images. - - Predicts the class labels of images. - -3. **Class Conditioning**: - - By integrating label embeddings, the generator learns to associate specific features with each class, enhancing image quality and controllability. - ---- - -## Implementation Overview - -Below is a high-level explanation of the ACGAN implementation: - -1. **Dataset**: - - The MNIST dataset is used for training, consisting of grayscale images of digits (0-9). - -2. **Model Architecture**: - - **Generator**: - - Takes random noise (latent vector) and class labels as input. - - Outputs images that correspond to the input class labels. - - **Discriminator**: - - Classifies images as real or fake. - - Simultaneously predicts the class label of the image. - -3. **Training Process**: - - The generator is trained to fool the discriminator into classifying fake images as real. - - The discriminator is trained to: - - Differentiate real from fake images. - - Accurately predict the class labels of real images. - -4. **Loss Functions**: - - Binary Cross-Entropy Loss for real/fake classification. - - Categorical Cross-Entropy Loss for class label prediction. - ---- - -## Implementation Code - -### Core Components - -#### Discriminator -```python -class Discriminator(nn.Module): - def __init__(self): - super(Discriminator, self).__init__() - self.label_emb = nn.Embedding(num_classes, num_classes) - self.model = nn.Sequential( - nn.Linear(image_size + num_classes, hidden_size), - nn.LeakyReLU(0.2), - nn.Dropout(0.3), - nn.Linear(hidden_size, hidden_size), - nn.LeakyReLU(0.2), - nn.Dropout(0.3), - nn.Linear(hidden_size, 1), - nn.Sigmoid() - ) - - def forward(self, x, labels): - x = x.view(x.size(0), image_size) - c = self.label_emb(labels) - x = torch.cat([x, c], 1) - return self.model(x) -``` - -#### Generator -```python -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.label_emb = nn.Embedding(num_classes, num_classes) - self.model = nn.Sequential( - nn.Linear(latent_size + num_classes, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh() - ) - - def forward(self, z, labels): - z = z.view(z.size(0), latent_size) - c = self.label_emb(labels) - x = torch.cat([z, c], 1) - return self.model(x) -``` - -#### Training Loop -```python -for epoch in range(num_epochs): - for i, (images, labels) in enumerate(train_loader): - batch_size = images.size(0) - images = images.to(device) - labels = labels.to(device) - - # Real and fake labels - real_labels = torch.ones(batch_size, 1).to(device) - fake_labels = torch.zeros(batch_size, 1).to(device) - - # Train Discriminator - outputs = D(images, labels) - d_loss_real = criterion(outputs, real_labels) - - z = create_noise(batch_size, latent_size) - fake_images = G(z, labels) - outputs = D(fake_images, labels) - d_loss_fake = criterion(outputs, fake_labels) - - d_loss = d_loss_real + d_loss_fake - D.zero_grad() - d_loss.backward() - d_optimizer.step() - - # Train Generator - z = create_noise(batch_size, latent_size) - fake_images = G(z, labels) - outputs = D(fake_images, labels) - g_loss = criterion(outputs, real_labels) - - G.zero_grad() - g_loss.backward() - g_optimizer.step() - - if (i+1) % 200 == 0: - print(f"Epoch [{epoch}/{num_epochs}], Step [{i+1}/{total_step}], d_loss: {d_loss.item():.4f}, g_loss: {g_loss.item():.4f}") -``` - ---- - -## Applications of ACGAN - -1. **Image Synthesis**: - - Generate diverse images conditioned on specific labels. - -2. **Data Augmentation**: - - Create synthetic data to augment existing datasets. - -3. **Creative Domains**: - - Design tools for controlled image generation in fashion, gaming, and media. - ---- - -## Additional Resources - -- [PyTorch Documentation](https://pytorch.org/docs/) -- [Original ACGAN Paper](https://arxiv.org/abs/1610.09585) -- [MNIST Dataset](http://yann.lecun.com/exdb/mnist/) - diff --git a/docs/algorithms/generative-adversarial-networks/basic-gan.md b/docs/algorithms/generative-adversarial-networks/basic-gan.md deleted file mode 100644 index d381d9b9..00000000 --- a/docs/algorithms/generative-adversarial-networks/basic-gan.md +++ /dev/null @@ -1,222 +0,0 @@ -# Basic GAN - - Basic GAN stands for Basic Generative Adversarial Network - -This folder contains a basic implementation of a Generative Adversarial Network (GAN) using PyTorch. GANs are a type of neural network architecture that consists of two networks: a generator and a discriminator. The generator learns to create realistic data samples (e.g., images) from random noise, while the discriminator learns to distinguish between real and generated samples. - -## Overview - -This project implements a simple GAN architecture to generate hand-written digits resembling those from the MNIST dataset. The generator network creates fake images, while the discriminator network tries to differentiate between real and generated images. The networks are trained simultaneously in a minimax game until the generator produces realistic images. - ---- - -## Files - -```py -import torch -import torch.nn as nn -import torch.optim as optim -import torchvision.datasets as dsets -import torchvision.transforms as transforms -from torch.utils.data import DataLoader -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -image_size = 784 # 28x28 -hidden_size = 256 -latent_size = 64 -num_epochs = 200 -batch_size = 100 -learning_rate = 0.0002 - -# MNIST dataset -dataset = dsets.MNIST(root='../data/', - train=True, - transform=transforms.ToTensor(), - download=True) - -# Data loader -data_loader = DataLoader(dataset=dataset, - batch_size=batch_size, - shuffle=True) - -# Discriminator -D = nn.Sequential( - nn.Linear(image_size, hidden_size), - nn.LeakyReLU(0.2), - nn.Linear(hidden_size, hidden_size), - nn.LeakyReLU(0.2), - nn.Linear(hidden_size, 1), - nn.Sigmoid()) - -# Generator -G = nn.Sequential( - nn.Linear(latent_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh()) - -# Device setting -D = D.to(device) -G = G.to(device) - -# Binary cross entropy loss and optimizer -criterion = nn.BCELoss() -d_optimizer = optim.Adam(D.parameters(), lr=learning_rate) -g_optimizer = optim.Adam(G.parameters(), lr=learning_rate) - -# Utility function to create real and fake labels -def create_real_labels(size): - data = torch.ones(size, 1) - return data.to(device) - -def create_fake_labels(size): - data = torch.zeros(size, 1) - return data.to(device) - -# Utility function to generate random noise -def create_noise(size, latent_dim): - return torch.randn(size, latent_dim).to(device) - -# Training the GAN -total_step = len(data_loader) -for epoch in range(num_epochs): - for i, (images, _) in enumerate(data_loader): - batch_size = images.size(0) - images = images.reshape(batch_size, -1).to(device) - - # Create the labels which are later used as input for the BCE loss - real_labels = create_real_labels(batch_size) - fake_labels = create_fake_labels(batch_size) - - # ================================================================== # - # Train the discriminator # - # ================================================================== # - # Compute BCELoss using real images - # Second term of the loss is always zero since real_labels == 1 - outputs = D(images) - d_loss_real = criterion(outputs, real_labels) - real_score = outputs - - # Compute BCELoss using fake images - noise = create_noise(batch_size, latent_size) - fake_images = G(noise) - outputs = D(fake_images) - d_loss_fake = criterion(outputs, fake_labels) - fake_score = outputs - - # Backprop and optimize - d_loss = d_loss_real + d_loss_fake - d_optimizer.zero_grad() - d_loss.backward() - d_optimizer.step() - - # ================================================================== # - # Train the generator # - # ================================================================== # - # Compute loss with fake images - noise = create_noise(batch_size, latent_size) - fake_images = G(noise) - outputs = D(fake_images) - - # We train G to maximize log(D(G(z)) instead of minimizing log(1-D(G(z))) - # For the reason, look at the last part of section 3 of the paper: - # https://arxiv.org/pdf/1406.2661.pdf - g_loss = criterion(outputs, real_labels) - - # Backprop and optimize - g_optimizer.zero_grad() - g_loss.backward() - g_optimizer.step() - - if (i+1) % 200 == 0: - print(f'Epoch [{epoch}/{num_epochs}], Step [{i+1}/{total_step}], d_loss: {d_loss.item():.4f}, g_loss: {g_loss.item():.4f}, D(x): {real_score.mean().item():.2f}, D(G(z)): {fake_score.mean().item():.2f}') - -# Save the trained models -torch.save(G.state_dict(), 'G.pth') -torch.save(D.state_dict(), 'D.pth') - -# Plot some generated images -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -G.eval() -with torch.no_grad(): - noise = create_noise(64, latent_size) - fake_images = G(noise) - fake_images = fake_images.reshape(fake_images.size(0), 1, 28, 28) - fake_images = denorm(fake_images) - grid = np.transpose(fake_images.cpu(), (0, 2, 3, 1)).numpy() - - plt.figure(figsize=(8, 8)) - for i in range(grid.shape[0]): - plt.subplot(8, 8, i+1) - plt.imshow(grid[i, :, :, 0], cmap='gray') - plt.axis('off') - plt.show() -``` - -- `BasicGAN.py`: Contains the implementation of the GAN model, training loop, and saving of trained models. - -```py -import torch -import torch.nn as nn -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -latent_size = 64 -hidden_size = 256 -image_size = 784 # 28x28 - -# Generator -G = nn.Sequential( - nn.Linear(latent_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh()) - -# Load the trained generator model -G.load_state_dict(torch.load('G.pth')) -G.to(device) -G.eval() - -# Utility function to generate random noise -def create_noise(size, latent_dim): - return torch.randn(size, latent_dim).to(device) - -# Utility function to denormalize the images -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Generate images -with torch.no_grad(): - noise = create_noise(64, latent_size) - fake_images = G(noise) - fake_images = fake_images.reshape(fake_images.size(0), 1, 28, 28) - fake_images = denorm(fake_images) - grid = np.transpose(fake_images.cpu(), (0, 2, 3, 1)).numpy() - - plt.figure(figsize=(8, 8)) - for i in range(grid.shape[0]): - plt.subplot(8, 8, i+1) - plt.imshow(grid[i, :, :, 0], cmap='gray') - plt.axis('off') - plt.show() -``` - -- `test_BasicGAN.py`: Uses the trained generator to generate sample images after training. - diff --git a/docs/algorithms/generative-adversarial-networks/c-gan.md b/docs/algorithms/generative-adversarial-networks/c-gan.md deleted file mode 100644 index 76eb3619..00000000 --- a/docs/algorithms/generative-adversarial-networks/c-gan.md +++ /dev/null @@ -1,251 +0,0 @@ -# C GAN - -This folder contains an implementation of a Conditional Generative Adversarial Network (cGAN) using PyTorch. cGANs generate images conditioned on specific class labels, allowing for controlled image synthesis. - ----- - -## Overview - -cGANs extend the traditional GAN architecture by including class information in both the generator and discriminator. The generator learns to generate images conditioned on given class labels, while the discriminator not only distinguishes between real and fake images but also predicts the class labels of the generated images. - ----- - -## Files - -```py -import torch -import torch.nn as nn -import torch.optim as optim -import torchvision.datasets as dsets -import torchvision.transforms as transforms -from torch.utils.data import DataLoader -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -image_size = 28 * 28 -num_classes = 10 -latent_size = 100 -hidden_size = 256 -num_epochs = 100 -batch_size = 64 -learning_rate = 0.0002 - -# MNIST dataset -transform = transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize(mean=(0.5,), std=(0.5,)) -]) - -train_dataset = dsets.MNIST(root='../data/', - train=True, - transform=transform, - download=True) - -train_loader = DataLoader(dataset=train_dataset, - batch_size=batch_size, - shuffle=True) - -# Discriminator -class Discriminator(nn.Module): - def __init__(self): - super(Discriminator, self).__init__() - self.label_emb = nn.Embedding(num_classes, num_classes) - - self.model = nn.Sequential( - nn.Linear(image_size + num_classes, hidden_size), - nn.LeakyReLU(0.2), - nn.Dropout(0.3), - nn.Linear(hidden_size, hidden_size), - nn.LeakyReLU(0.2), - nn.Dropout(0.3), - nn.Linear(hidden_size, 1), - nn.Sigmoid() - ) - - def forward(self, x, labels): - x = x.view(x.size(0), image_size) - c = self.label_emb(labels) - x = torch.cat([x, c], 1) - out = self.model(x) - return out - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.label_emb = nn.Embedding(num_classes, num_classes) - - self.model = nn.Sequential( - nn.Linear(latent_size + num_classes, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh() - ) - - def forward(self, z, labels): - z = z.view(z.size(0), latent_size) - c = self.label_emb(labels) - x = torch.cat([z, c], 1) - out = self.model(x) - return out - -# Initialize models -D = Discriminator().to(device) -G = Generator().to(device) - -# Loss function and optimizer -criterion = nn.BCELoss() -d_optimizer = optim.Adam(D.parameters(), lr=learning_rate) -g_optimizer = optim.Adam(G.parameters(), lr=learning_rate) - -# Utility functions -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -def create_noise(batch_size, latent_size): - return torch.randn(batch_size, latent_size).to(device) - -def create_labels(batch_size): - return torch.randint(0, num_classes, (batch_size,)).to(device) - -# Training the cGAN -total_step = len(train_loader) -for epoch in range(num_epochs): - for i, (images, labels) in enumerate(train_loader): - batch_size = images.size(0) - images = images.to(device) - labels = labels.to(device) - - # Create the labels which are later used as input for the discriminator - real_labels = torch.ones(batch_size, 1).to(device) - fake_labels = torch.zeros(batch_size, 1).to(device) - - # ================================================================== # - # Train the discriminator # - # ================================================================== # - - # Compute BCELoss using real images - outputs = D(images, labels) - d_loss_real = criterion(outputs, real_labels) - real_score = outputs - - # Compute BCELoss using fake images - z = create_noise(batch_size, latent_size) - fake_images = G(z, labels) - outputs = D(fake_images, labels) - d_loss_fake = criterion(outputs, fake_labels) - fake_score = outputs - - # Backprop and optimize - d_loss = d_loss_real + d_loss_fake - D.zero_grad() - d_loss.backward() - d_optimizer.step() - - # ================================================================== # - # Train the generator # - # ================================================================== # - - # Compute loss with fake images - z = create_noise(batch_size, latent_size) - fake_images = G(z, labels) - outputs = D(fake_images, labels) - - # We train G to maximize log(D(G(z))) - g_loss = criterion(outputs, real_labels) - - # Backprop and optimize - G.zero_grad() - g_loss.backward() - g_optimizer.step() - - if (i+1) % 200 == 0: - print(f'Epoch [{epoch}/{num_epochs}], Step [{i+1}/{total_step}], d_loss: {d_loss.item():.4f}, g_loss: {g_loss.item():.4f}, D(x): {real_score.mean().item():.2f}, D(G(z)): {fake_score.mean().item():.2f}') - -# Save the trained models -torch.save(G.state_dict(), 'G_cgan.pth') -torch.save(D.state_dict(), 'D_cgan.pth') -``` - -- `cGAN.py`: Contains the implementation of the ACGAN model, training loop, and saving of trained models. - -```py -import torch -import torch.nn as nn -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -latent_size = 100 -num_classes = 10 -image_size = 28 * 28 - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.label_emb = nn.Embedding(num_classes, num_classes) - - self.model = nn.Sequential( - nn.Linear(latent_size + num_classes, 256), - nn.ReLU(), - nn.Linear(256, 512), - nn.ReLU(), - nn.Linear(512, image_size), - nn.Tanh() - ) - - def forward(self, z, labels): - z = z.view(z.size(0), latent_size) - c = self.label_emb(labels) - x = torch.cat([z, c], 1) - out = self.model(x) - return out - -# Load the trained generator model -G = Generator() -G.load_state_dict(torch.load('G_cgan.pth', map_location=torch.device('cpu'))) -G.eval() - -# Utility function to generate random noise -def create_noise(size, latent_dim): - return torch.randn(size, latent_dim) - -# Utility function to generate labels -def create_labels(size): - return torch.randint(0, num_classes, (size,)) - -# Utility function to denormalize the images -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Generate images -with torch.no_grad(): - noise = create_noise(64, latent_size) - labels = create_labels(64) - fake_images = G(noise, labels) - fake_images = fake_images.reshape(fake_images.size(0), 1, 28, 28) - fake_images = denorm(fake_images) - grid = np.transpose(fake_images, (0, 2, 3, 1)).numpy() - - plt.figure(figsize=(8, 8)) - for i in range(grid.shape[0]): - plt.subplot(8, 8, i+1) - plt.imshow(grid[i, :, :, 0], cmap='gray') - plt.axis('off') - plt.show() -``` - -- `test_cGAN.py`: Uses the trained generator to generate sample images after training. - diff --git a/docs/algorithms/generative-adversarial-networks/eb-gan.md b/docs/algorithms/generative-adversarial-networks/eb-gan.md deleted file mode 100644 index 234d672e..00000000 --- a/docs/algorithms/generative-adversarial-networks/eb-gan.md +++ /dev/null @@ -1,225 +0,0 @@ -# EB GAN - -This folder contains an implementation of an Energy-Based Generative Adversarial Network (EBGAN) using PyTorch. EBGAN focuses on matching the energy distribution of generated samples to that of real data, optimizing both a discriminator and a generator network. - ----- - -## Overview - -EBGAN introduces an energy function that is used to measure the quality of generated samples. The discriminator (autoencoder-like) network tries to minimize this energy function while the generator tries to maximize it. This results in a more stable training process compared to traditional GANs. - ----- - -## Files - -```py -import torch -import torch.nn as nn -import torch.optim as optim -import torchvision.datasets as dsets -import torchvision.transforms as transforms -from torch.utils.data import DataLoader -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -image_size = 28 * 28 -latent_size = 64 -hidden_size = 256 -num_epochs = 100 -batch_size = 64 -learning_rate = 0.0002 -k = 3 # Number of iterations for optimizing D - -# MNIST dataset -transform = transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize(mean=(0.5,), std=(0.5,)) -]) - -train_dataset = dsets.MNIST(root='../data/', - train=True, - transform=transform, - download=True) - -train_loader = DataLoader(dataset=train_dataset, - batch_size=batch_size, - shuffle=True) - -# Discriminator -class Discriminator(nn.Module): - def __init__(self): - super(Discriminator, self).__init__() - self.encoder = nn.Sequential( - nn.Linear(image_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, latent_size) - ) - self.decoder = nn.Sequential( - nn.Linear(latent_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh() - ) - - def forward(self, x): - encoded = self.encoder(x) - decoded = self.decoder(encoded) - return decoded, encoded - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.model = nn.Sequential( - nn.Linear(latent_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, hidden_size), - nn.ReLU(), - nn.Linear(hidden_size, image_size), - nn.Tanh() - ) - - def forward(self, z): - out = self.model(z) - return out - -# Initialize models -D = Discriminator().to(device) -G = Generator().to(device) - -# Loss function and optimizer -criterion_rec = nn.MSELoss() -d_optimizer = optim.Adam(D.parameters(), lr=learning_rate) -g_optimizer = optim.Adam(G.parameters(), lr=learning_rate) - -# Utility functions -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Training the EBGAN -total_step = len(train_loader) -for epoch in range(num_epochs): - for i, (images, _) in enumerate(train_loader): - batch_size = images.size(0) - images = images.view(-1, image_size).to(device) - - # ================================================================== # - # Train the discriminator # - # ================================================================== # - - encoded_real, _ = D(images) - decoded_real = D.decoder(encoded_real) - - rec_loss_real = criterion_rec(decoded_real, images) - - z = torch.randn(batch_size, latent_size).to(device) - fake_images = G(z) - encoded_fake, _ = D(fake_images.detach()) - decoded_fake = D.decoder(encoded_fake) - - rec_loss_fake = criterion_rec(decoded_fake, fake_images.detach()) - - d_loss = rec_loss_real + torch.max(torch.zeros(1).to(device), k * rec_loss_real - rec_loss_fake) - - D.zero_grad() - d_loss.backward() - d_optimizer.step() - - # ================================================================== # - # Train the generator # - # ================================================================== # - - z = torch.randn(batch_size, latent_size).to(device) - fake_images = G(z) - encoded_fake, _ = D(fake_images) - decoded_fake = D.decoder(encoded_fake) - - rec_loss_fake = criterion_rec(decoded_fake, fake_images) - - g_loss = rec_loss_fake - - G.zero_grad() - g_loss.backward() - g_optimizer.step() - - if (i+1) % 200 == 0: - print(f'Epoch [{epoch}/{num_epochs}], Step [{i+1}/{total_step}], d_loss: {d_loss.item():.4f}, g_loss: {g_loss.item():.4f}, Rec_loss_real: {rec_loss_real.item():.4f}, Rec_loss_fake: {rec_loss_fake.item():.4f}') - -# Save the trained models -torch.save(G.state_dict(), 'G_ebgan.pth') -torch.save(D.state_dict(), 'D_ebgan.pth') -``` - -- `EBGAN.py`: Contains the implementation of the ACGAN model, training loop, and saving of trained models. - -```py -import torch -import torch.nn as nn -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -latent_size = 64 -image_size = 28 * 28 - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.model = nn.Sequential( - nn.Linear(latent_size, 256), - nn.ReLU(), - nn.Linear(256, 512), - nn.ReLU(), - nn.Linear(512, image_size), - nn.Tanh() - ) - - def forward(self, z): - out = self.model(z) - return out - -# Load the trained generator model -G = Generator() -G.load_state_dict(torch.load('G_ebgan.pth', map_location=torch.device('cpu'))) -G.eval() - -# Utility function to generate random noise -def create_noise(size, latent_dim): - return torch.randn(size, latent_dim) - -# Utility function to denormalize the images -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Generate images -with torch.no_grad(): - noise = create_noise(64, latent_size) - fake_images = G(noise) - fake_images = fake_images.reshape(fake_images.size(0), 1, 28, 28) - fake_images = denorm(fake_images) - grid = np.transpose(fake_images, (0, 2, 3, 1)).numpy() - - plt.figure(figsize=(8, 8)) - for i in range(grid.shape[0]): - plt.subplot(8, 8, i+1) - plt.imshow(grid[i, :, :, 0], cmap='gray') - plt.axis('off') - plt.show() -``` - -- `test_EBGAN.py`: Uses the trained generator to generate sample images after training. - diff --git a/docs/algorithms/generative-adversarial-networks/index.md b/docs/algorithms/generative-adversarial-networks/index.md deleted file mode 100644 index 004f3839..00000000 --- a/docs/algorithms/generative-adversarial-networks/index.md +++ /dev/null @@ -1,56 +0,0 @@ -# Generative Adversarial Networks ๐Ÿ’ฑ - -
- - - - -
-

Auxiliary Classifier Generative Adversarial Network

-

Empowering GANs with Class-Specific Data Generation.

-

๐Ÿ“… 2025-01-10 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Basic Generative Adversarial Network

-

Hand writing digit resembler from MNIST dataset

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Conditional Generative Adversarial Network

-

Controlled Image synthesis

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Energy Based Generative Adversarial Network

-

Minimize the energy functiom to more stable training process.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Information Maximizing Generative Adversarial Network

-

Empowering Data-Driven Insights with Generative Adversarial Networks for Advanced Information Synthesis.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - -
diff --git a/docs/algorithms/generative-adversarial-networks/info-gan.md b/docs/algorithms/generative-adversarial-networks/info-gan.md deleted file mode 100644 index 020fa0bb..00000000 --- a/docs/algorithms/generative-adversarial-networks/info-gan.md +++ /dev/null @@ -1,255 +0,0 @@ -# Info GAN - - Information Maximizing Generative Adversarial Network - -This folder contains an implementation of InfoGAN using PyTorch. InfoGAN extends the traditional GAN framework by incorporating unsupervised learning of interpretable and disentangled representations. - ----- - -## Overview - -InfoGAN introduces latent codes that can be split into categorical and continuous variables, allowing for more control over the generated outputs. The generator is conditioned on these latent codes, which are learned in an unsupervised manner during training. - ----- - -## Files - -```py -import torch -import torch.nn as nn -import torch.optim as optim -import torchvision.datasets as dsets -import torchvision.transforms as transforms -from torch.utils.data import DataLoader -import numpy as np -import matplotlib.pyplot as plt - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -image_size = 28 * 28 -num_epochs = 50 -batch_size = 100 -latent_size = 62 -num_continuous = 2 -num_categories = 10 -learning_rate = 0.0002 - -# MNIST dataset -transform = transforms.Compose([ - transforms.ToTensor(), - transforms.Normalize(mean=(0.5,), std=(0.5,)) -]) - -train_dataset = dsets.MNIST(root='../data/', - train=True, - transform=transform, - download=True) - -train_loader = DataLoader(dataset=train_dataset, - batch_size=batch_size, - shuffle=True) - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.fc = nn.Sequential( - nn.Linear(latent_size + num_categories + num_continuous, 256), - nn.ReLU(), - nn.Linear(256, 512), - nn.ReLU(), - nn.Linear(512, 1024), - nn.ReLU(), - nn.Linear(1024, image_size), - nn.Tanh() - ) - - def forward(self, z, c_cat, c_cont): - inputs = torch.cat([z, c_cat, c_cont], dim=1) - return self.fc(inputs) - - -# Discriminator -class Discriminator(nn.Module): - def __init__(self): - super(Discriminator, self).__init__() - self.fc = nn.Sequential( - nn.Linear(image_size, 1024), - nn.ReLU(), - nn.Dropout(0.3), - nn.Linear(1024, 512), - nn.ReLU(), - nn.Dropout(0.3), - ) - - self.fc_disc = nn.Linear(512, num_categories) - self.fc_mu = nn.Linear(512, num_continuous) - self.fc_var = nn.Linear(512, num_continuous) - - def forward(self, x): - x = self.fc(x) - disc_logits = self.fc_disc(x) - mu = self.fc_mu(x) - var = torch.exp(self.fc_var(x)) - return disc_logits, mu, var - - -# Initialize networks -G = Generator().to(device) -D = Discriminator().to(device) - -# Loss functions -criterion_cat = nn.CrossEntropyLoss() -criterion_cont = nn.MSELoss() - -# Optimizers -g_optimizer = optim.Adam(G.parameters(), lr=learning_rate) -d_optimizer = optim.Adam(D.parameters(), lr=learning_rate) - -# Utility functions -def sample_noise(batch_size, latent_size): - return torch.randn(batch_size, latent_size).to(device) - -def sample_categorical(batch_size, num_categories): - return torch.randint(0, num_categories, (batch_size,)).to(device) - -def sample_continuous(batch_size, num_continuous): - return torch.rand(batch_size, num_continuous).to(device) - -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Training InfoGAN -total_step = len(train_loader) -for epoch in range(num_epochs): - for i, (images, labels) in enumerate(train_loader): - batch_size = images.size(0) - images = images.view(-1, image_size).to(device) - - # Create labels for discriminator - real_labels = torch.ones(batch_size, dtype=torch.long, device=device) - fake_labels = torch.zeros(batch_size, dtype=torch.long, device=device) - - # Sample noise, categorical, and continuous latent codes - z = sample_noise(batch_size, latent_size) - c_cat = sample_categorical(batch_size, num_categories) - c_cont = sample_continuous(batch_size, num_continuous) - - # Generate fake images - fake_images = G(z, c_cat, c_cont) - - # Train discriminator - d_optimizer.zero_grad() - d_real_cat, d_real_mu, d_real_var = D(images) - d_real_loss_cat = criterion_cat(d_real_cat, labels) - d_fake_cat, d_fake_mu, d_fake_var = D(fake_images.detach()) - d_fake_loss_cat = criterion_cat(d_fake_cat, c_cat) - - d_loss_cat = d_real_loss_cat + d_fake_loss_cat - - d_real_loss_cont = torch.mean(0.5 * torch.sum(torch.div((d_real_mu - c_cont)**2, d_real_var), dim=1)) - d_fake_loss_cont = torch.mean(0.5 * torch.sum(torch.div((d_fake_mu - c_cont)**2, d_fake_var), dim=1)) - - d_loss_cont = d_real_loss_cont + d_fake_loss_cont - - d_loss = d_loss_cat + d_loss_cont - d_loss.backward() - d_optimizer.step() - - # Train generator - g_optimizer.zero_grad() - _, d_fake_mu, d_fake_var = D(fake_images) - - g_loss_cat = criterion_cat(_, c_cat) - g_loss_cont = torch.mean(0.5 * torch.sum(torch.div((d_fake_mu - c_cont)**2, d_fake_var), dim=1)) - - g_loss = g_loss_cat + g_loss_cont - g_loss.backward() - g_optimizer.step() - - if (i+1) % 200 == 0: - print(f'Epoch [{epoch}/{num_epochs}], Step [{i+1}/{total_step}], d_loss: {d_loss.item():.4f}, g_loss: {g_loss.item():.4f}') - -# Save the trained models -torch.save(G.state_dict(), 'G_infogan.pth') -torch.save(D.state_dict(), 'D_infogan.pth') -``` - -- `InfoGAN.py`: Contains the implementation of the ACGAN model, training loop, and saving of trained models. - -```py -import torch -import torch.nn as nn -import matplotlib.pyplot as plt -import numpy as np - -# Device configuration -device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - -# Hyper-parameters -latent_size = 62 -num_categories = 10 -num_continuous = 2 -image_size = 28 * 28 - -# Generator -class Generator(nn.Module): - def __init__(self): - super(Generator, self).__init__() - self.fc = nn.Sequential( - nn.Linear(latent_size + num_categories + num_continuous, 256), - nn.ReLU(), - nn.Linear(256, 512), - nn.ReLU(), - nn.Linear(512, 1024), - nn.ReLU(), - nn.Linear(1024, image_size), - nn.Tanh() - ) - - def forward(self, z, c_cat, c_cont): - inputs = torch.cat([z, c_cat, c_cont], dim=1) - return self.fc(inputs) - -# Load the trained generator model -G = Generator().to(device) -G.load_state_dict(torch.load('G_infogan.pth', map_location=torch.device('cpu'))) -G.eval() - -# Utility functions to generate samples -def sample_noise(batch_size, latent_size): - return torch.randn(batch_size, latent_size).to(device) - -def sample_categorical(batch_size, num_categories): - return torch.randint(0, num_categories, (batch_size,)).to(device) - -def sample_continuous(batch_size, num_continuous): - return torch.rand(batch_size, num_continuous).to(device) - -def denorm(x): - out = (x + 1) / 2 - return out.clamp(0, 1) - -# Generate images -with torch.no_grad(): - noise = sample_noise(64, latent_size) - c_cat = sample_categorical(64, num_categories) - c_cont = sample_continuous(64, num_continuous) - fake_images = G(noise, c_cat, c_cont) - fake_images = fake_images.reshape(fake_images.size(0), 1, 28, 28) - fake_images = denorm(fake_images) - grid = np.transpose(fake_images, (0, 2, 3, 1)).numpy() - - plt.figure(figsize=(8, 8)) - for i in range(grid.shape[0]): - plt.subplot(8, 8, i+1) - plt.imshow(grid[i, :, :, 0], cmap='gray') - plt.axis('off') - plt.show() -``` - -- `test_InfoGAN.py`: Uses the trained generator to generate sample images after training. - diff --git a/docs/algorithms/index.md b/docs/algorithms/index.md deleted file mode 100644 index 45a2a3d1..00000000 --- a/docs/algorithms/index.md +++ /dev/null @@ -1,93 +0,0 @@ -# Algorithms ๐Ÿ”ท - -
- - -
- - -
-

Statistics

-

Understanding data through statistical analysis and inference methods.

-
-
-
- - -
- - -
-

Machine Learning

-

Dive into the world of algorithms and models in Machine Learning.

-
-
-
- - -
- - -
-

Deep Learning

-

Explore the fascinating world of deep learning.

-
-
-
- - -
- - -
-

Computer Vision

-

Learn computer vision with OpenCV for real-time image processing applications.

-
-
-
- - -
- - -
-

Natural Language Processing

-

Dive into how machines understand and generate human language.

-
-
-
- - -
- - -
-

Generative Adversarial Networks

-

Learn about the power of Generative Adversarial Networks for creative AI solutions.

-
-
-
- - -
- - -
-

Large Language Models

-

Explore the cutting-edge techniques behind large language models like GPT and BERT.

-
-
-
- - -
- - -
-

Artificial Intelligence

-

Explore the fundamentals and advanced concepts of Artificial Intelligence.

-
-
-
- -
diff --git a/docs/algorithms/large-language-models/bert/index.md b/docs/algorithms/large-language-models/bert/index.md deleted file mode 100644 index 3f461112..00000000 --- a/docs/algorithms/large-language-models/bert/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# BERT ๐Ÿง  - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/large-language-models/bloom/index.md b/docs/algorithms/large-language-models/bloom/index.md deleted file mode 100644 index 637b6fc6..00000000 --- a/docs/algorithms/large-language-models/bloom/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Bloom ๐Ÿง  - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/large-language-models/gpt-series/index.md b/docs/algorithms/large-language-models/gpt-series/index.md deleted file mode 100644 index 5b73b98d..00000000 --- a/docs/algorithms/large-language-models/gpt-series/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# GPT Series ๐Ÿง  - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/large-language-models/index.md b/docs/algorithms/large-language-models/index.md deleted file mode 100644 index f0c9cc9b..00000000 --- a/docs/algorithms/large-language-models/index.md +++ /dev/null @@ -1,49 +0,0 @@ -# Large Language Models ๐Ÿง  - - diff --git a/docs/algorithms/large-language-models/t5/index.md b/docs/algorithms/large-language-models/t5/index.md deleted file mode 100644 index b2952864..00000000 --- a/docs/algorithms/large-language-models/t5/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# T5 ๐Ÿง  - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/machine-learning/boosting/index.md b/docs/algorithms/machine-learning/boosting/index.md deleted file mode 100644 index bc66f3c4..00000000 --- a/docs/algorithms/machine-learning/boosting/index.md +++ /dev/null @@ -1,15 +0,0 @@ -# Boosting ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/boosting/light-gbm.md b/docs/algorithms/machine-learning/boosting/light-gbm.md deleted file mode 100644 index da094a5f..00000000 --- a/docs/algorithms/machine-learning/boosting/light-gbm.md +++ /dev/null @@ -1,128 +0,0 @@ -### **LightGBM: A Comprehensive Guide to Scratch Implementation** - -**Overview:** -LightGBM (Light Gradient Boosting Machine) is an advanced gradient boosting framework that efficiently handles large datasets. Unlike traditional boosting methods, LightGBM uses leaf-wise tree growth, which improves accuracy and reduces computation time. - ---- - -### **Key Highlights:** -- **Speed and Efficiency:** Faster training on large datasets compared to XGBoost. -- **Memory Optimization:** Lower memory usage, making it scalable. -- **Built-in Handling of Categorical Data:** No need for manual one-hot encoding. -- **Parallel and GPU Training:** Supports multi-threading and GPU acceleration for faster computation. - ---- - -### **How LightGBM Works (Scratch Implementation Guide):** - -#### **1. Core Concept (Leaf-Wise Tree Growth):** -- **Level-wise (XGBoost):** Grows all leaves at the same depth before moving to the next. -- **Leaf-wise (LightGBM):** Grows the leaf that reduces the most loss, potentially leading to deeper, more accurate trees. - -*Example Visualization:* -``` -Level-wise (XGBoost) Leaf-wise (LightGBM) - O O - / \ / \ - O O O O - / \ \ - O O O -``` - ---- - -### **Algorithm Breakdown:** -1. **Initialize Model:** Start with a simple model (like mean predictions). -2. **Compute Residuals:** Calculate errors between actual and predicted values. -3. **Train Trees to Predict Residuals:** Fit new trees to minimize residuals. -4. **Update Model:** Adjust predictions by adding the new treeโ€™s results. -5. **Repeat Until Convergence or Early Stopping.** - ---- - -### **Parameters Explained:** -- **num_leaves:** Limits the number of leaves in a tree (complexity control). -- **max_depth:** Constrains tree depth to prevent overfitting. -- **learning_rate:** Scales the contribution of each tree to control convergence. -- **n_estimators:** Number of boosting rounds (trees). -- **min_data_in_leaf:** Minimum number of data points in a leaf to avoid overfitting small branches. - ---- - -### **Scratch Code Example (From the Ground Up):** - -**File:** `lightgbm_model.py` -```python -import lightgbm as lgb -from sklearn.model_selection import train_test_split -from sklearn.metrics import mean_squared_error - -class LightGBMModel: - def __init__(self, params=None): - self.params = params if params else { - 'objective': 'regression', - 'metric': 'rmse', - 'boosting_type': 'gbdt', - 'num_leaves': 31, - 'learning_rate': 0.05, - 'n_estimators': 100 - } - self.model = None - - def fit(self, X_train, y_train): - d_train = lgb.Dataset(X_train, label=y_train) - self.model = lgb.train(self.params, d_train) - - def predict(self, X_test): - return self.model.predict(X_test) -``` - ---- - -### **Testing the Model:** - -**File:** `lightgbm_model_test.py` -```python -import unittest -import numpy as np -from sklearn.datasets import load_diabetes -from sklearn.model_selection import train_test_split -from lightgbm_model import LightGBMModel - -class TestLightGBMModel(unittest.TestCase): - - def test_lightgbm(self): - # Load Dataset - data = load_diabetes() - X_train, X_test, y_train, y_test = train_test_split( - data.data, data.target, test_size=0.2, random_state=42) - - # Train Model - model = LightGBMModel() - model.fit(X_train, y_train) - - # Predict and Evaluate - predictions = model.predict(X_test) - mse = mean_squared_error(y_test, predictions) - self.assertTrue(mse < 3500, "MSE is too high, LightGBM not performing well") - -if __name__ == '__main__': - unittest.main() -``` - ---- - -### **Additional Insights to Aid Understanding:** -- **Feature Importance:** -```python -lgb.plot_importance(model.model) -``` -- **Early Stopping Implementation:** -```python -self.model = lgb.train(self.params, d_train, valid_sets=[d_train], early_stopping_rounds=10) -``` - ---- - -### **Testing and Validation:** -Use `sklearn` datasets to validate the implementation. Compare performance with other boosting models to highlight LightGBMโ€™s efficiency. diff --git a/docs/algorithms/machine-learning/data-preprocessing/encoding/index.md b/docs/algorithms/machine-learning/data-preprocessing/encoding/index.md deleted file mode 100644 index 88bf8dec..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/encoding/index.md +++ /dev/null @@ -1,16 +0,0 @@ -# Encoding Algorithms ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/data-preprocessing/encoding/ordinal-encoder.md b/docs/algorithms/machine-learning/data-preprocessing/encoding/ordinal-encoder.md deleted file mode 100644 index a995876a..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/encoding/ordinal-encoder.md +++ /dev/null @@ -1,115 +0,0 @@ -# ORDINAL ENCODER - -A custom implementation of an OrdinalEncoder class for encoding categorical data into ordinal integers using a pandas DataFrame. The class maps each unique category to an integer based on the order of appearance. - -## Features - -- **fit**: Learn the mapping of categories to ordinal integers for each column. -- **transform**: Transform the categorical data to ordinal integers based on the learned mapping. -- **fit_transform**: Fit the encoder and transform the data in one step. - -## Methods - -1. `__init__(self)` - - Initializes the OrdinalEncoding class. - - No parameters are required. -2. `fit(self, data)` - - Learns the mapping of categories to ordinal integers for each column. - - Parameters: - - data (pandas.DataFrame): The data to fit. - - Raises: - - TypeError: If the input data is not a pandas DataFrame. -3. `transform(self, data)` - - Transforms the categorical data to ordinal integers based on the learned mapping. - - Parameters: - - data (pandas.DataFrame): The data to transform. - - Returns: - - pandas.DataFrame: The transformed data. - - Raises: - - Error: If transform is called before fit or fit_transform. -4. `fit_transform(self, data)` - - Fits the encoder to the data and transforms the data in one step. - - Parameters: - - data (pandas.DataFrame): The data to fit and transform. - - Returns: - - pandas.DataFrame: The transformed data. - -## Error Handling - -- Raises a TypeError if the input data is not a pandas DataFrame in the fit method. -- Raises an error if transform is called before fit or fit_transform. - -## Use Case - -![use_case](https://github.com/user-attachments/assets/af3f20f7-b26a-45b7-9a0f-fc9dcdd99534) - -## Output - -![output](https://github.com/user-attachments/assets/12f31b6b-c165-460f-b1e9-5726663f625d) - - -- ordinal_encoder.py file - -```py -import pandas as pd - -class OrdinalEncoding: - def __init__(self): - self.category_mapping = {} - - def fit(self, data): - # Fit the encoder to the data (pandas DataFrame). - # type check - if not type(data)==pd.DataFrame: - raise f"Type of data should be Pandas.DataFrame; {type(data)} found" - for column in data.columns: - unique_categories = sorted(set(data[column])) - self.category_mapping[column] = {category: idx for idx, category in enumerate(unique_categories)} - - def transform(self, data): - # Transform the data (pandas DataFrame) to ordinal integers. - # checking for empty mapping - if not self.category_mapping: - raise "Catrgorical Mapping not found. Call OrdinalExcoding.fit() method or call OrdinalEncoding.fit_transform() method" - - data_transformed = data.copy() - for column in data.columns: - data_transformed[column] = data[column].map(self.category_mapping[column]) - return data_transformed - - def fit_transform(self, data): - # Fit the encoder and transform the data in one step. - self.fit(data) - return self.transform(data) -``` - -- test_ordinal_encoder.py file - -```py -import os -import sys -# for resolving any path conflict -current = os.path.dirname(os.path.realpath("ordinal_encoder.py")) -parent = os.path.dirname(current) -sys.path.append(current) - -import pandas as pd - -from Ordinal_Encoder.ordinal_encoder import OrdinalEncoding - -# Example usage -data = { - 'Category1': ['low', 'medium', 'high', 'medium', 'low', 'high', 'medium'], - 'Category2': ['A', 'B', 'A', 'B', 'A', 'B', 'A'], - 'Category3': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X'] -} -df = pd.DataFrame(data) - -encoder = OrdinalEncoding() -encoded_df = encoder.fit_transform(df) - -print("Original DataFrame:") -print(df) -print("\nEncoded DataFrame:") -print(encoded_df) -``` diff --git a/docs/algorithms/machine-learning/data-preprocessing/imputation/index.md b/docs/algorithms/machine-learning/data-preprocessing/imputation/index.md deleted file mode 100644 index 50d9ac91..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/imputation/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Imputation Algorithm ๐Ÿค– - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/machine-learning/data-preprocessing/index.md b/docs/algorithms/machine-learning/data-preprocessing/index.md deleted file mode 100644 index dd02f4cd..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/index.md +++ /dev/null @@ -1,40 +0,0 @@ -# Data Pre-processing ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/index.md b/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/index.md deleted file mode 100644 index 622d7804..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/index.md +++ /dev/null @@ -1,27 +0,0 @@ -# Scaling and Normalization ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/min-max-scaler.md b/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/min-max-scaler.md deleted file mode 100644 index 5898f722..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/min-max-scaler.md +++ /dev/null @@ -1,133 +0,0 @@ -# MIN MAX SCALER - -A custom implementation of a MinMaxScaler class for scaling numerical data in a pandas DataFrame. The class scales the features to a specified range, typically between 0 and 1. - -## Features - -- **fit**: Calculate the minimum and maximum values of the data. -- **transform**: Scale the data to the specified feature range. -- **fit_transform**: Fit the scaler and transform the data in one step. -- **get_params**: Retrieve the minimum and maximum values calculated during fitting. - -## Methods - -1. `__init__(self, feature_range=(0, 1))` - - Initializes the MinMaxScaling class. - - Parameters: - - feature_range (tuple): Desired range of transformed data. Default is (0, 1). -2. `fit(self, data)` - - Calculates the minimum and maximum values of the data. - - Parameters: - - data (pandas.DataFrame): The data to fit. -3. `transform(self, data)` - - Transforms the data to the specified feature range. - - Parameters: - - data (pandas.DataFrame): The data to transform. - - Returns: - - pandas.DataFrame: The scaled data. -4. `fit_transform(self, data)` - - Fits the scaler to the data and transforms the data in one step. - - Parameters: - - data (pandas.DataFrame): The data to fit and transform. - - Returns: - - pandas.DataFrame: The scaled data. -5. `get_params(self)` - - Retrieves the minimum and maximum values calculated during fitting. - - Returns: - - dict: Dictionary containing the minimum and maximum values. - -## Error Handling - -- Raises a TypeError if the input data is not a pandas DataFrame in the fit method. -- Raises an error if transform is called before fit or fit_transform. -- Raises an error in get_params if called before fit. - -## Use Case - -![use_case](https://github.com/user-attachments/assets/86cc2962-e744-490d-97a6-c084496701de) - -## Output - -![output](https://github.com/user-attachments/assets/d62b9856-d67c-4c92-a2db-f0e76409856a) - - -- min_max_scaler.py file - -```py -import pandas as pd - -# Custom MinMaxScaler class -class MinMaxScaling: - # init function - def __init__(self, feature_range=(0, 1)): # feature range can be specified by the user else it takes (0,1) - self.min = feature_range[0] - self.max = feature_range[1] - self.data_min_ = None - self.data_max_ = None - - # fit function to calculate min and max value of the data - def fit(self, data): - # type check - if not type(data)==pd.DataFrame: - raise f"TypeError : parameter should be a Pandas.DataFrame; {type(data)} found" - else: - self.data_min_ = data.min() - self.data_max_ = data.max() - - # transform function - def transform(self, data): - if self.data_max_ is None or self.data_min_ is None: - raise "Call MinMaxScaling.fit() first or call MinMaxScaling.fit_transform() as the required params not found" - else: - data_scaled = (data - self.data_min_) / (self.data_max_ - self.data_min_) - data_scaled = data_scaled * (self.max - self.min) + self.min - return data_scaled - - # fit_tranform function - def fit_transform(self, data): - self.fit(data) - return self.transform(data) - - # get_params function - def get_params(self): - if self.data_max_ is None or self.data_min_ is None: - raise "Params not found! Call MinMaxScaling.fit() first" - else: - return {"Min" : self.data_min_, - "Max" : self.data_max_} -``` - -- test_min_max_scaler.py file - -```py -import os -import sys -# for resolving any path conflict -current = os.path.dirname(os.path.realpath("min_max_scaler.py")) -parent = os.path.dirname(current) -sys.path.append(current) - -import pandas as pd - -from Min_Max_Scaler.min_max_scaler import MinMaxScaling - -# Example DataFrame -data = { - 'A': [1, 2, 3, 4, 5], - 'B': [10, 20, 30, 40, 50], - 'C': [100, 200, 300, 400, 500] -} - -df = pd.DataFrame(data) - -# Initialize the CustomMinMaxScaler -scaler = MinMaxScaling() - -# Fit the scaler to the data and transform the data -scaled_df = scaler.fit_transform(df) - -print("Original DataFrame:") -print(df) -print("\nScaled DataFrame:") -print(scaled_df) -``` diff --git a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/standard-scaler.md b/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/standard-scaler.md deleted file mode 100644 index daa3a509..00000000 --- a/docs/algorithms/machine-learning/data-preprocessing/scaling-and-normalization/standard-scaler.md +++ /dev/null @@ -1,140 +0,0 @@ -# STANDARD SCALER - -A custom implementation of a StandardScaler class for scaling numerical data in a pandas DataFrame or NumPy array. The class scales the features to have zero mean and unit variance. - -## Features - -- **fit**: Calculate the mean and standard deviation of the data. -- **transform**: Scale the data to have zero mean and unit variance. -- **fit_transform**: Fit the scaler and transform the data in one step. -- **get_params**: Retrieve the mean and standard deviation calculated during fitting. - -## Methods - -1. `__init__(self)` - - Initializes the StandardScaling class. - - No parameters are required. -2. `fit(self, data)` - - Calculates the mean and standard deviation of the data. - - Parameters: - - data (pandas.DataFrame or numpy.ndarray): The data to fit. - - Raises: - - TypeError: If the input data is not a pandas DataFrame or NumPy array. -3. `transform(self, data)` - - Transforms the data to have zero mean and unit variance. - - Parameters: - - data (pandas.DataFrame or numpy.ndarray): The data to transform. - - Returns: - - numpy.ndarray: The scaled data. - - Raises: - - Error: If transform is called before fit or fit_transform. -4. `fit_transform(self, data)` - - Fits the scaler to the data and transforms the data in one step. - - Parameters: - - data (pandas.DataFrame or numpy.ndarray): The data to fit and transform. - - Returns: - - numpy.ndarray: The scaled data. -5. `get_params(self)` - - Retrieves the mean and standard deviation calculated during fitting. - - Returns: - - dict: Dictionary containing the mean and standard deviation. - - Raises: - - Error: If get_params is called before fit. - -## Error Handling - -- Raises a TypeError if the input data is not a pandas DataFrame or NumPy array in the fit method. -- Raises an error if transform is called before fit or fit_transform. -- Raises an error in get_params if called before fit. - -## Use Case - -![use_case](https://github.com/user-attachments/assets/857aa25a-6cb9-4320-aa43-993bd289bd32) - -## Output - -![output](https://github.com/user-attachments/assets/ea2c1374-78c7-4cff-a431-eced068c052f) - - -- standard_scaler.py file - -```py -import pandas as pd -import numpy as np - -# Custom MinMaxScaler class -class StandardScaling: - # init function - def __init__(self): - self.data_mean_ = None - self.data_std_ = None - - # fit function to calculate min and max value of the data - def fit(self, data): - # type check - if not (type(data)==pd.DataFrame or type(data)==np.ndarray): - raise f"TypeError : parameter should be a Pandas.DataFrame or Numpy.ndarray; {type(data)} found" - elif type(data)==pd.DataFrame: - data = data.to_numpy() - - self.data_mean_ = np.mean(data, axis=0) - self.data_std_ = np.sqrt(np.var(data, axis=0)) - - # transform function - def transform(self, data): - if self.data_mean_ is None or self.data_std_ is None: - raise "Call StandardScaling.fit() first or call StandardScaling.fit_transform() as the required params not found" - else: - data_scaled = (data - self.data_mean_) / (self.data_std_) - return data_scaled - - # fit_tranform function - def fit_transform(self, data): - self.fit(data) - return self.transform(data) - - # get_params function - def get_params(self): - if self.data_mean_ is None or self.data_std_ is None: - raise "Params not found! Call StandardScaling.fit() first" - else: - return {"Mean" : self.data_mean_, - "Standard Deviation" : self.data_std_} -``` - -- test_standard_scaler.py file - -```py -import os -import sys -# for resolving any path conflict -current = os.path.dirname(os.path.realpath("standard_scaler.py")) -parent = os.path.dirname(current) -sys.path.append(current) - -import pandas as pd - -from Standard_Scaler.standard_scaler import StandardScaling - -# Example DataFrame -data = { - 'A': [1, 2, 3, 4, 5], - 'B': [10, 20, 30, 40, 50], - 'C': [100, 200, 300, 400, 500] -} - -df = pd.DataFrame(data) - -# Initialize the CustomMinMaxScaler -scaler = StandardScaling() - -# Fit the scaler to the data and transform the data -scaled_df = scaler.fit_transform(df) - -print("Original DataFrame:") -print(df) -print("\nScaled DataFrame:") -print(scaled_df) -print("\nAssociated Parameters:") -print(scaler.get_params()) -``` \ No newline at end of file diff --git a/docs/algorithms/machine-learning/index.md b/docs/algorithms/machine-learning/index.md deleted file mode 100644 index 516bc688..00000000 --- a/docs/algorithms/machine-learning/index.md +++ /dev/null @@ -1,49 +0,0 @@ -# Machine Learning ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/supervised/classifications/index.md b/docs/algorithms/machine-learning/supervised/classifications/index.md deleted file mode 100644 index eef10a31..00000000 --- a/docs/algorithms/machine-learning/supervised/classifications/index.md +++ /dev/null @@ -1,10 +0,0 @@ -# Classification Algorithms ๐Ÿค– -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
\ No newline at end of file diff --git a/docs/algorithms/machine-learning/supervised/index.md b/docs/algorithms/machine-learning/supervised/index.md deleted file mode 100644 index 24a87375..00000000 --- a/docs/algorithms/machine-learning/supervised/index.md +++ /dev/null @@ -1,27 +0,0 @@ -# Supervised Machine Learning ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/supervised/regressions/adaboost.md b/docs/algorithms/machine-learning/supervised/regressions/adaboost.md deleted file mode 100644 index bf20e558..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/adaboost.md +++ /dev/null @@ -1,144 +0,0 @@ -# AdaBoost - -**Overview:** -AdaBoost (Adaptive Boosting) is one of the most popular ensemble methods for boosting weak learners to create a strong learner. It works by combining multiple "weak" models, typically decision stumps, and focusing more on the errors from previous models. This iterative process improves accuracy and reduces bias. - ---- - -### **Key Highlights:** -- **Boosting Concept:** Builds an ensemble by sequentially focusing on harder-to-classify instances. -- **Adaptive Weighting:** Misclassified instances get higher weights, and correctly classified instances get lower weights in subsequent rounds. -- **Simple and Effective:** Often uses decision stumps (single-level decision trees) as base models. -- **Versatility:** Applicable to both regression and classification problems. - ---- - -### **How AdaBoost Works (Scratch Implementation Guide):** - -#### **1. Core Concept (Error Weight Adjustment):** -- Assigns equal weights to all data points initially. -- In each iteration: - - A weak model (e.g., a decision stump) is trained on the weighted dataset. - - Misclassified points are assigned higher weights for the next iteration. - - A final strong model is constructed by combining all weak models, weighted by their accuracy. - -*Visualization:* -``` -Iteration 1: Train weak model -> Update weights -Iteration 2: Train weak model -> Update weights -... -Final Model: Combine weak models with weighted contributions -``` - ---- - -### **Algorithm Breakdown:** -1. **Initialize Weights:** Assign equal weights to all instances. -2. **Train a Weak Model:** Use weighted data to train a weak learner. -3. **Calculate Model Error:** Compute the weighted error rate of the model. -4. **Update Instance Weights:** Increase weights for misclassified points and decrease weights for correctly classified points. -5. **Update Model Weight:** Calculate the modelโ€™s contribution based on its accuracy. -6. **Repeat for a Set Number of Iterations or Until Convergence.** - ---- - -### **Parameters Explained:** -- **n_estimators:** Number of weak learners (iterations). -- **learning_rate:** Shrinks the contribution of each weak learner to avoid overfitting. -- **base_estimator:** The weak learner used (e.g., `DecisionTreeRegressor` or `DecisionTreeClassifier`). - ---- - -### **Scratch Code Example (From the Ground Up):** - -**File:** `adaboost_model.py` -```python -import numpy as np -from sklearn.tree import DecisionTreeRegressor - -class AdaBoostRegressor: - def __init__(self, n_estimators=50, learning_rate=1.0): - self.n_estimators = n_estimators - self.learning_rate = learning_rate - self.models = [] - self.model_weights = [] - - def fit(self, X, y): - n_samples = X.shape[0] - # Initialize weights - sample_weights = np.ones(n_samples) / n_samples - - for _ in range(self.n_estimators): - # Train weak model - model = DecisionTreeRegressor(max_depth=1) - model.fit(X, y, sample_weight=sample_weights) - predictions = model.predict(X) - - # Calculate weighted error - error = np.sum(sample_weights * (y != predictions)) / np.sum(sample_weights) - if error > 0.5: - break - - # Calculate model weight - model_weight = self.learning_rate * np.log((1 - error) / error) - - # Update sample weights - sample_weights *= np.exp(model_weight * (y != predictions)) - sample_weights /= np.sum(sample_weights) - - self.models.append(model) - self.model_weights.append(model_weight) - - def predict(self, X): - # Combine predictions from all models - final_prediction = sum(weight * model.predict(X) for model, weight in zip(self.models, self.model_weights)) - return np.sign(final_prediction) -``` - ---- - -### **Testing the Model:** - -**File:** `adaboost_model_test.py` -```python -import unittest -import numpy as np -from sklearn.datasets import make_regression -from sklearn.metrics import mean_squared_error -from adaboost_model import AdaBoostRegressor - -class TestAdaBoostRegressor(unittest.TestCase): - - def test_adaboost(self): - # Generate synthetic dataset - X, y = make_regression(n_samples=100, n_features=1, noise=15, random_state=42) - y = np.sign(y) # Convert to classification-like regression - - # Train AdaBoost Regressor - model = AdaBoostRegressor(n_estimators=10) - model.fit(X, y) - - # Predict and Evaluate - predictions = model.predict(X) - mse = mean_squared_error(y, predictions) - self.assertTrue(mse < 0.5, "MSE is too high, AdaBoost not performing well") - -if __name__ == '__main__': - unittest.main() -``` - ---- - -### **Additional Insights to Aid Understanding:** -- **Feature Importance:** -```python -for i, model in enumerate(model.models): - print(f"Model {i} weight: {model_weights[i]}") -``` -- **Early Stopping Implementation:** -Use validation metrics to stop training if performance does not improve over several iterations. - ---- - -### **Testing and Validation:** -Use datasets from `sklearn` (e.g., `make_regression`) to validate the implementation. Compare AdaBoost with other boosting models like Gradient Boosting and LightGBM to analyze performance differences. diff --git a/docs/algorithms/machine-learning/supervised/regressions/bayesian.md b/docs/algorithms/machine-learning/supervised/regressions/bayesian.md deleted file mode 100644 index 01e543b9..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/bayesian.md +++ /dev/null @@ -1,94 +0,0 @@ -# Bayesian Regression - -This module contains an implementation of Bayesian Regression, a probabilistic approach to linear regression that provides uncertainty estimates for predictions. - -## Overview - -Bayesian Regression is an extension of traditional linear regression that models the distribution of coefficients, allowing for uncertainty in the model parameters. It's particularly useful when dealing with limited data and provides a full probability distribution over the possible values of the regression coefficients. - -## Parameters - -- `alpha`: Prior precision for the coefficients. -- `beta`: Precision of the noise in the observations. - -## Scratch Code - -- bayesian_regression.py file - -```py -import numpy as np - -class BayesianRegression: - def __init__(self, alpha=1, beta=1): - """ - Constructor for the BayesianRegression class. - - Parameters: - - alpha: Prior precision. - - beta: Noise precision. - """ - self.alpha = alpha - self.beta = beta - self.w_mean = None - self.w_precision = None - - def fit(self, X, y): - """ - Fit the Bayesian Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Add a bias term to X - X = np.c_[np.ones(X.shape[0]), X] - - # Compute posterior precision and mean - self.w_precision = self.alpha * np.eye(X.shape[1]) + self.beta * X.T @ X - self.w_mean = self.beta * np.linalg.solve(self.w_precision, X.T @ y) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - # Add a bias term to X - X = np.c_[np.ones(X.shape[0]), X] - - # Compute predicted mean - y_pred = X @ self.w_mean - - return y_pred -``` - -- bayesian_regression_test.py file - -```py -import unittest -import numpy as np -from BayesianRegression import BayesianRegression - -class TestBayesianRegression(unittest.TestCase): - def setUp(self): - # Generate synthetic data for testing - np.random.seed(42) - self.X_train = 2 * np.random.rand(100, 1) - self.y_train = 4 + 3 * self.X_train + np.random.randn(100, 1) - - self.X_test = 2 * np.random.rand(20, 1) - - def test_fit_predict(self): - blr = BayesianRegression() - blr.fit(self.X_train, self.y_train) - y_pred = blr.predict(self.X_test) - - self.assertTrue(y_pred.shape == (20, 1)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/decision-tree.md b/docs/algorithms/machine-learning/supervised/regressions/decision-tree.md deleted file mode 100644 index f4f40075..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/decision-tree.md +++ /dev/null @@ -1,205 +0,0 @@ -# Decision Tree Regression - -This module contains an implementation of Decision Tree Regression, a versatile algorithm for predicting a continuous outcome based on input features. - -## Parameters - -- `max_depth`: Maximum depth of the decision tree. Controls the complexity of the model. - -## Scratch Code - -- decision_tree_regression.py file - -```py -import numpy as np - -class DecisionTreeRegression: - - def __init__(self, max_depth=None): - """ - Constructor for the DecisionTreeRegression class. - - Parameters: - - max_depth: Maximum depth of the decision tree. - """ - self.max_depth = max_depth - self.tree = None - - def _calculate_variance(self, y): - """ - Calculate the variance of a set of target values. - - Parameters: - - y: Target values (numpy array). - - Returns: - - Variance of the target values. - """ - return np.var(y) - - def _split_dataset(self, X, y, feature_index, threshold): - """ - Split the dataset based on a feature and threshold. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - feature_index: Index of the feature to split on. - - threshold: Threshold value for the split. - - Returns: - - Left and right subsets of the dataset. - """ - left_mask = X[:, feature_index] <= threshold - right_mask = ~left_mask - return X[left_mask], X[right_mask], y[left_mask], y[right_mask] - - def _find_best_split(self, X, y): - """ - Find the best split for the dataset. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - Returns: - - Index of the best feature and the corresponding threshold. - """ - m, n = X.shape - best_feature_index = None - best_threshold = None - best_variance_reduction = 0 - - initial_variance = self._calculate_variance(y) - - for feature_index in range(n): - thresholds = np.unique(X[:, feature_index]) - - for threshold in thresholds: - # Split the dataset - _, _, y_left, y_right = self._split_dataset(X, y, feature_index, threshold) - - # Calculate variance reduction - left_weight = len(y_left) / m - right_weight = len(y_right) / m - variance_reduction = initial_variance - (left_weight * self._calculate_variance(y_left) + right_weight * self._calculate_variance(y_right)) - - # Update the best split if variance reduction is greater - if variance_reduction > best_variance_reduction: - best_feature_index = feature_index - best_threshold = threshold - best_variance_reduction = variance_reduction - - return best_feature_index, best_threshold - - def _build_tree(self, X, y, depth): - """ - Recursively build the decision tree. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - depth: Current depth of the tree. - - Returns: - - Node of the decision tree. - """ - # Check if max depth is reached or if all target values are the same - if depth == self.max_depth or np.all(y == y[0]): - return {'value': np.mean(y)} - - # Find the best split - feature_index, threshold = self._find_best_split(X, y) - - if feature_index is not None: - # Split the dataset - X_left, X_right, y_left, y_right = self._split_dataset(X, y, feature_index, threshold) - - # Recursively build left and right subtrees - left_subtree = self._build_tree(X_left, y_left, depth + 1) - right_subtree = self._build_tree(X_right, y_right, depth + 1) - - return {'feature_index': feature_index, - 'threshold': threshold, - 'left': left_subtree, - 'right': right_subtree} - else: - # If no split is found, return a leaf node - return {'value': np.mean(y)} - - def fit(self, X, y): - """ - Fit the Decision Tree Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - self.tree = self._build_tree(X, y, depth=0) - - def _predict_single(self, node, x): - """ - Recursively predict a single data point. - - Parameters: - - node: Current node in the decision tree. - - x: Input features for prediction. - - Returns: - - Predicted value. - """ - if 'value' in node: - return node['value'] - else: - if x[node['feature_index']] <= node['threshold']: - return self._predict_single(node['left'], x) - else: - return self._predict_single(node['right'], x) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - return np.array([self._predict_single(self.tree, x) for x in X]) -``` - -- decision_tree_regression_test.py file - -```py -import unittest -import numpy as np -from DecisionTreeRegressor import DecisionTreeRegression - -class TestDecisionTreeRegressor(unittest.TestCase): - - def setUp(self): - # Create sample data for testing - np.random.seed(42) - self.X_train = np.random.rand(100, 2) - self.y_train = 2 * self.X_train[:, 0] + 3 * self.X_train[:, 1] + np.random.normal(0, 0.1, 100) - - self.X_test = np.random.rand(10, 2) - - def test_fit_predict(self): - # Test if the model can be fitted and predictions are made - dt_model = DecisionTreeRegression(max_depth=3) - dt_model.fit(self.X_train, self.y_train) - - # Ensure predictions are made without errors - predictions = dt_model.predict(self.X_test) - - # Add your specific assertions based on the expected behavior of your model - self.assertIsInstance(predictions, np.ndarray) - self.assertEqual(predictions.shape, (10,)) - - # Add more test cases as needed - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/elastic-net.md b/docs/algorithms/machine-learning/supervised/regressions/elastic-net.md deleted file mode 100644 index fedfcf79..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/elastic-net.md +++ /dev/null @@ -1,92 +0,0 @@ -# Elastic Net Regression - -This module contains an implementation of Elastic Net Regression, a powerful linear regression technique that combines both L1 (Lasso) and L2 (Ridge) regularization. Elastic Net is particularly useful when dealing with high-dimensional datasets and can effectively handle correlated features. - -## Parameters - -- `alpha`: The regularization strength. A positive float value. -- `l1_ratio`: The ratio of L1 regularization to L2 regularization. Should be between 0 and 1. -- `max_iter`: The maximum number of iterations to run the optimization algorithm. -- `tol`: The tolerance for the optimization. If the updates are smaller than this value, the optimization will stop. - -## Scratch Code - -- elastic_net_regression.py file - -```py -import numpy as np - -class ElasticNetRegression: - def __init__(self, alpha=1.0, l1_ratio=0.5, max_iter=1000, tol=1e-4): - self.alpha = alpha - self.l1_ratio = l1_ratio - self.max_iter = max_iter - self.tol = tol - self.coef_ = None - self.intercept_ = None - - def fit(self, X, y): - n_samples, n_features = X.shape - self.coef_ = np.zeros(n_features) - self.intercept_ = 0 - learning_rate = 0.01 - - for iteration in range(self.max_iter): - y_pred = np.dot(X, self.coef_) + self.intercept_ - error = y - y_pred - - gradient_w = (-2 / n_samples) * (X.T.dot(error)) + self.alpha * (self.l1_ratio * np.sign(self.coef_) + (1 - self.l1_ratio) * 2 * self.coef_) - gradient_b = (-2 / n_samples) * np.sum(error) - - new_coef = self.coef_ - learning_rate * gradient_w - new_intercept = self.intercept_ - learning_rate * gradient_b - - if np.all(np.abs(new_coef - self.coef_) < self.tol) and np.abs(new_intercept - self.intercept_) < self.tol: - break - - self.coef_ = new_coef - self.intercept_ = new_intercept - - def predict(self, X): - return np.dot(X, self.coef_) + self.intercept_ -``` - -- elastic_net_regression_test.py file - -```py -import unittest -import numpy as np -from sklearn.linear_model import ElasticNet -from ElasticNetRegression import ElasticNetRegression - -class TestElasticNetRegression(unittest.TestCase): - - def test_elastic_net_regression(self): - np.random.seed(42) - X_train = np.random.rand(100, 1) * 10 - y_train = 2 * X_train.squeeze() + np.random.randn(100) * 2 - - X_test = np.array([[2.5], [5.0], [7.5]]) - - custom_model = ElasticNetRegression(alpha=1.0, l1_ratio=0.5) - custom_model.fit(X_train, y_train) - custom_predictions = custom_model.predict(X_test) - - sklearn_model = ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=1000, tol=1e-4) - sklearn_model.fit(X_train, y_train) - sklearn_predictions = sklearn_model.predict(X_test) - - np.testing.assert_allclose(custom_predictions, sklearn_predictions, rtol=1e-1) - - train_predictions_custom = custom_model.predict(X_train) - train_predictions_sklearn = sklearn_model.predict(X_train) - - custom_mse = np.mean((y_train - train_predictions_custom) ** 2) - sklearn_mse = np.mean((y_train - train_predictions_sklearn) ** 2) - - print(f"Custom Model MSE: {custom_mse}") - print(f"Scikit-learn Model MSE: {sklearn_mse}") - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/gradient-boosting.md b/docs/algorithms/machine-learning/supervised/regressions/gradient-boosting.md deleted file mode 100644 index 1f0415e0..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/gradient-boosting.md +++ /dev/null @@ -1,218 +0,0 @@ -# Gradient Boosting Regression - -This module contains an implementation of Gradient Boosting Regression, an ensemble learning method that combines multiple weak learners (typically decision trees) to create a more robust and accurate model for predicting continuous outcomes based on input features. - -## Parameters - -- `n_estimators`: Number of boosting stages (trees) to be run. -- `learning_rate`: Step size shrinkage to prevent overfitting. -- `max_depth`: Maximum depth of each decision tree. - -## Scratch Code - -- gradient_boosting_regression.py file - -```py -import numpy as np - -class GradientBoostingRegression: - def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3): - """ - Constructor for the GradientBoostingRegression class. - - Parameters: - - n_estimators: Number of trees in the ensemble. - - learning_rate: Step size for each tree's contribution. - - max_depth: Maximum depth of each decision tree. - """ - self.n_estimators = n_estimators - self.learning_rate = learning_rate - self.max_depth = max_depth - self.trees = [] - - def fit(self, X, y): - """ - Fit the gradient boosting regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Initialize predictions with the mean of the target values - predictions = np.mean(y) * np.ones_like(y) - - for _ in range(self.n_estimators): - # Compute residuals - residuals = y - predictions - - # Fit a decision tree to the residuals - tree = self._fit_tree(X, residuals, depth=0) - - # Update predictions using the tree's contribution scaled by the learning rate - predictions += self.learning_rate * self._predict_tree(X, tree) - - # Save the tree in the ensemble - self.trees.append(tree) - - def _fit_tree(self, X, y, depth): - """ - Fit a decision tree to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - depth: Current depth of the tree. - - Returns: - - Tree structure (dictionary). - """ - if depth == self.max_depth: - # If the maximum depth is reached, return the mean of the target values - return np.mean(y) - - # Find the best split point - feature_index, threshold = self._find_best_split(X, y) - - if feature_index is None: - # If no split improves the purity, return the mean of the target values - return np.mean(y) - - # Split the data - mask = X[:, feature_index] < threshold - left_tree = self._fit_tree(X[mask], y[mask], depth + 1) - right_tree = self._fit_tree(X[~mask], y[~mask], depth + 1) - - # Return the tree structure - return {'feature_index': feature_index, 'threshold': threshold, - 'left_tree': left_tree, 'right_tree': right_tree} - - def _find_best_split(self, X, y): - """ - Find the best split point for a decision tree. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - Returns: - - Best feature index and threshold for the split. - """ - m, n = X.shape - if m <= 1: - return None, None # No split is possible - - # Calculate the initial impurity - initial_impurity = self._calculate_impurity(y) - - # Initialize variables to store the best split parameters - best_feature_index, best_threshold, best_impurity_reduction = None, None, 0 - - for feature_index in range(n): - # Sort the feature values and corresponding target values - sorted_indices = np.argsort(X[:, feature_index]) - sorted_X = X[sorted_indices, feature_index] - sorted_y = y[sorted_indices] - - # Initialize variables to keep track of impurity and counts for the left and right nodes - left_impurity, left_count = 0, 0 - right_impurity, right_count = initial_impurity, m - - for i in range(1, m): - # Update impurity and counts for the left and right nodes - value = sorted_X[i] - left_impurity += (i / m) * self._calculate_impurity(sorted_y[i-1:i+1]) - left_count += 1 - right_impurity -= ((i-1) / m) * self._calculate_impurity(sorted_y[i-1:i+1]) - right_count -= 1 - - # Calculate impurity reduction - impurity_reduction = initial_impurity - (left_count / m * left_impurity + right_count / m * right_impurity) - - # Check if this is the best split so far - if impurity_reduction > best_impurity_reduction: - best_feature_index = feature_index - best_threshold = value - best_impurity_reduction = impurity_reduction - - return best_feature_index, best_threshold - - def _calculate_impurity(self, y): - """ - Calculate the impurity of a node. - - Parameters: - - y: Target values (numpy array). - - Returns: - - Impurity. - """ - # For regression, impurity is the variance of the target values - return np.var(y) - - def _predict_tree(self, X, tree): - """ - Make predictions using a decision tree. - - Parameters: - - X: Input features (numpy array). - - tree: Tree structure (dictionary). - - Returns: - - Predicted values (numpy array). - """ - if 'feature_index' not in tree: - # If the node is a leaf, return the constant value - return tree - else: - # Recursively traverse the tree - mask = X[:, tree['feature_index']] < tree['threshold'] - return np.where(mask, self._predict_tree(X, tree['left_tree']), self._predict_tree(X, tree['right_tree'])) - - def predict(self, X): - """ - Make predictions on new data using the Gradient Boosting Regression. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - predictions = np.sum(self.learning_rate * self._predict_tree(X, tree) for tree in self.trees) - return predictions -``` - -- gradient_boosting_regression_test.py file - -```py -import unittest -import numpy as np -from GradientBoostingRegressor import GradientBoostingRegression - -class TestGradientBoostingRegressor(unittest.TestCase): - - def setUp(self): - # Create sample data for testing - np.random.seed(42) - self.X_train = np.random.rand(100, 2) - self.y_train = 2 * self.X_train[:, 0] + 3 * self.X_train[:, 1] + np.random.normal(0, 0.1, 100) - - self.X_test = np.random.rand(10, 2) - - def test_fit_predict(self): - # Test if the model can be fitted and predictions are made - gbr_model = GradientBoostingRegression(n_estimators=5, learning_rate=0.1, max_depth=3) - gbr_model.fit(self.X_train, self.y_train) - - # Ensure predictions are made without errors - predictions = gbr_model.predict(self.X_test) - - # Add your specific assertions based on the expected behavior of your model - self.assertIsInstance(predictions, np.ndarray) - self.assertEqual(predictions.shape, (10,)) - - # Add more test cases as needed - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/huber.md b/docs/algorithms/machine-learning/supervised/regressions/huber.md deleted file mode 100644 index b7d816fe..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/huber.md +++ /dev/null @@ -1,98 +0,0 @@ -# Huber Regression - -This module contains an implementation of Huber Regression, a robust linear regression technique that combines the properties of both least squares and absolute error loss functions. Huber Regression is particularly useful when dealing with datasets that have outliers, as it is less sensitive to outliers compared to standard linear regression. - -## Overview - -Huber Regression is a regression algorithm that adds a penalty based on the Huber loss function. This loss function is quadratic for small errors and linear for large errors, providing robustness against outliers. - -## Parameters - -- `alpha`: The regularization strength. A positive float value. -- `epsilon`: The threshold for the Huber loss function. A positive float value. -- `max_iter`: The maximum number of iterations to run the optimization algorithm. -- `tol`: The tolerance for the optimization. If the updates are smaller than this value, the optimization will stop. - -## Scratch Code - -- huber_regression.py file - -```py -import numpy as np - -class HuberRegression: - def __init__(self, alpha=1.0, epsilon=1.35, max_iter=1000, tol=1e-4): - self.alpha = alpha - self.epsilon = epsilon - self.max_iter = max_iter - self.tol = tol - self.coef_ = None - self.intercept_ = None - - def fit(self, X, y): - n_samples, n_features = X.shape - self.coef_ = np.zeros(n_features) - self.intercept_ = 0 - learning_rate = 0.01 - - for iteration in range(self.max_iter): - y_pred = np.dot(X, self.coef_) + self.intercept_ - error = y - y_pred - - # Compute Huber gradient - mask = np.abs(error) <= self.epsilon - gradient_w = (-2 / n_samples) * (X.T.dot(error * mask) + self.epsilon * np.sign(error) * (~mask)) + self.alpha * self.coef_ - gradient_b = (-2 / n_samples) * (np.sum(error * mask) + self.epsilon * np.sign(error) * (~mask)) - - new_coef = self.coef_ - learning_rate * gradient_w - new_intercept = self.intercept_ - learning_rate * gradient_b - - if np.all(np.abs(new_coef - self.coef_) < self.tol) and np.abs(new_intercept - self.intercept_) < self.tol: - break - - self.coef_ = new_coef - self.intercept_ = new_intercept - - def predict(self, X): - return np.dot(X, self.coef_) + self.intercept_ -``` - -- huber_regression_test.py file - -```py -import unittest -import numpy as np -from sklearn.linear_model import HuberRegressor -from HuberRegression import HuberRegression - -class TestHuberRegression(unittest.TestCase): - - def test_huber_regression(self): - np.random.seed(42) - X_train = np.random.rand(100, 1) * 10 - y_train = 2 * X_train.squeeze() + np.random.randn(100) * 2 - - X_test = np.array([[2.5], [5.0], [7.5]]) - - huber_model = HuberRegression(alpha=1.0, epsilon=1.35) - huber_model.fit(X_train, y_train) - huber_predictions = huber_model.predict(X_test) - - sklearn_model = HuberRegressor(alpha=1.0, epsilon=1.35, max_iter=1000, tol=1e-4) - sklearn_model.fit(X_train, y_train) - sklearn_predictions = sklearn_model.predict(X_test) - - np.testing.assert_allclose(huber_predictions, sklearn_predictions, rtol=1e-1) - - train_predictions_huber = huber_model.predict(X_train) - train_predictions_sklearn = sklearn_model.predict(X_train) - - huber_mse = np.mean((y_train - train_predictions_huber) ** 2) - sklearn_mse = np.mean((y_train - train_predictions_sklearn) ** 2) - - print(f"Huber Model MSE: {huber_mse}") - print(f"Scikit-learn Model MSE: {sklearn_mse}") - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/index.md b/docs/algorithms/machine-learning/supervised/regressions/index.md deleted file mode 100644 index 531275ad..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/index.md +++ /dev/null @@ -1,166 +0,0 @@ -# Regression Algorithms ๐Ÿค– - -
- - - - -
-

AdaBoost Regression

-

Iteratively correcting errors to improve accuracy

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Bayesian Regression

-

Infusing uncertainty with predictions for smarter decision-making.

-

๐Ÿ“… 2025-01-19 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Decision Tree Regression

-

Making decisions based on feature values to predict outcomes in a clear, interpretable way

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Elastic Net Regression

-

Balancing feature selection and regularization for optimal prediction

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Gradient Boosting Regression

-

Builds strong models by correcting weak learners.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Huber Regression

-

Balances squared and absolute loss for robustness.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

K Nearest Neighbors Regression

-

KNN predicts by averaging the nearest neighbors

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Lasso Regression

-

Lasso shrinks coefficients, promoting sparsity.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Linear Regression

-

Understanding the relationship between two variables.

-

๐Ÿ“… 2025-01-19 | โฑ๏ธ 2 mins

-
-
- - - - - -
-

Logistic Regression

-

Classifying data into discrete categories.

-

๐Ÿ“… 2025-01-19 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Neural Network Regression

-

Neural Networks model complex, non-linear relationships.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Polynomial Regression

-

Captures non-linear trends with higher-degree terms.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Random Forest Regression

-

Random Forest aggregates many decision trees for better accuracy.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Ridge Regression

-

Ridge applies regularization to prevent overfitting.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Support Vector Regression

-

Finds the optimal line with a balance between margin and accuracy.

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- - - - -
-

XGBoost Regression

-

Improves predictions using gradient boosting and regularization

-

๐Ÿ“… 2025-01-27 | โฑ๏ธ 2 mins

-
-
- -
diff --git a/docs/algorithms/machine-learning/supervised/regressions/k-nearest-neighbors.md b/docs/algorithms/machine-learning/supervised/regressions/k-nearest-neighbors.md deleted file mode 100644 index 46719a55..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/k-nearest-neighbors.md +++ /dev/null @@ -1,94 +0,0 @@ -# K Nearest Neighbors Regression - -This module contains an implementation of K-Nearest Neighbors Regression, a simple yet effective algorithm for predicting continuous outcomes based on input features. - -## Parameters - -- `k`: Number of neighbors to consider for prediction. - -## Scratch Code - -- k_nearest_neighbors_regression.py file - -```py -import numpy as np - -class KNNRegression: - def __init__(self, k=5): - """ - Constructor for the KNNRegression class. - - Parameters: - - k: Number of neighbors to consider. - """ - self.k = k - self.X_train = None - self.y_train = None - - def fit(self, X, y): - """ - Fit the KNN model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - self.X_train = X - self.y_train = y - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - predictions = [] - for x in X: - # Calculate Euclidean distances between the input point and all training points - distances = np.linalg.norm(self.X_train - x, axis=1) - - # Get indices of k-nearest neighbors - indices = np.argsort(distances)[:self.k] - - # Average the target values of k-nearest neighbors - predicted_value = np.mean(self.y_train[indices]) - predictions.append(predicted_value) - - return np.array(predictions) -``` - -- k_nearest_neighbors_regression_test.py file - -```py -import unittest -import numpy as np -from KNearestNeighborsRegression import KNNRegression - -class TestKNNRegression(unittest.TestCase): - - def test_knn_regression(self): - # Create synthetic data - np.random.seed(42) - X_train = np.random.rand(100, 1) * 10 - y_train = 2 * X_train.squeeze() + np.random.randn(100) * 2 # Linear relationship with noise - - X_test = np.array([[2.5], [5.0], [7.5]]) - - # Initialize and fit the KNN Regression model - knn_model = KNNRegression(k=3) - knn_model.fit(X_train, y_train) - - # Test predictions - predictions = knn_model.predict(X_test) - expected_predictions = [2 * 2.5, 2 * 5.0, 2 * 7.5] # Assuming a linear relationship - - # Check if predictions are close to the expected values - np.testing.assert_allclose(predictions, expected_predictions, rtol=1e-5) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/lasso.md b/docs/algorithms/machine-learning/supervised/regressions/lasso.md deleted file mode 100644 index 0bd4b264..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/lasso.md +++ /dev/null @@ -1,100 +0,0 @@ -# Lasso Regression - -This module contains an implementation of Lasso Regression, a linear regression technique with L1 regularization. - -## Overview - -Lasso Regression is a regression algorithm that adds a penalty term based on the absolute values of the coefficients. This penalty term helps in feature selection by driving some of the coefficients to exactly zero, effectively ignoring certain features. - -## Parameters - -- `learning_rate`: The step size for gradient descent. -- `lambda_param`: Regularization strength (L1 penalty). -- `n_iterations`: The number of iterations for gradient descent. - -## Scratch Code - -- lasso_regression.py file - -```py -import numpy as np - -class LassoRegression: - def __init__(self, learning_rate=0.01, lambda_param=0.01, n_iterations=1000): - """ - Constructor for the LassoRegression class. - - Parameters: - - learning_rate: The step size for gradient descent. - - lambda_param: Regularization strength. - - n_iterations: The number of iterations for gradient descent. - """ - self.learning_rate = learning_rate - self.lambda_param = lambda_param - self.n_iterations = n_iterations - self.weights = None - self.bias = None - - def fit(self, X, y): - """ - Fit the Lasso Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Initialize weights and bias - num_samples, num_features = X.shape - self.weights = np.zeros(num_features) - self.bias = 0 - - # Perform gradient descent - for _ in range(self.n_iterations): - predictions = np.dot(X, self.weights) + self.bias - errors = y - predictions - - # Update weights and bias - self.weights += self.learning_rate * (1/num_samples) * (np.dot(X.T, errors) - self.lambda_param * np.sign(self.weights)) - self.bias += self.learning_rate * (1/num_samples) * np.sum(errors) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - return np.dot(X, self.weights) + self.bias -``` - -- lasso_regression_test.py file - -```py -import unittest -import numpy as np -from LassoRegression import LassoRegression - -class TestLassoRegression(unittest.TestCase): - def setUp(self): - # Create a sample dataset - np.random.seed(42) - self.X_train = np.random.rand(100, 2) - self.y_train = 3 * self.X_train[:, 0] + 2 * self.X_train[:, 1] + np.random.randn(100) - - self.X_test = np.random.rand(10, 2) - - def test_fit_predict(self): - # Test the fit and predict methods - model = LassoRegression(learning_rate=0.01, lambda_param=0.1, n_iterations=1000) - model.fit(self.X_train, self.y_train) - predictions = model.predict(self.X_test) - - # Ensure predictions are of the correct shape - self.assertEqual(predictions.shape, (10,)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/linear.md b/docs/algorithms/machine-learning/supervised/regressions/linear.md deleted file mode 100644 index b9ebf77a..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/linear.md +++ /dev/null @@ -1,115 +0,0 @@ -# Linear Regression - -This module contains an implementation of the Linear Regression algorithm, a fundamental technique in machine learning for predicting a continuous outcome based on input features. - -## Parameters - -- `learning_rate`: The step size for gradient descent. -- `n_iterations`: The number of iterations for gradient descent. - -## Scratch Code - -- linear_regression.py file - -```py -import numpy as np - -# Linear regression implementation -class LinearRegression: - def __init__(self, learning_rate=0.01, n_iterations=1000): - """ - Constructor for the LinearRegression class. - - Parameters: - - learning_rate: The step size for gradient descent. - - n_iterations: The number of iterations for gradient descent. - - n_iterations: n_epochs. - """ - self.learning_rate = learning_rate - self.n_iterations = n_iterations - self.weights = None - self.bias = None - - def fit(self, X, y): - """ - Fit the linear regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Initialize weights and bias - self.weights = np.zeros((X.shape[1], 1)) - self.bias = 0 - - # Gradient Descent - for _ in range(self.n_iterations): - # Compute predictions - predictions = np.dot(X, self.weights) + self.bias - - # Calculate errors - errors = predictions - y - - # Update weights and bias - self.weights -= self.learning_rate * (1 / len(X)) * np.dot(X.T, errors) - self.bias -= self.learning_rate * (1 / len(X)) * np.sum(errors) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - return np.dot(X, self.weights) + self.bias -``` - -- linear_regression_test.py file - -```py -import unittest -import numpy as np -from LinearRegression import LinearRegression - -class TestLinearRegression(unittest.TestCase): - - def setUp(self): - # Set up some common data for testing - np.random.seed(42) - self.X_train = 2 * np.random.rand(100, 1) - self.y_train = 4 + 3 * self.X_train + np.random.randn(100, 1) - - self.X_test = 2 * np.random.rand(20, 1) - self.y_test = 4 + 3 * self.X_test + np.random.randn(20, 1) - - def test_fit_predict(self): - # Test the fit and predict methods - - # Create a LinearRegression model - lr_model = LinearRegression() - - # Fit the model to the training data - lr_model.fit(self.X_train, self.y_train) - - # Make predictions on the test data - predictions = lr_model.predict(self.X_test) - - # Check that the predictions are of the correct shape - self.assertEqual(predictions.shape, self.y_test.shape) - - def test_predict_with_unfitted_model(self): - # Test predicting with an unfitted model - - # Create a LinearRegression model (not fitted) - lr_model = LinearRegression() - - # Attempt to make predictions without fitting the model - with self.assertRaises(ValueError): - _ = lr_model.predict(self.X_test) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/logistic.md b/docs/algorithms/machine-learning/supervised/regressions/logistic.md deleted file mode 100644 index f39f703f..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/logistic.md +++ /dev/null @@ -1,174 +0,0 @@ -# ๐Ÿงฎ Logistic Regression Algorithm - -
Logistic Regression Poster
- -## ๐ŸŽฏ Objective -Logistic Regression is a supervised learning algorithm used for classification tasks. It predicts the probability of a data point belonging to a particular class, mapping the input to a value between 0 and 1 using a logistic (sigmoid) function. - -## ๐Ÿ“š Prerequisites -- Basic understanding of Linear Algebra and Probability. -- Familiarity with the concept of classification. -- Libraries: NumPy, Pandas, Matplotlib, Scikit-learn. - ---- - -## ๐Ÿงฉ Inputs -- *Input Dataset*: A structured dataset with features (independent variables) and corresponding labels (dependent variable). -- The dependent variable should be categorical (binary or multiclass). -- Example: A CSV file with columns like `age`, `income`, and `purchased` (label). - - -## ๐Ÿ“ค Outputs -- *Predicted Class*: The output is the probability of each data point belonging to a class. -- *Binary Classification*: Outputs 0 or 1 (e.g., Yes or No). -- *Multiclass Classification*: Outputs probabilities for multiple categories. - ---- - -## ๐Ÿ›๏ธ Algorithm Architecture - -### 1. Hypothesis Function -The hypothesis function of Logistic Regression applies the sigmoid function: - -\[ -h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} -\] - ---- - -### 2. Cost Function -The cost function used in Logistic Regression is the log-loss (or binary cross-entropy): - -\[ -J(\theta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] -\] - ---- - -### 3. Gradient Descent -The parameters of the logistic regression model are updated using the gradient descent algorithm: - -\[ -\theta := \theta - \alpha \frac{\partial J(\theta)}{\partial \theta} -\] - ---- - -## ๐Ÿ‹๏ธโ€โ™‚๏ธ Training Process -- **Model**: Logistic Regression model from sklearn. - -- **Validation Strategy**: A separate portion of the dataset can be reserved for validation (e.g., 20%), but this is not explicitly implemented in the current code. - -- **Training Data**: The model is trained on the entire provided dataset. - - - ---- - -## ๐Ÿ“Š Evaluation Metrics -- Accuracy is used to evaluate the classification performance of the model. - -\[ -\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} -\] - -Where: - -- **TP**: True Positives -- **TN**: True Negatives -- **FP**: False Positives -- **FN**: False Negatives - ---- - -## ๐Ÿ’ป Code Implementation - -```python -import numpy as np -from sklearn.linear_model import LogisticRegression -from sklearn.metrics import accuracy_score - -# Generate Example Dataset -np.random.seed(42) -X = np.random.rand(100, 2) # Features -y = (X[:, 0] + X[:, 1] > 1).astype(int) # Labels: 0 or 1 based on sum of features - -# Train Logistic Regression Model -model = LogisticRegression() -model.fit(X, y) - -# Predictions -y_pred = model.predict(X) -accuracy = accuracy_score(y, y_pred) - -# Output Accuracy -print("Accuracy:", accuracy) -``` - -## ๐Ÿ” Scratch Code Explanation -1. **Dataset Generation**: - - - A random dataset with 100 samples and 2 features is created. - - - Labels (`y`) are binary, determined by whether the sum of feature values is greater than 1. - -2. **Model Training**: - - The `LogisticRegression` model from `sklearn` is initialized and trained on the dataset using the fit method. - -3. **Predictions**: - - - The model predicts the labels for the input data (`X`) using the `predict` method. - - - The `accuracy_score` function evaluates the accuracy of the predictions. - -4. **Output**: - - - The calculated accuracy is printed to the console. - - -### ๐Ÿ› ๏ธ Example Usage: Predicting Customer Retention - -```python -# Example Data: Features (e.g., hours spent on platform, number of purchases) -X = np.array([[5.0, 20.0], [2.0, 10.0], [8.0, 50.0], [1.0, 5.0]]) # Features -y = np.array([1, 0, 1, 0]) # Labels: 1 (retained), 0 (not retained) - -# Train Logistic Regression Model -model = LogisticRegression() -model.fit(X, y) - -# Predict Retention for New Customers -X_new = np.array([[3.0, 15.0], [7.0, 30.0]]) -y_pred = model.predict(X_new) - -print("Predicted Retention (1 = Retained, 0 = Not Retained):", y_pred) -``` - -- This demonstrates how Logistic Regression can be applied to predict customer retention based on behavioral data, showcasing its practicality for real-world binary classification tasks. - - - ---- - -## ๐ŸŒŸ Advantages - - Simple and efficient for binary classification problems. - - - Outputs probabilities, allowing flexibility in decision thresholds. - - - Easily extendable to multiclass classification using the one-vs-rest (OvR) or multinomial approach. - -## โš ๏ธ Limitations - -- Assumes a linear relationship between features and log-odds of the target. - -- Not effective when features are highly correlated or when there is a non-linear relationship. - -## ๐Ÿš€ Application - -=== "Application 1" - **Medical Diagnosis**: Predicting the likelihood of a disease based on patient features. - - -=== "Application 2" - **Marketing**: Determining whether a customer will purchase a product based on demographic and behavioral data. - diff --git a/docs/algorithms/machine-learning/supervised/regressions/neural-network.md b/docs/algorithms/machine-learning/supervised/regressions/neural-network.md deleted file mode 100644 index 88139456..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/neural-network.md +++ /dev/null @@ -1,128 +0,0 @@ -# Neural Network Regression - -This module contains an implementation of Neural Network Regression, a powerful algorithm for predicting continuous outcomes based on input features. - -## Parameters - -- `input_size`: Number of features in the input data. -- `hidden_size`: Number of neurons in the hidden layer. -- `output_size`: Number of output neurons. -- `learning_rate`: Step size for updating weights during training. -- `n_iterations`: Number of iterations for training the neural network. - -## Scratch Code - -- neural_network_regression.py file - -```py -import numpy as np - -class NeuralNetworkRegression: - def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01, n_iterations=1000): - """ - Constructor for the NeuralNetworkRegression class. - - Parameters: - - input_size: Number of input features. - - hidden_size: Number of neurons in the hidden layer. - - output_size: Number of output neurons. - - learning_rate: Step size for gradient descent. - - n_iterations: Number of iterations for gradient descent. - """ - self.input_size = input_size - self.hidden_size = hidden_size - self.output_size = output_size - self.learning_rate = learning_rate - self.n_iterations = n_iterations - - # Initialize weights and biases - self.weights_input_hidden = np.random.rand(self.input_size, self.hidden_size) - self.bias_hidden = np.zeros((1, self.hidden_size)) - self.weights_hidden_output = np.random.rand(self.hidden_size, self.output_size) - self.bias_output = np.zeros((1, self.output_size)) - - def sigmoid(self, x): - """Sigmoid activation function.""" - return 1 / (1 + np.exp(-x)) - - def sigmoid_derivative(self, x): - """Derivative of the sigmoid function.""" - return x * (1 - x) - - def fit(self, X, y): - """ - Fit the Neural Network model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - for _ in range(self.n_iterations): - # Forward pass - hidden_layer_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden - hidden_layer_output = self.sigmoid(hidden_layer_input) - - output_layer_input = np.dot(hidden_layer_output, self.weights_hidden_output) + self.bias_output - predicted_output = self.sigmoid(output_layer_input) - - # Backpropagation - error = y - predicted_output - output_delta = error * self.sigmoid_derivative(predicted_output) - - hidden_layer_error = output_delta.dot(self.weights_hidden_output.T) - hidden_layer_delta = hidden_layer_error * self.sigmoid_derivative(hidden_layer_output) - - # Update weights and biases - self.weights_hidden_output += hidden_layer_output.T.dot(output_delta) * self.learning_rate - self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * self.learning_rate - - self.weights_input_hidden += X.T.dot(hidden_layer_delta) * self.learning_rate - self.bias_hidden += np.sum(hidden_layer_delta, axis=0, keepdims=True) * self.learning_rate - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - hidden_layer_input = np.dot(X, self.weights_input_hidden) + self.bias_hidden - hidden_layer_output = self.sigmoid(hidden_layer_input) - - output_layer_input = np.dot(hidden_layer_output, self.weights_hidden_output) + self.bias_output - predicted_output = self.sigmoid(output_layer_input) - - return predicted_output -``` - -- neural_network_regression_test.py file - -```py -import numpy as np -import unittest -from NeuralNetworkRegression import NeuralNetworkRegression - -class TestNeuralNetworkRegression(unittest.TestCase): - def setUp(self): - # Generate synthetic data for testing - np.random.seed(42) - self.X_train = np.random.rand(100, 3) - self.y_train = np.random.rand(100, 1) - - self.X_test = np.random.rand(10, 3) - - def test_fit_predict(self): - # Initialize and fit the model - model = NeuralNetworkRegression(input_size=3, hidden_size=4, output_size=1, learning_rate=0.01, n_iterations=1000) - model.fit(self.X_train, self.y_train) - - # Ensure predictions have the correct shape - predictions = model.predict(self.X_test) - self.assertEqual(predictions.shape, (10, 1)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/polynomial.md b/docs/algorithms/machine-learning/supervised/regressions/polynomial.md deleted file mode 100644 index 2005b539..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/polynomial.md +++ /dev/null @@ -1,114 +0,0 @@ -# Polynomial Regression - -This module contains an implementation of Polynomial Regression, an extension of Linear Regression that models the relationship between the independent variable and the dependent variable as a polynomial. - -## Parameters - -- `degree`: Degree of the polynomial. -- `learning_rate`: The step size for gradient descent. -- `n_iterations`: The number of iterations for gradient descent. - -## Scratch Code - -- polynomial_regression.py file - -```py -import numpy as np - -# Polynomial regression implementation -class PolynomialRegression: - def __init__(self, degree=2, learning_rate=0.01, n_iterations=1000): - """ - Constructor for the PolynomialRegression class. - - Parameters: - - degree: Degree of the polynomial. - - learning_rate: The step size for gradient descent. - - n_iterations: The number of iterations for gradient descent. - """ - self.degree = degree - self.learning_rate = learning_rate - self.n_iterations = n_iterations - self.weights = None - self.bias = None - - def _polynomial_features(self, X): - """ - Create polynomial features up to the specified degree. - - Parameters: - - X: Input features (numpy array). - - Returns: - - Polynomial features (numpy array). - """ - return np.column_stack([X ** i for i in range(1, self.degree + 1)]) - - def fit(self, X, y): - """ - Fit the polynomial regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - X_poly = self._polynomial_features(X) - self.weights = np.zeros((X_poly.shape[1], 1)) - self.bias = 0 - - for _ in range(self.n_iterations): - predictions = np.dot(X_poly, self.weights) + self.bias - errors = predictions - y - - self.weights -= self.learning_rate * (1 / len(X_poly)) * np.dot(X_poly.T, errors) - self.bias -= self.learning_rate * (1 / len(X_poly)) * np.sum(errors) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - X_poly = self._polynomial_features(X) - return np.dot(X_poly, self.weights) + self.bias -``` - -- polynomial_regression_test.py file - -```py -import unittest -import numpy as np -from PolynomialRegression import PolynomialFeatures - -class TestPolynomialRegression(unittest.TestCase): - - def setUp(self): - # Create synthetic data for testing - np.random.seed(42) - self.X_train = 2 * np.random.rand(100, 1) - self.y_train = 4 + 3 * self.X_train + np.random.randn(100, 1) - - def test_fit_predict(self): - # Test the fit and predict methods - poly_model = PolynomialFeatures(degree=2) - poly_model.fit(self.X_train, self.y_train) - - # Create test data - X_test = np.array([[1.5], [2.0]]) - - # Make predictions - predictions = poly_model.predict(X_test) - - # Assert that the predictions are NumPy arrays - self.assertTrue(isinstance(predictions, np.ndarray)) - - # Assert that the shape of predictions is as expected - self.assertEqual(predictions.shape, (X_test.shape[0], 1)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/random-forest.md b/docs/algorithms/machine-learning/supervised/regressions/random-forest.md deleted file mode 100644 index 116b3092..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/random-forest.md +++ /dev/null @@ -1,244 +0,0 @@ -# Random Forest Regression - -This module contains an implementation of Random Forest Regression, an ensemble learning method that combines multiple decision trees to create a more robust and accurate model for predicting continuous outcomes based on input features. - -## Parameters - -- `n_trees`: Number of trees in the random forest. -- `max_depth`: Maximum depth of each decision tree. -- `max_features`: Maximum number of features to consider for each split. - -## Scratch Code - -- random_forest_regression.py file - -```py -import numpy as np - -class RandomForestRegression: - - def __init__(self, n_trees=100, max_depth=None, max_features=None): - """ - Constructor for the RandomForestRegression class. - - Parameters: - - n_trees: Number of trees in the random forest. - - max_depth: Maximum depth of each decision tree. - - max_features: Maximum number of features to consider for each split. - """ - self.n_trees = n_trees - self.max_depth = max_depth - self.max_features = max_features - self.trees = [] - - def _bootstrap_sample(self, X, y): - """ - Create a bootstrap sample of the dataset. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - Returns: - - Bootstrap sample of X and y. - """ - indices = np.random.choice(len(X), len(X), replace=True) - return X[indices], y[indices] - - def _build_tree(self, X, y, depth): - """ - Recursively build a decision tree. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - depth: Current depth of the tree. - - Returns: - - Node of the decision tree. - """ - if depth == self.max_depth or np.all(y == y[0]): - return {'value': np.mean(y)} - - n_features = X.shape[1] - if self.max_features is None: - subset_features = np.arange(n_features) - else: - subset_features = np.random.choice(n_features, self.max_features, replace=False) - - # Create a random subset of features for this tree - X_subset = X[:, subset_features] - - # Create a bootstrap sample - X_bootstrap, y_bootstrap = self._bootstrap_sample(X_subset, y) - - # Find the best split using the selected subset of features - feature_index, threshold = self._find_best_split(X_bootstrap, y_bootstrap, subset_features) - - if feature_index is not None: - # Split the dataset - X_left, X_right, y_left, y_right = self._split_dataset(X, y, feature_index, threshold) - - # Recursively build left and right subtrees - left_subtree = self._build_tree(X_left, y_left, depth + 1) - right_subtree = self._build_tree(X_right, y_right, depth + 1) - - return {'feature_index': feature_index, - 'threshold': threshold, - 'left': left_subtree, - 'right': right_subtree} - else: - # If no split is found, return a leaf node - return {'value': np.mean(y)} - - def _find_best_split(self, X, y, subset_features): - """ - Find the best split for a subset of features. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - subset_features: Subset of features to consider. - - Returns: - - Index of the best feature and the corresponding threshold. - """ - m, n = X.shape - best_feature_index = None - best_threshold = None - best_variance_reduction = 0 - - initial_variance = self._calculate_variance(y) - - for feature_index in subset_features: - thresholds = np.unique(X[:, feature_index]) - - for threshold in thresholds: - # Split the dataset - _, _, y_left, y_right = self._split_dataset(X, y, feature_index, threshold) - - # Calculate variance reduction - left_weight = len(y_left) / m - right_weight = len(y_right) / m - variance_reduction = initial_variance - (left_weight * self._calculate_variance(y_left) + right_weight * self._calculate_variance(y_right)) - - # Update the best split if variance reduction is greater - if variance_reduction > best_variance_reduction: - best_feature_index = feature_index - best_threshold = threshold - best_variance_reduction = variance_reduction - - return best_feature_index, best_threshold - - def _calculate_variance(self, y): - """ - Calculate the variance of a set of target values. - - Parameters: - - y: Target values (numpy array). - - Returns: - - Variance of the target values. - """ - return np.var(y) - - def _split_dataset(self, X, y, feature_index, threshold): - """ - Split the dataset based on a feature and threshold. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - - feature_index: Index of the feature to split on. - - threshold: Threshold value for the split. - - Returns: - - Left and right subsets of the dataset. - """ - left_mask = X[:, feature_index] <= threshold - right_mask = ~left_mask - return X[left_mask], X[right_mask], y[left_mask], y[right_mask] - - def fit(self, X, y): - """ - Fit the Random Forest Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - self.trees = [] - for _ in range(self.n_trees): - # Create a bootstrap sample for each tree - X_bootstrap, y_bootstrap = self._bootstrap_sample(X, y) - - # Build a decision tree and add it to the forest - tree = self._build_tree(X_bootstrap, y_bootstrap, depth=0) - self.trees.append(tree) - - def _predict_single(self, tree, x): - """ - Recursively predict a single data point using a decision tree. - - Parameters: - - tree: Decision tree node. - - x: Input features for prediction. - - Returns: - - Predicted value. - """ - if 'value' in tree: - return tree['value'] - else: - if x[tree['feature_index']] <= tree['threshold']: - return self._predict_single(tree['left'], x) - else: - return self._predict_single(tree['right'], x) - - def predict(self, X): - """ - Make predictions on new data using the Random Forest. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - predictions = np.array([self._predict_single(tree, x) for x in X for tree in self.trees]) - return np.mean(predictions.reshape(-1, len(self.trees)), axis=1) -``` - -- random_forest_regression_test.py file - -```py -import unittest -import numpy as np -from RandomForestRegressor import RandomForestRegression - -class TestRandomForestRegressor(unittest.TestCase): - def setUp(self): - # Create sample data for testing - np.random.seed(42) - self.X_train = np.random.rand(100, 2) - self.y_train = 2 * self.X_train[:, 0] + 3 * self.X_train[:, 1] + np.random.normal(0, 0.1, 100) - - self.X_test = np.random.rand(10, 2) - - def test_fit_predict(self): - # Test if the model can be fitted and predictions are made - rfr_model = RandomForestRegression(n_trees=5, max_depth=3, max_features=2) - rfr_model.fit(self.X_train, self.y_train) - - # Ensure predictions are made without errors - predictions = rfr_model.predict(self.X_test) - - # Add your specific assertions based on the expected behavior of your model - self.assertIsInstance(predictions, np.ndarray) - self.assertEqual(predictions.shape, (10,)) - - # Add more test cases as needed - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/ridge.md b/docs/algorithms/machine-learning/supervised/regressions/ridge.md deleted file mode 100644 index c0b540f5..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/ridge.md +++ /dev/null @@ -1,101 +0,0 @@ -# Ridge Regression - -This module contains an implementation of Ridge Regression, a linear regression variant that includes regularization to prevent overfitting. - -## Overview - -Ridge Regression is a linear regression technique with an added regularization term to handle multicollinearity and prevent the model from becoming too complex. - -## Parameters - -- `alpha`: Regularization strength. A higher alpha increases the penalty for large coefficients. - -## Scratch Code - -- ridge_regression.py file - -```py -import numpy as np - -class RidgeRegression: - def __init__(self, alpha=1.0): - """ - Constructor for the Ridge Regression class. - - Parameters: - - alpha: Regularization strength. Higher values specify stronger regularization. - """ - self.alpha = alpha - self.weights = None - - def fit(self, X, y): - """ - Fit the Ridge Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Add a column of ones to the input features for the bias term - X_bias = np.c_[np.ones(X.shape[0]), X] - - # Compute the closed-form solution for Ridge Regression - identity_matrix = np.identity(X_bias.shape[1]) - self.weights = np.linalg.inv(X_bias.T @ X_bias + self.alpha * identity_matrix) @ X_bias.T @ y - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - # Add a column of ones to the input features for the bias term - X_bias = np.c_[np.ones(X.shape[0]), X] - - # Make predictions using the learned weights - predictions = X_bias @ self.weights - - return predictions -``` - -- ridge_regression_test.py file - -```py -import numpy as np -import unittest -from RidgeRegression import RidgeRegression # Assuming your RidgeRegression class is in a separate file - -class TestRidgeRegression(unittest.TestCase): - def test_fit_predict(self): - # Generate synthetic data for testing - np.random.seed(42) - X_train = np.random.rand(100, 2) - y_train = 3 * X_train[:, 0] + 5 * X_train[:, 1] + 2 + 0.1 * np.random.randn(100) - X_test = np.random.rand(20, 2) - - # Create a Ridge Regression model - ridge_model = RidgeRegression(alpha=0.1) - - # Fit the model to training data - ridge_model.fit(X_train, y_train) - - # Make predictions on test data - predictions = ridge_model.predict(X_test) - - # Ensure the predictions have the correct shape - self.assertEqual(predictions.shape, (20,)) - - def test_invalid_alpha(self): - # Check if an exception is raised for an invalid alpha value - with self.assertRaises(ValueError): - RidgeRegression(alpha=-1) - - # Add more test cases as needed - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/support-vector.md b/docs/algorithms/machine-learning/supervised/regressions/support-vector.md deleted file mode 100644 index 56ff0963..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/support-vector.md +++ /dev/null @@ -1,140 +0,0 @@ -# Support Vector Regression - -This module contains an implementation of Support Vector Regression (SVR), a regression technique using Support Vector Machines (SVM) principles. - -## Parameters - -- `epsilon`: Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function. -- `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. - -## Scratch Code - -- support_vector_regression.py file - -```py -import numpy as np - -class SupportVectorRegression: - - def __init__(self, epsilon=0.1, C=1.0): - """ - Constructor for the SupportVectorRegression class. - - Parameters: - - epsilon: Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function. - - C: Regularization parameter. The strength of the regularization is inversely proportional to C. - """ - self.epsilon = epsilon - self.C = C - self.weights = None - self.bias = None - - def _linear_kernel(self, X1, X2): - """ - Linear kernel function. - - Parameters: - - X1, X2: Input data (numpy arrays). - - Returns: - - Linear kernel result (numpy array). - """ - return np.dot(X1, X2.T) - - def _compute_kernel_matrix(self, X): - """ - Compute the kernel matrix for the linear kernel. - - Parameters: - - X: Input data (numpy array). - - Returns: - - Kernel matrix (numpy array). - """ - m = X.shape[0] - kernel_matrix = np.zeros((m, m)) - - for i in range(m): - for j in range(m): - kernel_matrix[i, j] = self._linear_kernel(X[i, :], X[j, :]) - - return kernel_matrix - - def fit(self, X, y): - """ - Fit the Support Vector Regression model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - m, n = X.shape - - # Create the kernel matrix - kernel_matrix = self._compute_kernel_matrix(X) - - # Quadratic programming problem coefficients - P = np.vstack([np.hstack([kernel_matrix, -kernel_matrix]), - np.hstack([-kernel_matrix, kernel_matrix])]) - q = np.vstack([self.epsilon * np.ones((m, 1)) - y, self.epsilon * np.ones((m, 1)) + y]) - - # Constraints matrix - G = np.vstack([np.eye(2 * m), -np.eye(2 * m)]) - h = np.vstack([self.C * np.ones((2 * m, 1)), np.zeros((2 * m, 1))]) - - # Solve the quadratic programming problem - solution = np.linalg.solve(P, q) - - # Extract weights and bias - self.weights = solution[:n] - self.bias = solution[n] - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - predictions = np.dot(X, self.weights) + self.bias - return predictions -``` - -- support_vector_regression_test.py file - -```py -import unittest -import numpy as np -from SVR import SupportVectorRegression - -class TestSupportVectorRegression(unittest.TestCase): - - def setUp(self): - # Create synthetic data for testing - np.random.seed(42) - self.X_train = 2 * np.random.rand(100, 1) - self.y_train = 4 + 3 * self.X_train + np.random.randn(100, 1) - - def test_fit_predict(self): - # Test the fit and predict methods - svr_model = SupportVectorRegression(epsilon=0.1, C=1.0) - svr_model.fit(self.X_train, self.y_train) - - # Create test data - X_test = np.array([[1.5], [2.0]]) - - # Make predictions - predictions = svr_model.predict(X_test) - - # Assert that the predictions are NumPy arrays - self.assertTrue(isinstance(predictions, np.ndarray)) - - # Assert that the shape of predictions is as expected - self.assertEqual(predictions.shape, (X_test.shape[0], 1)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/supervised/regressions/xg-boost.md b/docs/algorithms/machine-learning/supervised/regressions/xg-boost.md deleted file mode 100644 index 54028ca9..00000000 --- a/docs/algorithms/machine-learning/supervised/regressions/xg-boost.md +++ /dev/null @@ -1,127 +0,0 @@ -# XG Boost Regression - -This module contains an implementation of the XGBoost Regressor, a popular ensemble learning algorithm that combines the predictions from multiple decision trees to create a more robust and accurate model for regression tasks. - -## Parameters - -- `n_estimators`: Number of boosting rounds (trees). -- `learning_rate`: Step size shrinkage to prevent overfitting. -- `max_depth`: Maximum depth of each tree. -- `gamma`: Minimum loss reduction required to make a further partition. - -## Scratch Code - -- x_g_boost_regression.py file - -```py -import numpy as np -from sklearn.tree import DecisionTreeRegressor - -class XGBoostRegressor: - - def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, gamma=0): - """ - Constructor for the XGBoostRegressor class. - - Parameters: - - n_estimators: Number of boosting rounds (trees). - - learning_rate: Step size shrinkage to prevent overfitting. - - max_depth: Maximum depth of each tree. - - gamma: Minimum loss reduction required to make a further partition. - """ - self.n_estimators = n_estimators - self.learning_rate = learning_rate - self.max_depth = max_depth - self.gamma = gamma - self.trees = [] - - def fit(self, X, y): - """ - Fit the XGBoost model to the input data. - - Parameters: - - X: Input features (numpy array). - - y: Target values (numpy array). - """ - # Initialize residuals - residuals = np.copy(y) - - for _ in range(self.n_estimators): - # Fit a weak learner (decision tree) to the residuals - tree = DecisionTreeRegressor(max_depth=self.max_depth, min_samples_split=self.gamma) - tree.fit(X, residuals) - - # Compute predictions from the weak learner - predictions = tree.predict(X) - - # Update residuals with the weighted sum of previous residuals and predictions - residuals -= self.learning_rate * predictions - - # Store the tree in the list - self.trees.append(tree) - - def predict(self, X): - """ - Make predictions on new data. - - Parameters: - - X: Input features for prediction (numpy array). - - Returns: - - Predicted values (numpy array). - """ - # Initialize predictions with zeros - predictions = np.zeros(X.shape[0]) - - # Make predictions using each tree and update the overall prediction - for tree in self.trees: - predictions += self.learning_rate * tree.predict(X) - - return predictions -``` - -- x_g_boost_regression_test.py file - -```py -import unittest -import numpy as np -from XGBoostRegressor import XGBoostRegressor - -class TestXGBoostRegressor(unittest.TestCase): - - def setUp(self): - # Generate synthetic data for testing - np.random.seed(42) - self.X_train = np.random.rand(100, 5) - self.y_train = np.random.rand(100) - self.X_test = np.random.rand(20, 5) - - def test_fit_predict(self): - # Test the fit and predict methods - xgb_model = XGBoostRegressor(n_estimators=50, learning_rate=0.1, max_depth=3, gamma=0.1) - xgb_model.fit(self.X_train, self.y_train) - predictions = xgb_model.predict(self.X_test) - - # Ensure predictions have the correct shape - self.assertEqual(predictions.shape, (20,)) - - def test_invalid_parameters(self): - # Test invalid parameter values - with self.assertRaises(ValueError): - XGBoostRegressor(n_estimators=-1, learning_rate=0.1, max_depth=3, gamma=0.1) - - with self.assertRaises(ValueError): - XGBoostRegressor(n_estimators=50, learning_rate=-0.1, max_depth=3, gamma=0.1) - - with self.assertRaises(ValueError): - XGBoostRegressor(n_estimators=50, learning_rate=0.1, max_depth=-3, gamma=0.1) - - def test_invalid_fit(self): - # Test fitting with mismatched X_train and y_train shapes - xgb_model = XGBoostRegressor(n_estimators=50, learning_rate=0.1, max_depth=3, gamma=0.1) - with self.assertRaises(ValueError): - xgb_model.fit(self.X_train, np.random.rand(50)) - -if __name__ == '__main__': - unittest.main() -``` diff --git a/docs/algorithms/machine-learning/unsupervised/clustering/index.md b/docs/algorithms/machine-learning/unsupervised/clustering/index.md deleted file mode 100644 index 5c110f17..00000000 --- a/docs/algorithms/machine-learning/unsupervised/clustering/index.md +++ /dev/null @@ -1,16 +0,0 @@ -# Clustering Algorithms ๐Ÿค– - - diff --git a/docs/algorithms/machine-learning/unsupervised/clustering/kmeans-clustering.md b/docs/algorithms/machine-learning/unsupervised/clustering/kmeans-clustering.md deleted file mode 100644 index 42414783..00000000 --- a/docs/algorithms/machine-learning/unsupervised/clustering/kmeans-clustering.md +++ /dev/null @@ -1,185 +0,0 @@ -# K Means Clustering - -**Overview:** -K-means clustering is an unsupervised machine learning algorithm for grouping similar data points together into clusters based on their features. - ---- -### **Advantages of K-means:** -- **Simple and Easy to implement** -- **Efficiency:** K-means is computationally efficient and can handle large datasets with high dimensionality. -- **Flexibility:** K-means offers flexibility as it can be easily customized for different applications, allowing the use of various distance metrics and - initialization techniques. -- **Scalability:** K-means can handle large datasets with many data points - ---- -**How K-means Works (Scratch Implementation Guide):** -### **Algorithm Overview:** -1. **Initialization**: - - Choose `k` initial centroids randomly from the dataset. - -2. **Iterative Process**: - - **Assign Data Points**: For each data point, calculate the Euclidean distance to all centroids and assign the data point to the nearest centroid. - - **Update Centroids**: Recalculate the centroids by averaging the data points assigned to each cluster. - - **Check for Convergence**: If the centroids do not change significantly between iterations (i.e., they converge), stop. Otherwise, repeat the process. - -3. **Termination**: - - The algorithm terminates either when the centroids have converged or when the maximum number of iterations is reached. - -4. **Output**: - - The final cluster assignments for each data point. - - -## Parameters - -- `num_clusters`: Number of clusters to form. -- `max_iterations`: Maximum number of iterations before stopping. -- `show_steps`: Whether to visualize the clustering process step by step (Boolean). - -## Scratch Code - -- kmeans_scratch.py file - -```py -import numpy as np -import matplotlib.pyplot as plt - -def euclidean_distance(point1, point2): - """ - Calculate the Euclidean distance between two points in space. - """ - return np.sqrt(np.sum((point1 - point2) ** 2)) - -class KMeansClustering: - def __init__(self, num_clusters=5, max_iterations=100, show_steps=False): - """ - Initialize the KMeans clustering model with the following parameters: - - num_clusters: Number of clusters we want to form - - max_iterations: Maximum number of iterations for the algorithm - - show_steps: Boolean flag to visualize the clustering process step by step - """ - self.num_clusters = num_clusters - self.max_iterations = max_iterations - self.show_steps = show_steps - self.clusters = [[] for _ in range(self.num_clusters)] # Initialize empty clusters - self.centroids = [] # List to store the centroids of clusters - - def fit_predict(self, data): - """ - Fit the KMeans model on the data and predict the cluster labels for each data point. - """ - self.data = data - self.num_samples, self.num_features = data.shape # Get number of samples and features - initial_sample_indices = np.random.choice(self.num_samples, self.num_clusters, replace=False) - self.centroids = [self.data[idx] for idx in initial_sample_indices] - - for _ in range(self.max_iterations): - # Step 1: Assign each data point to the closest centroid to form clusters - self.clusters = self._assign_to_clusters(self.centroids) - if self.show_steps: - self._plot_clusters() - - # Step 2: Calculate new centroids by averaging the data points in each cluster - old_centroids = self.centroids - self.centroids = self._calculate_new_centroids(self.clusters) - - # Step 3: Check for convergence - if self._has_converged(old_centroids, self.centroids): - break - if self.show_steps: - self._plot_clusters() - - return self._get_cluster_labels(self.clusters) - - def _assign_to_clusters(self, centroids): - """ - Assign each data point to the closest centroid based on Euclidean distance. - """ - clusters = [[] for _ in range(self.num_clusters)] - for sample_idx, sample in enumerate(self.data): - closest_centroid_idx = self._find_closest_centroid(sample, centroids) - clusters[closest_centroid_idx].append(sample_idx) - return clusters - - def _find_closest_centroid(self, sample, centroids): - """ - Find the index of the closest centroid to the given data point (sample). - """ - distances = [euclidean_distance(sample, centroid) for centroid in centroids] - closest_idx = np.argmin(distances) # Index of the closest centroid - return closest_idx - - def _calculate_new_centroids(self, clusters): - """ - Calculate new centroids by averaging the data points in each cluster. - """ - centroids = np.zeros((self.num_clusters, self.num_features)) - for cluster_idx, cluster in enumerate(clusters): - cluster_mean = np.mean(self.data[cluster], axis=0) - centroids[cluster_idx] = cluster_mean - return centroids - - def _has_converged(self, old_centroids, new_centroids): - """ - Check if the centroids have converged - """ - distances = [euclidean_distance(old_centroids[i], new_centroids[i]) for i in range(self.num_clusters)] - return sum(distances) == 0 # If centroids haven't moved, they are converged - - def _get_cluster_labels(self, clusters): - """ - Get the cluster labels for each data point based on the final clusters. - """ - labels = np.empty(self.num_samples) - for cluster_idx, cluster in enumerate(clusters): - for sample_idx in cluster: - labels[sample_idx] = cluster_idx - return labels - - def _plot_clusters(self): - """ - Visualize the clusters and centroids in a 2D plot using matplotlib. - """ - fig, ax = plt.subplots(figsize=(12, 8)) - for i, cluster in enumerate(self.clusters): - cluster_points = self.data[cluster] - ax.scatter(cluster_points[:, 0], cluster_points[:, 1]) - - for centroid in self.centroids: - ax.scatter(centroid[0], centroid[1], marker="x", color="black", linewidth=2) - - plt.show() - -``` - -- test_kmeans.py file - -```py -import unittest -import numpy as np -from kmeans_scratch import KMeansClustering - -class TestKMeansClustering(unittest.TestCase): - - def setUp(self): - np.random.seed(42) - self.X_train = np.vstack([ - np.random.randn(100, 2) + np.array([5, 5]), - np.random.randn(100, 2) + np.array([-5, -5]), - np.random.randn(100, 2) + np.array([5, -5]), - np.random.randn(100, 2) + np.array([-5, 5]) - ]) - - def test_kmeans(self): - """Test the basic KMeans clustering functionality""" - kmeans = KMeansClustering(num_clusters=4, max_iterations=100, show_steps=False) - - cluster_labels = kmeans.fit_predict(self.X_train) - - unique_labels = np.unique(cluster_labels) - self.assertEqual(len(unique_labels), 4) - self.assertEqual(cluster_labels.shape, (self.X_train.shape[0],)) - print("Cluster labels for the data points:") - print(cluster_labels) - -if __name__ == '__main__': - unittest.main() diff --git a/docs/algorithms/machine-learning/unsupervised/dimensionality-reduction/index.md b/docs/algorithms/machine-learning/unsupervised/dimensionality-reduction/index.md deleted file mode 100644 index c8fc7ac8..00000000 --- a/docs/algorithms/machine-learning/unsupervised/dimensionality-reduction/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Dimensionality Reduction ๐Ÿค– - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/machine-learning/unsupervised/index.md b/docs/algorithms/machine-learning/unsupervised/index.md deleted file mode 100644 index d6e21115..00000000 --- a/docs/algorithms/machine-learning/unsupervised/index.md +++ /dev/null @@ -1,27 +0,0 @@ -# Unsupervised Machine Learning ๐Ÿค– - - diff --git a/docs/algorithms/natural-language-processing/Bag_Of_Words.md b/docs/algorithms/natural-language-processing/Bag_Of_Words.md deleted file mode 100644 index 3c165007..00000000 --- a/docs/algorithms/natural-language-processing/Bag_Of_Words.md +++ /dev/null @@ -1,51 +0,0 @@ -# Bag Of Words - -```py -import re -import pandas as pd -from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer - -#data collection -data = [ - 'Fashion is an art form and expression.', - 'Style is a way to say who you are without having to speak.', - 'Fashion is what you buy, style is what you do with it.', - 'With fashion, you convey a message about yourself without uttering a single word' -] - -#text processing - -def preprocess_text(text): - text = text.lower() - text = re.sub(r'[^a-zs]',' ',text) - return text - -preprocessed_data = [preprocess_text(doc) for doc in data] - -for i, doc in enumerate(preprocessed_data, 1): - print(f'Data-{i} {doc}') - - - -# removing words like the, is, are, and as they usually do not carry much useful information for the analysis. -vectorizer = CountVectorizer(stop_words='english') -X=vectorizer.fit_transform(preprocessed_data) -Word=vectorizer.get_feature_names_out() - -bow_df = pd.DataFrame(X.toarray(),columns=Word) -bow_df.index =[f'Data {i}' for i in range(1, len(data) + 1)] - -tfidf_transformer = TfidfTransformer() -X_tfidf=tfidf_transformer.fit_transform(X) -tfidf_df=pd.DataFrame(X_tfidf.toarray(), columns=Word) -tfidf_df.index=[f'Data {i}' for i in range(1, len(data) + 1)] - - -print() -print("--------------------------------BoW Represention----------------------------") -print(bow_df) - -print() -print("--------------------------------TF-IDF Value----------------------------") -print(tfidf_df) -``` \ No newline at end of file diff --git a/docs/algorithms/natural-language-processing/Fast_Text.md b/docs/algorithms/natural-language-processing/Fast_Text.md deleted file mode 100644 index 86afaf68..00000000 --- a/docs/algorithms/natural-language-processing/Fast_Text.md +++ /dev/null @@ -1,228 +0,0 @@ -# Fast Text - -## Introduction - -The `FastText` class implements a word representation and classification tool developed by Facebook's AI Research (FAIR) lab. FastText extends the Word2Vec model by representing each word as a bag of character n-grams. This approach helps capture subword information and improves the handling of rare words. - -## Explanation - -### Initialization - -- **`vocab_size`**: Size of the vocabulary. -- **`embedding_dim`**: Dimension of the word embeddings. -- **`n_gram_size`**: Size of character n-grams. -- **`learning_rate`**: Learning rate for updating embeddings. -- **`epochs`**: Number of training epochs. - -### Building Vocabulary - -- **`build_vocab()`**: Constructs the vocabulary from the input sentences and creates a reverse mapping of words to indices. - -### Generating N-grams - -- **`get_ngrams()`**: Generates character n-grams for a given word. It pads the word with `<` and `>` symbols to handle edge cases effectively. - -### Training - -- **`train()`**: Updates word and context embeddings using a simple Stochastic Gradient Descent (SGD) approach. The loss is computed as the squared error between the predicted and actual values. - -### Prediction - -- **`predict()`**: Calculates the dot product between the target word and context embeddings to predict word vectors. - -### Getting Word Vectors - -- **`get_word_vector()`**: Retrieves the embedding for a specific word from the trained model. - -### Normalization - -- **`get_embedding_matrix()`**: Returns the normalized embedding matrix for better performance and stability. - -## Advantages - -- **Subword Information**: FastText captures morphological details by using character n-grams, improving handling of rare and out-of-vocabulary words. -- **Improved Representations**: The use of subwords allows for better word representations, especially for languages with rich morphology. -- **Efficiency**: FastText is designed to handle large-scale datasets efficiently, with optimizations for both training and inference. - -## Applications - -- **Natural Language Processing (NLP)**: FastText embeddings are used in tasks like text classification, sentiment analysis, and named entity recognition. -- **Information Retrieval**: Enhances search engines by providing more nuanced semantic matching between queries and documents. -- **Machine Translation**: Improves translation models by leveraging subword information for better handling of rare words and phrases. - -## Implementation - -### Preprocessing - -1. **Initialization**: Set up parameters such as vocabulary size, embedding dimension, n-gram size, learning rate, and number of epochs. - -### Building Vocabulary - -2. **Build Vocabulary**: Construct the vocabulary from the input sentences and create a mapping for words. - -### Generating N-grams - -3. **Generate N-grams**: Create character n-grams for each word in the vocabulary, handling edge cases with padding. - -### Training - -4. **Train the Model**: Use SGD to update word and context embeddings based on the training data. - -### Prediction - -5. **Predict Word Vectors**: Calculate the dot product between target and context embeddings to predict word vectors. - -### Getting Word Vectors - -6. **Retrieve Word Vectors**: Extract the embedding for a specific word from the trained model. - -### Normalization - -7. **Normalize Embeddings**: Return the normalized embedding matrix for stability and improved performance. - -For more advanced implementations, consider using optimized libraries like the FastText library by Facebook or other frameworks that offer additional features and efficiency improvements. - -## Code - -- main.py - -```py -from fasttext import FastText - -# Example sentences -sentences = [ - "fast text is a library for efficient text classification", - "word embeddings are useful for NLP tasks", - "fasttext models can handle out-of-vocabulary words" -] - -# Initialize and train FastText model -fasttext_model = FastText(vocab_size=100, embedding_dim=50) -fasttext_model.build_vocab(sentences) -fasttext_model.train(sentences) - -# Get the vector for a word -vector = fasttext_model.get_word_vector("fast") -print(f"Vector for 'fast': {vector}") -``` - -- fast_test.py - -```py -import numpy as np -from collections import defaultdict -from sklearn.preprocessing import normalize - -class FastText: - def __init__(self, vocab_size, embedding_dim, n_gram_size=3, learning_rate=0.01, epochs=10): - self.vocab_size = vocab_size - self.embedding_dim = embedding_dim - self.n_gram_size = n_gram_size - self.learning_rate = learning_rate - self.epochs = epochs - self.word_embeddings = np.random.uniform(-0.1, 0.1, (vocab_size, embedding_dim)) - self.context_embeddings = np.random.uniform(-0.1, 0.1, (vocab_size, embedding_dim)) - self.vocab = {} - self.rev_vocab = {} - - def build_vocab(self, sentences): - """ - Build vocabulary from sentences. - - Args: - sentences (list): List of sentences (strings). - """ - word_count = defaultdict(int) - for sentence in sentences: - words = sentence.split() - for word in words: - word_count[word] += 1 - self.vocab = {word: idx for idx, (word, _) in enumerate(word_count.items())} - self.rev_vocab = {idx: word for word, idx in self.vocab.items()} - - def get_ngrams(self, word): - """ - Get n-grams for a given word. - - Args: - word (str): Input word. - - Returns: - set: Set of n-grams. - """ - ngrams = set() - word = '<' * (self.n_gram_size - 1) + word + '>' * (self.n_gram_size - 1) - for i in range(len(word) - self.n_gram_size + 1): - ngrams.add(word[i:i + self.n_gram_size]) - return ngrams - - def train(self, sentences): - """ - Train the FastText model using the given sentences. - - Args: - sentences (list): List of sentences (strings). - """ - for epoch in range(self.epochs): - loss = 0 - for sentence in sentences: - words = sentence.split() - for i, word in enumerate(words): - if word not in self.vocab: - continue - word_idx = self.vocab[word] - target_ngrams = self.get_ngrams(word) - for j in range(max(0, i - 1), min(len(words), i + 2)): - if i != j and words[j] in self.vocab: - context_idx = self.vocab[words[j]] - prediction = self.predict(word_idx, context_idx) - error = prediction - 1 if j == i + 1 else prediction - loss += error**2 - self.word_embeddings[word_idx] -= self.learning_rate * error * self.context_embeddings[context_idx] - self.context_embeddings[context_idx] -= self.learning_rate * error * self.word_embeddings[word_idx] - print(f'Epoch {epoch + 1}/{self.epochs}, Loss: {loss}') - - def predict(self, word_idx, context_idx): - """ - Predict the dot product of the word and context embeddings. - - Args: - word_idx (int): Index of the word. - context_idx (int): Index of the context word. - - Returns: - float: Dot product. - """ - return np.dot(self.word_embeddings[word_idx], self.context_embeddings[context_idx]) - - def get_word_vector(self, word): - """ - Get the word vector for the specified word. - - Args: - word (str): Input word. - - Returns: - np.ndarray: Word vector. - """ - if word in self.vocab: - return self.word_embeddings[self.vocab[word]] - else: - raise ValueError(f"Word '{word}' not found in vocabulary") - - def get_embedding_matrix(self): - """ - Get the normalized embedding matrix. - - Returns: - np.ndarray: Normalized word embeddings. - """ - return normalize(self.word_embeddings, axis=1) -``` - -## References - -1. [FastText - Facebook AI Research](https://fasttext.cc/) -2. [Understanding FastText](https://arxiv.org/pdf/1702.05531) -3. [FastText on GitHub](https://github.com/facebookresearch/fastText) - diff --git a/docs/algorithms/natural-language-processing/GloVe.md b/docs/algorithms/natural-language-processing/GloVe.md deleted file mode 100644 index 43ae2dc1..00000000 --- a/docs/algorithms/natural-language-processing/GloVe.md +++ /dev/null @@ -1,223 +0,0 @@ -# GloVe - -## Introduction - -The `GloVe` class implements the Global Vectors for Word Representation algorithm, developed by Stanford researchers. GloVe generates dense vector representations of words, capturing semantic relationships between them. Unlike traditional one-hot encoding, GloVe produces low-dimensional, continuous vectors that convey meaningful information about words and their contexts. - -## Key Concepts - -- **Co-occurrence Matrix**: GloVe starts by creating a co-occurrence matrix from a large corpus of text. This matrix counts how often words appear together within a given context window, capturing the frequency of word pairs. - -- **Weighted Least Squares**: The main idea behind GloVe is to factorize this co-occurrence matrix to find word vectors that capture the relationships between words. It aims to represent words that frequently appear together in similar contexts with similar vectors. - -- **Weighting Function**: To ensure that the optimization process doesn't get overwhelmed by very frequent co-occurrences, GloVe uses a weighting function. This function reduces the influence of extremely common word pairs. - -- **Training Objective**: The goal is to adjust the word vectors so that their dot products align with the observed co-occurrence counts. This helps in capturing the similarity between words based on their contexts. - -## GloVe Training Objective - -GloVeโ€™s training involves adjusting word vectors so that their interactions match the observed co-occurrence data. It focuses on ensuring that words appearing together often have similar vector representations. - -## Advantages - -- **Efficient Training**: By using a global co-occurrence matrix, GloVe captures semantic relationships effectively, including long-range dependencies. -- **Meaningful Vectors**: The resulting vectors can represent complex relationships between words, such as analogies (e.g., "king" - "man" + "woman" โ‰ˆ "queen"). -- **Flexibility**: GloVe vectors are versatile and can be used in various NLP tasks, including sentiment analysis and machine translation. - -## Applications - -- **Natural Language Processing (NLP)**: GloVe vectors are used as features in NLP tasks like sentiment analysis, named entity recognition, and question answering. -- **Information Retrieval**: Enhance search engines by providing better semantic matching between queries and documents. -- **Machine Translation**: Improve translation models by capturing semantic similarities between words in different languages. - -## Implementation - -### Preprocessing - -1. **Clean and Tokenize**: Prepare the text data by cleaning and tokenizing it into words. - -### Building Vocabulary - -2. **Create Vocabulary**: Construct a vocabulary and map words to unique indices. - -### Co-occurrence Matrix - -3. **Build Co-occurrence Matrix**: Create a matrix that captures how often each word pair appears together within a specified context. - -### GloVe Model - -4. **Initialization**: Set up the model parameters and hyperparameters. -5. **Weighting Function**: Define how to balance the importance of different co-occurrence counts. -6. **Training**: Use optimization techniques to adjust the word vectors based on the co-occurrence data. -7. **Get Word Vector**: Extract the vector representation for each word from the trained model. - -For more advanced implementations, consider using libraries like TensorFlow or PyTorch, which offer enhanced functionalities and optimizations. - -## Code - -- main.py file - -```py -import numpy as np -from preprocess import preprocess -from vocab_and_matrix import build_vocab, build_cooccurrence_matrix -from glove_model import GloVe - -# Example text corpus -corpus = ["I love NLP", "NLP is a fascinating field", "Natural language processing with GloVe"] - -# Preprocess the corpus -tokens = [token for sentence in corpus for token in preprocess(sentence)] - -# Build vocabulary and co-occurrence matrix -word_to_index = build_vocab(tokens) -cooccurrence_matrix = build_cooccurrence_matrix(tokens, word_to_index) - -# Initialize and train the GloVe model -glove = GloVe(vocab_size=len(word_to_index), embedding_dim=50) -glove.train(cooccurrence_matrix, epochs=100) - -# Get the word vector for 'nlp' -word_vector = glove.get_word_vector('nlp', word_to_index) -print(word_vector) -``` - -- glove_model.py file - -```py -import numpy as np - -class GloVe: - def __init__(self, vocab_size, embedding_dim=50, x_max=100, alpha=0.75): - self.vocab_size = vocab_size - self.embedding_dim = embedding_dim - self.x_max = x_max - self.alpha = alpha - self.W = np.random.rand(vocab_size, embedding_dim) - self.W_tilde = np.random.rand(vocab_size, embedding_dim) - self.b = np.random.rand(vocab_size) - self.b_tilde = np.random.rand(vocab_size) - self.gradsq_W = np.ones((vocab_size, embedding_dim)) - self.gradsq_W_tilde = np.ones((vocab_size, embedding_dim)) - self.gradsq_b = np.ones(vocab_size) - self.gradsq_b_tilde = np.ones(vocab_size) - - def weighting_function(self, x): - if x < self.x_max: - return (x / self.x_max) ** self.alpha - return 1.0 - - def train(self, cooccurrence_matrix, epochs=100, learning_rate=0.05): - for epoch in range(epochs): - total_cost = 0 - for i in range(self.vocab_size): - for j in range(self.vocab_size): - if cooccurrence_matrix[i, j] == 0: - continue - X_ij = cooccurrence_matrix[i, j] - weight = self.weighting_function(X_ij) - cost = weight * (np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j] - np.log(X_ij)) ** 2 - total_cost += cost - - grad_common = weight * (np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j] - np.log(X_ij)) - grad_W = grad_common * self.W_tilde[j] - grad_W_tilde = grad_common * self.W[i] - grad_b = grad_common - grad_b_tilde = grad_common - - self.W[i] -= learning_rate * grad_W / np.sqrt(self.gradsq_W[i]) - self.W_tilde[j] -= learning_rate * grad_W_tilde / np.sqrt(self.gradsq_W_tilde[j]) - self.b[i] -= learning_rate * grad_b / np.sqrt(self.gradsq_b[i]) - self.b_tilde[j] -= learning_rate * grad_b_tilde / np.sqrt(self.gradsq_b_tilde[j]) - - self.gradsq_W[i] += grad_W ** 2 - self.gradsq_W_tilde[j] += grad_W_tilde ** 2 - self.gradsq_b[i] += grad_b ** 2 - self.gradsq_b_tilde[j] += grad_b_tilde ** 2 - - if epoch % 10 == 0: - print(f'Epoch: {epoch}, Cost: {total_cost}') - - def get_word_vector(self, word, word_to_index): - if word in word_to_index: - word_index = word_to_index[word] - return self.W[word_index] - return None -``` - -- preprocess.py file - -```py -import string - -def preprocess(text): - """ - Preprocess the text by removing punctuation, converting to lowercase, and splitting into words. - - Args: - text (str): Input text string. - - Returns: - list: List of words (tokens). - """ - text = text.translate(str.maketrans('', '', string.punctuation)).lower() - tokens = text.split() - return tokens -``` - -- vocab_and_matrix.py file - -```py -import numpy as np -from collections import Counter - -def build_vocab(tokens): - """ - Build vocabulary from tokens and create a word-to-index mapping. - - Args: - tokens (list): List of words (tokens). - - Returns: - dict: Word-to-index mapping. - """ - vocab = Counter(tokens) - word_to_index = {word: i for i, word in enumerate(vocab)} - return word_to_index - -def build_cooccurrence_matrix(tokens, word_to_index, window_size=2): - """ - Build the co-occurrence matrix from tokens using a specified window size. - - Args: - tokens (list): List of words (tokens). - word_to_index (dict): Word-to-index mapping. - window_size (int): Context window size. - - Returns: - np.ndarray: Co-occurrence matrix. - """ - vocab_size = len(word_to_index) - cooccurrence_matrix = np.zeros((vocab_size, vocab_size)) - - for i, word in enumerate(tokens): - word_index = word_to_index[word] - context_start = max(0, i - window_size) - context_end = min(len(tokens), i + window_size + 1) - - for j in range(context_start, context_end): - if i != j: - context_word = tokens[j] - context_word_index = word_to_index[context_word] - cooccurrence_matrix[word_index, context_word_index] += 1 - - return cooccurrence_matrix -``` - - -## References - -1. [GloVe - Stanford NLP](https://nlp.stanford.edu/projects/glove/) -2. [Understanding GloVe](https://nlp.stanford.edu/pubs/glove.pdf) -3. [GloVe: Global Vectors for Word Representation - Wikipedia](https://en.wikipedia.org/wiki/GloVe) - diff --git a/docs/algorithms/natural-language-processing/NLTK_Setup.md b/docs/algorithms/natural-language-processing/NLTK_Setup.md deleted file mode 100644 index 35d09da6..00000000 --- a/docs/algorithms/natural-language-processing/NLTK_Setup.md +++ /dev/null @@ -1,241 +0,0 @@ -# NLTK Setup - -Hello there! ๐ŸŒŸ Welcome to your first step into the fascinating world of Natural Language Processing (NLP) with the Natural Language Toolkit (NLTK). This guide is designed to be super beginner-friendly. Weโ€™ll cover everything from installation to basic operations with lots of explanations along the way. Let's get started! - -## What is NLTK? -The Natural Language Toolkit (NLTK) is a comprehensive Python library for working with human language data (text). It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It also includes wrappers for industrial-strength NLP libraries. - -Key Features of NLTK: - -1.Corpora and Lexical Resources: NLTK includes access to a variety of text corpora and lexical resources, such as WordNet, the Brown Corpus, the Gutenberg Corpus, and many more. - -2.Text Processing Libraries: It provides tools for a wide range of text processing tasks: - -Tokenization (splitting text into words, sentences, etc.) - - Part-of-Speech (POS) tagging - - Named Entity Recognition (NER) - - Stemming and Lemmatization - - Parsing (syntax analysis) - - Semantic reasoning - -3.Classification and Machine Learning: NLTK includes various classifiers and machine learning algorithms that can be used for text classification tasks. - -4.Visualization and Demonstrations: It offers visualization tools for trees, graphs, and other linguistic structures. It also includes a number of interactive demonstrations and sample data. - -## Installation -First, we need to install NLTK. Make sure you have Python installed on your system. If not, you can download it from python.org. Once you have Python, open your command prompt (or terminal) and type the following command: -``` -pip install nltk -``` -To verify that NLTK is installed correctly, open a Python shell and import the library: - -import nltk -If no errors occur, NLTK is successfully installed. -NLTK requires additional data packages for various functionalities. To download all the data packages, open a python shell and run : -``` -import nltk -nltk.download ('all') -``` -Alternatively you can download specific data packages using : - -nltk.download ('punkt') # Tokenizer for splitting sentences into words -nltk.download ('averaged_perceptron_tagger') # Part-of-speech tagger for tagging words with their parts of speech -nltk.download ('maxent_ne_chunker') # Named entity chunker for recognizing named entities in text -nltk.download ('words') # Corpus of English words required for many NLTK functions -Now that we have everything set up, letโ€™s dive into some basic NLP operations with NLTK. - -## Tokenization -Tokenization is the process of breaking down text into smaller pieces, like words or sentences. It's like cutting a big cake into smaller slices. -``` -from nltk.tokenize import word_tokenize, sent_tokenize - -#Sample text to work with -text = "Natural Language Processing with NLTK is fun and educational." - -#Tokenize into words -words = word_tokenize(text) -print("Word Tokenization:", words) - -#Tokenize into sentences -sentences = sent_tokenize(text) -print("Sentence Tokenization:", sentences) -``` -Word Tokenization: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.'] - -Sentence Tokenization: ['Natural Language Processing with NLTK is fun and educational.'] - - - -## Stopwords Removal -Stopwords are common words that donโ€™t carry much meaning on their own. In many NLP tasks, we remove these words to focus on the important ones. -``` -from nltk.corpus import stopwords - -# Get the list of stopwords in English -stop_words = set(stopwords.words('english')) - -# Remove stopwords from our list of words -filtered_words = [word for word in words if word.lower() not in stop_words] - -print("Filtered Words:", filtered_words) -``` -Filtered Words: ['Natural', 'Language', 'Processing', 'NLTK', 'fun', 'educational', '.'] - -### Explanation: - -stopwords.words('english'): This gives us a list of common English stopwords. - -[word for word in words if word.lower() not in stop_words]: This is a list comprehension that filters out the stopwords from our list of words. - - -## Stemming -Stemming is the process of reducing words to their root form. Itโ€™s like finding the 'stem' of a word. -``` -from nltk.stem import PorterStemmer - -# Create a PorterStemmer object -ps = PorterStemmer() - -# Stem each word in our list of words -stemmed_words = [ps.stem(word) for word in words] - -print("Stemmed Words:", stemmed_words) -``` -Stemmed Words: ['natur', 'languag', 'process', 'with', 'nltk', 'is', 'fun', 'and', 'educ', '.'] - -### Explanation: - -PorterStemmer(): This creates a PorterStemmer object, which is a popular stemming algorithm. - -[ps.stem(word) for word in words]: This applies the stemming algorithm to each word in our list. - -## Lemmatization -Lemmatization is similar to stemming but it uses a dictionary to find the base form of a word. Itโ€™s more accurate than stemming. -``` -from nltk.stem import WordNetLemmatizer - -# Create a WordNetLemmatizer object -lemmatizer = WordNetLemmatizer() - -# Lemmatize each word in our list of words -lemmatized_words = [lemmatizer.lemmatize(word) for word in words] - -print("Lemmatized Words:", lemmatized_words) -``` -Lemmatized Words: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', 'and', 'educational', '.'] - -### Explanation: - -WordNetLemmatizer(): This creates a lemmatizer object. - -[lemmatizer.lemmatize(word) for word in words]: This applies the lemmatization process to each word in our list. - -## Part-of-speech tagging -Part-of-speech tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. NLTK provides functionality to perform POS tagging easily. -``` -# Import the word_tokenize function from nltk.tokenize module -# Import the pos_tag function from nltk module -from nltk.tokenize import word_tokenize -from nltk import pos_tag - -# Sample text to work with -text = "NLTK is a powerful tool for natural language processing." - -# Tokenize the text into individual words -# The word_tokenize function splits the text into a list of words -words = word_tokenize(text) - -# Perform Part-of-Speech (POS) tagging -# The pos_tag function takes a list of words and assigns a part-of-speech tag to each word -pos_tags = pos_tag(words) - -# Print the part-of-speech tags -print("Part-of-speech tags:") -print(pos_tags) -``` -Part-of-speech tags: -[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('tool', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')] - -### Explanation: - -pos_tags = pos_tag(words): The pos_tag function takes the list of words and assigns a part-of-speech tag to each word. For example, it might tag 'NLTK' as a proper noun (NNP), 'is' as a verb (VBZ), and so on. - -Here is a list of common POS tags used in the Penn Treebank tag set, along with explanations and examples: - -### Common POS Tags: -CC: Coordinating conjunction (e.g., and, but, or) - -CD: Cardinal number (e.g., one, two) - -DT: Determiner (e.g., the, a, an) - -EX: Existential there (e.g., there is) - -FW: Foreign word (e.g., en route) - -IN: Preposition or subordinating conjunction (e.g., in, of, like) - -JJ: Adjective (e.g., big, blue, fast) - -JJR: Adjective, comparative (e.g., bigger, faster) - -JJS: Adjective, superlative (e.g., biggest, fastest) - -LS: List item marker (e.g., 1, 2, One) - -MD: Modal (e.g., can, will, must) - -NN: Noun, singular or mass (e.g., dog, city, music) - -NNS: Noun, plural (e.g., dogs, cities) - -NNP: Proper noun, singular (e.g., John, London) - -NNPS: Proper noun, plural (e.g., Americans, Sundays) - -PDT: Predeterminer (e.g., all, both, half) - -POS: Possessive ending (e.g., 's, s') - -PRP: Personal pronoun (e.g., I, you, he) - -PRP$: Possessive pronoun (e.g., my, your, his) - -RB: Adverb (e.g., quickly, softly) - -RBR: Adverb, comparative (e.g., faster, harder) - -RBS: Adverb, superlative (e.g., fastest, hardest) - -RP: Particle (e.g., up, off) - -SYM: Symbol (e.g., $, %, &) - -TO: to (e.g., to go, to read) - -UH: Interjection (e.g., uh, well, wow) - -VB: Verb, base form (e.g., run, eat) - -VBD: Verb, past tense (e.g., ran, ate) - -VBG: Verb, gerund or present participle (e.g., running, eating) - -VBN: Verb, past participle (e.g., run, eaten) - -VBP: Verb, non-3rd person singular present (e.g., run, eat) - -VBZ: Verb, 3rd person singular present (e.g., runs, eats) - -WDT: Wh-determiner (e.g., which, that) - -WP: Wh-pronoun (e.g., who, what) - -WP$: Possessive wh-pronoun (e.g., whose) - -WRB: Wh-adverb (e.g., where, when) diff --git a/docs/algorithms/natural-language-processing/N_L_P_Introduction.md b/docs/algorithms/natural-language-processing/N_L_P_Introduction.md deleted file mode 100644 index de949bd4..00000000 --- a/docs/algorithms/natural-language-processing/N_L_P_Introduction.md +++ /dev/null @@ -1,67 +0,0 @@ -# NLP Introduction - -![NLP Banner](https://th.bing.com/th/id/OIG3.wl1FYeKHMXjwrMA3Xd59?pid=ImgGn) - -## What is NLP? ๐Ÿค– - -Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and respond to human language. It bridges the gap between human communication and machine understanding, making it possible for computers to interact with us in a natural and intuitive way. - -![NLP Footer](https://assets-global.website-files.com/5ec6a20095cdf182f108f666/5f22908f09f2341721cd8901_AI%20poster.png) - -## Importance of NLP ๐ŸŒ - -NLP is essential for various applications that we use daily, such as: - -- **Voice Assistants** (e.g., Siri, Alexa) ๐Ÿ—ฃ๏ธ -- **Chatbots** for customer service ๐Ÿ’ฌ -- **Language Translation** services (e.g., Google Translate) ๐ŸŒ -- **Sentiment Analysis** in social media monitoring ๐Ÿ“Š -- **Text Summarization** for news articles ๐Ÿ“ฐ - -## History and Evolution of NLP ๐Ÿ“œ - -1. **1950s**: Alan Turing's seminal paper "Computing Machinery and Intelligence" proposes the Turing Test. -2. **1960s-1970s**: Development of early NLP systems like ELIZA and SHRDLU. -3. **1980s**: Introduction of machine learning algorithms and statistical models. -4. **1990s**: Emergence of more sophisticated algorithms and large annotated datasets. -5. **2000s**: Advent of deep learning, leading to significant breakthroughs in NLP. -6. **2010s-Present**: Development of powerful models like BERT and GPT, revolutionizing the field. - -## Key Concepts and Terminology ๐Ÿ“š - -- **Tokens**: The smallest units of text, such as words or punctuation marks. Example: "Hello, world!" becomes ["Hello", ",", "world", "!"]. -- **Corpus**: A large collection of text used for training NLP models. Example: The Wikipedia Corpus. -- **Stopwords**: Commonly used words (e.g., "the", "is", "in") that are often removed from text during preprocessing. -- **Stemming**: Reducing words to their base or root form. Example: "running" becomes "run". -- **Lemmatization**: Similar to stemming, but it reduces words to their dictionary form. Example: "better" becomes "good". - -## Real-World Use Cases ๐ŸŒŸ - -### Voice Assistants ๐Ÿ—ฃ๏ธ - -Voice assistants like Siri and Alexa use NLP to understand and respond to user commands. For example, when you ask, "What's the weather today?", NLP helps the assistant interpret your query and provide the relevant weather information. - -### Customer Service Chatbots ๐Ÿ’ฌ - -Many companies use chatbots to handle customer inquiries. NLP enables these bots to understand customer questions and provide accurate responses, improving customer satisfaction and reducing response time. - -### Language Translation ๐ŸŒ - -NLP powers translation services like Google Translate, which can translate text from one language to another. This helps break down language barriers and facilitates global communication. - -### Sentiment Analysis ๐Ÿ“Š - -Businesses use sentiment analysis to monitor social media and understand public opinion about their products or services. NLP analyzes text to determine whether the sentiment expressed is positive, negative, or neutral. - -### Text Summarization ๐Ÿ“ฐ - -NLP algorithms can summarize long articles into concise summaries, making it easier for readers to grasp the main points quickly. This is particularly useful for news articles and research papers. - -## Conclusion ๐ŸŒŸ - -NLP is a dynamic and rapidly evolving field that plays a crucial role in how we interact with technology. By understanding its basics and key concepts, you can start exploring the fascinating world of language and machines. - - ---- - -> This README provides a brief introduction to NLP. For a deeper dive, explore more resources and start building your own NLP projects! diff --git a/docs/algorithms/natural-language-processing/Text_PreProcessing_Techniques.md b/docs/algorithms/natural-language-processing/Text_PreProcessing_Techniques.md deleted file mode 100644 index b27b6ac7..00000000 --- a/docs/algorithms/natural-language-processing/Text_PreProcessing_Techniques.md +++ /dev/null @@ -1,320 +0,0 @@ -# Text Preprocessing Techniques - -#### Welcome to this comprehensive guide on text preprocessing with NLTK (Natural Language Toolkit)! This notebook will walk you through various essential text preprocessing techniques, all explained in simple terms with easy-to-follow code examples. Whether you're just starting out in NLP (Natural Language Processing) or looking to brush up on your skills, you're in the right place! ๐Ÿš€ - -#### NLTK provides a comprehensive suite of tools for processing and analyzing unstructured text data. - -## 1. Tokenization -Tokenization is the process of splitting text into individual words or sentences. - -#### Sentence Tokenization -``` -import nltk -nltk.download('punkt') -from nltk.tokenize import sent_tokenize - -text = "Hello World. This is NLTK. It is great for text processing." -sentences = sent_tokenize(text) -print(sentences) -``` -#### Word Tokenization -``` -from nltk.tokenize import word_tokenize - -words = word_tokenize(text) -print(words) -``` - -## 2. Removing Stop Words - -Stop words are common words that may not be useful for text analysis (e.g., "is", "the", "and"). - -``` -from nltk.corpus import stopwords -nltk.download('stopwords') - -stop_words = set(stopwords.words('english')) -filtered_words = [word for word in words if word.lower() not in stop_words] -print(filtered_words) -``` - -## 3. Stemming - -Stemming reduces words to their root form by chopping off the ends. -``` -from nltk.stem import PorterStemmer -stemmer = PorterStemmer() -stemmed_words = [stemmer.stem(word) for word in filtered_words] -print(stemmed_words) -``` - -## 4. Lemmatization - -Lemmatization reduces words to their base form (lemma), taking into account the meaning of the word. -``` -from nltk.stem import WordNetLemmatizer -nltk.download('wordnet') - -lemmatizer = WordNetLemmatizer() -lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words] -print(lemmatized_words) -``` - -## 5. Part-of-Speech Tagging - -Tagging words with their parts of speech (POS) helps understand the grammatical structure. - -The complete POS tag list can be accessed from the Installation and set-up notebook. -``` -nltk.download('averaged_perceptron_tagger') - -pos_tags = nltk.pos_tag(lemmatized_words) -print(pos_tags) -``` - -## 6. Named Entity Recognition - -Identify named entities such as names of people, organizations, locations, etc. - -``` -# Numpy is required to run this -%pip install numpy - -nltk.download('maxent_ne_chunker') -nltk.download('words') -from nltk.chunk import ne_chunk - -named_entities = ne_chunk(pos_tags) -print(named_entities) -``` - -## 7. Word Frequency Distribution - -Count the frequency of each word in the text. - -``` -from nltk.probability import FreqDist - -freq_dist = FreqDist(lemmatized_words) -print(freq_dist.most_common(5)) -``` - -## 8. Removing Punctuation - -Remove punctuation from the text. - -``` -import string - -no_punct = [word for word in lemmatized_words if word not in string.punctuation] -print(no_punct) -``` - -## 9. Lowercasing - -Convert all words to lowercase. - -``` -lowercased = [word.lower() for word in no_punct] -print(lowercased) -``` - -## 10. Spelling Correction - -Correct the spelling of words. - -``` -%pip install pyspellchecker - -from nltk.corpus import wordnet -from spellchecker import SpellChecker - -spell = SpellChecker() - -def correct_spelling(word): - if not wordnet.synsets(word): - return spell.correction(word) - return word - -lemmatized_words = ['hello', 'world', '.', 'klown', 'taxt', 'procass', '.'] -words_with_corrected_spelling = [correct_spelling(word) for word in lemmatized_words] -print(words_with_corrected_spelling) -``` - -## 11. Removing Numbers - -Remove numerical values from the text. - -``` -lemmatized_words = ['hello', 'world', '88', 'text', 'process', '.'] - -no_numbers = [word for word in lemmatized_words if not word.isdigit()] -print(no_numbers) -``` - -## 12. Word Replacement - -Replace specific words with other words (e.g., replacing slang with formal words). - -``` -lemmatized_words = ['hello', 'world', 'gr8', 'text', 'NLTK', '.'] -replacements = {'NLTK': 'Natural Language Toolkit', 'gr8' : 'great'} - -replaced_words = [replacements.get(word, word) for word in lemmatized_words] -print(replaced_words) -``` - -## 13. Synonym Replacement - -Replace words with their synonyms. - -``` -from nltk.corpus import wordnet -lemmatized_words = ['hello', 'world', 'awesome', 'text', 'great', '.'] - -def get_synonym(word): - synonyms = wordnet.synsets(word) - if synonyms: - return synonyms[0].lemmas()[0].name() - return word - -synonym_replaced = [get_synonym(word) for word in lemmatized_words] -print(synonym_replaced) -``` - -## 14. Extracting Bigrams and Trigrams - -Extract bigrams (pairs of consecutive words) and trigrams (triplets of consecutive words). - -``` -from nltk import bigrams - -bigrams_list = list(bigrams(lemmatized_words)) -print(bigrams_list) - -from nltk import trigrams - -trigrams_list = list(trigrams(lemmatized_words)) -print(trigrams_list) -``` - -## 15. Sentence Segmentation - -Split text into sentences while considering abbreviations and other punctuation complexities. - -``` -import nltk.data - -text = 'Hello World. This is NLTK. It is great for text preprocessing.' - -# Load the sentence tokenizer -tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') - -# Tokenize the text into sentences -sentences = tokenizer.tokenize(text) - -# Print the tokenized sentences -print(sentences) -``` - -## 16. Identifying Word Frequencies - -Identify and display the frequency of words in a text. - -``` -from nltk.probability import FreqDist - -lemmatized_words = ['hello', 'hello', 'awesome', 'text', 'great', '.', '.', '.'] - - -word_freq = FreqDist(lemmatized_words) -for word, freq in word_freq.items(): - print(f"{word}: {freq}") -``` - -## 17. Removing HTML tags - -Remove HTML tags from the text. -``` -%pip install bs4 - -from bs4 import BeautifulSoup - -html_text = "

Hello World. This is NLTK.

" -soup = BeautifulSoup(html_text, "html.parser") -cleaned_text = soup.get_text() -print(cleaned_text) -``` - -## 18. Detecting Language - -Detect the language of the text. -``` -%pip install langdetect - -from langdetect import detect - -language = detect(text) -print(language) #`en` (for English) -``` - -## 19. Tokenizing by Regular Expressions - -Use Regular Expressions to tokenize text. -``` -text = 'Hello World. This is NLTK. It is great for text preprocessing.' - -from nltk.tokenize import regexp_tokenize - -pattern = r'\w+' -regex_tokens = regexp_tokenize(text, pattern) -print(regex_tokens) -``` - -## 20. Remove Frequent Words - -Removes frequent words (also known as โ€œhigh-frequency wordsโ€) from a list of tokens using NLTK, you can use the nltk.FreqDist() function to calculate the frequency of each word and filter out the most common ones. -``` -import nltk - -# input text -text = "Natural language processing is a field of AI. I love AI." - -# tokenize the text -tokens = nltk.word_tokenize(text) - -# calculate the frequency of each word -fdist = nltk.FreqDist(tokens) - -# remove the most common words (e.g., the top 10% of words by frequency) -filtered_tokens = [token for token in tokens if fdist[token] < fdist.N() * 0.1] - -print("Tokens without frequent words:", filtered_tokens) -``` - -## 21. Remove extra whitespace - -Tokenizes the input string into individual sentences and remove any leading or trailing whitespace from each sentence. -``` -import nltk.data - -# Text data -text = 'Hello World. This is NLTK. It is great for text preprocessing.' - -# Load the sentence tokenizer -tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') - -# Tokenize the text into sentences -sentences = tokenizer.tokenize(text) - -# Remove extra whitespace from each sentence -sentences = [sentence.strip() for sentence in sentences] - -# Print the tokenized sentences -print(sentences) -``` - -# Conclusion ๐ŸŽ‰ - -#### Text preprocessing is a crucial step in natural language processing (NLP) and can significantly impact the performance of your models and applications. With NLTK, we have a powerful toolset that simplifies and streamlines these tasks. -#### I hope this guide has provided you with a solid foundation for text preprocessing with NLTK. As you continue your journey in NLP, remember that preprocessing is just the beginning. There are many more exciting and advanced techniques to explore and apply in your projects. diff --git a/docs/algorithms/natural-language-processing/Tf_Idf.md b/docs/algorithms/natural-language-processing/Tf_Idf.md deleted file mode 100644 index 4aca3d9b..00000000 --- a/docs/algorithms/natural-language-processing/Tf_Idf.md +++ /dev/null @@ -1,192 +0,0 @@ -# Tf-Idf - -## Introduction - -The `TFIDF` class converts a collection of documents into their respective TF-IDF (Term Frequency-Inverse Document Frequency) representations. TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). - -## Attributes - -The `TFIDF` class is initialized with two main attributes: - -- **`self.vocabulary`**: A dictionary that maps words to their indices in the TF-IDF matrix. -- **`self.idf_values`**: A dictionary that stores the IDF (Inverse Document Frequency) values for each word. - -## Methods - -### fit Method - -#### Input - -- **`documents`** (list of str): List of documents where each document is a string. - -#### Purpose - -Calculate the IDF values for all unique words in the corpus. - -#### Steps - -1. **Count Document Occurrences**: Determine how many documents contain each word. -2. **Compute IDF**: Calculate the importance of each word across all documents. Higher values indicate the word is more unique to fewer documents. -3. **Build Vocabulary**: Create a mapping of words to unique indexes. - -### transform Method - -#### Input - -- **`documents`** (list of str): A list where each entry is a document in the form of a string. - -#### Purpose - -Convert each document into a numerical representation that shows the importance of each word. - -#### Steps - -1. **Compute Term Frequency (TF)**: Determine how often each word appears in a document relative to the total number of words in that document. -2. **Compute TF-IDF**: Multiply the term frequency of each word by its IDF to get a measure of its relevance in each document. -3. **Store Values**: Save these numerical values in a matrix where each row represents a document. - -### fit_transform Method - -#### Purpose - -Perform both fitting (computing IDF values) and transforming (converting documents to TF-IDF representation) in one step. - -## Explanation of the Code - -The `TFIDF` class includes methods for fitting the model to the data, transforming new data into the TF-IDF representation, and combining these steps. Here's a breakdown of the primary methods: - -1. **`fit` Method**: Calculates IDF values for all unique words in the corpus. It counts the number of documents containing each word and computes the IDF. The vocabulary is built with a word-to-index mapping. - -2. **`transform` Method**: Converts each document into a TF-IDF representation. It computes Term Frequency (TF) for each word in the document, calculates TF-IDF by multiplying TF with IDF, and stores these values in a matrix where each row corresponds to a document. - -3. **`fit_transform` Method**: Combines the fitting and transforming steps into a single method for efficient processing of documents. - -## Code - -- main.py file - -```py -import math -from collections import Counter - -class TFIDF: - def __init__(self): - self.vocabulary = {} # Vocabulary to store word indices - self.idf_values = {} # IDF values for words - - def fit(self, documents): - """ - Compute IDF values based on the provided documents. - - Args: - documents (list of str): List of documents where each document is a string. - """ - doc_count = len(documents) - term_doc_count = Counter() # To count the number of documents containing each word - - # Count occurrences of words in documents - for doc in documents: - words = set(doc.split()) # Unique words in the current document - for word in words: - term_doc_count[word] += 1 - - # Compute IDF values - self.idf_values = { - word: math.log(doc_count / (count + 1)) # +1 to avoid division by zero - for word, count in term_doc_count.items() - } - - # Build vocabulary - self.vocabulary = {word: idx for idx, word in enumerate(self.idf_values.keys())} - - def transform(self, documents): - """ - Transform documents into TF-IDF representation. - - Args: - documents (list of str): List of documents where each document is a string. - - Returns: - list of list of float: TF-IDF matrix where each row corresponds to a document. - """ - rows = [] - for doc in documents: - words = doc.split() - word_count = Counter(words) - doc_length = len(words) - row = [0] * len(self.vocabulary) - - for word, count in word_count.items(): - if word in self.vocabulary: - tf = count / doc_length - idf = self.idf_values[word] - index = self.vocabulary[word] - row[index] = tf * idf - rows.append(row) - return rows - - def fit_transform(self, documents): - """ - Compute IDF values and transform documents into TF-IDF representation. - - Args: - documents (list of str): List of documents where each document is a string. - - Returns: - list of list of float: TF-IDF matrix where each row corresponds to a document. - """ - self.fit(documents) - return self.transform(documents) -# Example usage -if __name__ == "__main__": - documents = [ - "the cat sat on the mat", - "the dog ate my homework", - "the cat ate the dog food", - "I love programming in Python", - "Machine learning is fun", - "Python is a versatile language", - "Learning new skills is always beneficial" - ] - - # Initialize the TF-IDF model - tfidf = TFIDF() - - # Fit the model and transform the documents - tfidf_matrix = tfidf.fit_transform(documents) - - # Print the vocabulary - print("Vocabulary:", tfidf.vocabulary) - - # Print the TF-IDF representation - print("TF-IDF Representation:") - for i, vector in enumerate(tfidf_matrix): - print(f"Document {i + 1}: {vector}") - - # More example documents with mixed content - more_documents = [ - "the quick brown fox jumps over the lazy dog", - "a journey of a thousand miles begins with a single step", - "to be or not to be that is the question", - "the rain in Spain stays mainly in the plain", - "all human beings are born free and equal in dignity and rights" - ] - - # Fit the model and transform the new set of documents - tfidf_more = TFIDF() - tfidf_matrix_more = tfidf_more.fit_transform(more_documents) - - # Print the vocabulary for the new documents - print("\nVocabulary for new documents:", tfidf_more.vocabulary) - - # Print the TF-IDF representation for the new documents - print("TF-IDF Representation for new documents:") - for i, vector in enumerate(tfidf_matrix_more): - print(f"Document {i + 1}: {vector}") -``` - -## References - -1. [TF-IDF - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) -2. [Understanding TF-IDF](https://towardsdatascience.com/understanding-tf-idf-a-traditional-approach-to-feature-extraction-in-nlp-a5bfbe04723f) -3. [Scikit-learn: TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) diff --git a/docs/algorithms/natural-language-processing/Transformers.md b/docs/algorithms/natural-language-processing/Transformers.md deleted file mode 100644 index da3af34e..00000000 --- a/docs/algorithms/natural-language-processing/Transformers.md +++ /dev/null @@ -1,85 +0,0 @@ -# Transformers - -Welcome to the official documentation for the **Transformers** library! ๐Ÿš€ This library, developed by Hugging Face, is designed to provide state-of-the-art natural language processing (NLP) models and tools. It's widely used for a variety of NLP tasks, including text classification, translation, summarization, and more. - -## ๐Ÿ” Overview - -Transformers are a type of deep learning model that excel in handling sequential data, like text. They rely on mechanisms such as attention to process and generate text in a way that captures long-range dependencies and contextual information. - -### Key Features - -- **State-of-the-art Models**: Access pre-trained models like BERT, GPT, T5, and many more. ๐Ÿ† -- **Easy-to-use Interface**: Simplify the process of using and fine-tuning models with a user-friendly API. ๐ŸŽฏ -- **Tokenization Tools**: Tokenize and preprocess text efficiently for model input. ๐Ÿงฉ -- **Multi-Framework Support**: Compatible with PyTorch and TensorFlow, giving you flexibility in your deep learning environment. โš™๏ธ -- **Extensive Documentation**: Detailed guides and tutorials to help you get started and master the library. ๐Ÿ“– - -## ๐Ÿ”ง Installation - -To get started with the Transformers library, you need to install it via pip: - -```bash -pip install transformers -``` - -### System Requirements - -- **Python**: Version 3.6 or later. -- **PyTorch** or **TensorFlow**: Depending on your preferred framework. Visit the [official documentation](https://huggingface.co/transformers/installation.html) for compatibility details. - -## ๐Ÿš€ Quick Start - -Here's a basic example to demonstrate how to use the library for sentiment classification: - -```python -from transformers import pipeline - -# Initialize the pipeline for sentiment analysis -classifier = pipeline('sentiment-analysis') - -# Analyze sentiment of a sample text -result = classifier("Transformers are amazing for NLP tasks! ๐ŸŒŸ") - -print(result) -``` - -### Common Pipelines - -- **Text Classification**: Classify text into predefined categories. -- **Named Entity Recognition (NER)**: Identify entities like names, dates, and locations. -- **Text Generation**: Generate text based on a prompt. -- **Question Answering**: Answer questions based on a given context. -- **Translation**: Translate text between different languages. - -## ๐Ÿ“š Documentation - -For comprehensive guides, tutorials, and API references, check out the following resources: - -- **[Transformers Documentation](https://huggingface.co/transformers/)**: The official site with detailed information on using and customizing the library. -- **[Model Hub](https://huggingface.co/models)**: Explore a wide range of pre-trained models available for different NLP tasks. -- **[API Reference](https://huggingface.co/transformers/main_classes/pipelines.html)**: Detailed descriptions of classes and functions in the library. - -## ๐Ÿ› ๏ธ Community and Support - -Join the vibrant community of Transformers users and contributors to get support, share your work, and stay updated: - -- **[Hugging Face Forums](https://discuss.huggingface.co/)**: Engage with other users and experts. Ask questions, share your projects, and participate in discussions. -- **[GitHub Repository](https://github.com/huggingface/transformers)**: Browse the source code, report issues, and contribute to the project. Check out the [issues](https://github.com/huggingface/transformers/issues) for ongoing conversations. - -## ๐Ÿ”— Additional Resources - -- **[Research Papers](https://huggingface.co/papers)**: Read the research papers behind the models and techniques used in the library. -- **[Blog Posts](https://huggingface.co/blog/)**: Discover insights, tutorials, and updates from the Hugging Face team. -- **[Webinars and Talks](https://huggingface.co/events/)**: Watch recorded talks and webinars on the latest developments and applications of Transformers. - -## โ“ FAQ - -**Q: What are the main differences between BERT and GPT?** - -A: BERT (Bidirectional Encoder Representations from Transformers) is designed for understanding the context of words in both directions (left and right). GPT (Generative Pre-trained Transformer), on the other hand, is designed for generating text and understanding context in a left-to-right manner. - -**Q: Can I fine-tune a model on my own data?** - -A: Yes, the Transformers library provides tools for fine-tuning pre-trained models on your custom datasets. Check out the [fine-tuning guide](https://huggingface.co/transformers/training.html) for more details. - -Happy Transforming! ๐ŸŒŸ \ No newline at end of file diff --git a/docs/algorithms/natural-language-processing/Word_2_Vec.md b/docs/algorithms/natural-language-processing/Word_2_Vec.md deleted file mode 100644 index 72b66264..00000000 --- a/docs/algorithms/natural-language-processing/Word_2_Vec.md +++ /dev/null @@ -1,222 +0,0 @@ -# Word 2 Vec - -## Introduction - -Word2Vec is a technique to learn word embeddings using neural networks. The primary goal is to represent words in a continuous vector space where semantically similar words are mapped to nearby points. Word2Vec can be implemented using two main architectures: - -1. **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context words (surrounding words). -2. **Skip-gram**: Predicts the context words based on a given target word. - -In this example, we focus on the Skip-gram approach, which is more commonly used in practice. The Skip-gram model tries to maximize the probability of context words given a target word. - -## Installation - -Ensure you have Python installed. You can install the necessary dependencies using pip: - -```sh -pip install numpy -``` - -## Usage - -### Initialization - -Define the parameters for the Word2Vec model: - -- `window_size`: Defines the size of the context window around the target word. -- `embedding_dim`: Dimension of the word vectors (embedding space). -- `learning_rate`: Rate at which weights are updated. - -### Tokenization - -The `tokenize` method creates a vocabulary from the documents and builds mappings between words and their indices. - -### Generate Training Data - -The `generate_training_data` method creates pairs of target words and context words based on the window size. - -### Training - -The `train` method initializes the weight matrices and updates them using gradient descent. - -For each word-context pair, it computes the hidden layer representation, predicts context probabilities, calculates the error, and updates the weights. - -### Retrieve Word Vector - -The `get_word_vector` method retrieves the embedding of a specific word. - -## Explanation of the Code - -### Initialization - -- **Parameters**: - - `window_size`: Size of the context window around the target word. - - `embedding_dim`: Dimension of the word vectors (embedding space). - - `learning_rate`: Rate at which weights are updated. - -### Tokenization - -- The `tokenize` method creates a vocabulary from the documents. -- Builds mappings between words and their indices. - -### Generate Training Data - -- The `generate_training_data` method creates pairs of target words and context words based on the window size. - -### Training - -- The `train` method initializes the weight matrices. -- Updates the weights using gradient descent. -- For each word-context pair: - - Computes the hidden layer representation. - - Predicts context probabilities. - - Calculates the error. - - Updates the weights. - -### Softmax Function - -- The `softmax` function converts the output layer scores into probabilities. -- Used to compute the error and update the weights. - -### Retrieve Word Vector - -- The `get_word_vector` method retrieves the embedding of a specific word. - -## Code - -- word2vec.py file - -```py -import numpy as np - -class Word2Vec: - def __init__(self, window_size=2, embedding_dim=10, learning_rate=0.01): - # Initialize parameters - self.window_size = window_size - self.embedding_dim = embedding_dim - self.learning_rate = learning_rate - self.vocabulary = {} - self.word_index = {} - self.index_word = {} - self.W1 = None - self.W2 = None - - def tokenize(self, documents): - # Tokenize documents and build vocabulary - vocabulary = set() - for doc in documents: - words = doc.split() - vocabulary.update(words) - - self.vocabulary = list(vocabulary) - self.word_index = {word: idx for idx, word in enumerate(self.vocabulary)} - self.index_word = {idx: word for idx, word in enumerate(self.vocabulary)} - - def generate_training_data(self, documents): - # Generate training data for the Skip-gram model - training_data = [] - for doc in documents: - words = doc.split() - for idx, word in enumerate(words): - target_word = self.word_index[word] - context = [self.word_index[words[i]] for i in range(max(0, idx - self.window_size), min(len(words), idx + self.window_size + 1)) if i != idx] - for context_word in context: - training_data.append((target_word, context_word)) - return training_data - - def train(self, documents, epochs=1000): - # Tokenize the documents and generate training data - self.tokenize(documents) - training_data = self.generate_training_data(documents) - - # Initialize weight matrices with random values - vocab_size = len(self.vocabulary) - self.W1 = np.random.uniform(-1, 1, (vocab_size, self.embedding_dim)) - self.W2 = np.random.uniform(-1, 1, (self.embedding_dim, vocab_size)) - - for epoch in range(epochs): - loss = 0 - for target_word, context_word in training_data: - # Forward pass - h = self.W1[target_word] # Hidden layer representation of the target word - u = np.dot(h, self.W2) # Output layer scores - y_pred = self.softmax(u) # Predicted probabilities - - # Calculate error - e = np.zeros(vocab_size) - e[context_word] = 1 - error = y_pred - e - - # Backpropagation - self.W1[target_word] -= self.learning_rate * np.dot(self.W2, error) - self.W2 -= self.learning_rate * np.outer(h, error) - - # Calculate loss (cross-entropy) - loss -= np.log(y_pred[context_word]) - - if (epoch + 1) % 100 == 0: - print(f'Epoch {epoch + 1}, Loss: {loss}') - - def softmax(self, x): - # Softmax function to convert scores into probabilities - e_x = np.exp(x - np.max(x)) - return e_x / e_x.sum(axis=0) - - def get_word_vector(self, word): - # Retrieve the vector representation of a word - return self.W1[self.word_index[word]] - - def get_vocabulary(self): - # Retrieve the vocabulary - return self.vocabulary -# Example usage -if __name__ == "__main__": - # Basic example usage - documents = [ - "the cat sat on the mat", - "the dog ate my homework", - "the cat ate the dog food", - "I love programming in Python", - "Machine learning is fun", - "Python is a versatile language", - "Learning new skills is always beneficial" - ] - - # Initialize and train the Word2Vec model - word2vec = Word2Vec() - word2vec.train(documents) - - # Print the vocabulary - print("Vocabulary:", word2vec.get_vocabulary()) - - # Print the word vectors for each word in the vocabulary - print("Word Vectors:") - for word in word2vec.get_vocabulary(): - vector = word2vec.get_word_vector(word) - print(f"Vector for '{word}':", vector) - - # More example documents with mixed content - more_documents = [ - "the quick brown fox jumps over the lazy dog", - "a journey of a thousand miles begins with a single step", - "to be or not to be that is the question", - "the rain in Spain stays mainly in the plain", - "all human beings are born free and equal in dignity and rights" - ] - - # Initialize and train the Word2Vec model on new documents - word2vec_more = Word2Vec() - word2vec_more.train(more_documents) - - # Print the word vectors for selected words - print("\nWord Vectors for new documents:") - for word in ['quick', 'journey', 'be', 'rain', 'human']: - vector = word2vec_more.get_word_vector(word) - print(f"Vector for '{word}':", vector) -``` - -## References - -1. [Word2Vec - Google](https://code.google.com/archive/p/word2vec/) -2. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781) -3. [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546) diff --git a/docs/algorithms/natural-language-processing/Word_Embeddings.md b/docs/algorithms/natural-language-processing/Word_Embeddings.md deleted file mode 100644 index 0b688ba2..00000000 --- a/docs/algorithms/natural-language-processing/Word_Embeddings.md +++ /dev/null @@ -1,128 +0,0 @@ -# Word Embeddings - -## Introduction - -Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. These embeddings capture semantic meanings of words based on their context and usage in a given corpus. Word embeddings have become a fundamental concept in Natural Language Processing (NLP) and are widely used for various NLP tasks such as sentiment analysis, machine translation, and more. - -## Why Use Word Embeddings? - -Traditional word representations, such as one-hot encoding, have limitations: -- **High Dimensionality**: One-hot encoding results in very high-dimensional vectors, which are sparse (mostly zeros). -- **Lack of Semantics**: One-hot vectors do not capture any semantic relationships between words. - -Word embeddings address these issues by: -- **Dimensionality Reduction**: Representing words in a lower-dimensional space. -- **Capturing Semantics**: Encoding words such that similar words are closer together in the vector space. - -## Types of Word Embeddings - -### 1. **Word2Vec** - -Developed by Mikolov et al., Word2Vec generates word embeddings using neural networks. There are two main models: - -- **Continuous Bag of Words (CBOW)**: Predicts a target word based on its surrounding context words. -- **Skip-gram**: Predicts surrounding context words given a target word. - -#### Advantages -- Efficient and scalable. -- Captures semantic similarity. - -#### Limitations -- Context window size and other parameters need tuning. -- Does not consider word order beyond the fixed context window. - -### 2. **GloVe (Global Vectors for Word Representation)** - -Developed by Pennington et al., GloVe generates embeddings by factorizing the word co-occurrence matrix. The key idea is to use the global statistical information of the corpus. - -#### Advantages -- Captures global statistical information. -- Produces high-quality embeddings. - -#### Limitations -- Computationally intensive. -- Fixed window size and parameters. - -### 3. **FastText** - -Developed by Facebook's AI Research (FAIR) lab, FastText is an extension of Word2Vec that represents words as bags of character n-grams. This allows FastText to generate embeddings for out-of-vocabulary words. - -#### Advantages -- Handles morphologically rich languages better. -- Generates embeddings for out-of-vocabulary words. - -#### Limitations -- Slightly more complex to train than Word2Vec. -- Increased training time. - -### 4. **ELMo (Embeddings from Language Models)** - -Developed by Peters et al., ELMo generates embeddings using deep contextualized word representations based on a bidirectional LSTM. - -#### Advantages -- Contextual embeddings. -- Captures polysemy (different meanings of a word). - -#### Limitations -- Computationally expensive. -- Not as interpretable as static embeddings. - -### 5. **BERT (Bidirectional Encoder Representations from Transformers)** - -Developed by Devlin et al., BERT uses transformers to generate embeddings. BERT's embeddings are contextual and capture bidirectional context. - -#### Advantages -- State-of-the-art performance in many NLP tasks. -- Contextual embeddings capture richer semantics. - -#### Limitations -- Requires significant computational resources. -- Complexity in implementation. - -## Applications of Word Embeddings - -- **Sentiment Analysis**: Understanding the sentiment of a text by analyzing word embeddings. -- **Machine Translation**: Translating text from one language to another using embeddings. -- **Text Classification**: Categorizing text into predefined categories. -- **Named Entity Recognition**: Identifying and classifying entities in text. - -## Example Code - -Here's an example of using Word2Vec with the `gensim` library in Python: - -```python -from gensim.models import Word2Vec -from nltk.tokenize import word_tokenize -import nltk - -nltk.download('punkt') - -# Sample corpus -corpus = [ - "This is a sample sentence", - "Word embeddings are useful in NLP", - "Natural Language Processing with embeddings" -] - -# Tokenize the corpus -tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus] - -# Train Word2Vec model -model = Word2Vec(sentences=tokenized_corpus, vector_size=50, window=3, min_count=1, sg=1) - -# Get the embedding for the word 'word' -word_embedding = model.wv['word'] -print("Embedding for 'word':", word_embedding) -``` - -## Conclusion - -Word embeddings are a powerful technique in NLP that provide a way to represent words in a dense, continuous vector space. By capturing semantic relationships and reducing dimensionality, embeddings improve the performance of various NLP tasks and models. - -## References - -- [Word2Vec Explained](https://towardsdatascience.com/word2vec-explained-49c52b4ccb71) -- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) -- [FastText](https://fasttext.cc/) -- [ELMo: Deep Contextualized Word Representations](https://arxiv.org/abs/1802.05365) -- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) \ No newline at end of file diff --git a/docs/algorithms/natural-language-processing/index.md b/docs/algorithms/natural-language-processing/index.md deleted file mode 100644 index b29b87ca..00000000 --- a/docs/algorithms/natural-language-processing/index.md +++ /dev/null @@ -1,105 +0,0 @@ -# Natural Language Processing ๐Ÿ—ฃ๏ธ - -
- - - - -
-

Bag Of Words

-

Representation of text that is based on an unordered collection.

-

๐Ÿ“… 2025-01-10 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Fast Text

-

From Facebook AI Research(FAIR) for learning word embeddings and word classifications.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 7 mins

-
-
- - - - -
-

Gloabl Vectors

-

Unsupervised learning algorithm for obtaining vector representations for words.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

NLP Introduction

-

Enables computers to comprehend, generate, and manipulate human language.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 3 mins

-
-
- - - - -
-

NLTK Setup

-

Working with human language data.

-

๐Ÿ“… 2025-01-10 | โฑ๏ธ 2 mins

-
-
- - - - -
-

Text Pre-Processing Techniques

-

Cleaning and preparing raw text data for further analysis or model training.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Term Frequency-Inverse Document Frequency

-

Measure of importance of a word to a document.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Transformers

-

Deep neural network architecture.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 4 mins

-
-
- - - - -
-

Word2Vec

-

Creates vector representations of words.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 3 mins

-
-
- - - - -
-

Word Embeddings

-

Numeric representations of words in a lower-dimensional space.

-

๐Ÿ“… 2025-01-15 | โฑ๏ธ 3 mins

-
-
- -
diff --git a/docs/algorithms/statistics/descriptive/index.md b/docs/algorithms/statistics/descriptive/index.md deleted file mode 100644 index 32f48b0c..00000000 --- a/docs/algorithms/statistics/descriptive/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Descriptive Statstics ๐Ÿ“ƒ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/statistics/index.md b/docs/algorithms/statistics/index.md deleted file mode 100644 index bd30fc6e..00000000 --- a/docs/algorithms/statistics/index.md +++ /dev/null @@ -1,50 +0,0 @@ -# Statistics ๐Ÿ“ƒ - - diff --git a/docs/algorithms/statistics/inferential/index.md b/docs/algorithms/statistics/inferential/index.md deleted file mode 100644 index b8860a6f..00000000 --- a/docs/algorithms/statistics/inferential/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Inferential Statstics ๐Ÿ“ƒ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Absolute_Error.md b/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Absolute_Error.md deleted file mode 100644 index 2d62b7d6..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Absolute_Error.md +++ /dev/null @@ -1,18 +0,0 @@ -# Mean Absolute Error - -```py -import numpy as np - -def mean_absolute_error(y_true, y_pred): - """ - Calculate the mean absolute error between true and predicted values. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - Mean absolute error (float). - """ - return (np.absolute(y_true - y_pred)).mean() -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Squared_Error.md b/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Squared_Error.md deleted file mode 100644 index bc868298..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/errors/Mean_Squared_Error.md +++ /dev/null @@ -1,18 +0,0 @@ -# Mean Squared Error - -```py -import numpy as np - -def mean_squared_error(y_true, y_pred): - """ - Calculate the mean squared error between true and predicted values. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - Mean squared error (float). - """ - return np.mean((y_true - y_pred) ** 2) -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/errors/R2_Squared_Error.md b/docs/algorithms/statistics/metrics-and-losses/errors/R2_Squared_Error.md deleted file mode 100644 index c11fc2dc..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/errors/R2_Squared_Error.md +++ /dev/null @@ -1,21 +0,0 @@ -# R2 Squared Error - -```py -import numpy as np - -def r_squared(y_true, y_pred): - """ - Calculate the R-squared value between true and predicted values. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - R-squared value (float). - """ - total_variance = np.sum((y_true - np.mean(y_true)) ** 2) - explained_variance = np.sum((y_pred - np.mean(y_true)) ** 2) - r2 = 1 - (explained_variance / total_variance) - return r2 -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/errors/Root_Mean_Squared_Error.md b/docs/algorithms/statistics/metrics-and-losses/errors/Root_Mean_Squared_Error.md deleted file mode 100644 index 2d4a1475..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/errors/Root_Mean_Squared_Error.md +++ /dev/null @@ -1,19 +0,0 @@ -# Root Mean Squared Error - -```py -import numpy as np -import math as mt - -def root_mean_squared_error(y_true,y_pred): - """ - Calculate the root mean squared error between true and predicted values. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - Root Mean squared error (float). - """ - return mt.sqrt(np.mean((y_true - y_pred) ** 2)) -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/index.md b/docs/algorithms/statistics/metrics-and-losses/index.md deleted file mode 100644 index 3c94014e..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/index.md +++ /dev/null @@ -1,2 +0,0 @@ -# Metrics and Losses ๐Ÿ“ƒ - diff --git a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Cross_Entropy_Loss.md b/docs/algorithms/statistics/metrics-and-losses/loss-functions/Cross_Entropy_Loss.md deleted file mode 100644 index 7f803df2..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Cross_Entropy_Loss.md +++ /dev/null @@ -1,170 +0,0 @@ -# Cross Entropy Loss - -```py -import numpy as np - -def binary_cross_entropy_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list) -> float: - """ - Calculate the binary cross entropy loss between true and predicted values. - It measures the difference between the predicted probability distribution and the actual binary label distribution. - The formula for binary cross-entropy loss is as follows: - - L(y, ลท) = -[y * log(ลท) + (1 โ€” y) * log(1 โ€” ลท)] - - where y is the true binary label (0 or 1), ลท is the predicted probability (ranging from 0 to 1), and log is the natural logarithm. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - Binary cross entropy loss (float). - """ - if (y_true is not None) and (y_pred is not None): - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_true.shape == y_pred.shape, f"Shape of y_true ({y_true.shape}) does not match y_pred ({y_pred.shape})" - # calculate the binary cross-entropy loss - loss = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)).mean() - return loss - else: - return None - -def weighted_binary_cross_entropy_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list, w_pos: float, w_neg: float) -> float: - """ - Calculates the weighted binary cross entropy loss between true and predicted values. - Weighted Binary Cross-Entropy loss is a variation of the binary cross-entropy loss that allows for assigning different weights to positive and negative examples. This can be useful when dealing with imbalanced datasets, where one class is significantly underrepresented compared to the other. - The formula for weighted binary cross-entropy loss is as follows: - - L(y, ลท) = -[w_pos * y * log(ลท) + w_neg * (1 โ€” y) * log(1 โ€” ลท)] - - where y is the true binary label (0 or 1), ลท is the predicted probability (ranging from 0 to 1), log is the natural logarithm, and w_pos and w_neg are the positive and negative weights, respectively. - - Parameters: - - y_true: True target values (numpy array). - - y_pred: Predicted values (numpy array). - - Returns: - - Weighted binary cross entropy loss (float). - """ - if (y_true is not None) and (y_pred is not None): - assert w_pos != 0.0, f"Weight w_pos = {w_pos}" - assert w_neg != 0.0, f"Weight w_neg = {w_neg}" - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_true.shape == y_pred.shape, f"Shape of y_true ({y_true.shape}) does not match y_pred ({y_pred.shape})" - # calculate the binary cross-entropy loss - loss = -(w_pos * y_true * np.log(y_pred) + w_neg * (1 - y_true) * np.log(1 - y_pred)).mean() - return loss - else: - return None - - -def categorical_cross_entropy_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list) -> float: - """ - Calculate the categorical cross entropy loss between true and predicted values. - It measures the difference between the predicted probability distribution and the actual one-hot encoded label distribution. - The formula for categorical cross-entropy loss is as follows: - - L(y, ลท) = -1/N * ฮฃ[ฮฃ{y * log(ลท)}] - - where y is the true one-hot encoded label vector, ลท is the predicted probability distribution, and log is the natural logarithm. - - Parameters: - - y_true: True target values (numpy array) (one-hot encoded). - - y_pred: Predicted values (numpy array) (probabilities). - - Returns: - - Categorical cross entropy loss (float). - """ - if (y_true is not None) and (y_pred is not None): - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_pred.ndim == 2, f"Shape of y_pred should be (N, C), got {y_pred.shape}" - assert y_true.shape == y_pred.shape, f"Shape of y_true ({y_true.shape}) does not match y_pred ({y_pred.shape})" - - # Ensure numerical stability - y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15) - - # calculate the categorical cross-entropy loss - loss = -1/len(y_true) * np.sum(np.sum(y_true * np.log(y_pred))) - return loss.mean() - else: - return None - -def sparse_categorical_cross_entropy_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list) -> float: - """ - Calculate the sparse categorical cross entropy loss between true and predicted values. - It measures the difference between the predicted probability distribution and the actual class indices. - The formula for sparse categorical cross-entropy loss is as follows: - - L(y, ลท) = -ฮฃ[log(ลท[range(N), y])] - - where y is the true class indices, ลท is the predicted probability distribution, and log is the natural logarithm. - - Parameters: - - y_true: True target values (numpy array) (class indices). - - y_pred: Predicted values (numpy array) (probabilities). - - Returns: - - Sparse categorical cross entropy loss (float). - """ - if (y_true is not None) and (y_pred is not None): - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_true.shape[0] == y_pred.shape[0], f"Batch size of y_true ({y_true.shape[0]}) does not match y_pred ({y_pred.shape[0]})" - - # convert true labels to one-hot encoding - y_true_onehot = np.zeros_like(y_pred) - y_true_onehot[np.arange(len(y_true)), y_true] = 1 - - # Ensure numerical stability - y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15) - - # calculate loss - loss = -np.mean(np.sum(y_true_onehot * np.log(y_pred), axis=-1)) - return loss - else: - return None - - -if __name__ == "__main__": - # define true labels and predicted probabilities - y_true = np.array([0, 1, 1, 0]) - y_pred = np.array([0.1, 0.9, 0.8, 0.3]) - - print("\nTesting Binary Cross Entropy Loss") - print("Y_True: ", y_true) - print("Y_Pred:", y_pred) - print("Binary Cross Entropy Loss: ", binary_cross_entropy_loss(y_true, y_pred)) - - positive_weight = 0.7 - negative_weight = 0.3 - - print("\nTesting Weighted Binary Cross Entropy Loss") - print("Y_True: ", y_true) - print("Y_Pred:", y_pred) - print("Weighted Binary Cross Entropy Loss: ", weighted_binary_cross_entropy_loss(y_true, y_pred, positive_weight, negative_weight)) - - y_true = np.array([[0, 1, 0], [0, 0, 1], [1, 0, 0]]) - y_pred = np.array([[0.8, 0.1, 0.1], [0.2, 0.3, 0.5], [0.1, 0.6, 0.3]]) - print("\nTesting Categorical Cross Entropy Loss") - print("Y_True: ", y_true) - print("Y_Pred:", y_pred) - print("Categorical Cross Entropy Loss: ", categorical_cross_entropy_loss(y_true, y_pred)) - - y_true = np.array([1, 2, 0]) - y_pred = np.array([[0.1, 0.8, 0.1], [0.3, 0.2, 0.5], [0.4, 0.3, 0.3]]) - print("\nTesting Sparse Categorical Cross Entropy Loss") - print("Y_True: ", y_true) - print("Y_Pred:", y_pred) - print("Sparse Categorical Cross Entropy Loss: ", sparse_categorical_cross_entropy_loss(y_true, y_pred)) -``` \ No newline at end of file diff --git a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Hinge_Loss.md b/docs/algorithms/statistics/metrics-and-losses/loss-functions/Hinge_Loss.md deleted file mode 100644 index 1dbfbeea..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Hinge_Loss.md +++ /dev/null @@ -1,39 +0,0 @@ -# Hinge Loss - -```py -import numpy as np - -def hinge_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list)-> float: - """ - Calculates the hinge loss between true and predicted values. - - The formula for hinge loss is as follows: - - L(y, ลท) = max(0, 1 - y * ลท) - - """ - if (y_true is not None) and (y_pred is not None): - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_true.shape[0] == y_pred.shape[0], f"Batch size of y_true ({y_true.shape[0]}) does not match y_pred ({y_pred.shape[0]})" - - # replacing 0 values to -1 - y_pred = np.where(y_pred == 0, -1, 1) - y_true = np.where(y_true == 0, -1, 1) - - # Calculate loss - loss = np.maximum(0, 1 - y_true * y_pred).mean() - return loss - -if __name__ == "__main__": - # define true labels and predicted probabilities - actual = np.array([1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1]) - predicted = np.array([0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1]) - - print("\nTesting Hinge Loss") - print("Y_True: ", actual) - print("Y_Pred:", predicted) - print("Hinge Loss: ", hinge_loss(actual, predicted)) -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Kullback_Leibler_Divergence_Loss.md b/docs/algorithms/statistics/metrics-and-losses/loss-functions/Kullback_Leibler_Divergence_Loss.md deleted file mode 100644 index 90464589..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Kullback_Leibler_Divergence_Loss.md +++ /dev/null @@ -1,54 +0,0 @@ -# KL Divergence Loss - - Kullback Leibler Divergence Loss - -```py -import numpy as np - -def kl_divergence_loss(y_true: np.ndarray | list, y_pred: np.ndarray | list) -> float: - """ - Calculate the Kullback-Leibler (KL) divergence between two probability distributions. - KL divergence measures how one probability distribution diverges from another reference probability distribution. - - The formula for KL divergence is: - D_KL(P || Q) = ฮฃ P(x) * log(P(x) / Q(x)) - - where P is the true probability distribution and Q is the predicted probability distribution. - - Parameters: - - y_true: True probability distribution (numpy array or list). - - y_pred: Predicted probability distribution (numpy array or list). - - Returns: - - KL divergence loss (float). - """ - if (y_true is not None) and (y_pred is not None): - if type(y_true) == list: - y_true = np.asarray(y_true) - if type(y_pred) == list: - y_pred = np.asarray(y_pred) - assert y_true.shape == y_pred.shape, f"Shape of p_true ({y_true.shape}) does not match q_pred ({y_pred.shape})" - - # Ensure numerical stability by clipping the probabilities - y_true = np.clip(y_true, 1e-15, 1) - y_pred = np.clip(y_pred, 1e-15, 1) - - # Normalize the distributions - y_true /= y_true.sum(axis=-1, keepdims=True) - y_pred /= y_pred.sum(axis=-1, keepdims=True) - - # Calculate KL divergence - kl_div = np.sum(y_true * np.log(y_true / y_pred), axis=-1) - return kl_div.mean() - else: - return None - -if __name__ == "__main__": - y_true = np.array([[0.2, 0.5, 0.3], [0.1, 0.7, 0.2]]) # True probability distributions - y_pred = np.array([[0.1, 0.6, 0.3], [0.2, 0.5, 0.3]]) # Predicted probability distributions - - print("\nTesting Kullback Leibler Divergence Loss") - print("Y_True: ", y_true) - print("Y_Pred:", y_pred) - print("Kullback Leibler Divergence Loss: ", kl_divergence_loss(y_true, y_pred)) -``` diff --git a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Ranking_Losses.md b/docs/algorithms/statistics/metrics-and-losses/loss-functions/Ranking_Losses.md deleted file mode 100644 index 2d83f9ba..00000000 --- a/docs/algorithms/statistics/metrics-and-losses/loss-functions/Ranking_Losses.md +++ /dev/null @@ -1,64 +0,0 @@ -# Ranking Losses - -## Pair Wise Ranking Loss - -```py -import tensorflow as tf -from typing import Tuple - -def pairwise_ranking_loss(y_true: tf.Tensor, y_pred: tf.Tensor, margin: float = 1.0) -> tf.Tensor: - """ - Computes the pairwise ranking loss for a batch of pairs. - - Args: - y_true: Tensor of true labels (0 for negative pairs, 1 for positive pairs). - y_pred: Tensor of predicted similarities/distances, expected to be a tensor of shape (batch_size, 2, embedding_dim) where - y_pred[:, 0] is the anchor and y_pred[:, 1] is the positive/negative. - margin: Margin parameter for the pairwise ranking loss. - - Returns: - loss: Computed pairwise ranking loss as a scalar tensor. - """ - anchor, positive_or_negative = y_pred[:, 0], y_pred[:, 1] - - distances = tf.reduce_sum(tf.square(anchor - positive_or_negative), axis=-1) - positive_loss = y_true * distances - negative_loss = (1 - y_true) * tf.maximum(margin - distances, 0.0) - - loss = positive_loss + negative_loss - return tf.reduce_mean(loss) - -# Example usage: -# model.compile(optimizer='adam', loss=pairwise_ranking_loss) -``` - -## Triplet Loss - -```py -import tensorflow as tf -from typing import Tuple - -def triplet_loss_func(y_true: tf.Tensor, y_pred: tf.Tensor, alpha: float = 0.3) -> tf.Tensor: - """ - Computes the triplet loss for a batch of triplets. - - Args: - y_true: True values of classification (unused in this implementation, typically required for compatibility with Keras). - y_pred: Predicted values, expected to be a tensor of shape (batch_size, 3, embedding_dim) where - y_pred[:, 0] is the anchor, y_pred[:, 1] is the positive, and y_pred[:, 2] is the negative. - alpha: Margin parameter for the triplet loss. - - Returns: - loss: Computed triplet loss as a scalar tensor. - """ - anchor, positive, negative = y_pred[:, 0], y_pred[:, 1], y_pred[:, 2] - - positive_dist = tf.reduce_sum(tf.square(anchor - positive), axis=-1) - negative_dist = tf.reduce_sum(tf.square(anchor - negative), axis=-1) - - loss = tf.maximum(positive_dist - negative_dist + alpha, 0.0) - return tf.reduce_mean(loss) - -# Example usage: -# model.compile(optimizer='adam', loss=triplet_loss_func) -``` diff --git a/docs/algorithms/statistics/probability/index.md b/docs/algorithms/statistics/probability/index.md deleted file mode 100644 index 7057ad69..00000000 --- a/docs/algorithms/statistics/probability/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Probability ๐Ÿ“ƒ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/customs/extra.css b/docs/customs/extra.css deleted file mode 100644 index 27c5c0ce..00000000 --- a/docs/customs/extra.css +++ /dev/null @@ -1,49 +0,0 @@ -body { - background-image: url('https://github.com/user-attachments/assets/70330a16-a16d-4228-b389-b1af64c03972'); -} - -.md-header, -.md-footer { - background: transparent; -} - -.md-header { - backdrop-filter: blur(10px); - -webkit-backdrop-filter: blur(10px); -} - -/* Light Mode Styling */ -[data-md-color-scheme="default"] { - --text-color: black; - --border-color: #12151e; -} - -[data-md-color-scheme="default"] * { - color: var(--text-color); -} - -[data-md-color-scheme="default"] .md-header { - border-bottom: 1px solid var(--border-color); -} - -[data-md-color-scheme="default"] .md-footer { - border-top: 1px solid var(--border-color); -} - -/* Dark Mode Styling */ -[data-md-color-scheme="slate"] { - --text-color: white; - --border-color: #53535b; -} - -[data-md-color-scheme="slate"] * { - color: var(--text-color); -} - -[data-md-color-scheme="slate"] .md-header { - border-bottom: 1px solid var(--border-color); -} - -[data-md-color-scheme="slate"] .md-footer { - border-top: 1px solid var(--border-color); -} \ No newline at end of file diff --git a/docs/libraries/index.md b/docs/libraries/index.md deleted file mode 100644 index f71f26ed..00000000 --- a/docs/libraries/index.md +++ /dev/null @@ -1,25 +0,0 @@ -# Libraries & Packages ๐Ÿ“š - - diff --git a/docs/libraries/numpy.md b/docs/libraries/numpy.md deleted file mode 100644 index 8e9eb3f4..00000000 --- a/docs/libraries/numpy.md +++ /dev/null @@ -1,2 +0,0 @@ -# Numpy - diff --git a/docs/libraries/pandas.md b/docs/libraries/pandas.md deleted file mode 100644 index da5514e8..00000000 --- a/docs/libraries/pandas.md +++ /dev/null @@ -1,2 +0,0 @@ -# Pandas - From f87a674dd362dbbc37df9d76ccda66aca280a9b2 Mon Sep 17 00:00:00 2001 From: Avdhesh-Varshney <114330097+Avdhesh-Varshney@users.noreply.github.com> Date: Mon, 24 Feb 2025 11:30:44 +0530 Subject: [PATCH 16/19] update: projects --- .../bicep-reps-counting.md} | 0 .../black-and-white-image-colorizer.md} | 0 .../computer-vision/brightness-control.md | 0 .../computer-vision/face-detection.md | 0 docs/computer-vision/index.md | 57 +++++ docs/contribute.md | 193 ++++------------ .../bangladesh-premier-league-analysis.md | 0 .../black-friday-sales-analysis.md | 0 docs/data-visualization/index.md | 19 ++ .../deep-learning/anamoly-detection.md | 8 +- .../brain-tumor-detection-model.md | 0 docs/deep-learning/index.md | 41 ++++ .../music-genre-classification-model.md} | 0 docs/generative-adversarial-networks/index.md | 6 + docs/index.md | 216 ++++++++++++++---- docs/large-language-models/index.md | 6 + .../air-quality-prediction.md | 0 .../cardiovascular-disease-prediction.md | 0 .../machine-learning/crop-recommendation.md | 0 .../health-insurance-cross-sell-prediction.md | 0 .../heart-disease-detection-model.md | 0 docs/machine-learning/index.md | 89 ++++++++ .../machine-learning/poker-hand-prediction.md | 0 .../sleep-quality-prediction.md | 0 .../used-cars-price-prediction.md | 0 .../chatbot-implementation.md} | 0 .../email-spam-detection.md} | 0 docs/natural-language-processing/index.md | 77 +++++++ .../name-entity-recognition.md} | 3 +- .../next-word-pred.md | 0 .../text-summarization.md} | 0 .../twitter-sentiment-analysis.md} | 0 .../projects/artificial-intelligence/index.md | 11 - docs/projects/computer-vision/index.md | 34 --- docs/projects/deep-learning/index.md | 25 -- .../generative-adversarial-networks/index.md | 11 - docs/projects/index.md | 93 -------- docs/projects/large-language-models/index.md | 11 - docs/projects/machine-learning/index.md | 17 -- .../natural-language-processing/index.md | 15 -- docs/projects/statistics/index.md | 17 -- mkdocs.yml | 22 +- 42 files changed, 529 insertions(+), 442 deletions(-) rename docs/{projects/computer-vision/counting-bicep-reps.md => computer-vision/bicep-reps-counting.md} (100%) rename docs/{projects/computer-vision/black_and_white_image_colorizer.md => computer-vision/black-and-white-image-colorizer.md} (100%) rename docs/{projects => }/computer-vision/brightness-control.md (100%) rename docs/{projects => }/computer-vision/face-detection.md (100%) create mode 100644 docs/computer-vision/index.md rename docs/{projects/statistics => data-visualization}/bangladesh-premier-league-analysis.md (100%) rename docs/{projects/statistics => data-visualization}/black-friday-sales-analysis.md (100%) create mode 100644 docs/data-visualization/index.md rename docs/{projects => }/deep-learning/anamoly-detection.md (95%) rename docs/{projects => }/deep-learning/brain-tumor-detection-model.md (100%) create mode 100644 docs/deep-learning/index.md rename docs/{projects/computer-vision/music_genre_classification_model.md => deep-learning/music-genre-classification-model.md} (100%) create mode 100644 docs/generative-adversarial-networks/index.md create mode 100644 docs/large-language-models/index.md rename docs/{projects => }/machine-learning/air-quality-prediction.md (100%) rename docs/{projects => }/machine-learning/cardiovascular-disease-prediction.md (100%) rename docs/{projects => }/machine-learning/crop-recommendation.md (100%) rename docs/{projects => }/machine-learning/health-insurance-cross-sell-prediction.md (100%) rename docs/{projects => }/machine-learning/heart-disease-detection-model.md (100%) create mode 100644 docs/machine-learning/index.md rename docs/{projects => }/machine-learning/poker-hand-prediction.md (100%) rename docs/{projects => }/machine-learning/sleep-quality-prediction.md (100%) rename docs/{projects => }/machine-learning/used-cars-price-prediction.md (100%) rename docs/{projects/natural-language-processing/chatbot-project-implementation.md => natural-language-processing/chatbot-implementation.md} (100%) rename docs/{projects/natural-language-processing/email_spam_detection.md => natural-language-processing/email-spam-detection.md} (100%) create mode 100644 docs/natural-language-processing/index.md rename docs/{projects/natural-language-processing/name_entity_recognition.md => natural-language-processing/name-entity-recognition.md} (95%) rename docs/{projects => }/natural-language-processing/next-word-pred.md (100%) rename docs/{projects/natural-language-processing/text_summarization.md => natural-language-processing/text-summarization.md} (100%) rename docs/{projects/natural-language-processing/twitter_sentiment_analysis.md => natural-language-processing/twitter-sentiment-analysis.md} (100%) delete mode 100644 docs/projects/artificial-intelligence/index.md delete mode 100644 docs/projects/computer-vision/index.md delete mode 100644 docs/projects/deep-learning/index.md delete mode 100644 docs/projects/generative-adversarial-networks/index.md delete mode 100644 docs/projects/index.md delete mode 100644 docs/projects/large-language-models/index.md delete mode 100644 docs/projects/machine-learning/index.md delete mode 100644 docs/projects/natural-language-processing/index.md delete mode 100644 docs/projects/statistics/index.md diff --git a/docs/projects/computer-vision/counting-bicep-reps.md b/docs/computer-vision/bicep-reps-counting.md similarity index 100% rename from docs/projects/computer-vision/counting-bicep-reps.md rename to docs/computer-vision/bicep-reps-counting.md diff --git a/docs/projects/computer-vision/black_and_white_image_colorizer.md b/docs/computer-vision/black-and-white-image-colorizer.md similarity index 100% rename from docs/projects/computer-vision/black_and_white_image_colorizer.md rename to docs/computer-vision/black-and-white-image-colorizer.md diff --git a/docs/projects/computer-vision/brightness-control.md b/docs/computer-vision/brightness-control.md similarity index 100% rename from docs/projects/computer-vision/brightness-control.md rename to docs/computer-vision/brightness-control.md diff --git a/docs/projects/computer-vision/face-detection.md b/docs/computer-vision/face-detection.md similarity index 100% rename from docs/projects/computer-vision/face-detection.md rename to docs/computer-vision/face-detection.md diff --git a/docs/computer-vision/index.md b/docs/computer-vision/index.md new file mode 100644 index 00000000..da720189 --- /dev/null +++ b/docs/computer-vision/index.md @@ -0,0 +1,57 @@ +# ๐ŸŽฅ Computer Vision + + diff --git a/docs/contribute.md b/docs/contribute.md index 611116e4..190b0c42 100644 --- a/docs/contribute.md +++ b/docs/contribute.md @@ -1,165 +1,52 @@ -# ๐Ÿ“ How to Contribute? +# ๐Ÿ“ Contribute to AI-Code ๐Ÿš€ -Welcome to the **AI-Code** project! Whether you're a seasoned developer or just starting, this guide will help you contribute systematically and effectively. Let's build amazing AI projects together! ๐Ÿš€ - ---- +Welcome to **AI-Code**! Whether you're an expert or a beginner, your contributions matter. Let's build AI projects together! ## Getting Started -### ๐ŸŒŸ Star This Repository - -Show your support by starring the project! ๐ŸŒŸ This helps others discover and contribute. Click [here](https://github.com/Avdhesh-Varshney/AI-Code) to star. - -### ๐Ÿด Fork the Repository - -Create a personal copy of the repository by clicking the **Fork** button at the top right corner of the GitHub page. - -### ๐Ÿ“ฅ Clone Your Forked Repository - -Clone your forked repository to your local machine using: - -```bash -git clone https://github.com//AI-Code.git -``` - -### ๐Ÿ“‚ Navigate to the Project Directory - -Move into the directory where you've cloned the project: - -```bash -cd AI-Code -``` - -### ๐ŸŒฑ Create a New Branch - -Create a separate branch for your changes to keep the `main` branch clean: - -```bash -git checkout -b -``` - ---- - -### ๐Ÿ› ๏ธ Set Up the Development Environment - -#### 1. Create a Virtual Environment - -To isolate dependencies, create a virtual environment: - -```bash -python -m venv myenv -``` - -#### 2. Activate the Virtual Environment - -- **Windows:** - ```bash - myenv\Scripts\activate - ``` -- **macOS/Linux:** - ```bash - source myenv/bin/activate - ``` - -#### 3. Install Required Dependencies - -Install all dependencies listed in the `requirements.txt` file: - -```bash -pip install -r requirements.txt -``` - -#### 4. Preview Locally - -Use MkDocs to start the development server and preview the project: - -```bash -mkdocs serve -``` - -Access the site locally at: - -``` -http://127.0.0.1:8000/AI-Code/ -``` - ---- +1. **Star & Fork:** [Star](https://github.com/Avdhesh-Varshney/AI-Code) โญ & fork the repo. +2. **Clone:** + ```bash + git clone https://github.com//AI-Code.git && cd AI-Code + ``` +3. **Create Branch:** + ```bash + git checkout -b + ``` +4. **Set Up Environment:** + ```bash + python -m venv env && source env/bin/activate # (Windows: env\Scripts\activate) + pip install -r requirements.txt + ``` +5. **Preview Locally:** + ```bash + mkdocs serve # Visit http://127.0.0.1:8000/AI-Code/ + ``` ## Making Contributions -### โœ๏ธ Make Changes - -Make your desired code edits, add features, or improve documentation. Follow the project's coding standards and contribution guidelines for consistency. - -### ๐Ÿ’พ Stage and Commit Changes - -#### 1. Stage All Changes: - -```bash -git add . -``` - -#### 2. Commit Changes with a Descriptive Message: - -```bash -git commit -m "" -``` - -### ๐Ÿš€ Push Your Changes - -Push your branch to your forked repository: - -```bash -git push -u origin -``` - -### ๐Ÿ“ Create a Pull Request - -1. Navigate to your forked repository on GitHub. -2. Click Pull Requests, then New Pull Request. -3. Select your branch and describe your changes clearly before submitting. - ---- +1. **Edit Code:** Follow project standards. +2. **Stage & Commit:** + ```bash + git add . && git commit -m "" + ``` +3. **Push Changes:** + ```bash + git push -u origin + ``` +4. **Create a Pull Request (PR):** + - Go to GitHub โ†’ Open a PR โ†’ Provide clear details. ## Contribution Guidelines -### ๐Ÿ“‚ File Naming Conventions - -- Use `kebab-case` for file names (e.g., `ai-code-example`). - -### ๐Ÿ“š Documentation Standards - -- Follow the [PROJECT README TEMPLATE](./project-readme-template.md) and [ALGORITHM README TEMPLATE](./algorithm-readme-template.md). -- Use raw URLs for images and videos rather than direct uploads. - -### ๐Ÿ’ป Commit Best Practices - -- Keep commits concise and descriptive. -- Group related changes into a single commit. - -### ๐Ÿ”€ Pull Request Guidelines - -- Do not commit directly to the `main` branch. -- Use the PR Template and provide all requested details. -- Include screenshots, video demonstrations, or work samples for UI/UX changes. - -### ๐Ÿง‘โ€๐Ÿ’ป Code Quality Standards - -- Write clean, maintainable, and well-commented code. -- Ensure originality and adherence to project standards. - ---- - -## ๐Ÿ“˜ Learning Resources - -### ๐Ÿง‘โ€๐Ÿ’ป Git & GitHub Basics - -- [Forking a Repository](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) -- [Cloning a Repository](https://help.github.com/en/desktop/contributing-to-projects/creating-an-issue-or-pull-request) -- [Creating a Pull Request](https://opensource.com/article/19/7/create-pull-request-github) -- [GitHub Learning Lab](https://lab.github.com/githubtraining/introduction-to-github) +- **File Naming:** Use `kebab-case` (e.g., `ai-model.py`). +- **Docs:** Follow [README Template](./project-readme-template.md). +- **Commits:** Keep them concise & meaningful. +- **PRs:** No direct commits to `main`, use PR templates, and include screenshots if relevant. +- **Code Quality:** Clean, maintainable & well-commented. -### ๐Ÿ’ป General Programming +## Resources -- [Learn Python](https://www.learnpython.org/) -- [MkDocs Documentation](https://www.mkdocs.org/) +- **Git & GitHub:** [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo), [Clone](https://help.github.com/en/desktop/contributing-to-projects/creating-an-issue-or-pull-request), [PR Guide](https://opensource.com/article/19/7/create-pull-request-github) +- **Learn Python:** [LearnPython.org](https://www.learnpython.org/) +- **MkDocs:** [Documentation](https://www.mkdocs.org/) diff --git a/docs/projects/statistics/bangladesh-premier-league-analysis.md b/docs/data-visualization/bangladesh-premier-league-analysis.md similarity index 100% rename from docs/projects/statistics/bangladesh-premier-league-analysis.md rename to docs/data-visualization/bangladesh-premier-league-analysis.md diff --git a/docs/projects/statistics/black-friday-sales-analysis.md b/docs/data-visualization/black-friday-sales-analysis.md similarity index 100% rename from docs/projects/statistics/black-friday-sales-analysis.md rename to docs/data-visualization/black-friday-sales-analysis.md diff --git a/docs/data-visualization/index.md b/docs/data-visualization/index.md new file mode 100644 index 00000000..959eae7a --- /dev/null +++ b/docs/data-visualization/index.md @@ -0,0 +1,19 @@ +# ๐Ÿ“Š Data Visualization + + diff --git a/docs/projects/deep-learning/anamoly-detection.md b/docs/deep-learning/anamoly-detection.md similarity index 95% rename from docs/projects/deep-learning/anamoly-detection.md rename to docs/deep-learning/anamoly-detection.md index 5615d8b4..a98d2ff0 100644 --- a/docs/projects/deep-learning/anamoly-detection.md +++ b/docs/deep-learning/anamoly-detection.md @@ -15,8 +15,14 @@ To detect anomalies in time-series data using Long Short-Term Memory (LSTM) netw ??? Abstract "Kaggle Notebook" + - ## โš™๏ธ TECH STACK diff --git a/docs/projects/deep-learning/brain-tumor-detection-model.md b/docs/deep-learning/brain-tumor-detection-model.md similarity index 100% rename from docs/projects/deep-learning/brain-tumor-detection-model.md rename to docs/deep-learning/brain-tumor-detection-model.md diff --git a/docs/deep-learning/index.md b/docs/deep-learning/index.md new file mode 100644 index 00000000..2823e1d6 --- /dev/null +++ b/docs/deep-learning/index.md @@ -0,0 +1,41 @@ +# Deep Learning โœจ + + diff --git a/docs/projects/computer-vision/music_genre_classification_model.md b/docs/deep-learning/music-genre-classification-model.md similarity index 100% rename from docs/projects/computer-vision/music_genre_classification_model.md rename to docs/deep-learning/music-genre-classification-model.md diff --git a/docs/generative-adversarial-networks/index.md b/docs/generative-adversarial-networks/index.md new file mode 100644 index 00000000..556dd7a3 --- /dev/null +++ b/docs/generative-adversarial-networks/index.md @@ -0,0 +1,6 @@ +# Generative Adversarial Networks ๐Ÿ’ฑ + +
+ + +
diff --git a/docs/index.md b/docs/index.md index beffbd21..f862ecc5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,51 +1,183 @@ # Welcome to AI Code ๐Ÿ‘‹ -
-

- - -

- - - - - - -
- - - - -

+
+
+

AI Code

+

๐Ÿš€ Your Gateway to Artificial Intelligence & Machine Learning ๐Ÿค–

+
-

- - - - - - - -

+ +
+ + Buy Me A Coffee + +
-
+ +
+
+
+ Stars +

Stars

+
+
+ Forks +

Forks

+
+
+ Contributors +

Contributors

+
+
+ Last Commit +

Last Update

+
+
+
+ +
+
+ AI + Deep Learning + ML + NLP + CV +
+
+
-AI-Code is an open-source project designed to help individuals learn and understand foundational code implementations of various AI algorithms, providing structured guides, resources, and hands-on projects across multiple AI domains like ML, DL, NLP, and GANs. - -### ๐ŸŒŸ **Overview** -**AI-Code** simplifies learning AI technologies with **easy-to-follow** code and **real-world project** guides for ML, DL, GAN, NLP, OpenCV, and more. + +
+

๐ŸŒŸ About AI Code

+

+ AI-Code is an open-source initiative designed to democratize AI education through practical, hands-on learning. We provide structured implementations of various AI algorithms, from basic machine learning to advanced deep learning techniques, making complex concepts accessible to everyone. +

+ +
+

โœจ Why Choose AI Code?

+
    +
  • + ๐ŸŽฏ + Learn by Doing: Practical, hands-on projects across multiple AI domains +
  • +
  • + ๐Ÿ“š + Comprehensive Resources: Detailed guides, tutorials, and documentation +
  • +
  • + ๐Ÿค + Community Driven: Active community support and contributions +
  • +
  • + ๐Ÿš€ + Industry Ready: Projects aligned with current industry practices +
  • +
+
+
-### ๐Ÿ”‘ **Core Features** -- Scratch-level implementations of **AI algorithms** ๐Ÿง  -- **Guides**, datasets, research papers, and **step-by-step tutorials** ๐Ÿ“˜ -- Clear directories with focused **README** files ๐Ÿ“‚ -- Fast learning with minimal complexity ๐Ÿš€ + +
+

๐Ÿ› ๏ธ Tech Stack

+
+
+ Python +

Python 3.9+

+
+
+ MkDocs +

Documentation

+
+
+ Git +

Version Control

+
+
+ VS Code +

IDE

+
+
+
-### ๐Ÿ› ๏ธ **Tech Stack** + + -- Python 3.9+ -- Mk Docs (A Python Package) -- Markdown -- Git/GitHub -- VS Code + +
+

๐ŸŒŸ Ready to Start Your AI Journey?

+

Join our community and start building amazing AI projects today!

+ +
diff --git a/docs/large-language-models/index.md b/docs/large-language-models/index.md new file mode 100644 index 00000000..a1edb1e2 --- /dev/null +++ b/docs/large-language-models/index.md @@ -0,0 +1,6 @@ +# Large Language Models ๐Ÿคช + +
+ + +
diff --git a/docs/projects/machine-learning/air-quality-prediction.md b/docs/machine-learning/air-quality-prediction.md similarity index 100% rename from docs/projects/machine-learning/air-quality-prediction.md rename to docs/machine-learning/air-quality-prediction.md diff --git a/docs/projects/machine-learning/cardiovascular-disease-prediction.md b/docs/machine-learning/cardiovascular-disease-prediction.md similarity index 100% rename from docs/projects/machine-learning/cardiovascular-disease-prediction.md rename to docs/machine-learning/cardiovascular-disease-prediction.md diff --git a/docs/projects/machine-learning/crop-recommendation.md b/docs/machine-learning/crop-recommendation.md similarity index 100% rename from docs/projects/machine-learning/crop-recommendation.md rename to docs/machine-learning/crop-recommendation.md diff --git a/docs/projects/machine-learning/health-insurance-cross-sell-prediction.md b/docs/machine-learning/health-insurance-cross-sell-prediction.md similarity index 100% rename from docs/projects/machine-learning/health-insurance-cross-sell-prediction.md rename to docs/machine-learning/health-insurance-cross-sell-prediction.md diff --git a/docs/projects/machine-learning/heart-disease-detection-model.md b/docs/machine-learning/heart-disease-detection-model.md similarity index 100% rename from docs/projects/machine-learning/heart-disease-detection-model.md rename to docs/machine-learning/heart-disease-detection-model.md diff --git a/docs/machine-learning/index.md b/docs/machine-learning/index.md new file mode 100644 index 00000000..d35d481e --- /dev/null +++ b/docs/machine-learning/index.md @@ -0,0 +1,89 @@ +# Machine Learning ๐Ÿค– + +
+ + +
+ + Air Quality Prediction +
+

Air Quality Prediction

+

Predicting Air Quality with Precision, One Sensor at a Time!

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 9 mins

+
+
+
+ + +
+ + Poker Hand Prediction +
+

Poker Hand Prediction

+

Predicting Poker Hands Using Machine Learning

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 7 mins

+
+
+
+ + +
+ + Heart Disease Detection +
+

Heart Disease Detection

+

Early Detection of Heart Disease Using ML

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 8 mins

+
+
+
+ + +
+ + Used Cars Price Prediction +
+

Used Cars Price Prediction

+

Accurate Price Predictions for Used Vehicles

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 6 mins

+
+
+
+ + +
+ + Sleep Quality Prediction +
+

Sleep Quality Prediction

+

Predicting Sleep Quality Based on Lifestyle

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 5 mins

+
+
+
+ + +
+ + Health Insurance Cross-Sell +
+

Insurance Cross-Sell Prediction

+

Predicting Vehicle Insurance Cross-Sell Opportunities

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 7 mins

+
+
+
+ + +
+ + Cardiovascular Disease Prediction +
+

Cardiovascular Disease Prediction

+

Predicting Cardiovascular Disease Risk

+

๐Ÿ“… 2025-01-26 | โฑ๏ธ 8 mins

+
+
+
+ +
diff --git a/docs/projects/machine-learning/poker-hand-prediction.md b/docs/machine-learning/poker-hand-prediction.md similarity index 100% rename from docs/projects/machine-learning/poker-hand-prediction.md rename to docs/machine-learning/poker-hand-prediction.md diff --git a/docs/projects/machine-learning/sleep-quality-prediction.md b/docs/machine-learning/sleep-quality-prediction.md similarity index 100% rename from docs/projects/machine-learning/sleep-quality-prediction.md rename to docs/machine-learning/sleep-quality-prediction.md diff --git a/docs/projects/machine-learning/used-cars-price-prediction.md b/docs/machine-learning/used-cars-price-prediction.md similarity index 100% rename from docs/projects/machine-learning/used-cars-price-prediction.md rename to docs/machine-learning/used-cars-price-prediction.md diff --git a/docs/projects/natural-language-processing/chatbot-project-implementation.md b/docs/natural-language-processing/chatbot-implementation.md similarity index 100% rename from docs/projects/natural-language-processing/chatbot-project-implementation.md rename to docs/natural-language-processing/chatbot-implementation.md diff --git a/docs/projects/natural-language-processing/email_spam_detection.md b/docs/natural-language-processing/email-spam-detection.md similarity index 100% rename from docs/projects/natural-language-processing/email_spam_detection.md rename to docs/natural-language-processing/email-spam-detection.md diff --git a/docs/natural-language-processing/index.md b/docs/natural-language-processing/index.md new file mode 100644 index 00000000..d8f53239 --- /dev/null +++ b/docs/natural-language-processing/index.md @@ -0,0 +1,77 @@ +# Natural Language Processing ๐Ÿ—ฃ๏ธ + + diff --git a/docs/projects/natural-language-processing/name_entity_recognition.md b/docs/natural-language-processing/name-entity-recognition.md similarity index 95% rename from docs/projects/natural-language-processing/name_entity_recognition.md rename to docs/natural-language-processing/name-entity-recognition.md index a9c24fa2..05a046ab 100644 --- a/docs/projects/natural-language-processing/name_entity_recognition.md +++ b/docs/natural-language-processing/name-entity-recognition.md @@ -9,8 +9,7 @@ N/A (This project uses text input for NER analysis, not a specific dataset) - It uses real time data as input . ## NOTEBOOK LINK -[Note book link ] -(https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing) +[https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing](https://colab.research.google.com/drive/1pBIEFA4a9LzyZKUFQMCypQ22M6bDbXM3?usp=sharing) ## LIBRARIES NEEDED - SpaCy diff --git a/docs/projects/natural-language-processing/next-word-pred.md b/docs/natural-language-processing/next-word-pred.md similarity index 100% rename from docs/projects/natural-language-processing/next-word-pred.md rename to docs/natural-language-processing/next-word-pred.md diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/natural-language-processing/text-summarization.md similarity index 100% rename from docs/projects/natural-language-processing/text_summarization.md rename to docs/natural-language-processing/text-summarization.md diff --git a/docs/projects/natural-language-processing/twitter_sentiment_analysis.md b/docs/natural-language-processing/twitter-sentiment-analysis.md similarity index 100% rename from docs/projects/natural-language-processing/twitter_sentiment_analysis.md rename to docs/natural-language-processing/twitter-sentiment-analysis.md diff --git a/docs/projects/artificial-intelligence/index.md b/docs/projects/artificial-intelligence/index.md deleted file mode 100644 index 76689ad4..00000000 --- a/docs/projects/artificial-intelligence/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Artificial Intelligence ๐Ÿ’ก - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/projects/computer-vision/index.md b/docs/projects/computer-vision/index.md deleted file mode 100644 index ad7ca5eb..00000000 --- a/docs/projects/computer-vision/index.md +++ /dev/null @@ -1,34 +0,0 @@ -# Computer Vision ๐ŸŽฅ - - diff --git a/docs/projects/deep-learning/index.md b/docs/projects/deep-learning/index.md deleted file mode 100644 index 068507f4..00000000 --- a/docs/projects/deep-learning/index.md +++ /dev/null @@ -1,25 +0,0 @@ -# Deep Learning โœจ - - diff --git a/docs/projects/generative-adversarial-networks/index.md b/docs/projects/generative-adversarial-networks/index.md deleted file mode 100644 index b70215a1..00000000 --- a/docs/projects/generative-adversarial-networks/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Generative Adversarial Networks ๐Ÿ’ฑ - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/projects/index.md b/docs/projects/index.md deleted file mode 100644 index 460b5048..00000000 --- a/docs/projects/index.md +++ /dev/null @@ -1,93 +0,0 @@ -# Projects ๐ŸŽ‰ - -
- - -
- - -
-

Statistics

-

Understanding data through statistical analysis and inference methods.

-
-
-
- - -
- - -
-

Machine Learning

-

Dive into the world of algorithms and models in Machine Learning.

-
-
-
- - -
- - -
-

Deep Learning

-

Explore the fascinating world of deep learning.

-
-
-
- - -
- - -
-

Computer Vision

-

Learn computer vision with OpenCV for real-time image processing applications.

-
-
-
- - -
- - -
-

Natural Language Processing

-

Dive into how machines understand and generate human language.

-
-
-
- - -
- - -
-

Generative Adversarial Networks

-

Learn about the power of Generative Adversarial Networks for creative AI solutions.

-
-
-
- - -
- - -
-

Large Language Models

-

Explore the cutting-edge techniques behind large language models like GPT and BERT.

-
-
-
- - -
- - -
-

Artificial Intelligence

-

Explore the fundamentals and advanced concepts of Artificial Intelligence.

-
-
-
- -
diff --git a/docs/projects/large-language-models/index.md b/docs/projects/large-language-models/index.md deleted file mode 100644 index 6848420b..00000000 --- a/docs/projects/large-language-models/index.md +++ /dev/null @@ -1,11 +0,0 @@ -# Large Language Models ๐Ÿคช - -
-
- -
-

No Items Found

-

- There are no items available at this time. Check back again later. -

-
diff --git a/docs/projects/machine-learning/index.md b/docs/projects/machine-learning/index.md deleted file mode 100644 index d1da78e4..00000000 --- a/docs/projects/machine-learning/index.md +++ /dev/null @@ -1,17 +0,0 @@ -# Machine Learning ๐Ÿค– - - diff --git a/docs/projects/natural-language-processing/index.md b/docs/projects/natural-language-processing/index.md deleted file mode 100644 index b64b4bf8..00000000 --- a/docs/projects/natural-language-processing/index.md +++ /dev/null @@ -1,15 +0,0 @@ -# Natural Language Processing ๐Ÿ—ฃ๏ธ - - diff --git a/docs/projects/statistics/index.md b/docs/projects/statistics/index.md deleted file mode 100644 index d6f3e7aa..00000000 --- a/docs/projects/statistics/index.md +++ /dev/null @@ -1,17 +0,0 @@ -# Statistics ๐Ÿ“ƒ - - diff --git a/mkdocs.yml b/mkdocs.yml index 655a159a..1be6de36 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,13 +2,18 @@ site_name: AI Code site_url: https://avdhesh-varshney.github.io/AI-Code/ nav: - - ๐Ÿ  Home: index.md - - ๐Ÿ”ท Algorithms: algorithms/index.md - - ๐ŸŽ‰ Projects: projects/index.md - - ๐Ÿ“š Libraries/Packages: libraries/index.md - - ๐Ÿ“ Contribute: contribute.md - - ๐Ÿงฎ Algorithm Template: algorithm-readme-template.md - - ๐Ÿ“œ Project Template: project-readme-template.md + - Overview: index.md + - Projects: + - ๐Ÿ“Š Data Insights: data-visualization/index.md + - ๐Ÿ“ˆ ML Models: machine-learning/index.md + - ๐Ÿง  Neural Networks: deep-learning/index.md + - ๐Ÿ“ท Vision Systems: computer-vision/index.md + - ๐Ÿ—ฃ๏ธ NLP Tasks: natural-language-processing/index.md + - ๐ŸŒ€ GANs: generative-adversarial-networks/index.md + - ๐Ÿ“š LLMs: large-language-models/index.md + - Get Involved: + - โœ๏ธ How to Contribute: contribute.md + - ๐Ÿ“„ Template Guide: project-readme-template.md theme: name: material @@ -70,9 +75,6 @@ extra: - icon: simple/discord link: https://discord.gg/tSqtvHUJzE -extra_css: - - customs/extra.css - extra_javascript: - https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML From 7ff3b5806cb31309542b78ffa1a43788307faf44 Mon Sep 17 00:00:00 2001 From: Kashish Khurana <113686328+Kashishkh@users.noreply.github.com> Date: Mon, 24 Feb 2025 11:34:26 +0530 Subject: [PATCH 17/19] crop-recommendation (#185) * crop-recommendation * changes done * changes completed --- .../machine-learning/crop-recommendation.md | 279 ++++++++++++++++++ 1 file changed, 279 insertions(+) create mode 100644 docs/projects/machine-learning/crop-recommendation.md diff --git a/docs/projects/machine-learning/crop-recommendation.md b/docs/projects/machine-learning/crop-recommendation.md new file mode 100644 index 00000000..0dce9170 --- /dev/null +++ b/docs/projects/machine-learning/crop-recommendation.md @@ -0,0 +1,279 @@ +# Crop-Recommendation-Model + +
+ +
+ +## ๐ŸŽฏ AIM + +It is an AI-powered Crop Recommendation System that helps farmers and agricultural stakeholders determine the most suitable crops for cultivation based on environmental conditions. The system uses machine learning models integrated with Flask to analyze key parameters and suggest the best crop to grow in a given region. + +## ๐Ÿ“Š DATASET LINK + +[https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data) + +## ๐Ÿ““ NOTEBOOK + +[https://www.kaggle.com/code/kashishkhurana1204/crop-recommendation-system](https://www.kaggle.com/code/kashishkhurana1204/crop-recommendation-system) + +??? Abstract "Kaggle Notebook" + + + +## โš™๏ธ TECH STACK + +| **Category** | **Technologies** | +|--------------------------|-----------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Pandas, Numpy, Matplotlib, Scikit-learn | +| **Tools** | Github, Jupyter, VS Code | + +--- + +## ๐Ÿ“ DESCRIPTION + + +!!! info "What is the requirement of the project?" + - To provide accurate crop recommendations based on environmental conditions. + - To assist farmers in maximizing yield and efficiency. + +??? info "How is it beneficial and used?" + - Helps in optimizing agricultural planning. + - Reduces trial-and-error farming practices. + + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Initial thoughts : The goal is to help farmers determine the most suitable crops based on their fieldโ€™s environmental conditions. + + - Dataset Selection : I searched for relevant datasets on Kaggle that include soil properties, weather conditions, and nutrient levels such as nitrogen (N), phosphorus (P), and potassium (K). + + - Initial Data Exploration : I analyzed the dataset structure to understand key attributes like soil pH, humidity, rainfall, and nutrient values, which directly impact crop suitability. + + - Feature Analysis : Studied how different environmental factors influence crop growth and identified the most significant parameters for prediction. + + - Model Selection & Implementation : Researched various ML models and implemented algorithms like Naรฏve Bayes, Decision Trees, and Random Forest to predict the best-suited crops. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - [https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset/data) + + +--- + +## ๐Ÿ” EXPLANATION + +### DATASET OVERVIEW & FEATURE DETAILS + +๐Ÿ“‚ dataset.csv +| **Feature**| **Description** | **Data Type** | +|------------|-----------------|----------------| +| Soil_pH | Soil pH level | float | +| Humidity | Humidity level | float | +| Rainfall | Rainfall amount | float | +| N | Nitrogen level | int64 | +| P | Phosphorus level| int64 | +| K | Potassium level | int64 | +|Temperature | Temperature | float | +| crop | Recommended crop| categorical | + + + +### ๐Ÿ›ค PROJECT WORKFLOW + +```mermaid + graph + Start -->|No| End; + Start -->|Yes| Import_Libraries --> Load_Dataset --> Data_Cleaning --> Feature_Selection --> Train_Test_Split --> Define_Models; + Define_Models --> Train_Models --> Evaluate_Models --> Save_Best_Model --> Develop_Flask_API --> Deploy_Application --> Conclusion; + Deploy_Application -->|Error?| Debug --> Yay!; + +``` + + +=== "Import Necessary Libraries" + - First, we import all the essential libraries needed for handling, analyzing, and modeling the dataset. + - This includes libraries like Pandas for data manipulation, Numpy for numerical computations, Matplotlib and Seaborn for data visualization, and Scikit-learn for machine learning models, evaluation, and data preprocessing. + - These libraries will enable us to perform all required tasks efficiently. + +=== "Load Dataset" + - We load the dataset using Pandas `read_csv()` function. The dataset contains crop data, which is loaded with a semicolon delimiter. + - After loading, we inspect the first few rows to understand the structure of the data and ensure that the dataset is correctly loaded. + +=== "Data Cleaning Process" + Data cleaning is a crucial step in any project. In this step: + + - Handle missing values, remove duplicates, and ensure data consistency. + - Convert categorical variables if necessary and normalize numerical values. + +=== "Visualizing Correlations Between Features" + + - Use heatmaps and scatter plots to understand relationships between features and how they impact crop recommendations. + +=== "Data Preparation - Features (X) and Target (y)" + + - Separate independent variables (environmental parameters) and the target variable (recommended crop). + +=== "Split the Data into Training and Test Sets" + + - Use train_test_split() from Scikit-learn to divide data into training and testing sets, ensuring model generalization. + +=== "Define Models" + We define multiple regression models to train and evaluate on the dataset: + + - **RandomForestRegressor**: A robust ensemble method that performs well on non-linear datasets. + - **Naive Bayes**: A probabilistic classifier based on Bayes' theorem, which assumes independence between features and is effective for classification tasks. + - **DecisionTreeRegressor**: A decision tree-based model, capturing non-linear patterns and interactions. + +=== "Train and Evaluate Each Model" + + - Fit models using training data and evaluate performance using accuracy, precision, recall, and F1-score metrics. + +=== "Visualizing Model Evaluation Metrics" + + - Use confusion matrices, precision-recall curves, and ROC curves to assess model performance. + +== "Conclusion and Observations" + + **Best-Performing Models and Insights Gained:** + + - The Random Forest model provided the highest accuracy and robustness in predictions. + + - Decision Tree performed well but was prone to overfitting on training data. + + - Naรฏve Bayes, though simple, showed competitive performance for certain crop categories. + + - Feature importance analysis revealed that soil pH and nitrogen levels had the most significant impact on crop recommendation. + + **Potential Improvements and Future Enhancements:** + + - Implement deep learning models for better feature extraction and prediction accuracy. + + - Expand the dataset by incorporating satellite and real-time sensor data. + + - Integrate weather forecasting models to enhance crop suitability predictions. + + - Develop a mobile-friendly UI for better accessibility to farmers. + +--- + +### ๐Ÿ–ฅ CODE EXPLANATION + +=== "Code to compute F1-score, Precision, and Recall" + + ```py + from sklearn.metrics import precision_score, recall_score, f1_score, classification_report + + # Initialize a dictionary to store model scores + model_scores = {} + + # Iterate through each model and compute evaluation metrics + for name, model in models.items(): + print(f"Evaluating {name}...") + + # Train the model + model.fit(x_train, y_train) + + # Predict on the test set + y_pred = model.predict(x_test) + + # Compute metrics + precision = precision_score(y_test, y_pred, average='weighted') + recall = recall_score(y_test, y_pred, average='weighted') + f1 = f1_score(y_test, y_pred, average='weighted') + + # Store results + model_scores[name] = { + 'Precision': precision, + 'Recall': recall, + 'F1 Score': f1 + } + + # Print results for each model + print(f"Precision: {precision:.4f}") + print(f"Recall: {recall:.4f}") + print(f"F1 Score: {f1:.4f}") + print("\nClassification Report:\n") + print(classification_report(y_test, y_pred)) + print("-" * 50) + + # Print a summary of all model scores + print("\nSummary of Model Performance:\n") + for name, scores in model_scores.items(): + print(f"{name}: Precision={scores['Precision']:.4f}, Recall={scores['Recall']:.4f}, F1 Score={scores['F1 Score']:.4f}") + + ``` + + - This code evaluates multiple machine learning models and displays performance metrics such as Precision, Recall, F1 Score, and a Classification Report for each model. + +--- + +### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade Off 1" + - **Trade-off**: Accuracy vs. Computational Efficiency + - **Solution**: Optimized hyperparameters and used efficient algorithms. + +=== "Trade Off 2" + - **Trade-off**: Model interpretability vs complexity. + - **Solution**: Selected models balancing accuracy and interpretability. + +--- + +## ๐Ÿ–ผ SCREENSHOTS + +!!! tip "Visualizations of different features" + + === "HeatMap" + ![img](https://github.com/Kashishkh/FarmSmart/blob/main/Screenshot%202025-02-04%20195349.png) + + === "Model Comparison" + ![model-comparison](https://github.com/Kashishkh/FarmSmart/blob/main/Screenshot%202025-02-05%20011859.png) + + +--- + +## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS + +| Model | Accuracy | Precision | Recall |F1-score| +|---------------------------|----------|-----------|--------|--------| +| Naive Bayes | 99.54% | 99.58% | 99.55% | 99.54% | +| Random Forest Regressor | 99.31% | 99.37% | 99.32% | 99.32% | +| Decision Tree Regressor | 98.63% | 98.68% | 98.64% | 98.63% | + +--- + +## โœ… CONCLUSION + +### ๐Ÿ”‘ KEY LEARNINGS + +!!! tip "Insights gained from the data" + - Soil conditions play a crucial role in crop recommendation. + - Environmental factors significantly impact crop yield. + +??? tip "Improvements in understanding machine learning concepts" + - Feature engineering and hyperparameter tuning. + - Deployment of ML models in real-world applications. + +--- + +### ๐ŸŒ USE CASES + +=== "Application 1" + **Application of FarmSmart in precision farming.** + + - FarmSmart helps optimize resource allocation, enabling farmers to make data-driven decisions for sustainable and profitable crop production. + [https://github.com/Kashishkh/FarmSmart](https://github.com/Kashishkh/FarmSmart) + +=== "Application 2" + **Use in government agricultural advisory services.** + + - Government agencies can use FarmSmart to provide region-specific crop recommendations, improving food security and agricultural productivity through AI-driven insights. + + + From 264c223ed2401dc0f92ec225fa67f75a0d021945 Mon Sep 17 00:00:00 2001 From: Mohammed Abdul Rahman <130785777+that-ar-guy@users.noreply.github.com> Date: Fri, 28 Feb 2025 10:38:00 +0530 Subject: [PATCH 18/19] Email/spam (#200) * email done * updated * Update index.md * added visualization --- .../email-spam-detection.md | 246 ++++++++---------- docs/natural-language-processing/index.md | 2 +- 2 files changed, 116 insertions(+), 132 deletions(-) diff --git a/docs/natural-language-processing/email-spam-detection.md b/docs/natural-language-processing/email-spam-detection.md index 15bf34b5..23928cb2 100644 --- a/docs/natural-language-processing/email-spam-detection.md +++ b/docs/natural-language-processing/email-spam-detection.md @@ -1,204 +1,188 @@ +# ๐ŸŒŸ Email Spam Detection -# Email Spam Detection +
+ +
-### AIM -To develop a machine learning-based system that classifies email content as spam or ham (not spam). +## ๐ŸŽฏ AIM +To classify emails as spam or ham using machine learning models, ensuring better email filtering and security. -### DATASET LINK -[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) +## ๐Ÿ“Š DATASET LINK +[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification) +## ๐Ÿ“š KAGGLE NOTEBOOK +[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023) -### NOTEBOOK LINK -[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) +??? Abstract "Kaggle Notebook" + -### LIBRARIES NEEDED +## โš™๏ธ TECH STACK -??? quote "LIBRARIES USED" +| **Category** | **Technologies** | +|--------------------------|---------------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn | +| **Databases** | NOT USED | +| **Tools** | Kaggle, Jupyter Notebook | +| **Deployment** | NOT USED | - - pandas - - numpy - - scikit-learn - - matplotlib - - seaborn - ---- +--- -### DESCRIPTION +## ๐Ÿ“ DESCRIPTION !!! info "What is the requirement of the project?" - - A robust system to detect spam emails is essential to combat increasing spam content. - - It improves user experience by automatically filtering unwanted messages. - -??? info "Why is it necessary?" - - Spam emails consume resources, time, and may pose security risks like phishing. - - Helps organizations and individuals streamline their email communication. + - To efficiently classify emails as spam or ham. + - To improve email security by filtering out spam messages. ??? info "How is it beneficial and used?" - - Provides a quick and automated solution for spam classification. - - Used in email services, IT systems, and anti-spam software to filter messages. + - Helps in reducing unwanted spam emails in user inboxes. + - Enhances productivity by filtering out irrelevant emails. + - Can be integrated into email service providers for automatic filtering. ??? info "How did you start approaching this project? (Initial thoughts and planning)" - - Analyzed the dataset and prepared features. - - Implemented various machine learning models for comparison. + - Collected and preprocessed the dataset. + - Explored various machine learning models. + - Evaluated models based on performance metrics. + - Visualized results for better understanding. ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." - - Documentation from [scikit-learn](https://scikit-learn.org) - - Blog: Introduction to Spam Classification with ML + - Scikit-learn documentation. + - Various Kaggle notebooks related to spam detection. --- -### EXPLANATION +## ๐Ÿ” PROJECT EXPLANATION + +### ๐Ÿงฉ DATASET OVERVIEW & FEATURE DETAILS + +??? example "๐Ÿ“‚ spam.csv" -#### DETAILS OF THE DIFFERENT FEATURES -The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. + - The dataset contains the following features: -| Feature | Description | -|----------------------|-------------------------------------------------| -| `word_freq_x` | Frequency of specific words in the email body | -| `capital_run_length` | Length of consecutive capital letters | -| `char_freq` | Frequency of special characters like `;` and `$` | -| `is_spam` | Target variable (1 = Spam, 0 = Ham) | + | Feature Name | Description | Datatype | + |--------------|-------------|:------------:| + | Category | Spam or Ham | object | + | Text | Email text | object | + | Length | Length of email | int64 | + +??? example "๐Ÿ›  Developed Features from spam.csv" + + | Feature Name | Description | Reason | Datatype | + |--------------|-------------|----------|:------------:| + | Length | Email text length | Helps in spam detection | int64 | --- -#### WHAT I HAVE DONE +### ๐Ÿ›ค PROJECT WORKFLOW -=== "Step 1" +!!! success "Project workflow" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset] + B --> C[Preprocess Data] + C --> D[Vectorize Text] + D --> E[Train Models] + E --> F[Evaluate Models] + F --> G[Visualize Results] + ``` - Initial data exploration and understanding: - - Loaded the dataset using pandas. - - Explored dataset features and target variable distribution. +=== "Step 1" + - Load the dataset and clean unnecessary columns. === "Step 2" - - Data cleaning and preprocessing: - - Checked for missing values. - - Standardized features using scaling techniques. + - Preprocess text and convert categorical labels. === "Step 3" - - Feature engineering and selection: - - Extracted relevant features for spam classification. - - Used correlation matrix to select significant features. + - Convert text into numerical features using CountVectorizer. === "Step 4" - - Model training and evaluation: - - Trained models: KNN, Naive Bayes, SVM, and Random Forest. - - Evaluated models using accuracy, precision, and recall. + - Train machine learning models. === "Step 5" - - Model optimization and fine-tuning: - - Tuned hyperparameters using GridSearchCV. + - Evaluate models using accuracy, precision, recall, and F1 score. === "Step 6" - - Validation and testing: - - Tested models on unseen data to check performance. + - Visualize performance using confusion matrices and heatmaps. --- -#### PROJECT TRADE-OFFS AND SOLUTIONS - -=== "Trade Off 1" - - **Accuracy vs. Training Time**: - - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. - -=== "Trade Off 2" - - **Complexity vs. Interpretability**: - - Simpler models like Naive Bayes were more interpretable but slightly less accurate. +### ๐Ÿ–ฅ CODE EXPLANATION ---- +=== "Section 1" + - Data loading and preprocessing. -### SCREENSHOTS - +=== "Section 2" + - Text vectorization using CountVectorizer. -!!! success "Project flowchart" - - ``` mermaid - graph LR - A[Start] --> B[Load Dataset]; - B --> C[Preprocessing]; - C --> D[Train Models]; - D --> E{Compare Performance}; - E -->|Best Model| F[Deploy]; - E -->|Retry| C; - ``` +=== "Section 3" + - Training models (MLP Classifier, MultinomialNB, BernoulliNB). -??? tip "Confusion Matrix" +=== "Section 4" + - Evaluating models using various metrics. - === "SVM" - ![Confusion Matrix - SVM](https://github.com/user-attachments/assets/5abda820-040a-4ea8-b389-cd114d329c62) +=== "Section 5" + - Visualizing confusion matrices and metric comparisons. - === "Naive Bayes" - ![Confusion Matrix - Naive Bayes](https://github.com/user-attachments/assets/bdae9210-9b9b-45c7-9371-36c0a66a9184) +--- - === "Decision Tree" - ![Confusion Matrix - Decision Tree](https://github.com/user-attachments/assets/8e92fc53-4aff-4973-b0a1-b65a7fc4a79e) +### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS - === "AdaBoost" - ![Confusion Matrix - AdaBoost](https://github.com/user-attachments/assets/043692e3-f733-419c-9fb2-834f2e199506) +=== "Trade Off 1" + - Balancing accuracy and computational efficiency. + - Used Naive Bayes for speed and MLP for improved accuracy. - === "Random Forest" - ![Confusion Matrix - Random Forest](https://github.com/user-attachments/assets/5c689f57-9ec5-4e49-9ef5-3537825ac772) +=== "Trade Off 2" + - Handling false positives vs. false negatives. + - Tuned models to improve precision for spam detection. --- -### MODELS USED AND THEIR EVALUATION METRICS +## ๐ŸŽฎ SCREENSHOTS -| Model | Accuracy | Precision | Recall | -|----------------------|----------|-----------|--------| -| KNN | 90% | 89% | 88% | -| Naive Bayes | 92% | 91% | 90% | -| SVM | 94% | 93% | 91% | -| Random Forest | 95% | 94% | 93% | -| AdaBoost | 97% | 97% | 100% | +!!! tip "Visualizations and EDA of different features" ---- - -#### MODELS COMPARISON GRAPHS + === "Confusion Matrix comparision" + ![img](https://github.com/user-attachments/assets/94a3b2d8-c7e5-41a5-bba7-8ba4cb1435a7) -!!! tip "Models Comparison Graphs" - === "Accuracy Comparison" - ![Model accracy comparison](https://github.com/user-attachments/assets/1e17844d-e953-4eb0-a24d-b3dbc727db93) +??? example "Model performance graphs" ---- + === "Meteric comparison" + ![img](https://github.com/user-attachments/assets/c2be4340-89c9-4aee-9a27-8c40bf2c0066) -### CONCLUSION -#### WHAT YOU HAVE LEARNED - -!!! tip "Insights gained from the data" - - Feature importance significantly impacts spam detection. - - Simple models like Naive Bayes can achieve competitive performance. +--- -??? tip "Improvements in understanding machine learning concepts" - - Gained hands-on experience with classification models and model evaluation techniques. +## ๐Ÿ“‰ MODELS USED AND THEIR EVALUATION METRICS -??? tip "Challenges faced and how they were overcome" - - Balancing between accuracy and training time was challenging, solved using model tuning. +| Model | Accuracy | Precision | Recall | F1 Score | +|------------|----------|------------|--------|----------| +| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 | +| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 | +| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 | --- -#### USE CASES OF THIS MODEL - -=== "Application 1" +## โœ… CONCLUSION - **Email Service Providers** - - Automated filtering of spam emails for improved user experience. +### ๐Ÿ”‘ KEY LEARNINGS -=== "Application 2" +!!! tip "Insights gained from the data" + - Text length plays a role in spam detection. + - Certain words appear more frequently in spam emails. - **Enterprise Email Security** - - Used in enterprise software to detect phishing and spam emails. +??? tip "Improvements in understanding machine learning concepts" + - Gained insights into text vectorization techniques. + - Understood trade-offs between different classification models. --- -### FEATURES PLANNED BUT NOT IMPLEMENTED +### ๐ŸŒ USE CASES -=== "Feature 1" +=== "Email Filtering Systems" + - Can be integrated into email services like Gmail and Outlook. - - Integration of deep learning models (LSTM) for improved accuracy. +=== "SMS Spam Detection" + - Used in mobile networks to block spam messages. diff --git a/docs/natural-language-processing/index.md b/docs/natural-language-processing/index.md index d8f53239..4c5e4736 100644 --- a/docs/natural-language-processing/index.md +++ b/docs/natural-language-processing/index.md @@ -29,7 +29,7 @@
- Email Spam Detection + Email Spam Detection

Email Spam Detection

ML-Based Email Spam Classification

From 0254fafff871788efe01cdcb832a56d5902e6107 Mon Sep 17 00:00:00 2001 From: Kashish Khurana <113686328+Kashishkh@users.noreply.github.com> Date: Tue, 4 Mar 2025 02:31:08 +0530 Subject: [PATCH 19/19] Update black-friday-sales-analysis.md --- docs/data-visualization/black-friday-sales-analysis.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/data-visualization/black-friday-sales-analysis.md b/docs/data-visualization/black-friday-sales-analysis.md index 1db03d81..e22ed650 100644 --- a/docs/data-visualization/black-friday-sales-analysis.md +++ b/docs/data-visualization/black-friday-sales-analysis.md @@ -1,4 +1,4 @@ -# ๐Ÿ“œ Exploratory Data Analysis +# ๐Ÿ“œ Black Friday Sales Analysis
@@ -24,7 +24,7 @@ To analyze the Black Friday sales dataset, understand customer purchasing behavi style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" - title="bpl-analysis"> + title="black-friday-sales-analysis"> ## โš™๏ธ TECH STACK @@ -234,4 +234,4 @@ To analyze the Black Friday sales dataset, understand customer purchasing behavi ### ๐Ÿ”— USEFUL LINKS === "GitHub Repository" - - [https://github.com/Kashishkh/-Exploratory-Data-Analysis-](https://github.com/Kashishkh/-Exploratory-Data-Analysis-) \ No newline at end of file + - [https://github.com/Kashishkh/-Exploratory-Data-Analysis-](https://github.com/Kashishkh/-Exploratory-Data-Analysis-)