Skip to content

Latest commit

 

History

History
80 lines (64 loc) · 2.83 KB

File metadata and controls

80 lines (64 loc) · 2.83 KB

🧪 Toxic Comment Classifier

banner

Python scikit-learn MIT License

🧪 Toxic Comment Classifier is a simple yet effective machine learning project that detects toxic comments in both English and Russian.
It uses classical NLP techniques (TF-IDF + Logistic Regression) for real-time text classification.

Whether you're building a moderation system or just exploring NLP, this project is a great starting point.

📦 Description

The script:

  • Downloads and extracts English and Russian toxic comment datasets.
  • Merges them into training and testing sets.
  • Uses TfidfVectorizer to convert text into numerical features.
  • Trains a logistic regression model.
  • Saves the model and vectorizer to model.pkl.
  • Allows the user to input a comment and checks if it is toxic.

🗃 Datasets Used

Jigsaw Toxic Comment Classification Challenge:
https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge
Russian Language Toxic Comments:
https://www.kaggle.com/datasets/blackmoon/russian-language-toxic-comments

📁 Project Structure

.
└── Toxic Comment Classifier AI
├── dataset
├── .gitattributes
├── .gitignore
├── LICENSE
├── main.py
├── model.pkl
├── README.md
└── requirements.txt

🛠 Installation

Clone the repository:

git clone https://github.com/pashudzu/ToxicCommentClassificationAI.git  
cd ToxicCommentClassificationAI
python main.py

Install dependencies:
pip install -r requirements.txt

🔍 Example Usage

example

Comment Classification
"You're stupid and nobody likes you!" ❌ Toxic
"Have a great day!" ✅ Kindness

📈 Model Performance

The model prints the accuracy score after training.

🧠 Technologies Used

  • Python 3
  • scikit-learn
  • NLTK
  • pickle
  • TF-IDF vectorization
  • Logistic Regression

📌 Notes

  • ✅ Supports both English and Russian comments.
  • 🧪 Uses only the toxic label (binary classification)
  • 💾 The model is saved to avoid retraining on each run.
  • 🚀 Avoids retraining if a saved model exists

📜 License

This project is licensed under the MIT License. Use it freely.

Made with ❤️ by pashudzu