This project builds a machine learning model to classify news articles as real or fake using Natural Language Processing (NLP) and Logistic Regression.
Fake news poses a significant challenge in the digital world. This project aims to classify news articles based on their text content into two categories:
-
Real News (label = 0)
-
Fake News (label = 1)
-
Source: Kaggle Fake News Dataset
-
File used: train.csv
-
Features: id, title, author, text, label
-
Python (Google Colab)
-
Numpy, Pandas
-
nltk for stopwords removal and stemming
-
scikit-learn for TF-IDF, model training, and evaluation
-
Missing Values: Replaced with empty strings
-
Content Creation: Combined author and title into a new content feature
-
Text Cleaning:
-
Remove non-alphabetic characters
-
Convert to lowercase
-
Tokenize
-
Remove stopwords
-
Apply stemming
- Feature Extraction: Used TfidfVectorizer to convert text into numerical form
-
Algorithm: Logistic Regression
-
Data Split: 80% training, 20% testing
-
Input: TF-IDF features from preprocessed text
-
Evaluation: Accuracy score
-
Successfully trained and evaluated the model.
-
Achieved good accuracy on the test set from the dataset.
-
Upload
train.csvto your Colab session -
Run the notebook cells sequentially
-
The notebook handles all preprocessing, training, and prediction
-
The model is trained and tested only on the provided dataset.
-
It may not accurately classify news from outside sources due to:
-
Dataset bias
-
Lack of contextual understanding
-
No real-world generalization capabilities
-
-
Experiment with advanced models (SVM, XGBoost)
-
Use deep learning models (LSTM, BERT)
-
Apply cross-validation and hyperparameter tuning
-
Train on more diverse and recent data sources