Fake and real news dataset | Kaggle

This project is developed for the UNICA.IT University ML exams.

Starting from the datasets provided by "Kaggle.com" we have structured an algorithm capable of distinguishing fake news from real news.

Master's degree in Computer Engineering, Cybersecurity and Artificial Intelligence - University of Cagliari

Machine Learning

Authors: Michela Lecca - Marco Mulas- Lello Molinario

Supervisor:Prof. Battista Biggio

The developed code implements a modular text classification pipeline using the strategy design pattern, allowing data representation and interchangeable classification strategies. The Strategy Design Pattern allows you to define a family of algorithms, encapsulating each one and making them interchangeable without altering the client code.

It includes three methods of data representation:

TokenizerRepresentation: Convert text to integer sequences using Keras Tokenizer and fixed-length pad sequences.
TextVectorizationRepresentation: Use TensorFlow's TextVectorization layer to map text into integer sequences and fill them.
TFIDFRepresentation: Applies TF-IDF vectorization to convert text into numeric features based on document frequency inverse to term frequency. Using these methods, you transform text data into numeric formats suitable for machine learning models.

The script also provides three classification strategies:

NaiveBayesClassifier: Implements Naive Bayes for binary classification, such as text classification tasks.
PerceptronNetStrategy: A basic form of neural network, suitable for binary classification tasks.
RandomForestClassifier: Implements a Random Forest algorithm with multiple decision trees for robust classification.

The performance of the classifiers is evaluated through:

Average accuracy:Precision is the ratio of actual positive predictions to the total predicted positives.
Average recall:Recall is the ratio of true positive predictions to total actual positives.
Average F1 Score: The F1 score is the harmonic mean of precision and recall.

The pipeline supports k-fold cross-validation for robust evaluation of model performance.

Analysis and Conclusion

NaiveBayes

Best Performance Metric: TF-IDF
Why: Metrics are consistently high for TF-IDF, indicating that NaiveBayes works exceptionally well with this data representation. It shows significant improvements in precision, recall, and F1-score, highlighting its ability to leverage the importance-weighted word representation effectively.

PerceptronNet

Best Performance Metric: TF-IDF
Why: PerceptronNet exhibits the most dramatic improvement with TF-IDF, achieving near-perfect scores in all metrics. This suggests that PerceptronNet can capture and learn from the detailed features provided by TF-IDF, making it highly effective for this task.

RandomForest

Best Performance Metric: TF-IDF
Why: RandomForest maintains high performance across all data representations, with the best metrics for TF-IDF. The high F1-score indicates a good balance between precision and recall, suggesting that it effectively handles both false positives and false negatives. Its robust performance across different representations makes it a versatile choice.

Model with the Best Performance Overall

RandomForest and PerceptronNet are the top performers, especially with TF-IDF, achieving near-perfect scores in all three metrics.

Between the Two?

RandomForest consistently has high metrics across all representations, indicating robust performance regardless of the data representation used.
PerceptronNet achieves slightly higher precision and recall with TF-IDF, but RandomForest's performance is more consistent across all representations.

Why `RandomForest` Works Better?

Ensemble Learning: Uses multiple decision trees, reducing overfitting and improving generalization.
Robustness: Handles variance and bias balancing well by averaging multiple trees.
Feature Importance: Effectively handles high-dimensional data and identifies important features, improving performance across different data representations.

Conclusion

Based on the metrics provided, RandomForest with TF-IDF is slightly the best overall due to its consistent high precision, recall, and F1-score.

Overall Goal of Machine Learning

The goal of machine learning is to create systems that can learn from data, identify patterns, and make informed decisions or predictions, ultimately improving efficiency, accuracy, and customization in various applications. In our case, the goal of binary classification in machine learning is to create models that can effectively and accurately distinguish between two classes, providing reliable and actionable predictions while balancing performance and efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
doc		doc
kaggle_credential		kaggle_credential
util		util
README.md		README.md
entrypoint.py		entrypoint.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake and real news dataset | Kaggle

Contents

Installation

Project goal

Solution Design

Analysis and Conclusion

Model with the Best Performance Overall

Why `RandomForest` Works Better?

Conclusion

Overall Goal of Machine Learning

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fake and real news dataset | Kaggle

Contents

Installation

Project goal

Solution Design

Analysis and Conclusion

Model with the Best Performance Overall

Why RandomForest Works Better?

Conclusion

Overall Goal of Machine Learning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Why `RandomForest` Works Better?

Packages