This extension implements a Max Voting ensemble classifier that combines the outputs of three independently trained models: (1) a fine-tuned BERT baseline, (2) BERT + CNN, and (3) BERT + LSTM. Each model produces a probability distribution over the two classes (hate vs. non-hate). For each class, the ensemble selects the maximum predicted probability across the three models, and the final prediction is the class with the highest such maximum score.
The motivation behind Max Voting is to allow the most confident model to dominate the final decision on a per-instance basis. This is particularly useful when one architecture is better suited to a specific linguistic pattern or context. The baseline BERT captures rich contextual semantics, the CNN-based model emphasizes local n-gram features, and the BiLSTM-based model captures sequential dependencies. By selecting the strongest signal among these models rather than averaging them, Max Voting can better preserve high-confidence predictions and reduce the dilution of decisive model outputs.
Note to Grader: The base learners in this ensemble were not retrained. As described later in this document, we reuse the pretrained models from previous milestones as fixed base learners, with no additional weight updates performed.
The model is evaluated on the held-out test set. Using sklearn tooling, this script reports metrics including accuracy, precision, recall, and F1 for both classes. The test-set performance of the Max Voting Ensemble Model is:
- F1 (Hate): 0.723
Please make sure to have the training data (train_data.csv), validation data (dev_data.csv), and test data (test_data.csv) arranged according to the below folder structure
Note to Grader: In addition to the data, you will also need the saved weights from Milestone #2 Fine-Tune of BERT Model, the saved weights from Milestone #3 training of BERT + CNN, and the saved weights from Milestone #3 training of BERT + LSTM. To access the weights, here is a Google Drive Link to a Shared Folder called Model Weights. Within this folder, you will see the following sub-folders
- Milestone2-Baseline-BERT-FinalModel (i.e. Saved Weights from Milestone #2 Fine-Tune of BERT Model)
- Milestone3-BERT-CNN-FinalModel (i.e. Saved Weights from Milestone #3 Training of BERT + CNN)
- Milestone3-BERT-BiLSTM-FinalModel (i.e. Saved Weights from Milestone #3 Training of BERT + LSTM)
Please download the above folders and structure your local repository as follows:
Required Folder Structure
├── data/
│ ├── train_data.csv
│ ├── dev_data.csv
│ └── test_data.csv
├── Milestone #2/
│ ├── Milestone2-Baseline-BERT-FinalModel # Saved Weights from Milestone #2 Fine-Tune of BERT Model
├── Milestone #3/
│ ├── Milestone3-BERT-CNN-FinalModel # Saved Weights from Milestone #3 Training of BERT + CNN
│ ├── Milestone3-BERT-BiLSTM-FinalModel # Saved Weights from Milestone #3 Training of BERT + LSTM
├── src/
│ ├── Max_Voting.py # Script for Max Voting ensemble
Running the Script
To execute the Max Voting Ensemble, execute the following command from within the src folder
python Max_Voting.pyAt the end of execution, you will see the following three files in your directory
- Max-Voting-train-results.csv: List of evaluation metrics on Training Dataset
- Max-Voting-dev-results.csv: List of evaluation metrics on Validation/Dev Dataset
- Max-Voting-test-results.csv: List of evaluation metrics on Test Dataset
Output Metrics
The above files contain the below evaluation metrics:
- accuracy: Overall proportion of correctly classified posts (both hateful and non-hateful).
- pos_precision: Of the posts predicted as hateful (
1), the fraction that are actually hateful. Formula:TP / (TP + FP). - pos_recall: Of all truly hateful posts, the fraction correctly identified as hateful. Formula:
TP / (TP + FN). - neg_precision: Of the posts predicted as non-hateful (
0), the fraction that are actually non-hateful. Formula:TN / (TN + FN). - neg_recall: Of all truly non-hateful posts, the fraction correctly identified as non-hateful. Formula:
TN / (TN + FP). - pos_f1: F1 score for the hateful class, harmonic mean of Pos Precision and Pos Recall. Formula:
2 * (Precision * Recall) / (Precision + Recall). - neg_f1: F1 score for the non-hateful class, harmonic mean of Neg Precision and Neg Recall. Formula:
2 * (Precision * Recall) / (Precision + Recall). - f1_macro: Average of Pos F1 and Neg F1, treating both classes equally regardless of class frequency.
- f1_micro: Global F1 considering total true positives, false positives, and false negatives across all classes.
- f1_weighted: Average of Pos F1 and Neg F1 weighted by the number of instances in each class.
Note to Grader: While multiple evaluation metrics are saved in the above files, our primary metric for assessing model performance is the F1 score(i.e. pos_f1). The other metrics are provided solely for additional analysis.
Metric Prefixes
All metrics in the saved CSVs are prefixed by the split:
| Split | Prefix | Example Metrics |
|---|---|---|
| Train | train_ |
train_accuracy, train_pos_f1, train_f1_macro |
| Dev | dev_ |
dev_neg_precision, dev_pos_recall, dev_f1_micro |
| Test | test_ |
test_f1_weighted, test_neg_recall, test_accuracy |