Skip to content

Latest commit

 

History

History
296 lines (208 loc) · 11.1 KB

File metadata and controls

296 lines (208 loc) · 11.1 KB

🧠 DNLP SS25 — Gradient Descenders

Status Version Python PyTorch

A comprehensive study of NLP tasks using BERT and BART — sentiment analysis, paraphrase detection, and controlled text generation.

📖 Overview · ⚙️ Setup · 🔬 Methodology · 📊 Results · 👥 Team


📌 Project Info

Field Details
🏷️ Group Code 06
👩‍🏫 Tutor Corinna Wegner
👥 Members Khalid Tariq · Shakhzod Bakhodirov · Alina Amanbayeva

🗺️ Overview

This project explores five core NLP tasks through fine-tuning of BERT and BART foundation models. Our work goes beyond standard baselines by introducing advanced loss functions, architectural improvements, and training strategies:

Task Model Key Innovation
Semantic Textual Similarity (STS) BERT Siamese network + Triplet Loss
Sentiment Analysis (SST) BERT CLS + Mean Pooling + LayerNorm
Paraphrase Detection (QQP) BERT Focal Loss + Adversarial Training (FGM)
Paraphrase Type Detection (PTD) BART Custom Head + Weighted Sampling
Paraphrase Type Generation (PTG) BART Multi-Objective Loss + Bayesian Optimization

⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/WeskerPRO/NLP_Project.git
cd NLP_Project

2. Install Dependencies

Option A — Automatic (recommended):

bash setup.sh

Option B — Manual (conda):

conda create -n dnlp python=3.10 -y
conda activate dnlp
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y tqdm requests scikit-learn numpy pandas matplotlib seaborn scipy -c conda-forge
conda install -y protobuf transformers tokenizers pyarrow=20.0.0 spacy -c conda-forge
conda install -y sentence-transformers=5.0.0 -c conda-forge
pip install explainaboard-client==0.1.4 sacrebleu==2.5.1 optuna==3.6.0 importlib_metadata

3. Train BERT (Multitask)

python multitask_classifier.py --option finetune --task=[sst, sts, qqp, etpc] --epochs=15 --use_gpu

4. Train BART

# Paraphrase Type Generation
python bart_generation.py --use_gpu

# Paraphrase Type Detection
python bart_detection.py --use_gpu

# Optional: T5 model
python bart_generation.py --model=T5 --use_gpu

🔬 Methodology

1. 📐 Semantic Textual Similarity (STS)

A Siamese network architecture maps sentence pairs into a shared embedding space using a single weight-shared model.

Key techniques:

  • Triplet Loss with Hard Negative Mining — pulls similar embeddings together, pushes dissimilar ones apart. Hard negatives are mined within each batch for maximum discrimination.
  • MSE Regression Loss — directly optimizes prediction of the ground-truth similarity score in the [0, 5] range.
  • Hybrid Loss: $L = \alpha \cdot L_{\text{triplet}} + (1 - \alpha) \cdot L_{\text{regression}}$, with optimal $\alpha = 0.8$.

2. 💬 Sentiment Analysis (SST)

The vanilla CLS-only baseline suffered from limited expressiveness and overfitting on this small dataset. Three improvements were introduced:

Improvement Description
Advanced Pooling Concatenates [CLS] embedding with mean of all last-layer hidden states
Layer Normalization Stabilizes the concatenated representation before the classifier
Strong Dropout (p=0.5) Acts as ensemble regularization, critical for SST's limited size

3. 🔁 Paraphrase Detection (QQP)

Five targeted improvements over the baseline:

# Method Purpose
1 Advanced Feature Engineering Element-wise difference, product, cosine similarity between embeddings
2 Custom Paraphrase Head Learns nonlinear decision boundaries from interaction features
3 Dropout Regularization Prevents over-reliance on any single interaction type
4 Focal Loss $FL(p_t) = -\alpha(1-p_t)^\gamma \log(p_t)$ — addresses class imbalance
5 Adversarial Training (FGM) Adds worst-case embedding perturbations to boost robustness

4. 🏷️ Paraphrase Type Detection (PTD) — BART

Multi-label classification across 26 paraphrase categories using the ETPC dataset, evaluated with Matthews Correlation Coefficient (MCC).

Improvements:

  • Class-wise Threshold Tuning — each of the 26 classes gets its own optimal threshold on the dev set
  • Weighted Random Sampler — rare paraphrase types get higher sampling weight: $w_i = \frac{1}{\text{count}(i) + \epsilon}$
  • L2 Regularization — weight decay in AdamW: $\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda |\theta|_2^2$
  • Custom Classifier Head — Linear → SiLU → BatchNorm → Dropout → Output Linear
  • BCEWithLogitsLoss — handles multi-label prediction with per-class binary cross-entropy

5. ✍️ Paraphrase Type Generation (PTG) — BART

A controlled generation framework trained with a multi-objective loss:

$$L_{\text{total}} = L_{CE} + \alpha_{\text{sem}} \cdot L_{\text{sem}} + \alpha_{\text{lex}} \cdot L_{\text{lex}} + \alpha_{\text{syn}} \cdot L_{\text{syn}}$$

Key contributions:

  • Stochastic Samplingtop_p and temperature encourage diverse, non-repetitive generation
  • Multi-Objective Loss — simultaneously optimizes semantic similarity, lexical variation, and syntactic variation
  • Bayesian Hyperparameter Optimization (Optuna) — intelligently searches learning rate, weight decay, loss weights, and generation parameters
  • BART vs. T5 Comparison — T5 starts strong (instruction-following pre-training), but BART learns to de-copy and generate novel paraphrases over epochs

📊 Results

BERT Tasks

Task Model Metric Score
SST Baseline Accuracy 52.20%
SST Advanced Pooling + Regularization Accuracy 55.20%
QQP Baseline Accuracy 75.00%
QQP Feature Eng. + Focal Loss + FGM Accuracy 85.70%
STS Baseline (MSE) Pearson r 0.352
STS Simple Regression + Regularization Pearson r 0.375
STS Siamese BERT + Contrastive Learning Pearson r 0.672

BART Tasks

Paraphrase Type Detection (PTD)

Configuration Accuracy MCC
Baseline 87.50% 0.180
Threshold Tuning 87.50% 0.206
WeightedSampler + BCE + L2 78.00% 0.318
Custom Classifier Head 80.00% 0.400
Optuna + Layer Freezing 56.50% 0.350

Paraphrase Type Generation (PTG)

Configuration BLEU Negative BLEU Penalized BLEU
Baseline 48.44 2.84 2.64
Stochastic Sampling 45.22 20.93 18.20
Multi-Objective Loss 44.08 22.61 19.16
Optimized Controlled Generation 42.02 29.79 24.08

Extended Typology Paraphrase Corpus (ETPC) — Bonus

Configuration Accuracy Micro F1 Macro F1 Macro MCC
Baseline 0.00% 0.000 0.000 0.000
Multi-label + Concatenated Inputs 85.60% 0.674 0.147 0.162

📈 Visualizations

PTD — Training & Validation Curve

PTD Training Curve

Training loss decreases consistently. Dev loss flattens around epoch 8, where Macro MCC peaks — confirming the value of early stopping for model selection.

PTG — Stochastic Sampling vs. Controlled Generation

PTG Stochastic

Figure 1: Stochastic Sampling approach

PTG Controlled

Figure 2: Optimized Controlled Generation

The controlled generation model doesn't just generate diverse text — it purposefully balances semantic, lexical, and syntactic objectives through the multi-objective loss, making it a superior solution over pure stochastic sampling.


🔧 Hyperparameters

Multitask BERT (QQP, STS, SST, ETPC)

Parameter Value
Mode finetune
Epochs 15
Learning Rate 1e-05
Weight Decay (L2) 1e-2
Dropout 0.3
Batch Size 16
Optimizer AdamW

PTD (BART Detection)

Parameter Value
Epochs 12
Learning Rate 5e-5
Weight Decay 1e-2
Dropout 0.1
Patience 3
Batch Size 16

PTG (BART Generation — Bayesian Search Space)

Parameter Range
Learning Rate 1e-5 → 1e-4
Weight Decay 1e-5 → 1e-2
Epochs 10 → 15
Alpha (sem/lex/syn) 0.1 → 1.0 each
Temperature 0.5 → 1.0
Top-p 0.80 → 0.95
Repetition Penalty 1.5 → 3.0
Patience 4 → 7
Batch Size 8

👥 Members Contribution

⭐ Lead Contributor

Shakhzod Bakhodirov · @WeskerPRO · Matriculation: 18749742

Responsible for the three most technically demanding tasks in the project. Designed and implemented the full controlled generation pipeline for PTG including multi-objective loss and Bayesian optimization, built the Siamese BERT architecture with contrastive learning for STS, and resolved the critical preprocessing failure in ETPC to deliver a working multi-label classifier.

Task Contribution
🥇 Paraphrase Type Generation (PTG) Multi-objective loss · Stochastic sampling · Bayesian optimization · BART vs T5 comparison
🥇 Semantic Textual Similarity (STS) Siamese BERT · Triplet loss · Hard negative mining · Hybrid loss ($\alpha$ analysis)
🥇 ETPC Bonus Task Concatenated input design · Multi-label classification · Full pipeline fix

🤖 AI Usage

AI support was documented in our AI Usage Card.


📚 References

  • Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers
  • Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Reimers & Gurevych (2019) — Sentence-BERT: Siamese Network Structures
  • Kumar et al. (2021) — Controlled Text Generation as Continuous Optimization
  • Lin et al. (2017) — Focal Loss for Dense Object Detection (RetinaNet)
  • Mushava & Murray (2022) — Flexible Loss Functions for Binary Classification
  • Loshchilov & Hutter (2017) — Decoupled Weight Decay Regularization
  • Ioffe & Szegedy (2015) — Batch Normalization
  • Srivastava et al. (2014) — Dropout: Preventing Neural Networks from Overfitting
  • Müller, Kornblith & Hinton (2019) — When Does Label Smoothing Help?
  • Ramachandran, Zoph & Le (2017) — Searching for Activation Functions (SiLU)