🧠 DNLP SS25 — Gradient Descenders

A comprehensive study of NLP tasks using BERT and BART — sentiment analysis, paraphrase detection, and controlled text generation.

📖 Overview · ⚙️ Setup · 🔬 Methodology · 📊 Results · 👥 Team

📌 Project Info

Field	Details
🏷️ Group Code	06
👩‍🏫 Tutor	Corinna Wegner
👥 Members	Khalid Tariq · Shakhzod Bakhodirov · Alina Amanbayeva

🗺️ Overview

This project explores five core NLP tasks through fine-tuning of BERT and BART foundation models. Our work goes beyond standard baselines by introducing advanced loss functions, architectural improvements, and training strategies:

Task	Model	Key Innovation
Semantic Textual Similarity (STS)	BERT	Siamese network + Triplet Loss
Sentiment Analysis (SST)	BERT	CLS + Mean Pooling + LayerNorm
Paraphrase Detection (QQP)	BERT	Focal Loss + Adversarial Training (FGM)
Paraphrase Type Detection (PTD)	BART	Custom Head + Weighted Sampling
Paraphrase Type Generation (PTG)	BART	Multi-Objective Loss + Bayesian Optimization

⚙️ Setup Instructions

1. Clone the Repository

git clone https://github.com/WeskerPRO/NLP_Project.git
cd NLP_Project

2. Install Dependencies

Option A — Automatic (recommended):

bash setup.sh

Option B — Manual (conda):

conda create -n dnlp python=3.10 -y
conda activate dnlp
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y tqdm requests scikit-learn numpy pandas matplotlib seaborn scipy -c conda-forge
conda install -y protobuf transformers tokenizers pyarrow=20.0.0 spacy -c conda-forge
conda install -y sentence-transformers=5.0.0 -c conda-forge
pip install explainaboard-client==0.1.4 sacrebleu==2.5.1 optuna==3.6.0 importlib_metadata

3. Train BERT (Multitask)

python multitask_classifier.py --option finetune --task=[sst, sts, qqp, etpc] --epochs=15 --use_gpu

4. Train BART

# Paraphrase Type Generation
python bart_generation.py --use_gpu

# Paraphrase Type Detection
python bart_detection.py --use_gpu

# Optional: T5 model
python bart_generation.py --model=T5 --use_gpu

🔬 Methodology

1. 📐 Semantic Textual Similarity (STS)

A Siamese network architecture maps sentence pairs into a shared embedding space using a single weight-shared model.

Key techniques:

Triplet Loss with Hard Negative Mining — pulls similar embeddings together, pushes dissimilar ones apart. Hard negatives are mined within each batch for maximum discrimination.
MSE Regression Loss — directly optimizes prediction of the ground-truth similarity score in the [0, 5] range.
Hybrid Loss: $L = \alpha \cdot L_{\text{triplet}} + (1 - \alpha) \cdot L_{\text{regression}}$, with optimal $\alpha = 0.8$.

2. 💬 Sentiment Analysis (SST)

The vanilla CLS-only baseline suffered from limited expressiveness and overfitting on this small dataset. Three improvements were introduced:

Improvement	Description
Advanced Pooling	Concatenates `[CLS]` embedding with mean of all last-layer hidden states
Layer Normalization	Stabilizes the concatenated representation before the classifier
Strong Dropout (p=0.5)	Acts as ensemble regularization, critical for SST's limited size

3. 🔁 Paraphrase Detection (QQP)

Five targeted improvements over the baseline:

#	Method	Purpose
1	Advanced Feature Engineering	Element-wise difference, product, cosine similarity between embeddings
2	Custom Paraphrase Head	Learns nonlinear decision boundaries from interaction features
3	Dropout Regularization	Prevents over-reliance on any single interaction type
4	Focal Loss	$FL(p_t) = -\alpha(1-p_t)^\gamma \log(p_t)$ — addresses class imbalance
5	Adversarial Training (FGM)	Adds worst-case embedding perturbations to boost robustness

4. 🏷️ Paraphrase Type Detection (PTD) — BART

Multi-label classification across 26 paraphrase categories using the ETPC dataset, evaluated with Matthews Correlation Coefficient (MCC).

Improvements:

Class-wise Threshold Tuning — each of the 26 classes gets its own optimal threshold on the dev set
Weighted Random Sampler — rare paraphrase types get higher sampling weight: $w_i = \frac{1}{\text{count}(i) + \epsilon}$
L2 Regularization — weight decay in AdamW: $\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda |\theta|_2^2$
Custom Classifier Head — Linear → SiLU → BatchNorm → Dropout → Output Linear
BCEWithLogitsLoss — handles multi-label prediction with per-class binary cross-entropy

5. ✍️ Paraphrase Type Generation (PTG) — BART

A controlled generation framework trained with a multi-objective loss:

$$L_{\text{total}} = L_{CE} + \alpha_{\text{sem}} \cdot L_{\text{sem}} + \alpha_{\text{lex}} \cdot L_{\text{lex}} + \alpha_{\text{syn}} \cdot L_{\text{syn}}$$

Key contributions:

Stochastic Sampling — top_p and temperature encourage diverse, non-repetitive generation
Multi-Objective Loss — simultaneously optimizes semantic similarity, lexical variation, and syntactic variation
Bayesian Hyperparameter Optimization (Optuna) — intelligently searches learning rate, weight decay, loss weights, and generation parameters
BART vs. T5 Comparison — T5 starts strong (instruction-following pre-training), but BART learns to de-copy and generate novel paraphrases over epochs

📊 Results

BERT Tasks

Task	Model	Metric	Score
SST	Baseline	Accuracy	52.20%
SST	Advanced Pooling + Regularization	Accuracy	55.20%
QQP	Baseline	Accuracy	75.00%
QQP	Feature Eng. + Focal Loss + FGM	Accuracy	85.70%
STS	Baseline (MSE)	Pearson r	0.352
STS	Simple Regression + Regularization	Pearson r	0.375
STS	Siamese BERT + Contrastive Learning	Pearson r	0.672

BART Tasks

Paraphrase Type Detection (PTD)

Configuration	Accuracy	MCC
Baseline	87.50%	0.180
Threshold Tuning	87.50%	0.206
WeightedSampler + BCE + L2	78.00%	0.318
Custom Classifier Head	80.00%	0.400
Optuna + Layer Freezing	56.50%	0.350

Paraphrase Type Generation (PTG)

Configuration	BLEU	Negative BLEU	Penalized BLEU
Baseline	48.44	2.84	2.64
Stochastic Sampling	45.22	20.93	18.20
Multi-Objective Loss	44.08	22.61	19.16
Optimized Controlled Generation	42.02	29.79	24.08

Extended Typology Paraphrase Corpus (ETPC) — Bonus

Configuration	Accuracy	Micro F1	Macro F1	Macro MCC
Baseline	0.00%	0.000	0.000	0.000
Multi-label + Concatenated Inputs	85.60%	0.674	0.147	0.162

📈 Visualizations

PTD — Training & Validation Curve

Training loss decreases consistently. Dev loss flattens around epoch 8, where Macro MCC peaks — confirming the value of early stopping for model selection.

PTG — Stochastic Sampling vs. Controlled Generation

Figure 1: Stochastic Sampling approach

Figure 2: Optimized Controlled Generation

The controlled generation model doesn't just generate diverse text — it purposefully balances semantic, lexical, and syntactic objectives through the multi-objective loss, making it a superior solution over pure stochastic sampling.

🔧 Hyperparameters

Multitask BERT (QQP, STS, SST, ETPC)

Parameter	Value
Mode	`finetune`
Epochs	`15`
Learning Rate	`1e-05`
Weight Decay (L2)	`1e-2`
Dropout	`0.3`
Batch Size	`16`
Optimizer	`AdamW`

PTD (BART Detection)

Parameter	Value
Epochs	`12`
Learning Rate	`5e-5`
Weight Decay	`1e-2`
Dropout	`0.1`
Patience	`3`
Batch Size	`16`

PTG (BART Generation — Bayesian Search Space)

Parameter	Range
Learning Rate	`1e-5 → 1e-4`
Weight Decay	`1e-5 → 1e-2`
Epochs	`10 → 15`
Alpha (sem/lex/syn)	`0.1 → 1.0` each
Temperature	`0.5 → 1.0`
Top-p	`0.80 → 0.95`
Repetition Penalty	`1.5 → 3.0`
Patience	`4 → 7`
Batch Size	`8`

👥 Members Contribution

⭐ Lead Contributor

Shakhzod Bakhodirov · @WeskerPRO · Matriculation: 18749742

Responsible for the three most technically demanding tasks in the project. Designed and implemented the full controlled generation pipeline for PTG including multi-objective loss and Bayesian optimization, built the Siamese BERT architecture with contrastive learning for STS, and resolved the critical preprocessing failure in ETPC to deliver a working multi-label classifier.

Task	Contribution
🥇 Paraphrase Type Generation (PTG)	Multi-objective loss · Stochastic sampling · Bayesian optimization · BART vs T5 comparison
🥇 Semantic Textual Similarity (STS)	Siamese BERT · Triplet loss · Hard negative mining · Hybrid loss ($\alpha$ analysis)
🥇 ETPC Bonus Task	Concatenated input design · Multi-label classification · Full pipeline fix

🤖 AI Usage

AI support was documented in our AI Usage Card.

📚 References

Devlin et al. (2018) — BERT: Pre-training of Deep Bidirectional Transformers
Liu et al. (2019) — RoBERTa: A Robustly Optimized BERT Pretraining Approach
Reimers & Gurevych (2019) — Sentence-BERT: Siamese Network Structures
Kumar et al. (2021) — Controlled Text Generation as Continuous Optimization
Lin et al. (2017) — Focal Loss for Dense Object Detection (RetinaNet)
Mushava & Murray (2022) — Flexible Loss Functions for Binary Classification
Loshchilov & Hutter (2017) — Decoupled Weight Decay Regularization
Ioffe & Szegedy (2015) — Batch Normalization
Srivastava et al. (2014) — Dropout: Preventing Neural Networks from Overfitting
Müller, Kornblith & Hinton (2019) — When Does Label Smoothing Help?
Ramachandran, Zoph & Le (2017) — Searching for Activation Functions (SiLU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧠 DNLP SS25 — Gradient Descenders

📌 Project Info

🗺️ Overview

⚙️ Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Train BERT (Multitask)

4. Train BART

🔬 Methodology

1. 📐 Semantic Textual Similarity (STS)

2. 💬 Sentiment Analysis (SST)

3. 🔁 Paraphrase Detection (QQP)

4. 🏷️ Paraphrase Type Detection (PTD) — BART

5. ✍️ Paraphrase Type Generation (PTG) — BART

📊 Results

BERT Tasks

BART Tasks

📈 Visualizations

PTD — Training & Validation Curve

PTG — Stochastic Sampling vs. Controlled Generation

🔧 Hyperparameters

Multitask BERT (QQP, STS, SST, ETPC)

PTD (BART Detection)

PTG (BART Generation — Bayesian Search Space)

👥 Members Contribution

⭐ Lead Contributor

🤖 AI Usage

📚 References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🧠 DNLP SS25 — Gradient Descenders

📌 Project Info

🗺️ Overview

⚙️ Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Train BERT (Multitask)

4. Train BART

🔬 Methodology

1. 📐 Semantic Textual Similarity (STS)

2. 💬 Sentiment Analysis (SST)

3. 🔁 Paraphrase Detection (QQP)

4. 🏷️ Paraphrase Type Detection (PTD) — BART

5. ✍️ Paraphrase Type Generation (PTG) — BART

📊 Results

BERT Tasks

BART Tasks

📈 Visualizations

PTD — Training & Validation Curve

PTG — Stochastic Sampling vs. Controlled Generation

🔧 Hyperparameters

Multitask BERT (QQP, STS, SST, ETPC)

PTD (BART Detection)

PTG (BART Generation — Bayesian Search Space)

👥 Members Contribution

⭐ Lead Contributor

🤖 AI Usage

📚 References