This project implements Natural Language Inference (NLI) and Dense Passage Retrieval (DPR) using transformer models. The implementation includes fine-tuning DistilBERT for NLI, soft prompt tuning, and training DPR models for question-answer retrieval.
Fine-tune DistilBERT on the NLI task:
python scripts/run_part1_nli.pyThis script:
- Loads the NLI dataset from
data/nli/ - Fine-tunes a DistilBERT model for binary entailment classification
- Uses mixed precision training and gradient accumulation
- Implements learning rate warmup and scheduling
Train soft prompts on the frozen DistilBERT model:
python scripts/run_part2_prompting.pyThis script:
- Freezes the fine-tuned DistilBERT model
- Trains soft prompts with different configurations (p=5, 10, 20)
- Compares prompt tuning performance
Train and evaluate DPR models:
python scripts/run_part3_dpr.pyThis script:
- Loads question-answer pairs from
data/qa/ - Trains separate encoders for questions and passages
- Uses contrastive loss with in-batch negative sampling
- Evaluates using Recall@k and Mean Reciprocal Rank (MRR)
Training Configuration:
- Model: DistilBERT (distilbert-base-uncased)
- Batch Size: 128 (effective 256 with gradient accumulation)
- Learning Rate: 2e-5
- Epochs: 10
- Weight Decay: 0.01
Performance:
| Epoch | Training Loss | Validation F1 Score | Global Step |
|---|---|---|---|
| 1 | 0.3963 | 0.9038 | 1,563 |
| 2 | 0.2460 | 0.9179 | 3,126 |
| 3 | 0.2175 | 0.9225 | 4,689 |
| 4 | 0.2000 | 0.9276 | 6,252 |
| 5 | 0.1866 | 0.9309 | 7,815 |
| 6 | 0.1763 | 0.9321 | 9,378 |
| 7 | 0.1665 | 0.9343 | 10,941 |
| 8 | 0.1583 | 0.9366 | 12,504 |
| 9 | 0.1507 | 0.9382 | 14,067 |
| 10 | 0.1443 | 0.9390 | 15,630 |
Training:
Example Evaluations:
The model was evaluated on 3 randomly selected validation examples:
-
Example 1 ✓
- Premise: "A group of kids listening to their band instructor, and reading music off their papers."
- Hypothesis: "The kids are not reading."
- True Label: Entailment (1)
- Predicted: Entailment (1)
- Confidence: 0.9541
-
Example 2 ✓
- Premise: "A man cooking food on the stove."
- Hypothesis: "A man is making hot food."
- True Label: Not Entailment (0)
- Predicted: Not Entailment (0)
- Confidence: 0.7855
-
Example 3 ✗
- Premise: "A baby in an indoor pool is using an inflatable tube on it's on"
- Hypothesis: "The baby is inside"
- True Label: Not Entailment (0)
- Predicted: Entailment (1)
- Confidence: 0.6960
Configuration:
- Frozen Model: DistilBERT (from Part 1)
- Prompt Lengths: p=5, 10, 20
- Learning Rate: 1e-3
- Epochs: 3
Performance by Configuration:
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 0.1839 | 0.9275 | 0.1599 | 0.9375 |
| 2 | 0.1742 | 0.9321 | 0.1594 | 0.9376 |
| 3 | 0.1711 | 0.9335 | 0.1595 | 0.9373 |
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 0.2168 | 0.9125 | 0.1647 | 0.9379 |
| 2 | 0.1836 | 0.9274 | 0.1613 | 0.9384 |
| 3 | 0.1747 | 0.9319 | 0.1600 | 0.9367 |
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc |
|---|---|---|---|---|
| 1 | 0.2277 | 0.9075 | 0.1640 | 0.9366 |
| 2 | 0.1805 | 0.9288 | 0.1607 | 0.9372 |
| 3 | 0.1738 | 0.9319 | 0.1600 | 0.9373 |
Configuration:
- Model: ELECTRA-small (google/electra-small-discriminator)
- Contrastive Loss: In-batch negative sampling
- Embedding Dimension: 256
- Max Length: 16 tokens
Initial Evaluation (Before Training):
- Recall@3: 0.125 (12.5%)
- Mean Reciprocal Rank (MRR): 0.125
The DPR system successfully retrieves relevant passages for questions using learned dense representations. The contrastive learning approach with in-batch negatives enables efficient training without explicit negative examples.
Validation Examples:
Question 1:
- Title: "What type of beer is best for beer battered fish?"
- Body: "I was looking for a beer battered fish recipe the other day when I noticed most of the recipes don't state a style of beer to use. Some of the recipes use a significant amount of beer so I assume that some of the flavor profile from the beer will carry over to the fish. So I'm wondering, which style is ideal? Porter? IPA? Maybe a Hefeweizen?"
- True Answer: "The primary use of beer in a beer batter is its alcohol, which disrupts gluten formation and needs less heat than water to evaporate, improving the texture of the final crust. For flavor, most recipes using beer do best with a malty, low-bitterness beer, like a marzen, scotch ale, or (maybe) amber ale. Highly-hopped 'put hair on your chest' IPAs are a bad idea: you don't want that bitterness. Hefeweizen would be fine."
- Top Retrieved Passages: Retrieved passages were not directly relevant to the question, indicating the need for training.
Question 2:
- Title: "When sauteing should I put onion or garlic first?"
- Body: "Most of the dishes here in the Philippines involved sauteing. But I am a little bit confused on what should I put first, are there any advantages on it?"
- True Answer: "Onions always benefit from a few minutes on their own to soften and start sweetening. Garlic burns easily, especially when finely chopped or crushed, so in general should not be fried as long as onion. Having said that, when doing a quick stir fry or similar dish, you can throw in the garlic first for 10-20 seconds so that it flavours the oil."
- Top Retrieved Passages: Retrieved passages were not directly relevant to the question.
Question 3:
- Title: "How long do unrefrigerated opened canned peppers last?"
- Body: "I received a couple homemade cans of banana peppers that were canned with a jalapeno in each for some extra heat. I absolutely love the taste of them, but I am curious how long they can last once opened when there isn't access to refrigeration."
- True Answer: "Given the vinegar, these sound like pickled peppers. Pickled items are usually made to last, even when not refrigerated -- preservation was the original purpose of pickling..."
- Top Retrieved Passages: One of the top retrieved passages correctly matched the true answer about pickled peppers, demonstrating some retrieval capability even before training.

