Skip to content

Conversation

@zhenhuan-yang
Copy link

Summary

This PR adds a new medium-difficulty Deep Learning question on computing Direct Preference Optimization (DPO) loss for language model alignment.

Question Details

  • ID: 189
  • Title: Compute Direct Preference Optimization Loss
  • Difficulty: Medium
  • Category: Deep Learning

Implementation

  • ✅ Complete solution with proper numerical stability using np.log1p
  • ✅ Comprehensive educational content covering DPO theory and Bradley-Terry model
  • ✅ Mathematical formulation with LaTeX
  • ✅ 4 diverse test cases with varying parameters
  • ✅ Example with detailed reasoning

Validation

  • ✅ Build successful
  • ✅ Schema validation passed
  • ✅ All test cases pass

Educational Value

Covers an important modern technique for LLM alignment that's simpler and more stable than traditional RLHF, making it highly relevant for current ML practitioners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant