New options for preference tuning: rpo alpha, logprobs normalization, reference-free, simpo gamma#327
Conversation
| training_method_cls = TrainingMethodSFT(train_on_inputs=train_on_inputs) | ||
| elif training_method == "dpo": | ||
| training_method_cls = TrainingMethodDPO(dpo_beta=dpo_beta) | ||
| if simpo_gamma is not None and simpo_gamma > 0: |
There was a problem hiding this comment.
By the way, should we raise a ValueError if it's <=0?
There was a problem hiding this comment.
Added + added for rpo_alpha (can't imagine an use case for negative values for these parameters)
| if rpo_alpha is not None: | ||
| if training_method != "dpo": | ||
| raise ValueError("rpo_alpha is only supported for DPO training") | ||
| if not rpo_alpha >= 0.0: |
There was a problem hiding this comment.
Maybe it's wise to put an upper limit too
There was a problem hiding this comment.
Not sure what can be a limit here, lets say 10? Wdyt?
There was a problem hiding this comment.
I'm not sure we should be enforcing any particular limit on this value, although it might be helpful. The problem is that this limit will apply only when users submit jobs via together-python
| raise ValueError( | ||
| "dpo_normalize_logratios_by_length=True is only supported for DPO training" | ||
| ) | ||
| if rpo_alpha is not None: |
There was a problem hiding this comment.
this could simply be if rpo_alpha
There was a problem hiding this comment.
A bit below I want to notify user that rpo_alpha==0.0 throws an error
src/together/resources/finetune.py
Outdated
There was a problem hiding this comment.
length* (sorry for being nit-picky)
Have you read the Contributing Guidelines?
Issue #
Describe your changes
Clearly and concisely describe what's in this pull request. Include screenshots, if necessary.