Add PandaLM RLAIF example and enhance RFT notebooks#678
Add PandaLM RLAIF example and enhance RFT notebooks#678w601sxs merged 3 commits intoaws-samples:mainfrom
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
| @@ -0,0 +1,724 @@ | |||
| { | |||
There was a problem hiding this comment.
looks ok overall except for the placement of this comparison table. Maybe this table should be in the high level readme of the RFT folder?
Reply via ReviewNB
There was a problem hiding this comment.
Good callout, moved further up.
w601sxs
left a comment
There was a problem hiding this comment.
minor comments on a few files in this commit
There was a problem hiding this comment.
Change "predicted reward" to just reward? or something better? Suggesting since the reward is not
"predicted"
| return { | ||
| "messages": [ | ||
| {"role": "system", "content": "You are a helpful math tutor who solves word problems step by step."}, | ||
| {"role": "user", "content": f"{row['question']} Let's think step by step and output the final answer after \"####\"."} |
There was a problem hiding this comment.
thought its better to do final answer in \box{}
There was a problem hiding this comment.
For most math datasets (e.g. MATH, Minerva) and is standard latex convention, however the above format matches the source data of GSM8K found in HF.
| matches[-1].replace("$", "").replace("%", "").replace(",", "").strip() | ||
| ) | ||
| try: | ||
| return float(num_str) |
There was a problem hiding this comment.
trying to think of a condition where the answer is something like "In 2007, a total amount of $25 Million was spent on..." . Then [2007,] gets matched first and returned? but the answer is $25 Million (best case is $25 is returned without the million
There was a problem hiding this comment.
Fixed: prioritize $-prefixed amounts over bare numbers and handle magnitude suffixes (million/billion/k) so "$25 Million" correctly extracts as 25000000 instead of matching "2007" first.
w601sxs
left a comment
There was a problem hiding this comment.
Thanks for all the changes!
Description of changes:
This PR adds a new RLAIF (Reinforcement Learning from AI Feedback) example using the PandaLM dataset and improves the existing RFT notebooks.
New additions
nova_pandalm_rft.ipynb) - Demonstrates LLM-as-judge evaluation trainingpandalm_rew_func.py) - Uses Bedrock to score model evaluationsNotebook improvements (applied to GSM8K, FinQA, PandaLM)
maxPromptLengthandinferenceMaxTokenstrain-6k.jsonl)BedrockRFT-{dataset}-Role)README updates