Skip to content

Add PandaLM RLAIF example and enhance RFT notebooks#678

Merged
w601sxs merged 3 commits intoaws-samples:mainfrom
mccartnick:rft-br
Feb 26, 2026
Merged

Add PandaLM RLAIF example and enhance RFT notebooks#678
w601sxs merged 3 commits intoaws-samples:mainfrom
mccartnick:rft-br

Conversation

@mccartnick
Copy link
Copy Markdown
Contributor

Description of changes:

This PR adds a new RLAIF (Reinforcement Learning from AI Feedback) example using the PandaLM dataset and improves the existing RFT notebooks.

New additions

  • PandaLM notebook (nova_pandalm_rft.ipynb) - Demonstrates LLM-as-judge evaluation training
  • PandaLM reward function (pandalm_rew_func.py) - Uses Bedrock to score model evaluations
  • Architecture diagrams explaining RLVR vs RLAIF approaches

Notebook improvements (applied to GSM8K, FinQA, PandaLM)

  • Dataset analysis cell to help set maxPromptLength and inferenceMaxTokens
  • Full dataset preprocessing with size info in filenames (e.g., train-6k.jsonl)
  • Test reward function with 10 real samples (5 train + 5 val) before training
  • Use-case specific Bedrock IAM roles (BedrockRFT-{dataset}-Role)
  • Unique job/model names with date and hyperparameters for tracking
  • Inline comments explaining hyperparameter trade-offs

README updates

  • Added PandaLM to examples table
  • New section explaining RLVR vs RLAIF with architecture diagrams
  • Symbol explanations for the RL training loop
  • Updated directory structure and code examples

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,724 @@
{
Copy link
Copy Markdown
Contributor

@w601sxs w601sxs Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok overall except for the placement of this comparison table. Maybe this table should be in the high level readme of the RFT folder?


Reply via ReviewNB

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, moved further up.

Copy link
Copy Markdown
Contributor

@w601sxs w601sxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments on a few files in this commit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "predicted reward" to just reward? or something better? Suggesting since the reward is not
"predicted"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

return {
"messages": [
{"role": "system", "content": "You are a helpful math tutor who solves word problems step by step."},
{"role": "user", "content": f"{row['question']} Let's think step by step and output the final answer after \"####\"."}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought its better to do final answer in \box{}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most math datasets (e.g. MATH, Minerva) and is standard latex convention, however the above format matches the source data of GSM8K found in HF.

matches[-1].replace("$", "").replace("%", "").replace(",", "").strip()
)
try:
return float(num_str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to think of a condition where the answer is something like "In 2007, a total amount of $25 Million was spent on..." . Then [2007,] gets matched first and returned? but the answer is $25 Million (best case is $25 is returned without the million

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: prioritize $-prefixed amounts over bare numbers and handle magnitude suffixes (million/billion/k) so "$25 Million" correctly extracts as 25000000 instead of matching "2007" first.

Copy link
Copy Markdown
Contributor

@w601sxs w601sxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes!

@w601sxs w601sxs merged commit cd9b964 into aws-samples:main Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants