Add PandaLM RLAIF example and enhance RFT notebooks by mccartnick · Pull Request #678 · aws-samples/amazon-bedrock-samples

mccartnick · 2026-02-10T19:33:22Z

Description of changes:

This PR adds a new RLAIF (Reinforcement Learning from AI Feedback) example using the PandaLM dataset and improves the existing RFT notebooks.

New additions

PandaLM notebook (nova_pandalm_rft.ipynb) - Demonstrates LLM-as-judge evaluation training
PandaLM reward function (pandalm_rew_func.py) - Uses Bedrock to score model evaluations
Architecture diagrams explaining RLVR vs RLAIF approaches

Notebook improvements (applied to GSM8K, FinQA, PandaLM)

Dataset analysis cell to help set maxPromptLength and inferenceMaxTokens
Full dataset preprocessing with size info in filenames (e.g., train-6k.jsonl)
Test reward function with 10 real samples (5 train + 5 val) before training
Use-case specific Bedrock IAM roles (BedrockRFT-{dataset}-Role)
Unique job/model names with date and hyperparameters for tracking
Inline comments explaining hyperparameter trade-offs

README updates

Added PandaLM to examples table
New section explaining RLVR vs RLAIF with architecture diagrams
Symbol explanations for the RL training loop
Updated directory structure and code examples

review-notebook-app · 2026-02-10T19:33:28Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

w601sxs · 2026-02-10T20:08:18Z

@@ -0,0 +1,724 @@
+{


looks ok overall except for the placement of this comparison table. Maybe this table should be in the high level readme of the RFT folder?

Reply via ReviewNB

Good callout, moved further up.

w601sxs

minor comments on a few files in this commit

w601sxs · 2026-02-10T20:11:23Z

Change "predicted reward" to just reward? or something better? Suggesting since the reward is not
"predicted"

w601sxs · 2026-02-10T20:12:17Z

        return {
+            "messages": [
+                {"role": "system", "content": "You are a helpful math tutor who solves word problems step by step."},
+                {"role": "user", "content": f"{row['question']} Let's think step by step and output the final answer after \"####\"."}


thought its better to do final answer in \box{}

For most math datasets (e.g. MATH, Minerva) and is standard latex convention, however the above format matches the source data of GSM8K found in HF.

w601sxs · 2026-02-10T20:14:45Z

+                matches[-1].replace("$", "").replace("%", "").replace(",", "").strip()
+            )
            try:
                return float(num_str)


trying to think of a condition where the answer is something like "In 2007, a total amount of $25 Million was spent on..." . Then [2007,] gets matched first and returned? but the answer is $25 Million (best case is $25 is returned without the million

Fixed: prioritize $-prefixed amounts over bare numbers and handle magnitude suffixes (million/billion/k) so "$25 Million" correctly extracts as 25000000 instead of matching "2007" first.

w601sxs

Thanks for all the changes!

add pandalm use-case

e6651db

w601sxs reviewed Feb 10, 2026

View reviewed changes

mccartnick added 2 commits February 25, 2026 17:38

fix FinQA currency extraction, add GPT-OSS-20B to README

f24f063

add qwen3 reference to model table in readme

9c4e9c2

w601sxs approved these changes Feb 26, 2026

View reviewed changes

w601sxs merged commit cd9b964 into aws-samples:main Feb 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PandaLM RLAIF example and enhance RFT notebooks#678

Add PandaLM RLAIF example and enhance RFT notebooks#678
w601sxs merged 3 commits intoaws-samples:mainfrom
mccartnick:rft-br

mccartnick commented Feb 10, 2026

Uh oh!

review-notebook-app Bot commented Feb 10, 2026

Uh oh!

w601sxs Feb 10, 2026 •

edited

Loading

Uh oh!

mccartnick Feb 25, 2026

Uh oh!

w601sxs left a comment

Uh oh!

w601sxs Feb 10, 2026

Uh oh!

mccartnick Feb 25, 2026

Uh oh!

w601sxs Feb 10, 2026

Uh oh!

mccartnick Feb 25, 2026

Uh oh!

w601sxs Feb 10, 2026

Uh oh!

mccartnick Feb 25, 2026

Uh oh!

w601sxs left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mccartnick commented Feb 10, 2026

New additions

Notebook improvements (applied to GSM8K, FinQA, PandaLM)

README updates

Uh oh!

review-notebook-app Bot commented Feb 10, 2026

Uh oh!

w601sxs Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccartnick Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

w601sxs left a comment

Choose a reason for hiding this comment

Uh oh!

w601sxs Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mccartnick Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

w601sxs Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mccartnick Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

w601sxs Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mccartnick Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

w601sxs left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

w601sxs Feb 10, 2026 •

edited

Loading