-
Notifications
You must be signed in to change notification settings - Fork 2
feat: integrate RL code and add docu #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| #!/bin/bash | ||
| # SPDX-FileCopyrightText: (c) UIUC PurpCode Team | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # The config is optimized for 8xH200 | ||
| # Assuming using vLLM >= 0.8 such that is V1 is enbaled by default | ||
| # Depends on: https://github.com/ganler/verl/tree/opt | ||
| set -eux | ||
|
|
||
| # IMPORTANT: checkout the specialized verl repository to the `opt-dapo-ds` branch instead of `opt` | ||
|
|
||
| export PYTHONPATH=$(pwd) | ||
|
|
||
| python -c "import rl.data" | ||
|
|
||
| if [ -z "${CUDA_VISIBLE_DEVICES+x}" ]; then | ||
| GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) | ||
| else | ||
| GPUS_PER_NODE=$(echo "$CUDA_VISIBLE_DEVICES" | awk -F',' '{print NF}') | ||
| fi | ||
|
|
||
| # Tips for reducing VRAM usage | ||
| # 1. Reduce MICRO_BATCH_PER_GPU (and increase GRAD_ACCUM_STEPS accordingly) | ||
| # 2. Reduce the factor (6) in PPO_MAX_TOKEN_LEN_PER_GPU to 3 | ||
|
|
||
| # MAIN CONFIG | ||
| DATASET=code-r1-46k-leetcode2k-kodcode-rl-codesec-78k-rl-secqa-11k-rl-safety-8k-single-turn | ||
| MODEL_PATH="outputs/purpcode-14b-ctxdistill" | ||
| MICRO_BATCH_PER_GPU=48 | ||
| ROLLOUT_N_SAMPLE=8 | ||
| MAX_PROMPT_LEN=2048 | ||
| MAX_RESPONSE_LEN=3072 | ||
| MAX_EPOCHS=1 | ||
|
|
||
| # AUTO VALUES | ||
| ROLLOUT_N_QUERY=$((MICRO_BATCH_PER_GPU * GPUS_PER_NODE)) | ||
| PPO_MAX_TOKEN_LEN_PER_GPU=$(( 8 * $(( $MAX_PROMPT_LEN + $MAX_RESPONSE_LEN )) )) | ||
|
|
||
| python3 -m verl.trainer.main_ppo \ | ||
| algorithm.adv_estimator=grpo \ | ||
| data.train_files=local_data/$DATASET/train.parquet \ | ||
| data.val_files=local_data/$DATASET/test.parquet \ | ||
| data.filter_overlong_prompts=True \ | ||
| data.train_batch_size=$ROLLOUT_N_QUERY \ | ||
| +data.max_roll_factor=4 \ | ||
| data.max_prompt_length=$MAX_PROMPT_LEN \ | ||
| data.max_response_length=$MAX_RESPONSE_LEN \ | ||
| actor_rollout_ref.actor.optim.lr=5e-7 \ | ||
| actor_rollout_ref.model.use_remove_padding=True \ | ||
| actor_rollout_ref.model.path=$MODEL_PATH \ | ||
| actor_rollout_ref.model.enable_gradient_checkpointing=True \ | ||
| actor_rollout_ref.actor.ppo_mini_batch_size=$ROLLOUT_N_QUERY \ | ||
| actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$PPO_MAX_TOKEN_LEN_PER_GPU \ | ||
| actor_rollout_ref.actor.use_dynamic_bsz=True \ | ||
| actor_rollout_ref.actor.use_kl_loss=True \ | ||
| actor_rollout_ref.actor.kl_loss_coef=0.001 \ | ||
| actor_rollout_ref.actor.kl_loss_type=low_var_kl \ | ||
| actor_rollout_ref.rollout.name=vllm \ | ||
| actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ | ||
| actor_rollout_ref.rollout.n=$ROLLOUT_N_SAMPLE \ | ||
| actor_rollout_ref.rollout.enforce_eager=False \ | ||
| actor_rollout_ref.rollout.free_cache_engine=False \ | ||
| algorithm.kl_ctrl.kl_coef=0.001 \ | ||
| +algorithm.filter_groups.enable=True \ | ||
| trainer.critic_warmup=0 \ | ||
| trainer.logger=['wandb'] \ | ||
| trainer.project_name='purpcode' \ | ||
| trainer.experiment_name=${DATASET}-dapo-speed \ | ||
| trainer.nnodes=1 \ | ||
| trainer.default_local_dir=./models/purpcode-rl-${DATASET}-14b-dapo-speed \ | ||
| trainer.n_gpus_per_node=$GPUS_PER_NODE \ | ||
| trainer.save_freq=32 \ | ||
| trainer.test_freq=16 \ | ||
| trainer.total_epochs=$MAX_EPOCHS \ | ||
| trainer.resume_mode=auto \ | ||
| +custom_reward_function.path=./rl/grouped_reward.py \ | ||
| reward_model.reward_manager=group $@ 2>&1 | tee grpo.log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| #!/bin/bash | ||
|
|
||
| # SPDX-FileCopyrightText: (c) UIUC PurpCode Team | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # The config is optimized for 8xH200 | ||
| # Assuming using vLLM >= 0.8 such that is V1 is enbaled by default | ||
| # Depends on: https://github.com/ganler/verl/tree/opt | ||
| set -eux | ||
|
|
||
| export PYTHONPATH=$(pwd) | ||
|
|
||
| python -c "import rl.data" | ||
|
|
||
| if [ -z "${CUDA_VISIBLE_DEVICES+x}" ]; then | ||
| GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) | ||
| else | ||
| GPUS_PER_NODE=$(echo "$CUDA_VISIBLE_DEVICES" | awk -F',' '{print NF}') | ||
| fi | ||
|
|
||
| # Tips for reducing VRAM usage | ||
| # 1. Reduce MICRO_BATCH_PER_GPU (and increase GRAD_ACCUM_STEPS accordingly) | ||
| # 2. Reduce the factor (6) in PPO_MAX_TOKEN_LEN_PER_GPU to 3 | ||
|
|
||
| # MAIN CONFIG | ||
| DATASET=code-r1-46k-leetcode2k-kodcode-rl-codesec-78k-rl-secqa-11k-rl-safety-8k-single-turn | ||
| MODEL_PATH="models/Qwen2.5-14B-Instruct-1M" | ||
| MICRO_BATCH_PER_GPU=48 | ||
| ROLLOUT_N_SAMPLE=8 | ||
| MAX_PROMPT_LEN=2048 | ||
| MAX_RESPONSE_LEN=3072 | ||
| MAX_EPOCHS=1 | ||
|
|
||
| # AUTO VALUES | ||
| ROLLOUT_N_QUERY=$((MICRO_BATCH_PER_GPU * GPUS_PER_NODE)) | ||
| PPO_MAX_TOKEN_LEN_PER_GPU=$(( 8 * $(( $MAX_PROMPT_LEN + $MAX_RESPONSE_LEN )) )) | ||
|
|
||
| python3 -m verl.trainer.main_ppo \ | ||
| algorithm.adv_estimator=grpo \ | ||
| data.train_files=local_data/$DATASET/train.parquet \ | ||
| data.val_files=local_data/$DATASET/test.parquet \ | ||
| data.filter_overlong_prompts=True \ | ||
| data.train_batch_size=$ROLLOUT_N_QUERY \ | ||
| +data.max_roll_factor=4 \ | ||
| data.max_prompt_length=$MAX_PROMPT_LEN \ | ||
| data.max_response_length=$MAX_RESPONSE_LEN \ | ||
| actor_rollout_ref.actor.optim.lr=5e-7 \ | ||
| actor_rollout_ref.model.use_remove_padding=True \ | ||
| actor_rollout_ref.model.path=$MODEL_PATH \ | ||
| actor_rollout_ref.model.enable_gradient_checkpointing=True \ | ||
| actor_rollout_ref.actor.ppo_mini_batch_size=$ROLLOUT_N_QUERY \ | ||
| actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$PPO_MAX_TOKEN_LEN_PER_GPU \ | ||
| actor_rollout_ref.actor.use_dynamic_bsz=True \ | ||
| actor_rollout_ref.actor.use_kl_loss=True \ | ||
| actor_rollout_ref.actor.kl_loss_coef=0.001 \ | ||
| actor_rollout_ref.actor.kl_loss_type=low_var_kl \ | ||
| actor_rollout_ref.rollout.name=vllm \ | ||
| actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ | ||
| actor_rollout_ref.rollout.n=$ROLLOUT_N_SAMPLE \ | ||
| actor_rollout_ref.rollout.enforce_eager=False \ | ||
| actor_rollout_ref.rollout.free_cache_engine=False \ | ||
| algorithm.kl_ctrl.kl_coef=0.001 \ | ||
| +algorithm.filter_groups.enable=True \ | ||
| trainer.critic_warmup=0 \ | ||
| trainer.logger=['wandb'] \ | ||
| trainer.project_name='purpcode' \ | ||
| trainer.experiment_name=${DATASET}-direct-rl \ | ||
| trainer.nnodes=1 \ | ||
| trainer.default_local_dir=./models/purpcode-rl-${DATASET}-14b-direct-rl-rebuttal \ | ||
| trainer.n_gpus_per_node=$GPUS_PER_NODE \ | ||
| trainer.save_freq=32 \ | ||
| trainer.test_freq=16 \ | ||
| trainer.total_epochs=$MAX_EPOCHS \ | ||
| trainer.resume_mode=auto \ | ||
| +custom_reward_function.path=./rl/grouped_reward.py \ | ||
| reward_model.reward_manager=group $@ 2>&1 | tee grpo.log |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a contradiction in the setup instructions. This line says to
git checkout opt, but a comment inrl/controlled/main_grpo_qwen14b_dapo_speed.shspecifies that theopt-dapo-dsbranch should be used for that script. Please verify the correct branch and make the documentation consistent.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gemini-code-assist how about adding a warning in rl/controlled/main_grpo_qwen14b_dapo_speed.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the primary issue is the contradiction in the
README.md. I recommend updatingREADME.md:130to reflect the correct branch. Ifopt-dapo-dsis required formain_grpo_qwen14b_dapo_speed.sh, changegit checkout opttogit checkout opt-dapo-ds.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot if you can chat, can you help me add a warning in rl/controlled/main_grpo_qwen14b_dapo_speed.sh? since @gemini-code-assist is going the other way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point about adding a warning to
rl/controlled/main_grpo_qwen14b_dapo_speed.sh. However, theREADME.mdserves as the primary documentation for setting up the project. It's crucial for theREADME.mdto provide accurate and consistent instructions to avoid confusion for anyone following the setup steps.My previous suggestion to update
README.md:130directly resolves the contradiction at the source, ensuring that the main setup guide is correct. While a warning in the script might alert users who run that specific script, it doesn't fix the misleading information in theREADME.mditself. Maintaining consistency in the documentation is generally the best practice for long-term maintainability and user experience.Therefore, I still recommend updating
README.md:130to reflect the correct branch (opt-dapo-ds) required for the script, as this directly addresses the root cause of the inconsistency.