This document provides detailed instructions for evaluating the Dream model on GSM8K math problem solving and HumanEval code generation tasks.
Before running any evaluation, set the following environment variables:
export HF_ALLOW_CODE_EVAL=1
export HF_DATASETS_TRUST_REMOTE_CODE=trueGSM8K is a dataset of 8,000 grade school math problems designed to evaluate mathematical reasoning capabilities.
task=gsm8k
length=256
block_length=32
num_fewshot=5
steps=$((length / block_length))
model="Dream-org/Dream-v0-Base-7B"- Baseline
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,show_speed=True \
--tasks ${task} \
--num_fewshot ${num_fewshot} \
--batch_size 1- Prefix Cache
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=256,diffusion_steps=256,add_bos_token=true,alg=entropy,use_cache=true,show_speed=True \
--tasks ${task} \
--num_fewshot ${num_fewshot} \
--batch_size 1- Parallel Generation
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,show_speed=True \
--tasks ${task} \
--num_fewshot ${num_fewshot} \
--batch_size 1- Prefix Cache + Parallel
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true \
--tasks ${task} \
--num_fewshot ${num_fewshot} \
--batch_size 1- Dual Cache + Parallel
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,dual_cache=true \
--tasks ${task} \
--num_fewshot ${num_fewshot} \
--batch_size 1task: Evaluation task (gsm8k)length: Generation lengthblock_length: Block size for parallel generationnum_fewshot: Number of few-shot examplessteps: Number of generation stepsmodel: Model name (Dream-v0-Base-7B)use_cache: Enable prefix cachedual_cache: Enable dual cachethreshold: Confidence threshold for parallel generationshow_speed: Display speed metricsalg: Generation algorithm (entropy or confidence_threshold)
HumanEval is a dataset of 164 Python programming problems designed to evaluate code generation capabilities.
task=humaneval
length=256
block_length=32
steps=$((length / block_length))
model="Dream-org/Dream-v0-Base-7B"- Baseline
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,show_speed=True,escape_until=true \
--tasks ${task} \
--batch_size 1 \
--output_path evals_results/baseline/humaneval-ns0-${length} --log_samples \
--confirm_run_unsafe_code- Prefix Cache
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,use_cache=true,show_speed=True,escape_until=true \
--tasks ${task} \
--batch_size 1 \
--output_path evals_results/cache/humaneval-ns0-${length} --log_samples \
--confirm_run_unsafe_code- Parallel Generation
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,show_speed=True,escape_until=true \
--tasks ${task} \
--batch_size 1 \
--output_path evals_results/parallel/humaneval-ns0-${length} --log_samples \
--confirm_run_unsafe_code- Prefix Cache + Parallel
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,escape_until=true \
--tasks ${task} \
--batch_size 1 \
--output_path evals_results/cache_parallel/humaneval-ns0-${length} --log_samples \
--confirm_run_unsafe_code- Dual Cache + Parallel
accelerate launch eval.py --model dream \
--model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,dual_cache=true,escape_until=true \
--tasks ${task} \
--batch_size 1 \
--output_path evals_results/dual_cache_parallel/humaneval-ns0-${length} --log_samples \
--confirm_run_unsafe_codeescape_until: Enable escape until for code generationconfirm_run_unsafe_code: Confirm running unsafe code for evaluationlog_samples: Log generated samples for analysis
For HumanEval evaluation, post-processing is required:
python postprocess_code.py {the samples_xxx.jsonl file under output_path}- All evaluations use the Dream-v0-Base-7B model
- Results are saved in the
evals_resultsdirectory - For HumanEval, samples are logged for postprocessing
- Speed metrics are shown for all evaluations
- Different optimization strategies can be combined:
- HumanEval evaluation requires additional safety confirmations