Dream Model Evaluation Guide

This document provides detailed instructions for evaluating the Dream model on GSM8K math problem solving and HumanEval code generation tasks.

Environment Setup

Before running any evaluation, set the following environment variables:

export HF_ALLOW_CODE_EVAL=1
export HF_DATASETS_TRUST_REMOTE_CODE=true

GSM8K Evaluation

GSM8K is a dataset of 8,000 grade school math problems designed to evaluate mathematical reasoning capabilities.

Common Parameters

task=gsm8k
length=256
block_length=32
num_fewshot=5
steps=$((length / block_length))
model="Dream-org/Dream-v0-Base-7B"

Evaluation Methods

Baseline

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,show_speed=True \
    --tasks ${task} \
    --num_fewshot ${num_fewshot} \
    --batch_size 1

Prefix Cache

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=256,diffusion_steps=256,add_bos_token=true,alg=entropy,use_cache=true,show_speed=True \
    --tasks ${task} \
    --num_fewshot ${num_fewshot} \
    --batch_size 1

Parallel Generation

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,show_speed=True \
    --tasks ${task} \
    --num_fewshot ${num_fewshot} \
    --batch_size 1

Prefix Cache + Parallel

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true \
    --tasks ${task} \
    --num_fewshot ${num_fewshot} \
    --batch_size 1

Dual Cache + Parallel

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,dual_cache=true \
    --tasks ${task} \
    --num_fewshot ${num_fewshot} \
    --batch_size 1

Parameter Descriptions

task: Evaluation task (gsm8k)
length: Generation length
block_length: Block size for parallel generation
num_fewshot: Number of few-shot examples
steps: Number of generation steps
model: Model name (Dream-v0-Base-7B)
use_cache: Enable prefix cache
dual_cache: Enable dual cache
threshold: Confidence threshold for parallel generation
show_speed: Display speed metrics
alg: Generation algorithm (entropy or confidence_threshold)

HumanEval Evaluation

HumanEval is a dataset of 164 Python programming problems designed to evaluate code generation capabilities.

Common Parameters

task=humaneval
length=256
block_length=32
steps=$((length / block_length))
model="Dream-org/Dream-v0-Base-7B"

Evaluation Methods

Baseline

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,show_speed=True,escape_until=true \
    --tasks ${task} \
    --batch_size 1 \
    --output_path evals_results/baseline/humaneval-ns0-${length} --log_samples \
    --confirm_run_unsafe_code

Prefix Cache

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${length},add_bos_token=true,alg=entropy,use_cache=true,show_speed=True,escape_until=true \
    --tasks ${task} \
    --batch_size 1 \
    --output_path evals_results/cache/humaneval-ns0-${length} --log_samples \
    --confirm_run_unsafe_code

Parallel Generation

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,show_speed=True,escape_until=true \
    --tasks ${task} \
    --batch_size 1 \
    --output_path evals_results/parallel/humaneval-ns0-${length} --log_samples \
    --confirm_run_unsafe_code

Prefix Cache + Parallel

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,escape_until=true \
    --tasks ${task} \
    --batch_size 1 \
    --output_path evals_results/cache_parallel/humaneval-ns0-${length} --log_samples \
    --confirm_run_unsafe_code

Dual Cache + Parallel

accelerate launch eval.py --model dream \
    --model_args pretrained=${model},max_new_tokens=${length},diffusion_steps=${steps},add_bos_token=true,alg=confidence_threshold,threshold=0.9,use_cache=true,dual_cache=true,escape_until=true \
    --tasks ${task} \
    --batch_size 1 \
    --output_path evals_results/dual_cache_parallel/humaneval-ns0-${length} --log_samples \
    --confirm_run_unsafe_code

Additional Parameters for HumanEval

escape_until: Enable escape until for code generation
confirm_run_unsafe_code: Confirm running unsafe code for evaluation
log_samples: Log generated samples for analysis

Post-processing

For HumanEval evaluation, post-processing is required:

python postprocess_code.py {the samples_xxx.jsonl file under output_path}

Notes

All evaluations use the Dream-v0-Base-7B model
Results are saved in the evals_results directory
For HumanEval, samples are logged for postprocessing
Speed metrics are shown for all evaluations
Different optimization strategies can be combined:
HumanEval evaluation requires additional safety confirmations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dream Model Evaluation Guide

Environment Setup

GSM8K Evaluation

Common Parameters

Evaluation Methods

Parameter Descriptions

HumanEval Evaluation

Common Parameters

Evaluation Methods

Additional Parameters for HumanEval

Post-processing

Notes

FilesExpand file tree

eval.md

Latest commit

History

eval.md

File metadata and controls

Dream Model Evaluation Guide

Environment Setup

GSM8K Evaluation

Common Parameters

Evaluation Methods

Parameter Descriptions

HumanEval Evaluation

Common Parameters

Evaluation Methods

Additional Parameters for HumanEval

Post-processing

Notes