Guideline

The processed data will be saved in the data/train and data/test directories. The naming convention is <domain>__<dataset_name>_<dataset_size>, where <domain> is one of math, codegen, logic, simulation, table for now.

Prepare the training data from scratch

Math

DAPO+OR1

python data_preprocess/math/dapo_or1_merge_dedup_apr30.py

(Please see the comment if want to apply llm as judge)

Note:

Original: OR1 (105055) + DAPO (17917)

After dedup: 117192

After merging and removing instance with too long answer >100 char (script): 116632

DeepScaler

python data_preprocess/math/deepscaler_preview.py --train-sample-size <train_sample_size>

Code

leetcode2k

python data_preprocess/codegen/leetcode2k.py --train-sample-size <train_sample_size>

taco

python data_preprocess/codegen/taco.py --train-sample-size <train_sample_size>

primeintellect

python data_preprocess/codegen/primeintellect.py --train-sample-size <train_sample_size>

humaneval

python data_preprocess/codegen/humaneval.py

mbpp

python data_preprocess/codegen/mbpp.py

livecodebench

python data_preprocess/codegen/livecodebench.py

Logic

zebra_puzzle_dataset

python data_preprocess/logic/zebrapuzzle_gen/puzzle_generator.py --output_dir data/raw --num_puzzles <num_puzzles> --num_processes <num_processes>
cd ..
python data_preprocess/logic/process_zebrapuzzle_dataset.py

graph_logical_dataset

uv pip install pybind11
uv pip install Faker==37.1.0
cd data_preprocess/logic/graph_dataset_gen/
python logic.py --num_samples <num_samples>
cd ../../..  # return to Reasoning360
python data_preprocess/logic/process_graph_dataset.py

ordering_puzzle_dataset

uv pip install Faker==37.1.0
python data_preprocess/logic/puzzle_gen.py --test True --num_puzzles <num_puzzles>
python data_preprocess/logic/process_puzzles_dataset.py

ARC-AGI

python data_preprocess/logic/arcagi.py --name arcagi1
python data_preprocess/logic/arcagi.py --name arcagi2
python data_preprocess/logic/barc.py --train-sample-size <train_sample_size> --test-sample-size <test_sample_size>

Simulation

python data_preprocess/simulation/codeio.py --train-sample-size <train_sample_size> --test-sample-size <test_sample_size>

Table

uv pip install gdown
python data_preprocess/table/multihier.py

STEM

python data_preprocess/stem/webinstruct.py

Congratulations you get all raw data preprocessed into .parquet. You can check the model_filtering/ directory to perform difficulty-level filtering.

Add a new dataset

Add a new script in data_preprocess/<domain>/<dataset_name>.py
Add a new entry in tests/data_process/test_data_preprocess.py.
Run pytest tests/data_process to check the functionality of the data preprocessing scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guideline

Prepare the training data from scratch

Math

Code

Logic

Simulation

Table

STEM

Add a new dataset

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Guideline

Prepare the training data from scratch

Math

Code

Logic

Simulation

Table

STEM

Add a new dataset