The processed data will be saved in the data/train and data/test directories.
The naming convention is <domain>__<dataset_name>_<dataset_size>, where <domain> is one of math, codegen, logic, simulation, table for now.
DAPO+OR1
python data_preprocess/math/dapo_or1_merge_dedup_apr30.py(Please see the comment if want to apply llm as judge)
Note:
Original: OR1 (105055) + DAPO (17917)
After dedup: 117192
After merging and removing instance with too long answer >100 char (script): 116632
DeepScaler
python data_preprocess/math/deepscaler_preview.py --train-sample-size <train_sample_size>leetcode2k
python data_preprocess/codegen/leetcode2k.py --train-sample-size <train_sample_size>taco
python data_preprocess/codegen/taco.py --train-sample-size <train_sample_size>primeintellect
python data_preprocess/codegen/primeintellect.py --train-sample-size <train_sample_size>humaneval
python data_preprocess/codegen/humaneval.pymbpp
python data_preprocess/codegen/mbpp.pylivecodebench
python data_preprocess/codegen/livecodebench.pyzebra_puzzle_dataset
python data_preprocess/logic/zebrapuzzle_gen/puzzle_generator.py --output_dir data/raw --num_puzzles <num_puzzles> --num_processes <num_processes>
cd ..
python data_preprocess/logic/process_zebrapuzzle_dataset.pygraph_logical_dataset
uv pip install pybind11
uv pip install Faker==37.1.0
cd data_preprocess/logic/graph_dataset_gen/
python logic.py --num_samples <num_samples>
cd ../../.. # return to Reasoning360
python data_preprocess/logic/process_graph_dataset.pyordering_puzzle_dataset
uv pip install Faker==37.1.0
python data_preprocess/logic/puzzle_gen.py --test True --num_puzzles <num_puzzles>
python data_preprocess/logic/process_puzzles_dataset.pyARC-AGI
python data_preprocess/logic/arcagi.py --name arcagi1
python data_preprocess/logic/arcagi.py --name arcagi2
python data_preprocess/logic/barc.py --train-sample-size <train_sample_size> --test-sample-size <test_sample_size>python data_preprocess/simulation/codeio.py --train-sample-size <train_sample_size> --test-sample-size <test_sample_size>uv pip install gdown
python data_preprocess/table/multihier.pypython data_preprocess/stem/webinstruct.pyCongratulations you get all raw data preprocessed into .parquet. You can check the model_filtering/ directory to perform difficulty-level filtering.
- Add a new script in
data_preprocess/<domain>/<dataset_name>.py - Add a new entry in
tests/data_process/test_data_preprocess.py. - Run
pytest tests/data_processto check the functionality of the data preprocessing scripts.