Skip to content

Commit a673f58

Browse files
Merge pull request #911 from mlcommons/dev
Dev -> main
2 parents 77f7aae + e534fff commit a673f58

5 files changed

Lines changed: 54 additions & 22 deletions

File tree

.dockerignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
*
2-
!datasets/
2+
!dataset/
33
!docker/

dataset/README.md

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,15 @@ python3 dataset/dataset_setup.py \
3131
--<optional_flags>
3232
```
3333

34-
The complete benchmark uses 6 different datasets:
34+
The complete benchmark uses 7 different datasets:
3535

3636
- [OGBG](#ogbg)
3737
- [WMT](#wmt)
3838
- [FastMRI](#fastmri)
3939
- [Imagenet](#imagenet)
4040
- [Criteo 1TB](#criteo1tb)
4141
- [Librispeech](#librispeech)
42+
- [Fineweb-edu 10B](#fineweb-edu-10b)
4243

4344
Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs.
4445

@@ -456,11 +457,50 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v
456457
```
457458

458459
### Fineweb-EDU 10B
459-
From `algorithmic-efficiency` run:
460+
461+
The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`.
460462

461463
```bash
462464
python3 dataset/dataset_setup.py \
463465
--data_dir $DATA_DIR \
464466
--temp_dir $DATA_DIR/tmp \
465467
--fineweb_edu
466-
```
468+
```
469+
470+
<details>
471+
<summary>The final directory structure should look like this:</summary>
472+
473+
```bash
474+
$DATA_DIR
475+
├──fineweb_edu_10B
476+
│ ├── fwedu_10B_tokenized
477+
│ │ ├── data-00000-of-00080.arrow
478+
│ │ ├── data-00001-of-00080.arrow
479+
│ │ ├── data-00002-of-00080.arrow
480+
│ │ ├── [...]
481+
│ │ ├── data-00078-of-00080.arrow
482+
│ │ ├── data-00079-of-00080.arrow
483+
│ │ ├── dataset_info.json
484+
│ │ └── state.json
485+
│ ├── train
486+
│ │ ├── 11814516993635243069
487+
│ │ │ └── 00000000.shard
488+
│ │ │ └── 00000000.snapshot
489+
│ │ ├── 1309159339089188891
490+
│ │ ├── 13196585434617636667
491+
│ │ ├── 13328239765396585889
492+
│ │ ├── 13443989554399185472
493+
│ │ ├── 17062004665044410656
494+
│ │ ├── 832373293846386485
495+
│ │ ├── 9244072261762587327
496+
│ │ ├── dataset_spec.pb
497+
│ │ └── snapshot.metadata
498+
│ └── val
499+
│ ├── 8122001362029945413
500+
│ │ └── 00000000.shard
501+
│ │ └── 00000000.snapshot
502+
│ ├── dataset_spec.pb
503+
│ └── snapshot.metadata
504+
```
505+
In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`).
506+
</details>

dataset/dataset_setup.py

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -782,29 +782,21 @@ def download_finewebedu(
782782
):
783783
"""Download FineWebEdu-10B."""
784784

785-
if not skip_download:
786-
data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
787-
tmp_dir = tmp_dir if tmp_dir is not None else '/tmp'
788-
cache_dir = (
789-
os.path.join(tmp_dir, 'lm')
790-
if tmp_dir is not None
791-
else os.path.expanduser('~/.cache/huggingface/datasets')
792-
)
793-
794-
_maybe_mkdir(data_dir)
795-
_maybe_mkdir(tmp_dir)
796-
_maybe_mkdir(cache_dir)
785+
data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
786+
_maybe_mkdir(data_dir)
787+
_maybe_mkdir(tmp_dir)
797788

789+
if not skip_download:
798790
os.environ['TMPDIR'] = tmp_dir
799791

800792
ds = hf_datasets.load_dataset(
801793
'HuggingFaceFW/fineweb-edu',
802794
name='sample-10BT',
803795
split='train',
804-
cache_dir=cache_dir,
796+
cache_dir=tmp_dir,
805797
)
806798
ds.save_to_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))
807-
else:
799+
elif not skip_tokenization:
808800
ds = hf_datasets.load_from_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))
809801

810802
if not skip_tokenization:

docs/CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`.
297297
We also have regression tests available in
298298
[.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
299299
that can be run semi-automatically. The regression tests are shorter end-to-end
300-
submissions run in a containerized environment across all 8 workloads, in both
300+
submissions run in a containerized environment across all 9 workloads, in both
301301
the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
302302
and are triggered for pull requests that target the main branch. Typically these
303303
PRs will be from the `dev` branch so the tests will run containers based on

docs/GETTING_STARTED.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the
219219
220220
## Download the Data
221221
222-
The workloads in this benchmark use 6 different datasets across 8 workloads. You
222+
The workloads in this benchmark use 6 different datasets across 9 workloads. You
223223
may choose to download some or all of the datasets as you are developing your
224-
submission, but your submission will be scored across all 8 workloads. For
224+
submission, but your submission will be scored across all 9 workloads. For
225225
instructions on obtaining and setting up the datasets see
226-
[datasets/README](/datasets/README.md#dataset-setup).
226+
[dataset/README](/dataset/README.md#dataset-setup).
227227
228228
## Develop your Submission
229229

0 commit comments

Comments
 (0)