Merge pull request #911 from mlcommons/dev

priyakasimbeg · web-flow · commit a673f5835670 · 2026-02-09T20:44:25.000-08:00
Dev -&gt; main
diff --git a/.dockerignore b/.dockerignore
@@ -1,3 +1,3 @@
 *
-!datasets/
+!dataset/
 !docker/
diff --git a/dataset/README.md b/dataset/README.md
@@ -31,14 +31,15 @@ python3 dataset/dataset_setup.py \
   --<optional_flags>
 ```
 
-The complete benchmark uses 6 different datasets:
+The complete benchmark uses 7 different datasets:
 
 - [OGBG](#ogbg)
 - [WMT](#wmt)
 - [FastMRI](#fastmri)
 - [Imagenet](#imagenet)
 - [Criteo 1TB](#criteo1tb)
 - [Librispeech](#librispeech)
+- [Fineweb-edu 10B](#fineweb-edu-10b)
 
 Some dataset setups will require you to sign a third-party agreement with the dataset owners in order to get the download URLs.
 
@@ -456,11 +457,50 @@ python3 librispeech_preprocess.py --data_dir=$DATA_DIR/librispeech --tokenizer_v
 ```
 
 ### Fineweb-EDU 10B
-From `algorithmic-efficiency` run:
+
+The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`. 
 
 ```bash
 python3 dataset/dataset_setup.py \
 --data_dir $DATA_DIR \
 --temp_dir $DATA_DIR/tmp \
 --fineweb_edu
-```
+```
+
+<details>
+<summary>The final directory structure should look like this:</summary>
+
+```bash
+$DATA_DIR
+├──fineweb_edu_10B
+│   ├── fwedu_10B_tokenized
+│   │   ├── data-00000-of-00080.arrow
+│   │   ├── data-00001-of-00080.arrow
+│   │   ├── data-00002-of-00080.arrow
+│   │   ├── [...]
+│   │   ├── data-00078-of-00080.arrow
+│   │   ├── data-00079-of-00080.arrow
+│   │   ├── dataset_info.json
+│   │   └── state.json
+│   ├── train
+│   │   ├── 11814516993635243069
+│   │   │   └── 00000000.shard
+│   │   │       └── 00000000.snapshot
+│   │   ├── 1309159339089188891
+│   │   ├── 13196585434617636667
+│   │   ├── 13328239765396585889
+│   │   ├── 13443989554399185472
+│   │   ├── 17062004665044410656
+│   │   ├── 832373293846386485
+│   │   ├── 9244072261762587327
+│   │   ├── dataset_spec.pb
+│   │   └── snapshot.metadata
+│   └── val
+│       ├── 8122001362029945413
+│       │   └── 00000000.shard
+│       │       └── 00000000.snapshot
+│       ├── dataset_spec.pb
+│       └── snapshot.metadata
+```
+In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`).
+</details>
diff --git a/dataset/dataset_setup.py b/dataset/dataset_setup.py
@@ -782,29 +782,21 @@ def download_finewebedu(
 ):
   """Download FineWebEdu-10B."""
 
-  if not skip_download:
-    data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
-    tmp_dir = tmp_dir if tmp_dir is not None else '/tmp'
-    cache_dir = (
-      os.path.join(tmp_dir, 'lm')
-      if tmp_dir is not None
-      else os.path.expanduser('~/.cache/huggingface/datasets')
-    )
-
-    _maybe_mkdir(data_dir)
-    _maybe_mkdir(tmp_dir)
-    _maybe_mkdir(cache_dir)
+  data_dir = os.path.join(data_dir, 'fineweb_edu_10B')
+  _maybe_mkdir(data_dir)
+  _maybe_mkdir(tmp_dir)
 
+  if not skip_download:
     os.environ['TMPDIR'] = tmp_dir
 
     ds = hf_datasets.load_dataset(
       'HuggingFaceFW/fineweb-edu',
       name='sample-10BT',
       split='train',
-      cache_dir=cache_dir,
+      cache_dir=tmp_dir,
     )
     ds.save_to_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))
-  else:
+  elif not skip_tokenization:
     ds = hf_datasets.load_from_disk(os.path.join(tmp_dir, 'fwedu_10B_raw'))
 
   if not skip_tokenization:
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -297,7 +297,7 @@ algorithm in `algorithms/target_setting_algorithms/`.
 We also have regression tests available in
 [.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
 that can be run semi-automatically. The regression tests are shorter end-to-end
-submissions run in a containerized environment across all 8 workloads, in both
+submissions run in a containerized environment across all 9 workloads, in both
 the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
 and are triggered for pull requests that target the main branch. Typically these
 PRs will be from the `dev` branch so the tests will run containers based on
diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
@@ -219,11 +219,11 @@ Users that wish to customize their images are invited to check and modify the
 
 ## Download the Data
 
-The workloads in this benchmark use 6 different datasets across 8 workloads. You
+The workloads in this benchmark use 6 different datasets across 9 workloads. You
 may choose to download some or all of the datasets as you are developing your
-submission, but your submission will be scored across all 8 workloads. For
+submission, but your submission will be scored across all 9 workloads. For
 instructions on obtaining and setting up the datasets see
-[datasets/README](/datasets/README.md#dataset-setup).
+[dataset/README](/dataset/README.md#dataset-setup).
 
 ## Develop your Submission
 

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,3 @@`
`1`	`1`	`*`
`2`		`-!datasets/`
	`2`	`+!dataset/`
`3`	`3`	`!docker/`