You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The preprocessing script will download and tokenize a 10 bilion token sample of FinewebEdu from Huggingface. The raw dataset will be saved in `tmp_dir/fwedu_10B_raw`, the tokenized dataset in `data_dir/fwedu_10B_tokenized`, and the train, valid split in `data_dir/fineweb_edu_10B`.
460
462
461
463
```bash
462
464
python3 dataset/dataset_setup.py \
463
465
--data_dir $DATA_DIR \
464
466
--temp_dir $DATA_DIR/tmp \
465
467
--fineweb_edu
466
-
```
468
+
```
469
+
470
+
<details>
471
+
<summary>The final directory structure should look like this:</summary>
472
+
473
+
```bash
474
+
$DATA_DIR
475
+
├──fineweb_edu_10B
476
+
│ ├── fwedu_10B_tokenized
477
+
│ │ ├── data-00000-of-00080.arrow
478
+
│ │ ├── data-00001-of-00080.arrow
479
+
│ │ ├── data-00002-of-00080.arrow
480
+
│ │ ├── [...]
481
+
│ │ ├── data-00078-of-00080.arrow
482
+
│ │ ├── data-00079-of-00080.arrow
483
+
│ │ ├── dataset_info.json
484
+
│ │ └── state.json
485
+
│ ├── train
486
+
│ │ ├── 11814516993635243069
487
+
│ │ │ └── 00000000.shard
488
+
│ │ │ └── 00000000.snapshot
489
+
│ │ ├── 1309159339089188891
490
+
│ │ ├── 13196585434617636667
491
+
│ │ ├── 13328239765396585889
492
+
│ │ ├── 13443989554399185472
493
+
│ │ ├── 17062004665044410656
494
+
│ │ ├── 832373293846386485
495
+
│ │ ├── 9244072261762587327
496
+
│ │ ├── dataset_spec.pb
497
+
│ │ └── snapshot.metadata
498
+
│ └── val
499
+
│ ├── 8122001362029945413
500
+
│ │ └── 00000000.shard
501
+
│ │ └── 00000000.snapshot
502
+
│ ├── dataset_spec.pb
503
+
│ └── snapshot.metadata
504
+
```
505
+
In total, it should contain 88 files (via `find -type f | wc -l`) for a total of 112G GB (via `du -sch --apparent-size fineweb_edu_10B/`).
0 commit comments