ServiceNow · RaymondLi0 · Dec 22, 2025 · Dec 22, 2025 · Dec 22, 2025 · Dec 22, 2025
diff --git a/tools/README_discover_datasets.md b/tools/README_discover_datasets.md
@@ -0,0 +1,267 @@
+# Dataset Discovery Tool
+
+A tool to recursively discover datasets in a directory and generate a blended dataset configuration for Fast-LLM.
+
+## Overview
+
+This tool walks through a directory tree, identifies datasets by their `fast_llm_config*.yaml` files, and generates a configuration file that blends all discovered datasets with weights proportional to token counts.
+
+## Features
+
+- **Recursive Discovery**: Automatically finds all dataset configs in nested directories
+- **Flexible Output**: Can use file references or inline full configs
+- **Token-Proportional Blending**: Automatically calculates weights based on dataset token counts for proportional sampling
+
+## Usage
+
+### Command Line
+
+```bash
+python tools/discover_datasets.py <directory> -o <output.yaml> [options]
+```
+
+**Arguments:**
+
+- `directory`: Directory to search for datasets recursively (required)
+- `-o, --output`: Output path for the generated config YAML file (required)
+- `--no-file-refs`: Inline configs instead of using file references (optional, not recommended)
+- `--ignore`: Path to ignore during dataset discovery (can be specified multiple times, optional)
+
+**Examples:**
+
+```bash
+# Basic usage - discover all datasets and create blended config
+python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml
+
+# Inline full configs instead of using file references
+python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml --no-file-refs
+
+# Ignore specific paths during discovery
+python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml --ignore experiments/old --ignore tmp
+```
+
+### Config File
+
+Create a config file:
+
+```yaml
+# discover_config.yaml
+directory: /path/to/datasets
+output: blended_dataset.yaml
+use_file_refs: true
+ignore_paths: []  # Optional list of paths to ignore
+```
+
+Run with:
+
+```bash
+python tools/discover_datasets.py --config discover_config.yaml
+```
+
+## Dataset Identification
+
+The tool identifies datasets by looking for files matching the pattern `fast_llm_config*.yaml`:
+
+- `fast_llm_config.yaml` - Unsplit dataset
+- `fast_llm_config_training.yaml` - Training split
+- `fast_llm_config_validation.yaml` - Validation split
+- Any other `fast_llm_config_*.yaml` files
+
+These files are typically generated by the `fast-llm prepare` command during dataset preparation.
+
+## Output Format
+
+### Blended Datasets
+
+The tool generates a blended dataset config with weights proportional to the number of tokens in each dataset:
+
+```yaml
+type: blended
+name: my_datasets
+datasets:
+  - type: file
+    path: /path/to/dataset1/fast_llm_config_training.yaml
+  - type: file
+    path: /path/to/dataset1/fast_llm_config_validation.yaml
+  - type: file
+    path: /path/to/dataset2/fast_llm_config.yaml
+weights:
+  - 1.5  # Dataset 1 has 1.5B tokens
+  - 0.5  # Dataset 2 has 0.5B tokens
+  - 2.0  # Dataset 3 has 2.0B tokens
+```
+
+With blended datasets, samples are drawn from each dataset proportionally to their weights during training. This means:
+
+- Larger datasets (more tokens) will be sampled more frequently
+- Smaller datasets will be sampled less frequently
+- The sampling is interleaved, not sequential
+- Each dataset maintains its internal order, but samples from different datasets are mixed
+
+**Hierarchical blending:** When datasets are in nested directories, the tool automatically calculates proper token-proportional weights at all levels. Subdirectories are weighted by their total token count (sum of all datasets within them), ensuring accurate proportional sampling across the entire directory structure.
+
+**Benefits of blended datasets:**
+
+- **Proportional sampling**: Each dataset is sampled proportionally to its size, preventing smaller datasets from being underrepresented
+- **Interleaved samples**: Unlike sequential concatenation, samples from different datasets are mixed during training
+- **Automatic weight calculation**: No need to manually specify weights - they're calculated from token counts
+
+### Using in Training Config
+
+The generated config can be used directly in a training config:
+
+```yaml
+data:
+  datasets:
+    training:
+      type: file
+      path: blended_dataset.yaml
+```
+
+## Example Workflow
+
+### 1. Prepare Multiple Datasets
+
+```bash
+# Prepare dataset 1
+fast-llm prepare --config dataset1_prepare.yaml
+
+# Prepare dataset 2
+fast-llm prepare --config dataset2_prepare.yaml
+
+# Prepare dataset 3
+fast-llm prepare --config dataset3_prepare.yaml
+```
+
+This creates a directory structure like:
+
+```
+my_datasets/
+├── dataset1/
+│   ├── fast_llm_config_training.yaml
+│   ├── fast_llm_config_validation.yaml
+│   ├── dataset1_training.fast_llm_dataset
+│   └── dataset1_validation.fast_llm_dataset
+├── dataset2/
+│   ├── fast_llm_config_training.yaml
+│   ├── fast_llm_config_validation.yaml
+│   ├── dataset2_training.fast_llm_dataset
+│   └── dataset2_validation.fast_llm_dataset
+└── dataset3/
+    └── experiments/
+        ├── fast_llm_config_training.yaml
+        └── dataset3_training.fast_llm_dataset
+```
+
+### 2. Discover and Blend Datasets
+
+```bash
+python tools/discover_datasets.py my_datasets/ -o blended_datasets.yaml
+```
+
+This generates `blended_datasets.yaml`:
+
+```yaml
+type: blended
+name: my_datasets
+datasets:
+  - type: file
+    path: my_datasets/dataset1/fast_llm_config_training.yaml
+  - type: file
+    path: my_datasets/dataset1/fast_llm_config_validation.yaml
+  - type: file
+    path: my_datasets/dataset2/fast_llm_config_training.yaml
+  - type: file
+    path: my_datasets/dataset2/fast_llm_config_validation.yaml
+  - type: file
+    path: my_datasets/dataset3/experiments/fast_llm_config_training.yaml
+weights:
+  - 1500.0  # Dataset 1 training: 1.5B tokens
+  - 500.0   # Dataset 1 validation: 500M tokens
+  - 2000.0  # Dataset 2 training: 2B tokens
+  - 800.0   # Dataset 2 validation: 800M tokens
+  - 3000.0  # Dataset 3 training: 3B tokens
+```
+
+### 3. Use in Training Config
+
+```yaml
+# training_config.yaml
+model:
+  # ... model config ...
+
+data:
+  datasets:
+    training:
+      type: file
+      path: blended_datasets.yaml
+  sampling:
+    shuffle: skip_first_epoch
+    seed: 784569
+
+# ... rest of training config ...
+```
+
+### 4. Train
+
+```bash
+fast-llm train --config training_config.yaml
+```
+
+## Use Cases
+
+### 1. Combining Multiple Data Sources
+
+You have data from different sources (web scrapes, books, code, etc.) prepared separately:
+
+```bash
+python tools/discover_datasets.py /data/pretraining -o all_pretraining_data.yaml
+```
+
+### 2. Incremental Data Addition
+
+You keep adding new datasets over time and want to automatically include all of them:
+
+```bash
+# Just add new prepared datasets to the directory
+# Re-run discovery to update the combined config
+python tools/discover_datasets.py /data/pretraining -o all_pretraining_data.yaml
+```
+
+### 3. Experiment Organization
+
+You have experiments with different preprocessing or filtering:
+
+```
+experiments/
+├── baseline/
+│   ├── fast_llm_config_training.yaml
+│   └── fast_llm_config_validation.yaml
+├── filtered_v1/
+│   ├── fast_llm_config_training.yaml
+│   └── fast_llm_config_validation.yaml
+└── filtered_v2/
+    ├── fast_llm_config_training.yaml
+    └── fast_llm_config_validation.yaml
+```
+
+```bash
+python tools/discover_datasets.py experiments/ -o all_experiments.yaml
+```
+
+## Notes
+
+- **File References**: By default, the tool uses `type: file` references which lazily load the actual dataset configs. This keeps the generated config small and readable.
+
+- **Absolute Paths**: The tool uses absolute paths for file references to ensure configs work regardless of where they're used from.
+
+- **Ordering**: Datasets are discovered and ordered alphabetically by path for consistency.
+
+- **Empty Directories**: If no `fast_llm_config*.yaml` files are found, the tool will raise an error.
+
+- **All Files Included**: The tool blends ALL discovered config files with weights proportional to their token counts. This means if you have both training and validation configs in the same directory, they will all be included in the blended dataset. You may want to organize your directory structure accordingly or use the `--ignore` flag to exclude specific paths.
+
+## See Also
+
+- [Fast-LLM Data Configuration Documentation](../docs/recipes/data-configuration.md)
+- [Dataset Preparation Guide](../docs/recipes/data-preparation.md)