Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions tools/README_discover_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# Dataset Discovery Tool

A tool to recursively discover datasets in a directory and generate a blended dataset configuration for Fast-LLM.

## Overview

This tool walks through a directory tree, identifies datasets by their `fast_llm_config*.yaml` files, and generates a configuration file that blends all discovered datasets with weights proportional to token counts.

## Features

- **Recursive Discovery**: Automatically finds all dataset configs in nested directories
- **Flexible Output**: Can use file references or inline full configs
- **Token-Proportional Blending**: Automatically calculates weights based on dataset token counts for proportional sampling

## Usage

### Command Line

```bash
python tools/discover_datasets.py <directory> -o <output.yaml> [options]
```

**Arguments:**

- `directory`: Directory to search for datasets recursively (required)
- `-o, --output`: Output path for the generated config YAML file (required)
- `--no-file-refs`: Inline configs instead of using file references (optional, not recommended)
- `--ignore`: Path to ignore during dataset discovery (can be specified multiple times, optional)

**Examples:**

```bash
# Basic usage - discover all datasets and create blended config
python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml

# Inline full configs instead of using file references
python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml --no-file-refs

# Ignore specific paths during discovery
python tools/discover_datasets.py /path/to/datasets -o blended_dataset.yaml --ignore experiments/old --ignore tmp
```

### Config File

Create a config file:

```yaml
# discover_config.yaml
directory: /path/to/datasets
output: blended_dataset.yaml
use_file_refs: true
ignore_paths: [] # Optional list of paths to ignore
```

Run with:

```bash
python tools/discover_datasets.py --config discover_config.yaml
```

## Dataset Identification

The tool identifies datasets by looking for files matching the pattern `fast_llm_config*.yaml`:

- `fast_llm_config.yaml` - Unsplit dataset
- `fast_llm_config_training.yaml` - Training split
- `fast_llm_config_validation.yaml` - Validation split
- Any other `fast_llm_config_*.yaml` files

These files are typically generated by the `fast-llm prepare` command during dataset preparation.

## Output Format

### Blended Datasets

The tool generates a blended dataset config with weights proportional to the number of tokens in each dataset:

```yaml
type: blended
name: my_datasets
datasets:
- type: file
path: /path/to/dataset1/fast_llm_config_training.yaml
- type: file
path: /path/to/dataset1/fast_llm_config_validation.yaml
- type: file
path: /path/to/dataset2/fast_llm_config.yaml
weights:
- 1.5 # Dataset 1 has 1.5B tokens
- 0.5 # Dataset 2 has 0.5B tokens
- 2.0 # Dataset 3 has 2.0B tokens
```

With blended datasets, samples are drawn from each dataset proportionally to their weights during training. This means:

- Larger datasets (more tokens) will be sampled more frequently
- Smaller datasets will be sampled less frequently
- The sampling is interleaved, not sequential
- Each dataset maintains its internal order, but samples from different datasets are mixed

**Hierarchical blending:** When datasets are in nested directories, the tool automatically calculates proper token-proportional weights at all levels. Subdirectories are weighted by their total token count (sum of all datasets within them), ensuring accurate proportional sampling across the entire directory structure.

**Benefits of blended datasets:**

- **Proportional sampling**: Each dataset is sampled proportionally to its size, preventing smaller datasets from being underrepresented
- **Interleaved samples**: Unlike sequential concatenation, samples from different datasets are mixed during training
- **Automatic weight calculation**: No need to manually specify weights - they're calculated from token counts

### Using in Training Config

The generated config can be used directly in a training config:

```yaml
data:
datasets:
training:
type: file
path: blended_dataset.yaml
```

## Example Workflow

### 1. Prepare Multiple Datasets

```bash
# Prepare dataset 1
fast-llm prepare --config dataset1_prepare.yaml

# Prepare dataset 2
fast-llm prepare --config dataset2_prepare.yaml

# Prepare dataset 3
fast-llm prepare --config dataset3_prepare.yaml
```

This creates a directory structure like:

```
my_datasets/
β”œβ”€β”€ dataset1/
β”‚ β”œβ”€β”€ fast_llm_config_training.yaml
β”‚ β”œβ”€β”€ fast_llm_config_validation.yaml
β”‚ β”œβ”€β”€ dataset1_training.fast_llm_dataset
β”‚ └── dataset1_validation.fast_llm_dataset
β”œβ”€β”€ dataset2/
β”‚ β”œβ”€β”€ fast_llm_config_training.yaml
β”‚ β”œβ”€β”€ fast_llm_config_validation.yaml
β”‚ β”œβ”€β”€ dataset2_training.fast_llm_dataset
β”‚ └── dataset2_validation.fast_llm_dataset
└── dataset3/
└── experiments/
β”œβ”€β”€ fast_llm_config_training.yaml
└── dataset3_training.fast_llm_dataset
```

### 2. Discover and Blend Datasets

```bash
python tools/discover_datasets.py my_datasets/ -o blended_datasets.yaml
```

This generates `blended_datasets.yaml`:

```yaml
type: blended
name: my_datasets
datasets:
- type: file
path: my_datasets/dataset1/fast_llm_config_training.yaml
- type: file
path: my_datasets/dataset1/fast_llm_config_validation.yaml
- type: file
path: my_datasets/dataset2/fast_llm_config_training.yaml
- type: file
path: my_datasets/dataset2/fast_llm_config_validation.yaml
- type: file
path: my_datasets/dataset3/experiments/fast_llm_config_training.yaml
weights:
- 1500.0 # Dataset 1 training: 1.5B tokens
- 500.0 # Dataset 1 validation: 500M tokens
- 2000.0 # Dataset 2 training: 2B tokens
- 800.0 # Dataset 2 validation: 800M tokens
- 3000.0 # Dataset 3 training: 3B tokens
```

### 3. Use in Training Config

```yaml
# training_config.yaml
model:
# ... model config ...

data:
datasets:
training:
type: file
path: blended_datasets.yaml
sampling:
shuffle: skip_first_epoch
seed: 784569

# ... rest of training config ...
```

### 4. Train

```bash
fast-llm train --config training_config.yaml
```

## Use Cases

### 1. Combining Multiple Data Sources

You have data from different sources (web scrapes, books, code, etc.) prepared separately:

```bash
python tools/discover_datasets.py /data/pretraining -o all_pretraining_data.yaml
```

### 2. Incremental Data Addition

You keep adding new datasets over time and want to automatically include all of them:

```bash
# Just add new prepared datasets to the directory
# Re-run discovery to update the combined config
python tools/discover_datasets.py /data/pretraining -o all_pretraining_data.yaml
```

### 3. Experiment Organization

You have experiments with different preprocessing or filtering:

```
experiments/
β”œβ”€β”€ baseline/
β”‚ β”œβ”€β”€ fast_llm_config_training.yaml
β”‚ └── fast_llm_config_validation.yaml
β”œβ”€β”€ filtered_v1/
β”‚ β”œβ”€β”€ fast_llm_config_training.yaml
β”‚ └── fast_llm_config_validation.yaml
└── filtered_v2/
β”œβ”€β”€ fast_llm_config_training.yaml
└── fast_llm_config_validation.yaml
```

```bash
python tools/discover_datasets.py experiments/ -o all_experiments.yaml
```

## Notes

- **File References**: By default, the tool uses `type: file` references which lazily load the actual dataset configs. This keeps the generated config small and readable.

- **Absolute Paths**: The tool uses absolute paths for file references to ensure configs work regardless of where they're used from.

- **Ordering**: Datasets are discovered and ordered alphabetically by path for consistency.

- **Empty Directories**: If no `fast_llm_config*.yaml` files are found, the tool will raise an error.

- **All Files Included**: The tool blends ALL discovered config files with weights proportional to their token counts. This means if you have both training and validation configs in the same directory, they will all be included in the blended dataset. You may want to organize your directory structure accordingly or use the `--ignore` flag to exclude specific paths.

## See Also

- [Fast-LLM Data Configuration Documentation](../docs/recipes/data-configuration.md)
- [Dataset Preparation Guide](../docs/recipes/data-preparation.md)
Loading