Skip to content

Commit d8d9b43

Browse files
Merge pull request #904 from mlcommons/dev
Dev -> main
2 parents 2c90454 + c823cdd commit d8d9b43

53 files changed

Lines changed: 3872 additions & 114 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/regression_tests.yml

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,16 @@ jobs:
107107
- name: Run containerized workload
108108
run: |
109109
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_${{ github.head_ref || github.ref_name }}
110-
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_${{ github.head_ref || github.ref_name }} -d wmt -f jax -s algorithms/archived_paper_baselines/adamw/jax/submission.py -w wmt -t algorithms/archived_paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false --data_bucket mlcommons-data --logs_bucket mlcommons-runs --data_bucket mlcommons-data --logs_bucket mlcommons-runs
110+
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_${{ github.head_ref || github.ref_name }} -d wmt -f jax -s algorithms/archived_paper_baselines/adamw/jax/submission.py -w wmt -t algorithms/archived_paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false --data_bucket mlcommons-data --logs_bucket mlcommons-runs --data_bucket mlcommons-data --logs_bucket mlcommons-runs
111+
finewebedu_lm_jax:
112+
runs-on: self-hosted
113+
needs: build_and_push_jax_docker_image
114+
steps:
115+
- uses: actions/checkout@v2
116+
- name: Run containerized workload
117+
run: |
118+
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_${{ github.head_ref || github.ref_name }}
119+
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_jax_${{ github.head_ref || github.ref_name }} -d fineweb_edu_10B -f jax -s algorithms/archived_paper_baselines/adamw/jax/submission.py -w finewebedu_lm -t algorithms/archived_paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false --data_bucket mlcommons-data --logs_bucket mlcommons-runs --data_bucket mlcommons-data --logs_bucket mlcommons-runs
111120
fastmri_pytorch:
112121
runs-on: self-hosted
113122
needs: build_and_push_pytorch_docker_image
@@ -181,3 +190,12 @@ jobs:
181190
run: |
182191
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
183192
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d wmt -f pytorch -s algorithms/archived_paper_baselines/adamw/pytorch/submission.py -w wmt -t algorithms/archived_paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false --data_bucket mlcommons-data --logs_bucket mlcommons-runs --data_bucket mlcommons-data --logs_bucket mlcommons-runs
193+
finewebedu_lm_pytorch:
194+
runs-on: self-hosted
195+
needs: build_and_push_pytorch_docker_image
196+
steps:
197+
- uses: actions/checkout@v2
198+
- name: Run containerized workload
199+
run: |
200+
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
201+
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d fineweb_edu_10B -f pytorch -s algorithms/archived_paper_baselines/adamw/pytorch/submission.py -w finewebedu_lm -t algorithms/archived_paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false --data_bucket mlcommons-data --logs_bucket mlcommons-runs --data_bucket mlcommons-data --logs_bucket mlcommons-runs

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ scoring/plots/
2525
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_0/eval_measurements.csv
2626
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_1/eval_measurements.csv
2727

28-
algoperf/_version.py
28+
algoperf/_version.py

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ The MLCommons™ **AlgoPerf: Training Algorithms benchmark** is designed to find
3131
When training neural nets, practitioners face many critical yet often opaque decisions: What optimizer to choose? How should its learning rate be tuned? What learning rate schedule should be used? These choices can make or break training, yet the community has lacked a clear, standardized way to identify the state of the art.
3232
Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates the **training algorithm** itself, which includes the optimizer, regularization, data selection, and hyperparameters like the learning rate schedule. By standardizing the benchmark process, AlgoPerf offers a meaningful apples-to-apples comparison of training algorithms and follows the following **key principles**:
3333

34-
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (8x NVIDIA V100 GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
34+
- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (4x A100 (40GB) GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
3535
- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
3636
- 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](/docs/DOCUMENTATION.md#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance, using [**performance profiles**](/docs/DOCUMENTATION.md#benchmark-score-using-performance-profiles), across all workloads to ensure general-purpose algorithms.
3737
- 📦 **Fully-Specified Algorithms:** Submissions must be complete procedures and thus hyperparameter tuning is treated as part of the algorithm. Submissions can either provide a search space for automated tuning ([**External tuning ruleset**](/docs/DOCUMENTATION.md#external-tuning-ruleset)) or be hyperparameter-free ([**Self-tuning ruleset**](/docs/DOCUMENTATION.md#self-tuning-ruleset)) with any tuning done automatically and "on the clock". This measures an algorithm's _total_ practical cost and provides practitioners with a complete method, eliminating the guesswork of how to apply it.

algoperf/checkpoint_utils.py

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,16 @@
55
"""
66

77
import os
8-
from typing import Sequence, Tuple
8+
from typing import Optional, Sequence, Tuple
99

1010
import numpy as np
11+
import orbax.checkpoint as ocp
1112
import torch
1213
from absl import logging
1314
from flax import jax_utils
1415
from flax.training import checkpoints as flax_checkpoints
1516
from flax.training.checkpoints import latest_checkpoint
17+
from orbax.checkpoint.type_handlers import NumpyHandler
1618
from tensorflow.io import gfile # pytype: disable=import-error
1719

1820
from algoperf import spec
@@ -30,6 +32,51 @@
3032
]
3133

3234

35+
class BoolHandler(NumpyHandler):
36+
"""
37+
An implementation of TypeHandler for np.bool_ that inherits from NumpyHandler.
38+
It works by treating the scalar as a 0-dimensional array.
39+
"""
40+
41+
def typestr(self) -> str:
42+
"""Unique string identifier for this handler."""
43+
return 'np.bool_'
44+
45+
async def serialize(
46+
self,
47+
values: Sequence[np.bool_],
48+
infos: Sequence,
49+
args: Optional[Sequence[ocp.SaveArgs]] = None,
50+
):
51+
"""
52+
Serializes a sequence of np.bool_ scalars by first converting them
53+
to 0-dim numpy arrays and then calling the parent NumpyHandler.
54+
"""
55+
# Convert each scalar np.bool_ to a 0-dimensional np.ndarray
56+
array_values = [np.asarray(v, dtype=np.bool_) for v in values]
57+
# Use the parent class's robust serialization logic
58+
return await super().serialize(array_values, infos, args)
59+
60+
async def deserialize(
61+
self,
62+
infos: Sequence,
63+
args: Optional[Sequence[ocp.RestoreArgs]] = None,
64+
) -> Sequence[np.bool_]:
65+
"""
66+
Deserializes into a sequence of np.bool_ scalars by calling the
67+
parent handler and then converting the resulting 0-dim arrays.
68+
"""
69+
# Parent deserialize will return a sequence of 0-dimensional np.ndarray
70+
results = await super().deserialize(infos, args)
71+
72+
# Convert each 0-d array back to an np.bool_ scalar using .item()
73+
scalar_results = [np.bool_(r.item()) for r in results]
74+
return scalar_results
75+
76+
77+
ocp.type_handlers.register_type_handler(np.bool_, BoolHandler(), override=True)
78+
79+
3380
def maybe_restore_checkpoint(
3481
framework: str,
3582
optimizer_state: spec.OptimizerState,

algoperf/param_utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ def pytorch_param_types(
4444
param_types[name] = spec.ParameterType.ATTENTION_BIAS
4545
elif 'in_proj' in name:
4646
param_types[name] = spec.ParameterType.ATTENTION_QKV
47+
elif 'qkv' in name:
48+
param_types[name] = spec.ParameterType.ATTENTION_QKV
4749
elif 'kv_proj' in name:
4850
param_types[name] = spec.ParameterType.ATTENTION_KV
4951
elif 'k_proj' in name or 'key' in name:

algoperf/pytorch_utils.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,14 +20,18 @@
2020

2121

2222
def pytorch_setup() -> Tuple[bool, int, torch.device, int]:
23+
torch.set_float32_matmul_precision('high')
24+
2325
use_pytorch_ddp = 'LOCAL_RANK' in os.environ
2426
rank = int(os.environ['LOCAL_RANK']) if use_pytorch_ddp else 0
2527
device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')
2628
n_gpus = torch.cuda.device_count()
2729
return use_pytorch_ddp, rank, device, n_gpus
2830

2931

30-
def pytorch_init(use_pytorch_ddp: bool, rank: int, profiler: Profiler) -> None:
32+
def pytorch_init(
33+
use_pytorch_ddp: bool, rank: int, profiler: Profiler, limit_tf_threads=True
34+
) -> None:
3135
# Make sure no GPU memory is preallocated to Jax.
3236
os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = 'false'
3337
# Only use CPU for Jax to avoid memory issues.
@@ -39,7 +43,7 @@ def pytorch_init(use_pytorch_ddp: bool, rank: int, profiler: Profiler) -> None:
3943

4044
if use_pytorch_ddp:
4145
# Avoid tf input pipeline creating too many threads.
42-
if rank != 0:
46+
if rank != 0 and limit_tf_threads:
4347
tf.config.threading.set_intra_op_parallelism_threads(1)
4448
tf.config.threading.set_inter_op_parallelism_threads(1)
4549

algoperf/random_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,13 @@ def _signed_to_unsigned(seed: SeedType) -> SeedType:
3535

3636
def _fold_in(seed: SeedType, data: Any) -> List[Union[SeedType, Any]]:
3737
rng = np.random.RandomState(seed=_signed_to_unsigned(seed))
38-
new_seed = rng.randint(MIN_INT32, MAX_INT32, dtype=np.int32)
38+
new_seed = rng.randint(MIN_INT32, MAX_INT32, dtype=np.uint32)
3939
return [new_seed, data]
4040

4141

4242
def _split(seed: SeedType, num: int = 2) -> SeedType:
4343
rng = np.random.RandomState(seed=_signed_to_unsigned(seed))
44-
return rng.randint(MIN_INT32, MAX_INT32, dtype=np.int32, size=[num, 2])
44+
return rng.randint(MIN_INT32, MAX_INT32, dtype=np.uint32, size=[num, 2])
4545

4646

4747
def _PRNGKey(seed: SeedType) -> SeedType: # pylint: disable=invalid-name

algoperf/workloads/cifar/cifar_pytorch/workload.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,12 +110,12 @@ def _build_dataset(
110110
batch_size=ds_iter_batch_size,
111111
shuffle=not USE_PYTORCH_DDP and is_train,
112112
sampler=sampler,
113-
num_workers=4 if is_train else self.eval_num_workers,
113+
num_workers=2 * N_GPUS if is_train else self.eval_num_workers,
114114
pin_memory=True,
115115
drop_last=is_train,
116116
)
117-
dataloader = data_utils.PrefetchedWrapper(dataloader, DEVICE)
118117
dataloader = data_utils.cycle(dataloader, custom_sampler=USE_PYTORCH_DDP)
118+
dataloader = data_utils.dataloader_iterator_wrapper(dataloader, DEVICE)
119119
return dataloader
120120

121121
def init_model_fn(self, rng: spec.RandomState) -> spec.ModelInitState:

algoperf/workloads/criteo1tb/workload.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,11 @@ def train_stddev(self):
9595

9696
@property
9797
def max_allowed_runtime_sec(self) -> int:
98-
return 7_703 # ~2.1 hours.
98+
return 8_915 # ~2.4 hours.
9999

100100
@property
101101
def eval_period_time_sec(self) -> int:
102-
return 2 * 60 # 2 mins.
102+
return 356 # approx 25 evals
103103

104104
def _build_input_queue(
105105
self,

algoperf/workloads/fastmri/workload.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,11 @@ def accelerations(self):
9595

9696
@property
9797
def max_allowed_runtime_sec(self) -> int:
98-
return 4_430 # ~1.2 hours
98+
return 2_745 # ~0.7 hours
9999

100100
@property
101101
def eval_period_time_sec(self) -> int:
102-
return 80
102+
return 110 # approx 25 evals
103103

104104
@property
105105
def step_hint(self) -> int:

0 commit comments

Comments
 (0)