schedule free adamw jax training results doesn't match pytorch

The JAX implementation of the schedule-free AdamW algorithm exhibits significant training curve discrepancies compared to the PyTorch reference across several workloads, including Librispeech DeepSpeech, ImageNet, WMT, and Criteo1TB.

Following an initial debugging session with @priyakasimbeg, a key issue was identified: the JAX code incorrectly used a single variable (y) for both training and validation phases. The intended logic requires using x for validation and y for training.

A fix was issued in [pull request #16](https://github.com/mlcommons/submissions_algorithms/pull/16/changes/a7fb74b8e030c8b60736e900ca659d14fa55b021). However, the training results for Librispeech still do not align with PyTorch, and other issues have emerged for WMT and Criteo1TB, specifically deadlocks and out-of-memory errors.

Further in-depth debugging is necessary to bring the JAX training results in line with PyTorch to finalize this MLCommons submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedule free adamw jax training results doesn't match pytorch #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

schedule free adamw jax training results doesn't match pytorch #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions