Improve pathways checkpoint load times by samos123 · Pull Request #1345 · apple/axlearn

samos123 · 2025-09-23T16:53:58Z

Utilize a shared memory between the Jax client and pathways proxy for data heavy transfers e.g. device_puts.
Increase threads of ThreadPoolExecutor from 32 (python default) to 192.
Remove memory limit from pathways head main container.

Callers of deserialize should utilize a concurrent_restore_gb as large as possible until OOM. Otherwise GCS read and device_put won't happen in parallel. The default of 32GB is too low to achieve optimal performance with Pathways.

* Utilize a shared memory between the Jax client and pathways proxy for data heavy transfers e.g. device_puts. * Increase threads of ThreadPoolExecutor from 32 (python default) to 192. * Remove memory limit from pathways head main container. Callers should utilize a concurrent_restore_gb as large as possible until OOM. Otherwise GCS read and device_put won't happen in parallel. The default of 32GB is too low to achieve optimal performance with Pathways.

muyangyuapple · 2025-09-23T17:17:06Z

 # This image version extends GRPC timeout for long context models, based on jax-0.5.3-patch060625
 # This image extends GRPC timeout for long context models.
-_PATHWAYS_IMAGE_TAG = "disable_settings_20250701"
+_PATHWAYS_IMAGE_TAG = "shm_proxy"


Could you double check with Shauray that this binary includes the path of extending GRPC timeout? Or it doesn't need it anymore?

muyangyuapple · 2025-09-23T17:20:19Z

+            # The flag below is needed for better H2D performance.
+            # Rule of thumb: 3x the shard size. So 128GB to be safe.
+            # Decrease if you start running out of host memory on TPU VMs.
+            "--tpu_premapped_buffer_size=137438953472",


Let's use 1/4 of the machine type's host memory and round up to the oder of 2:

https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/system_characteristics.py#L494-L499

muyangyuapple · 2025-09-23T17:22:40Z

        self._loop_thread.start()
-        self._single_thread_pool = ThreadPoolExecutor(1)
+        self._single_thread_pool = ThreadPoolExecutor(max_workers=1)
+        self._multi_thread_pool = ThreadPoolExecutor(max_workers=192)


Can we make this a config flag? It depends on how many cpu we allocate to the head pod: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L317

muyangyuapple · 2025-09-23T17:23:33Z

        mem_req = f"{self.config.pathways_head_mem}Gi"
        resources = {
            "requests": {"cpu": cpu_req, "memory": mem_req},
-            "limits": {"cpu": cpu_req, "memory": mem_req},


For my education, what's the effect of having "request" and not "limit"?

github-actions · 2025-12-27T02:16:52Z

This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.

github-actions · 2026-01-04T02:18:08Z

This pull request was closed because it has been inactive for more than 7 days since being marked as stale. Please feel free to reopen it if you would like to continue.

muyangyuapple reviewed Sep 23, 2025

View reviewed changes

github-actions Bot added the stale label Dec 27, 2025

github-actions Bot closed this Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pathways checkpoint load times#1345

Improve pathways checkpoint load times#1345
samos123 wants to merge 1 commit intoapple:mainfrom
samos123:pathways-shared-mem-clean

samos123 commented Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

github-actions Bot commented Dec 27, 2025

Uh oh!

github-actions Bot commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samos123 commented Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Dec 27, 2025

Uh oh!

github-actions Bot commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants