-
Notifications
You must be signed in to change notification settings - Fork 767
Description
🚀 The feature, motivation and pitch
Problem
In ExecuTorch today, models with multiple methods (e.g. prefill and decode) are exported as separate graphs. When lowering to a specific backend, each graph is lowered in isolation, without awareness or context of other graphs being lowered to the same backend. The problem arises when these separate graphs have shared components. In the case of a llama model with prefill and decode, linear layers in each method share the same weights and biases. Since the graphs of prefill and decode are lowered separately, shared weights and biases are copied and serialized twice in each backend payload. This results in model bloat from duplicated weights, which is a limiting factor when bringing models to production, especially on memory-constrained devices.
Requirements
- The user flow for lowering and executing models should not change
- Backwards Compatible
- Opt-in by delegates (Delegates should see no change if they don’t implement this new feature)
- Design Components introduced are reusable and extendable to program-data separation ([RFC] Data Separation in ExecuTorch #8118)
- Changes to the delegate APIs by the introduction of weight sharing should be reusable by program-data separation
- Core metrics like Model Load Time, Inference Latency, and Binary Size should not be significantly affected
- No C++ std library dependencies are introduced to ExecuTorch in this change
Goals
- Provide AoT API for backends to identify the shared components across the graphs to be lowered
- All the graphs to be lowered by a given backend (across partitions and methods) will be accessible to the backend before serializing the first payload
- Backends can identify shared data across graphs and serialize this separately from the lowered payloads (reduce copying across lowered payloads)
- Read-only shared data is accessible by the backend when initializing the lowered payloads
- Shared data will be loaded on request by backends
- Shared data is freeable by backends after its use
Non-Goals
- While we wish to reuse design components for program-data separation, this feature will not implement program-data separation and leveraging weight sharing does not imply usage of program-data separation
- Loaded shared data is not cached by the ExecuTorch runtime, meaning each request to load shared data at runtime allocates new memory for the shared data.
- This does not attempt to enable sharing of weights across different backends.
Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (RFC: Blob Storage Design). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.
