[RFC] Data Separation in ExecuTorch

### 🚀 The feature, motivation and pitch

Currently, ExecuTorch supports one file format, ‘PTE’. The PTE file contains everything required to execute the model; instructions, delegated blobs and constant weights. 

If there are two PTE files based on a common model, there’s currently no way for them to share weights or other data. If a system wants to download both PTE files, those PTE files will need to duplicate data on disk. There’s a similar problem when loading them; even if there was available disk space, loading both PTE files at the same time would require duplicating the data in RAM. For very large models, this could mean duplicating gigabytes of data. On edge systems with constrained disk space and RAM, this probably isn’t possible.

Note: This doc is for backend data separation. For backend weight sharing doc, please see: [[RFC] Enable Weight Sharing across a single Backend](https://github.com/pytorch/executorch/issues/8121)


### RFC (Optional)

## Scope
### Assumptions
- We want to provide a way for backends to separate weights into multiple files.
### Goals
- Provide a way for multiple PTE files to share memory; both on disk, and in RAM.
- Newly added infrastructure and APIs should have minimal effect on existing implementation and ExecuTorch flow. 
   - Data separation is opt-in.
   - Do not complicate AoT and runtime APIs for users who do not use data separation.
- Do not significantly regress load time for users of data separation.
- Do not significantly increase ET runtime binary size.
- Do not introduce C++ standard library dependencies to core ExecuTorch.

## Non-goals
- Runtime retargetability; this does not implement the case where a generic PTE file is created, and the backend used is decided at runtime based on the available hardware. 
   - Currently, a PTE file is generated with specific backend/s in mind. E.g. a PTE file may contain a program that’s partially lowered to XNNPACK. This means the runtime environment must have XNNPACK in order to run the PTE.
- Loaded external data is not necessarily cached, meaning each request to load shared data may allocate new memory. Currently, backends should manage this under the hood to realize the benefits of reduced memory from shared data.

## Overview
Data separation is a proposed new feature that allows parts of the PTE file to live in separate, sharable files. Data separation majorly unblocks data sharing between separate PTE files.

**Example**

<img src="https://github.com/user-attachments/assets/2375e02c-d9f2-49e3-8379-3578dedc391c" alt="drawing" width="600"/>

Note: each box is a separate file. The arrows indicate the dependency. Eg. PTE1 requires data1 and shared_data to execute.

PTE1 and PTE2 are separate models that share data. An example use case is [LoRA](https://huggingface.co/docs/diffusers/en/training/lora). Multiple LoRA programs may share the same foundation weights and be optimized for different tasks eg. assistant or summarization. Here, PTE1 and PTE2 contain separate LoRA programs. ‘shared_data’ contains the foundation weights for both LoRA programs. For LLMs, foundation weights can be on the order of gigabytes. Without sharing, PTE1 and PTE2 must both hold a copy, duplicating potentially gigabytes of data.

‘data1’ and ‘data2’ may contain LoRA adapter weights. LoRA adapter weights are usually small, on the order of megabytes. The size can vary depending on the degree of fine-tuning. Having ‘data1’ and ‘data2’ in standalone files helps with deployment efficiency. LoRA adapter weights are likely in a faster deployment cadence compared to the foundation weights. Deploying a smaller file OTA is quicker and less prone to failure. If the PTE/LoRA weights are small, it’s reasonable to keep them in a single file and update them together.

### Design
We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (https://github.com/pytorch/executorch/issues/8122). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.

cc @mcr229, @iseeyuan, @dbort, @JacobSzwejbka, @tarun292 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Data Separation in ExecuTorch #8118

🚀 The feature, motivation and pitch

RFC (Optional)

Scope

Assumptions

Goals

Non-goals

Overview

Design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Data Separation in ExecuTorch #8118

Description

🚀 The feature, motivation and pitch

RFC (Optional)

Scope

Assumptions

Goals

Non-goals

Overview

Design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions