Skip to content

[RFC] Data Separation in ExecuTorch #8118

@lucylq

Description

@lucylq

🚀 The feature, motivation and pitch

Currently, ExecuTorch supports one file format, ‘PTE’. The PTE file contains everything required to execute the model; instructions, delegated blobs and constant weights.

If there are two PTE files based on a common model, there’s currently no way for them to share weights or other data. If a system wants to download both PTE files, those PTE files will need to duplicate data on disk. There’s a similar problem when loading them; even if there was available disk space, loading both PTE files at the same time would require duplicating the data in RAM. For very large models, this could mean duplicating gigabytes of data. On edge systems with constrained disk space and RAM, this probably isn’t possible.

Note: This doc is for backend data separation. For backend weight sharing doc, please see: [RFC] Enable Weight Sharing across a single Backend

RFC (Optional)

Scope

Assumptions

  • We want to provide a way for backends to separate weights into multiple files.

Goals

  • Provide a way for multiple PTE files to share memory; both on disk, and in RAM.
  • Newly added infrastructure and APIs should have minimal effect on existing implementation and ExecuTorch flow.
    • Data separation is opt-in.
    • Do not complicate AoT and runtime APIs for users who do not use data separation.
  • Do not significantly regress load time for users of data separation.
  • Do not significantly increase ET runtime binary size.
  • Do not introduce C++ standard library dependencies to core ExecuTorch.

Non-goals

  • Runtime retargetability; this does not implement the case where a generic PTE file is created, and the backend used is decided at runtime based on the available hardware.
    • Currently, a PTE file is generated with specific backend/s in mind. E.g. a PTE file may contain a program that’s partially lowered to XNNPACK. This means the runtime environment must have XNNPACK in order to run the PTE.
  • Loaded external data is not necessarily cached, meaning each request to load shared data may allocate new memory. Currently, backends should manage this under the hood to realize the benefits of reduced memory from shared data.

Overview

Data separation is a proposed new feature that allows parts of the PTE file to live in separate, sharable files. Data separation majorly unblocks data sharing between separate PTE files.

Example

drawing

Note: each box is a separate file. The arrows indicate the dependency. Eg. PTE1 requires data1 and shared_data to execute.

PTE1 and PTE2 are separate models that share data. An example use case is LoRA. Multiple LoRA programs may share the same foundation weights and be optimized for different tasks eg. assistant or summarization. Here, PTE1 and PTE2 contain separate LoRA programs. ‘shared_data’ contains the foundation weights for both LoRA programs. For LLMs, foundation weights can be on the order of gigabytes. Without sharing, PTE1 and PTE2 must both hold a copy, duplicating potentially gigabytes of data.

‘data1’ and ‘data2’ may contain LoRA adapter weights. LoRA adapter weights are usually small, on the order of megabytes. The size can vary depending on the degree of fine-tuning. Having ‘data1’ and ‘data2’ in standalone files helps with deployment efficiency. LoRA adapter weights are likely in a faster deployment cadence compared to the foundation weights. Deploying a smaller file OTA is quicker and less prone to failure. If the PTE/LoRA weights are small, it’s reasonable to keep them in a single file and update them together.

Design

We propose new ahead-of-time APIs that provide backends with all the graphs across partitions and methods to be lowered. This enables backends to identify the shared components across these graphs. Additionally, we provide a blob storage service to backends to serialize data that is shared across graphs. At runtime, backends can retrieve the shared data for any further initialization. The design details are fleshed out in the Blob Storage Service here: (#8122). See sections ‘AoT: Preprocess’ and “Runtime: NamedDataMap’.

cc @mcr229, @iseeyuan, @dbort, @JacobSzwejbka, @tarun292

Metadata

Metadata

Assignees

Labels

rfcRequest for comment and feedback on a post, proposal, etc.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions