Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Contact e-mail addresses of WG leads can be found [here](https://wiki.lfaidata.f
| [Multi-device](https://lfaifoundation.slack.com/archives/C05JY32GCCS) | Multi-device support in ONNX |
| Safety-Related-Profile | |
| [Generative-AI](https://lfaifoundation.slack.com/archives/C08MERYU84T) | Accelerate GenAI support in ONNX |
| [Probabilistic Programming] |

## Completed working groups

Expand Down
69 changes: 69 additions & 0 deletions generative-ai/meetings/meeting_22_Jan_14_2026.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Recording and Transcript:

https://zoom.us/rec/share/O77VTbYJPR6hObo4_dJgWXcfo-y6PZoGtxSHblrJdY2ZGGTLnbbjbX8NfQAVuQ5y.Slg75HZun6F9E458

# Meeting Minutes:

## Summary
The meeting focused on three primary technical areas: reviewing an open-source contribution for flex-attention, debating the long-term infrastructure for LLM model exports, and introducing a proposal for a standardized ternary storage format to optimize LLM inference.

## Key Discussion Points

### Flex-Attention Contribution
- A comprehensive open-source contribution for flex-attention was submitted via GitHub [PR link](https://github.com/onnx/onnx/pull/7534) with numerous test cases.
- WG members are requested to review the PR before the next meeting for further discussion.
- The team noted that the primary challenge moving forward will be the backend implementation. They plan to discuss how to include this in upcoming versions and its integration with the exporter.

### LLM Model Export Paths
- There is a discussion whether to focus on Optimum ONNX or Olive as the primary path for exporting Large Language Models (LLMs) at scale.
- Intel has been using Optimum Intel and is considering integrating their quantization tool (NNCF) into Optimum ONNX to provide a consistent experience for users. Since Olive contains Optimum pass, the same integration can be used via Olive.
- Rama explained that any exporter-related issues in Optimum-ONNX will be supported by the exporter team. Further clarification is needed from Olive team regarding the ONNX export plans in Olive.
- Freddy raised concerns about "model variation," suggesting the need for architectural guidelines to ensure exported models remain reusable and semantically equivalent across different tools.

### Ternary Storage Format Proposal
- Soumendu introduced a proposal for a standardized storage format for Ternary LLMs (weights represented as -1, 0, 1).
- By using a 2-bit (or 10:8) compression scheme, memory traffic between DRAM and on-chip SRAM can be reduced by approximately 20%. This significantly improves power efficiency and performance during the memory-bound decoding phase (token generation).
- A formal draft of this data-independent compression scheme will be shared for review before the next meeting.

## Action Items
- All: Review the flex-attention GitHub [PR](https://github.com/onnx/onnx/pull/7534) and provide feedback.
- Yamini: Share the list of missing ORT classes from optimum-onnx in Slack for the ONNX exporter team to review.
- Soumendu: Share the draft proposal for the Ternary Storage Format
- Yamini & Rama: Finalize a new recurring meeting time (potentially 10:30 AM) to avoid conflicts.
- Justin: Provide updates on the RFC system for tracking proposals.

## Comparison between Auto and ORT model classes in Optimum (As of 1/23/2026):

| Task | Auto Class | ORT Class |
| ---------------------------------------- | --------------------------------------- | -------------------------------------- |
| audio-classification | AutoModelForAudioClassification | ORTModelForAudioClassification |
| audio-frame-classification | AutoModelForAudioFrameClassification | ORTModelForAudioFrameClassification |
| audio-xvector | AutoModelForAudioXVector | ORTModelForAudioXVector |
| automatic-speech-recognition | AutoModelForSpeechSeq2Seq | **Missing** |
| automatic-speech-recognition | AutoModelForCTC | ORTModelForCTC |
| depth-estimation | AutoModelForDepthEstimation | **Missing** |
| feature-extraction | AutoModel | ORTModel |
| feature-extraction / sentence-similarity | SentenceTransformer | **Missing** |
| fill-mask | AutoModelForMaskedLM | ORTModelForMaskedLM |
| image-classification | AutoModelForImageClassification | ORTModelForImageClassification |
| image-text-to-text | AutoModelForImageTextToText | **Missing** |
| image-to-image | AutoModelForImageToImage | ORTModelForImageToImage |
| image-to-image | AutoPipelineForImage2Image | ORTPipelineForImage2Image |
| image-to-text | AutoModelForVision2Seq | ORTModelForVision2Seq |
| inpainting | AutoPipelineForInpainting | ORTPipelineForInpainting |
| masked-im | AutoModelForMaskedImageModeling | **Missing** |
| multiple-choice | AutoModelForMultipleChoice | ORTModelForMultipleChoice |
| object-detection | AutoModelForObjectDetection | **Missing** |
| question-answering | AutoModelForQuestionAnswering | ORTModelForQuestionAnswering |
| semantic-segmentation | AutoModelForSemanticSegmentation | ORTModelForSemanticSegmentation |
| text2text-generation | AutoModelForSeq2SeqLM | ORTModelForSeq2SeqLM |
| text-classification | AutoModelForSequenceClassification | ORTModelForSequenceClassification |
| text-generation | AutoModelForCausalLM | ORTModelForCausalLM |
| text-to-audio | AutoModelForTextToSpectrogram | **Missing** |
| text-to-audio | AutoModelForTextToWaveform | **Missing** |
| text-to-image | AutoPipelineForText2Image | ORTPipelineForText2Image |
| token-classification | AutoModelForTokenClassification | ORTModelForTokenClassification |
| visual-question-answering | AutoModelForVisualQuestionAnswering | **Missing** |
| zero-shot-image-classification | AutoModelForZeroShotImageClassification | ORTModelForZeroShotImageClassification |
| zero-shot-object-detection | AutoModelForZeroShotObjectDetection | **Missing** |

48 changes: 48 additions & 0 deletions generative-ai/meetings/meeting_23_Jan_28_2026.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Recording and Transcript:

https://zoom.us/rec/share/Tej9nFPPA46beheUTHUftH5J_bGKSOAYA7Vp6cGmdmDRzVazUiOx7ZB8aOICvc-d.YQDpi8IhNHsIukAA

# Meeting Minutes:

**Ternary Storage Format Proposal**
- Soumendu Ghosh presented a proposal ([paper](https://drive.google.com/file/d/1oeHVpCygJ9XlWKChB5e1tCWVFil9j9gY/view?usp=drive_link), [presentation](https://drive.google.com/file/d/1ofjAkMWh0XEXm2WauN1CLSwNF49kpMwb/view?usp=drive_link)) for a standardized storage format for ternary weights in LLMs, specifically targeting models like BitNet b1.58.
- Key Technical Points:
- The Concept: Ternary weights consist of three values (-1, 0, 1). While theoretically requiring 1.58 bits, they are typically stored in 2 bits.
- Proposed Encoding: Packing 5 ternary values into 1 byte (8 bits).
- 3^5 = 243 states, which fits within the 256 states available in 8 bits.
- This provides a 20% memory reduction compared to standard 2-bit packing (8 bits vs 10 bits).
- Benefits: This is a deterministic, lossless compression scheme that reduces memory footprint and energy consumption by minimizing off-chip memory transactions.
- Hardware Support:
- Native: Hardware with decompression engines can unpack data in the data path.
- Fallback: Systems without native support can use a TernaryDecode operator at compile-time or runtime.
- Discussion & Opens:
- Block Size: Discussion around the optimal block size for alignment. A block size of 320 or 640 provides the best alignment for 64-byte cache lines without padding wastage.
- Lookup Table: The proposal includes a standardized 256-entry lookup table for encoding/decoding across all models and layers.

**Flex Attention PR**
- The group discussed Rama’s feedback on the [Flex Attention Pull Request](https://github.com/onnx/onnx/pull/7534).
- Current State: The PR follows PyTorch's element-wise transformation approach.
- Feedback: Ganesan suggested moving to a tensor-to-tensor computation representation.
- Reasoning: Element-by-element loops are inefficient in interpreter-based systems. Tensor-oriented ops (Softmax, MatMul) leverage existing optimized backend kernels.
- Outcome: This change is expected to significantly simplify the PR by removing complex loop-building logic.

**Optimum-ONNX & Exporter Path**
- Yamini and Xavier discussed the path forward for exporting LLM models.
- Current Issues:
- Optimum-ONNX currently defaults to dynamo=False.
- Enabling Dynamo requires significant patching for different models.
- The "Transformers" repo is slow to merge necessary patches (some pending for 6 months).
- Proposed Solution:
- Xavier has a set of patches for various Transformer versions that handle cache classes and control flow.
- Yamini expressed a preference for support of the optimum workflow as it is used by many developers.
- Next Steps:
- Xavier will prepare a presentation for the next meeting on Transformer patches and options.
- Xavier shared the following links:
- [Link](https://github.com/sdpython/onnx-diagnostic/tree/main/onnx_diagnostic/torch_export_patches) to the code of the patches made so far for the ONNX export
- [PR to transformers](https://github.com/huggingface/transformers/pull/41992) to check if models are exportable
- [PR to optimum-onnx](https://github.com/huggingface/optimum-onnx/pull/113) adds a ‘--dynamo’ parameter to trigger the dynamo exporter, but may not work for all models and fails with transformers 5.0

**Action Items**
- All: Review ternary storage proposal ([paper](https://drive.google.com/file/d/1oeHVpCygJ9XlWKChB5e1tCWVFil9j9gY/view?usp=drive_link), [presentation](https://drive.google.com/file/d/1ofjAkMWh0XEXm2WauN1CLSwNF49kpMwb/view?usp=drive_link)) from Soumendu and give feedback/inputs
- Xavier: Prepare presentation on Transformer/optimum patches and options to review in the next meeting
- PR contributor (mshr-h): Update [Flex Attention Pull Request](https://github.com/onnx/onnx/pull/7534) to tensor-to-tensor logic
30 changes: 30 additions & 0 deletions probabilistic-programming/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--- SPDX-License-Identifier: Apache-2.0 -->

# ONNX Probabilistic Programming (WG)

This repository is where ONNX Probabilistic WG will capture various artifacts and deliverables.


## Working Group Status
**ACTIVE**

# Slack channel

# WG Lead(s)

* Brian Parbhu, Adam Pocock (Oracle) (February 11, 2026 - Current)

# Logistics

* WG leads will drive the meeting.


# WG Meeting Info

* Meeting (to be defined).
* TEAMS Meeting link: (to be defined)
* Meeting ID: (to be defined)

# Meeting notes

The meeting notes can be found [here](https://github.com/onnx/working-groups/tree/main/safety-related-profile/meetings)
Binary file not shown.