Skip to content

[LinalgExt] Implement direct vectorization for im2col op#23855

Merged
Max191 merged 2 commits intomainfrom
users/Max191/im2col-direct-vectorization
Apr 8, 2026
Merged

[LinalgExt] Implement direct vectorization for im2col op#23855
Max191 merged 2 commits intomainfrom
users/Max191/im2col-direct-vectorization

Conversation

@Max191
Copy link
Copy Markdown
Contributor

@Max191 Max191 commented Mar 19, 2026

Implements direct vectorization for im2col, and completes the support for padding on the im2col op. The padding attributes are used to compute the read mask when vectorizing. In the old path, we would have separate padding on the input and the result of the im2col, and we try to compose those pads into a single masked read. This is fragile and difficult for cases where the im2col result dims don't map well to the input dims. With this direct vectorization approach, we can compute the mask based on the input and result padding simultaneously. This will make flattening of the spatial dimensions of convolutions possible.

Performance results:

These runs were taken on different commits, but they are functionally the same (just some cleanup differences). I am only able to reproduce 3 of the regressions locally, and most of the improvers (~35 of them with 10-40% speedup) are real. There seems to have been some noise in the runs, but overall there is a good perf improvement.

ci-extra: test_torch

@Max191 Max191 requested a review from yzhang93 March 19, 2026 16:14
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 155af4b to 35a5c1e Compare March 19, 2026 17:58
@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 19, 2026

This PR will depend on #23859 landing in order to avoid performance regressions, but the implementation in this PR will stay the same.

I will share benchmark results once I have them. My first run failed for some reason.

EDIT: That PR didn't merge, but the issue was fixed in #23947

Comment thread compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp
Comment thread compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp Outdated
Comment thread compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 35a5c1e to 12e6f6c Compare March 27, 2026 16:15
@Max191 Max191 marked this pull request as ready for review March 27, 2026 16:17
@Max191 Max191 requested a review from yzhang93 March 27, 2026 16:17
Max191 added a commit that referenced this pull request Mar 27, 2026
…uctionsOptimization (#23947)

The pass was bailing out on vector.transfer_read ops with non-identity
permutation maps (e.g., 1D reads from a 4D memref). After
#23855, we will frequently see 1D
reads, which need to be supported here. Ideally, we will do something
like what is done in #23859, but
that approach is causing performance regressions that are difficult to
deal with. For now, this provides a solution for the new mask types we
will be seeing.

Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 1ab1043 to 53f0f34 Compare March 27, 2026 16:18
Copy link
Copy Markdown
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large PR. I wonder if we break it into at least two PRs or not:

  • padding semangics
  • vectorization

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear how the vector sizes are derived for this naming. I categorize them a few, and I wonder if we move the tests into one of them?

Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow exactly what you mean. Is your point that the op-specific tests (like map_store transfer_gather, etc.) should be moved into masked_configured/masked_inferred/unmasked?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each file tests with different configuration. E.g., _unmasked tests are mainly for cases that input_vector_size is empty. masked_configured is mainly for cases that have input_vector_size where it is derived from configurations, etc. This one belongs to _unmasked IMO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the context. I'll move it. I think map_store should also be moved there, which I can do in another PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually opted to implement the vector size analysis for im2col, and put the tests in the _inferred file. I want to later use divisibility analysis to determine vector sizes for im2col, and we have access to a dataflow solver in the analysis, so this is a better place for it.

@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 27, 2026

This is a very large PR. I wonder if we break it into at least two PRs or not:

  • padding semangics
  • vectorization

This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.

I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.

If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.

@Max191 Max191 requested a review from hanhanW March 27, 2026 17:57
Copy link
Copy Markdown
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large PR. I wonder if we break it into at least two PRs or not:

  • padding semangics
  • vectorization

This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.

I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.

If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.

My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?

The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do.

I don't follow the performance regression issue. You add padding semantics to the op, but it does not change the input and the folding does not exist? In this case, adding the semantics (with existing interface implementation) does not hurt performance, right?

Comment thread compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.td Outdated
@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 27, 2026

My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?

Let me try to split it up. I think that should be fine, since it will effectively be a non-functional change. My concern is about adding the im2col(pad) canonicalizers, which would require me to actually support decomposition with padding in the intermediate state, which would probably just cause performance regressions. I can leave decomposition with padding unimplemented in the intermediate state, though.

On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further.

@hanhanW
Copy link
Copy Markdown
Contributor

hanhanW commented Mar 27, 2026

On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further.

I agree that it is good to have a single PR for full picture as long as they are splited into several commit. If we are going to do that, we may want to flip the "merge guidance" from "squash and merge" into "merge a chain of commit`. You still want "smaller piecies" (e.g., several commits) when you want people to review the code.

The other benefits of having several PRs is that you can easily revert a problematic one if it breaks something. You can argue that they'd be caught in a single big PR, but that's not 100% because some failures only happenned on post-submit or downstream projects. It is much easier to manage the states if they are several commits.

AI is a good tool, and it may change our mental model. I'm actually happy that we are growing with these tools.


I was going to offer that I can review a single PR or part of it, but I was in a meeting. I won't insist here, as it does not bother me much. But I'd like to share the pros of breaking PRs/commits above.

@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 53f0f34 to b57354f Compare March 27, 2026 19:33
@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Mar 27, 2026

Padding + tiling implementation PR is here: #23950

@Max191 Max191 requested a review from hanhanW March 27, 2026 20:18
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from b57354f to 2ec6ebc Compare March 27, 2026 20:25
@Max191 Max191 changed the title [LinalgExt] Add padding semantics and direct vectorization for im2col op [LinalgExt] Implement direct vectorization for im2col op Mar 27, 2026
Max191 added a commit that referenced this pull request Apr 1, 2026
Add optional padding fields to the im2col op:
- input_pad_low/high: per-input-dimension padding amounts
- output_pad_low/high: per-output-dimension padding amounts
- pad_value: the value for out-of-bounds positions

This includes:
- Op definition updates
- Padding accessors, setters, and verifier checks
- Tiling interface implementation
- Decomposition bail-out for padded im2col ops (full padding support in
#23855)
- Roundtrip, verifier, and tiling tests

This change is effectively non-functional, since we do not yet make use
of the padding attributes. The follow-up PR will introduce
canonicalizers that fold padding into im2col, which will enable the
padding path.

---------

Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch 3 times, most recently from 2784ac7 to 147a995 Compare April 7, 2026 17:51
@hanhanW hanhanW dismissed their stale review April 7, 2026 18:01

leaving the review to others

Copy link
Copy Markdown
Contributor

@yzhang93 yzhang93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. LGTM.

@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch 2 times, most recently from 638ad7c to be08704 Compare April 7, 2026 19:47
Add direct vectorization for im2col ops via VectorizableOpInterface,
and complete the padding support introduced in the previous commit.

This includes:
- Im2colUtils (computeIm2colSourceIndices, computeIm2colValidSize,
  chooseDimToVectorize) shared between vectorization and decomposition
- Im2colOpVectorizationModel implementing VectorizableOpInterface
- Full decomposition refactor: 1D slices, tensor.pad for padded path,
  read offset clamping to [0, dimSize-1]
- Canonicalization patterns: FoldInputPadIntoIm2col, FoldOutputPadIntoIm2col
- Pass reordering: DecomposeIm2col after GenericVectorization (vectorize
  first, decompose as fallback)
- subOfrs utility
- Tests for vectorization, decomposition, and canonicalization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from be08704 to ed575bf Compare April 8, 2026 14:07
Plumb im2col tile size computation through the TileSizeLattice dataflow
analysis in MaterializeVectorTileSizes, following the same pattern as
linalg ops. Im2col's output dimensions are the iteration domain (identity
map), so tile sizes are stored directly on the result lattice.

Im2colOpVectorizationModel::vectorize now reads the vectorized dimension
from the driver-provided vectorSizes instead of recomputing via
chooseDimToVectorize inline. This decouples the vector size decision from
the vectorization implementation, enabling future analyses (such as
divisibility analysis) to influence the tile size choice without
modifying the vectorization code.

GenericVectorization's Tier 3 IR inference path is extended with an
im2col case that calls computeIm2colVectorTileSizes as a fallback for
pipelines that do not run MaterializeVectorTileSizes.

Test changes:
  - Move all im2col vectorization tests from the standalone
    generic_vectorization_im2col.mlir into
    generic_vectorization_masked_inferred.mlir, which exercises the same
    code paths as part of the broader masked-inference pipeline. Delete
    the now-empty file and update BUILD.bazel / CMakeLists.txt.
  - Add three focused tests in materialize_vector_tile_sizes.mlir
    covering the basic im2col vector tile size analysis cases: standard
    NHWC vectorization along K (width 4), non-contiguous K via
    input_k_perm with no attribute stamped, and channel vectorization
    with width 8 and input padding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
@Max191 Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from ed575bf to 578818c Compare April 8, 2026 14:46
@Max191
Copy link
Copy Markdown
Contributor Author

Max191 commented Apr 8, 2026

VAE failures are pre-existing ever since the test was added to torch tests: https://github.com/iree-org/iree/actions/runs/23915109640/job/69748051690

@Max191 Max191 merged commit 94170e7 into main Apr 8, 2026
63 of 66 checks passed
@Max191 Max191 deleted the users/Max191/im2col-direct-vectorization branch April 8, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants