[LinalgExt] Implement direct vectorization for im2col op#23855
Conversation
155af4b to
35a5c1e
Compare
35a5c1e to
12e6f6c
Compare
…uctionsOptimization (#23947) The pass was bailing out on vector.transfer_read ops with non-identity permutation maps (e.g., 1D reads from a 4D memref). After #23855, we will frequently see 1D reads, which need to be supported here. Ideally, we will do something like what is done in #23859, but that approach is causing performance regressions that are difficult to deal with. For now, this provides a solution for the new mask types we will be seeing. Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1ab1043 to
53f0f34
Compare
hanhanW
left a comment
There was a problem hiding this comment.
This is a very large PR. I wonder if we break it into at least two PRs or not:
- padding semangics
- vectorization
There was a problem hiding this comment.
I'm not sure I follow exactly what you mean. Is your point that the op-specific tests (like map_store transfer_gather, etc.) should be moved into masked_configured/masked_inferred/unmasked?
There was a problem hiding this comment.
Each file tests with different configuration. E.g., _unmasked tests are mainly for cases that input_vector_size is empty. masked_configured is mainly for cases that have input_vector_size where it is derived from configurations, etc. This one belongs to _unmasked IMO.
There was a problem hiding this comment.
I see, thanks for the context. I'll move it. I think map_store should also be moved there, which I can do in another PR.
There was a problem hiding this comment.
I actually opted to implement the vector size analysis for im2col, and put the tests in the _inferred file. I want to later use divisibility analysis to determine vector sizes for im2col, and we have access to a dataflow solver in the analysis, so this is a better place for it.
This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process. I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that. If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it. |
hanhanW
left a comment
There was a problem hiding this comment.
This is a very large PR. I wonder if we break it into at least two PRs or not:
- padding semangics
- vectorization
This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.
I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.
If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.
My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?
The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do.
I don't follow the performance regression issue. You add padding semantics to the op, but it does not change the input and the folding does not exist? In this case, adding the semantics (with existing interface implementation) does not hurt performance, right?
Let me try to split it up. I think that should be fine, since it will effectively be a non-functional change. My concern is about adding the im2col(pad) canonicalizers, which would require me to actually support decomposition with padding in the intermediate state, which would probably just cause performance regressions. I can leave decomposition with padding unimplemented in the intermediate state, though. On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further. |
I agree that it is good to have a single PR for full picture as long as they are splited into several commit. If we are going to do that, we may want to flip the "merge guidance" from "squash and merge" into "merge a chain of commit`. You still want "smaller piecies" (e.g., several commits) when you want people to review the code. The other benefits of having several PRs is that you can easily revert a problematic one if it breaks something. You can argue that they'd be caught in a single big PR, but that's not 100% because some failures only happenned on post-submit or downstream projects. It is much easier to manage the states if they are several commits. AI is a good tool, and it may change our mental model. I'm actually happy that we are growing with these tools. I was going to offer that I can review a single PR or part of it, but I was in a meeting. I won't insist here, as it does not bother me much. But I'd like to share the pros of breaking PRs/commits above. |
53f0f34 to
b57354f
Compare
|
Padding + tiling implementation PR is here: #23950 |
b57354f to
2ec6ebc
Compare
Add optional padding fields to the im2col op: - input_pad_low/high: per-input-dimension padding amounts - output_pad_low/high: per-output-dimension padding amounts - pad_value: the value for out-of-bounds positions This includes: - Op definition updates - Padding accessors, setters, and verifier checks - Tiling interface implementation - Decomposition bail-out for padded im2col ops (full padding support in #23855) - Roundtrip, verifier, and tiling tests This change is effectively non-functional, since we do not yet make use of the padding attributes. The follow-up PR will introduce canonicalizers that fold padding into im2col, which will enable the padding path. --------- Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2784ac7 to
147a995
Compare
yzhang93
left a comment
There was a problem hiding this comment.
Thanks for the changes. LGTM.
638ad7c to
be08704
Compare
Add direct vectorization for im2col ops via VectorizableOpInterface, and complete the padding support introduced in the previous commit. This includes: - Im2colUtils (computeIm2colSourceIndices, computeIm2colValidSize, chooseDimToVectorize) shared between vectorization and decomposition - Im2colOpVectorizationModel implementing VectorizableOpInterface - Full decomposition refactor: 1D slices, tensor.pad for padded path, read offset clamping to [0, dimSize-1] - Canonicalization patterns: FoldInputPadIntoIm2col, FoldOutputPadIntoIm2col - Pass reordering: DecomposeIm2col after GenericVectorization (vectorize first, decompose as fallback) - subOfrs utility - Tests for vectorization, decomposition, and canonicalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
be08704 to
ed575bf
Compare
Plumb im2col tile size computation through the TileSizeLattice dataflow
analysis in MaterializeVectorTileSizes, following the same pattern as
linalg ops. Im2col's output dimensions are the iteration domain (identity
map), so tile sizes are stored directly on the result lattice.
Im2colOpVectorizationModel::vectorize now reads the vectorized dimension
from the driver-provided vectorSizes instead of recomputing via
chooseDimToVectorize inline. This decouples the vector size decision from
the vectorization implementation, enabling future analyses (such as
divisibility analysis) to influence the tile size choice without
modifying the vectorization code.
GenericVectorization's Tier 3 IR inference path is extended with an
im2col case that calls computeIm2colVectorTileSizes as a fallback for
pipelines that do not run MaterializeVectorTileSizes.
Test changes:
- Move all im2col vectorization tests from the standalone
generic_vectorization_im2col.mlir into
generic_vectorization_masked_inferred.mlir, which exercises the same
code paths as part of the broader masked-inference pipeline. Delete
the now-empty file and update BUILD.bazel / CMakeLists.txt.
- Add three focused tests in materialize_vector_tile_sizes.mlir
covering the basic im2col vector tile size analysis cases: standard
NHWC vectorization along K (width 4), non-contiguous K via
input_k_perm with no attribute stamped, and channel vectorization
with width 8 and input padding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
ed575bf to
578818c
Compare
|
VAE failures are pre-existing ever since the test was added to torch tests: https://github.com/iree-org/iree/actions/runs/23915109640/job/69748051690 |

Implements direct vectorization for im2col, and completes the support for padding on the im2col op. The padding attributes are used to compute the read mask when vectorizing. In the old path, we would have separate padding on the input and the result of the im2col, and we try to compose those pads into a single masked read. This is fragile and difficult for cases where the im2col result dims don't map well to the input dims. With this direct vectorization approach, we can compute the mask based on the input and result padding simultaneously. This will make flattening of the spatial dimensions of convolutions possible.
Performance results:
These runs were taken on different commits, but they are functionally the same (just some cleanup differences). I am only able to reproduce 3 of the regressions locally, and most of the improvers (~35 of them with 10-40% speedup) are real. There seems to have been some noise in the runs, but overall there is a good perf improvement.
ci-extra: test_torch