[Feature] Tensor::view overload that reduces rank by squeezing dropped axes

## Summary

Add a `Tensor::view` overload that produces a lower-rank view by squeezing
out size-1 axes (the result of a rank-reducing slice). The current
`view(shapes[], offsets[])` inherits `ndims` from the parent, so it cannot
express \"slice + drop axis\" in a single step, and `view(...).reshape(...)`
does not compose for sliced views.

## Motivation / Use Case

PyPTO RFC #1338 / PR #1343 added a `drop_dims` operand to `tensor.slice` so
that numpy-style indexing (`C[i]`, `C[i, j]`, `C[i, j, :, :]`, ...) can
produce a lower-rank `Tensor`. The orchestration codegen
(`src/codegen/tensor_op_codegen.cpp`, `REGISTER_ORCHESTRATION_OP(tensor_slice)`)
needs to emit a runtime `Tensor` at the reduced rank so that downstream
kernel-call bindings see the correct ndims. Without this, a kernel that
takes a sub-tensor via numpy-style indexing generates wrong-rank code. See
PyPTO follow-up issue: https://github.com/hw-native-sys/pypto/issues/1349.

The composition `view(...).reshape(...)` does not work because:

1. `view(view_shapes[], view_offsets[])` (`tensor.h:331`) inherits
   `ndims = other.ndims` (`init_with_view` at `tensor.h:233`). The view
   stays at the parent's rank with size-1 entries in dropped positions.
2. `reshape(new_shapes[], new_ndims)` (`tensor.h:358`) hard-asserts
   `is_contiguous()` (`tensor.h:360`). A rank-reducing slice produces
   `shapes[i]=1, raw_shapes[i]=B` for any non-leading dropped axis, which
   fails `is_contiguous()` (`tensor.h:341` only allows divergence in dim 0
   via `is_raw_eq_shapes`).
3. Even when the contiguity check passes, `reshape` clobbers the per-dim
   offsets that `view` just installed: `result.is_all_offset_zero = true;
   result.is_raw_eq_shapes = true;` (`tensor.h:364-365`). The data offset
   from the slice is lost.

## Proposed API / Behavior

Add an overload (under `runtime/src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/tensor.h`):

\`\`\`cpp
// Slice-and-squeeze: produce a view at a lower rank by collapsing size-1
// axes listed in `drop_dims`. `view_shapes` / `view_offsets` are at the
// parent's rank; `view_shapes[d]` for d in drop_dims must equal 1.
Tensor view(
    const uint32_t view_shapes[],
    const uint32_t view_offsets[],
    const uint32_t drop_dims[],
    uint32_t num_drop_dims,
    bool manual_dep = false) const;
\`\`\`

Semantics:

- The result's `ndims` is `parent.ndims - num_drop_dims`.
- The result's `shapes[]` and `raw_shapes[]` are the parent's
  `view_shapes[]` and `parent_raw[]` with the entries at indices in
  `drop_dims` removed.
- The element offset contributed by the dropped axes
  (`sum(view_offsets[d] * stride(d))` for `d in drop_dims`) is folded into
  `start_offset` (or equivalently into `offsets[]` of the surviving
  leading axis), so element addressing in the lower-rank view points at
  the same memory as the parent slice.
- Surviving axes keep their `view_offsets[]` as-is.

Constraint: every `d` in `drop_dims` must have `view_shapes[d] == 1`.
This matches what a rank-reducing slice produces and lets us collapse the
dimension without needing a real reshape (no contiguity requirement).

## Alternatives Considered

- **`view().reshape()` composition** — Rejected, see Motivation #2/#3
  above. Breaks for any non-leading drop and loses per-dim offsets.
- **Standalone `squeeze(keep_mask[])` operation** — Composable but
  requires two C++ calls per slice in generated code and a separate
  intermediate `Tensor`. The overload above is one call and one result.
- **Keep full-rank in runtime; only drop dims in IR / kernel-call binding
  metadata** — Rejected, this diverges runtime tensor rank from IR rank
  and breaks the orchestration → kernel-call ABI inference path.

## Additional Context

- Same change is needed in both `src/a2a3/.../tensor.h` and
  `src/a5/.../tensor.h` so the orchestration codegen path is uniform
  across architectures.
- PyPTO codegen call site that will adopt the new API:
  `src/codegen/tensor_op_codegen.cpp`, `REGISTER_ORCHESTRATION_OP(tensor_slice)`
  (around line 209). It already constructs `_shapes` and `_offsets` arrays;
  it would emit an additional `_drop_dims` array and call the new overload.
- The `tile.slice` rank-reducing path in PyPTO is purely codegen-side and
  does not need runtime support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Tensor::view overload that reduces rank by squeezing dropped axes #785

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Tensor::view overload that reduces rank by squeezing dropped axes #785

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions