Optimize torch.nn.functional.one_hot

### 🚀 The feature, motivation and pitch

one_hot is used in vLLM Gemma 4 26B workload https://github.com/vllm-project/vllm/blob/308cec5864890f5c0724e1d4531d9fe2ee0a8209/vllm/model_executor/models/gemma4.py#L214. We found 120 D2H copy were invoked per step and most likely caused by the boundary check in https://github.com/pytorch/pytorch/blob/beae96dfc1a2880f80a62f196306188a8d6dfdd9/aten/src/ATen/native/Onehot.cpp#L62. Pls consider to follow cuda to skip the check.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize torch.nn.functional.one_hot #3284

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize torch.nn.functional.one_hot #3284

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions