fix(hf): pass return_dict=True to apply_chat_template#1253
Open
ajbozarth wants to merge 1 commit into
Open
Conversation
apply_chat_template(return_tensors="pt") returns a 2-D torch.Tensor,
not a dict. The standard-generation call site at huggingface.py:1119-
1120 then indexes it with string keys ("input_ids", "attention_mask"),
which on torch >= 2.9 is interpreted as fancy indexing with the
codepoints of the string and raises IndexError on macOS (CPU and MPS).
Adding return_dict=True makes apply_chat_template return a
BatchEncoding, so dict access works as the surrounding code already
expects. The downstream isinstance(input_ids, torch.Tensor) ternaries
in processing()/post_processing() were already coded to handle both
the dict producer and the bare-Tensor producer from the merged-cache
path, so they continue to work for both shapes.
Latent since generative-computing#418 (transformers 5 bump). Not caught by PR CI because
the integration test is gated by @require_gpu, and not caught by the
GPU nightly because the Linux CUDA wheel of torch 2.11.0 doesn't
trigger the deprecated indexing path that fails locally on macOS.
Assisted-by: Claude Code
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Fixes #
Description
Fixes a latent bug introduced in #418 (transformers 5 bump) where
LocalHFBackend.generate_from_contextindexes a rawtorch.Tensorwith a string key.apply_chat_template(..., return_tensors="pt")returns a 2-Dtorch.Tensor, not a dict. The standard-generation call site athuggingface.py:1119-1120then does:Per the warning at
pytorch/csrc/autograd/python_variable_indexing.cpp:353, in torch ≥ 2.9 this expression is interpreted astensor[torch.tensor(seq)]— fancy indexing with the codepoints of"input_ids"against a 2-D tensor — which raises:Reproduces locally on macOS Apple Silicon (torch 2.11.0, both CPU and MPS).
The fix is one line: pass
return_dict=Truetoapply_chat_templateso the dict access (and the.to(self._device)chain) operate on aBatchEncoding.Why
return_dict=Trueand not "pass the bare tensor"Two producers feed the local
input_idsvariable into the same downstreamprocessing()andpost_processing()consumers:_compute_input_ids_with_kv_cache_smashing, line 831) — returns a baretorch.Tensor.apply_chat_template(...)— should return aBatchEncoding.The consumers already handle both shapes via
isinstance(input_ids, torch.Tensor)ternaries at lines 1200, 1273, 1305 (Tensorbranch for the merged-cache producer,else input_ids["input_ids"]for the dict producer). The dict branch is only reachable if the standard-generation producer returns a dict, so addingreturn_dict=Truematches what consumers were coded for. The alternative would also require deleting all three dict branches and synthesizing anattention_mask.Why this slipped through
PR CI doesn't catch it.
test_telemetry/test_metrics_backend.py::test_huggingface_token_metrics_integrationis gated by@require_gpu(min_vram_gb=8)(test/predicates.py:91). GitHub Actions runners have neither CUDA nor MPS, so the test is skipped on every PR run.GPU nightly runs do not fail on it. Confirmed against the most recent main-branch nightly log (Linux + CUDA, same
torch==2.11.0peruv.lock): both parametrizations of the test ran end-to-end (the transformersmax_lengthwarning fires from insidemodel.generate(), past the broken site) and neither appears in the failure list. The pytorch deprecation warning that fires every run locally is completely absent from the cluster log.Same mellea code, same indexing expression, same torch version string, same Python, same
filterwarningsconfig — but different platform wheels (uv.lockships ~530MB Linux x86_64 with CUDA vs. ~80MB macOS ARM64; distinct compiled binaries). Canonical pytorch source fortreatSequenceAsTupleis platform-agnostic and should warn on every platform; the cluster wheel evidently diverges from that. Mechanism uninvestigated — not blocking.What didn't change
Tensor; the three downstream ternaries continue to handle both producers.apply_chat_templatecall sites: only line 1054 was buggy. The KV-cache call at line 774 usestokenize=False(returns a string) and the alora batch path at 1511-1550 already operates on a dict-shaped result.Testing
Locally on Apple Silicon (torch 2.11.0, transformers 5.x, MPS):
The
python_variable_indexing.cpp:353warning is gone. No new test added —test_huggingface_token_metrics_integrationalready exercises this path; it's just GPU-gated. Will validate on the GPU path post-merge.Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.