Skip to content

Fix: respect -t/--tokenizer CLI flag over sibling tokenizer/ directory#629

Open
mario-sanz wants to merge 1 commit intoallenai:mainfrom
mario-sanz:fix/respect-tokenizer-cli-flag-609
Open

Fix: respect -t/--tokenizer CLI flag over sibling tokenizer/ directory#629
mario-sanz wants to merge 1 commit intoallenai:mainfrom
mario-sanz:fix/respect-tokenizer-cli-flag-609

Conversation

@mario-sanz
Copy link
Copy Markdown

Fixes #609

Problem

In convert_checkpoint_to_hf.py, the -t/--tokenizer CLI flag is silently ignored when a sibling tokenizer/ directory exists next to the checkpoint. The code unconditionally enters the if tokenizer_path.exists() branch, so the user-provided tokenizer_id is never consulted.

This is a common scenario for SFT checkpoints, which are typically saved with an adjacent tokenizer/ directory.

Fix

Add a tokenizer_id is None guard to the condition on convert_checkpoint_to_hf.py, so the sibling tokenizer/ directory is only used as a fallback when no explicit -t override was provided.

Precedence after the fix:

  1. Explicit -t/--tokenizer CLI argument (highest priority)
  2. Sibling tokenizer/ directory next to the checkpoint
  3. tokenizer_config.identifier from the checkpoint metadata

This matches the behavior already present in the hybrid converter (convert_checkpoint_to_hf_hybrid.py).

Tests

Added convert_checkpoint_to_hf_tokenizer_test.py with 5 test cases covering the full tokenizer selection logic:

  • test_tokenizer_id_used_when_no_sibling_dir: -t is used when no sibling tokenizer/ directory exists.
  • test_tokenizer_id_overrides_sibling_dir: -t takes precedence when a sibling tokenizer/ directory also exists (the core bug scenario). Before the fix, this test fails, but after the fix, it succeeds.
  • test_sibling_dir_used_when_no_tokenizer_id: The sibling tokenizer/ directory is used as a fallback when -t is not provided.
  • test_config_identifier_used_as_fallback: TokenizerConfig.identifier is used when neither -t nor a sibling directory is available.
  • test_no_tokenizer_saved_when_nothing_available: Tokenizer saving is skipped entirely when no source is available.

@tyler-romero
Copy link
Copy Markdown
Contributor

Hey! Thanks for the fix. This looks good to me but can you actually remove the test file? AI generated like 6 tests for just the if/else related to the tokenizer_id which is huge overkill.

@tyler-romero tyler-romero self-requested a review March 3, 2026 17:32
When a sibling tokenizer/ directory exists next to the checkpoint, the
convert_checkpoint_to_hf script would always use it, silently ignoring
the -t/--tokenizer CLI argument.

Fix by checking that tokenizer_id is None before falling back to the
sibling tokenizer/ directory, so an explicit -t flag always takes
precedence.

Fixes allenai#609
@mario-sanz mario-sanz force-pushed the fix/respect-tokenizer-cli-flag-609 branch from 9597b54 to 72df5ea Compare March 4, 2026 05:55
@mario-sanz
Copy link
Copy Markdown
Author

Hey! Thanks for the fix. This looks good to me but can you actually remove the test file? AI generated like 6 tests for just the if/else related to the tokenizer_id which is huge overkill.

Hi @tyler-romero - done!

Actually, I thought it would be a good idea to include tests for all possible cases (the 5 combinations of using the tokenizer with the -t flag and the tokenizer/ directory), but if you think it's not necessary, let's leave it as it is!

@mario-sanz
Copy link
Copy Markdown
Author

@tyler-romero, could you take a look at this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert_checkpoint_to_hf.py ignores -t/--tokenizer when sibling tokenizer/ directory exists

2 participants