Fix: respect -t/--tokenizer CLI flag over sibling tokenizer/ directory#629
Open
mario-sanz wants to merge 1 commit intoallenai:mainfrom
Open
Fix: respect -t/--tokenizer CLI flag over sibling tokenizer/ directory#629mario-sanz wants to merge 1 commit intoallenai:mainfrom
-t/--tokenizer CLI flag over sibling tokenizer/ directory#629mario-sanz wants to merge 1 commit intoallenai:mainfrom
Conversation
Contributor
|
Hey! Thanks for the fix. This looks good to me but can you actually remove the test file? AI generated like 6 tests for just the if/else related to the tokenizer_id which is huge overkill. |
When a sibling tokenizer/ directory exists next to the checkpoint, the convert_checkpoint_to_hf script would always use it, silently ignoring the -t/--tokenizer CLI argument. Fix by checking that tokenizer_id is None before falling back to the sibling tokenizer/ directory, so an explicit -t flag always takes precedence. Fixes allenai#609
9597b54 to
72df5ea
Compare
Author
Hi @tyler-romero - done! Actually, I thought it would be a good idea to include tests for all possible cases (the 5 combinations of using the tokenizer with the |
Author
|
@tyler-romero, could you take a look at this? Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #609
Problem
In
convert_checkpoint_to_hf.py, the-t/--tokenizerCLI flag is silently ignored when a siblingtokenizer/directory exists next to the checkpoint. The code unconditionally enters theif tokenizer_path.exists()branch, so the user-providedtokenizer_idis never consulted.This is a common scenario for SFT checkpoints, which are typically saved with an adjacent
tokenizer/directory.Fix
Add a
tokenizer_id is Noneguard to the condition onconvert_checkpoint_to_hf.py, so the siblingtokenizer/directory is only used as a fallback when no explicit-toverride was provided.Precedence after the fix:
-t/--tokenizerCLI argument (highest priority)tokenizer/directory next to the checkpointtokenizer_config.identifierfrom the checkpoint metadataThis matches the behavior already present in the hybrid converter (
convert_checkpoint_to_hf_hybrid.py).Tests
Added
convert_checkpoint_to_hf_tokenizer_test.pywith 5 test cases covering the full tokenizer selection logic:test_tokenizer_id_used_when_no_sibling_dir:-tis used when no siblingtokenizer/directory exists.test_tokenizer_id_overrides_sibling_dir:-ttakes precedence when a siblingtokenizer/directory also exists (the core bug scenario). Before the fix, this test fails, but after the fix, it succeeds.test_sibling_dir_used_when_no_tokenizer_id: The siblingtokenizer/directory is used as a fallback when-tis not provided.test_config_identifier_used_as_fallback:TokenizerConfig.identifieris used when neither-tnor a sibling directory is available.test_no_tokenizer_saved_when_nothing_available: Tokenizer saving is skipped entirely when no source is available.