Skip to content

Fix #13882: Decrement vocab.length when memory_zone clears transient …#13931

Open
Dzhud wants to merge 1 commit intoexplosion:masterfrom
Dzhud:fix-vocab-length-memory-zone
Open

Fix #13882: Decrement vocab.length when memory_zone clears transient …#13931
Dzhud wants to merge 1 commit intoexplosion:masterfrom
Dzhud:fix-vocab-length-memory-zone

Conversation

@Dzhud
Copy link
Copy Markdown

@Dzhud Dzhud commented Mar 5, 2026

Fix #13882: Decrement vocab.length when memory_zone clears transient lexemes

Description

This PR fixes issue #13882 where the Vocab.length counter was incremented when adding lexemes but never decremented when memory_zone cleared transient lexemes. This caused len(vocab) to grow continuously even though the actual lexemes were properly removed from the internal hash map, making it unreliable for monitoring memory_zone effectiveness in production environments.

Changes:

  • Modified spacy/vocab.pyx: Enhanced _clear_transient_orths() to track and decrement self.length by the number of cleared lexemes, with NULL check for edge cases
  • Added comprehensive tests in spacy/tests/vocab_vectors/test_memory_zone.py:
    • test_memory_zone_vocab_length_decremented: Verifies single memory_zone cycle
    • test_memory_zone_multiple_cycles: Verifies multiple cycles
    • Both marked with @pytest.mark.issue(13882)

Testing:
All tests pass successfully:

  • Existing memory_zone tests still pass
  • New tests verify the fix works for both simple and complex usage patterns
  • Tested with reproduction script confirming len(vocab) now correctly decrements
  • Code formatted with black and passes flake8 linting

The fix ensures len(vocab) correctly reflects actual lexeme count and matches iteration count over vocab.

Types of change

Bug fix - fixes issue #13882 where vocab.length counter was not properly maintained when memory_zone cleared transient lexemes.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@Dzhud Dzhud force-pushed the fix-vocab-length-memory-zone branch from a2a139e to 7ea5d76 Compare March 25, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: vocab.length not decremented when memory_zone clears transient lexemes

1 participant