Refactor ChatterBot corpus loader: pathlib, caching, dataclass, and improved error handling#2439
Refactor ChatterBot corpus loader: pathlib, caching, dataclass, and improved error handling#2439annuaicoder wants to merge 2 commits intogunthercox:masterfrom
Conversation
gunthercox
left a comment
There was a problem hiding this comment.
The modernization changes to this pull request look great, but there are a few parts of it that could be good to skip for now.
Additionally, I noticed that tests are currently failing on this branch with a message indicating a possible issue related to the path-related changes:
FileNotFoundError: Corpus file or directory not found for: chatterbot.corpus.english.greetingsTests can be run locally using the unittest library, for example:
python -m unittest discover -s tests
chatterbot/corpus.py
Outdated
| DATA_DIRECTORY = Path.home() / 'chatterbot_corpus' / 'data' | ||
|
|
||
| # Default corpus file extensions | ||
| CORPUS_EXTENSIONS = ['yml', 'yaml', 'json'] |
There was a problem hiding this comment.
The json addition here seems seems unexpected considering the data being passed to yaml.safe_load() later on. It might be better to continue to support only yaml formatted files to keep explanations of the expected corpus format simpler.
There was a problem hiding this comment.
All requested changes have been made; ready for review
This PR modernizes and improves the ChatterBot corpus loader module. The changes focus on readability, maintainability, and robustness while keeping backward compatibility with existing corpus files.
Key Changes:
Pathlib for all filesystem operations
Replaced os.path with pathlib.Path for clearer and OS-independent path handling.
Supports both dotted paths and direct filesystem paths.
Support for multiple file extensions
Corpus files with .yml, .yaml, or .json are now supported.
Automatic detection of the correct extension when loading files.
Caching of corpus reads
Prevents re-reading the same file multiple times, improving performance for repeated loads.
Structured return using a dataclass
Returns CorpusData objects instead of raw tuples.
Attributes: conversations, categories, and file_path.
Improved error handling
Raises FileNotFoundError for missing files or directories.
Raises OptionalDependencyImportError if PyYAML is missing.
Raises RuntimeError for invalid or unreadable corpus files.
Code cleanup and typing
Added type hints for all functions.
Clearer and more descriptive docstrings.
Sorted file listings for consistent behavior.
Benefits:
Easier to read and maintain.
More robust against missing dependencies or malformed files.
Ready for larger corpus datasets with caching and extension flexibility.
Backward Compatibility:
Existing YAML corpus files continue to work without any changes.
The dotted path interface is preserved.