Skip to content

Refactor ChatterBot corpus loader: pathlib, caching, dataclass, and improved error handling#2439

Open
annuaicoder wants to merge 2 commits intogunthercox:masterfrom
annuaicoder:master
Open

Refactor ChatterBot corpus loader: pathlib, caching, dataclass, and improved error handling#2439
annuaicoder wants to merge 2 commits intogunthercox:masterfrom
annuaicoder:master

Conversation

@annuaicoder
Copy link
Copy Markdown

This PR modernizes and improves the ChatterBot corpus loader module. The changes focus on readability, maintainability, and robustness while keeping backward compatibility with existing corpus files.

Key Changes:

Pathlib for all filesystem operations

Replaced os.path with pathlib.Path for clearer and OS-independent path handling.

Supports both dotted paths and direct filesystem paths.

Support for multiple file extensions

Corpus files with .yml, .yaml, or .json are now supported.

Automatic detection of the correct extension when loading files.

Caching of corpus reads

Prevents re-reading the same file multiple times, improving performance for repeated loads.

Structured return using a dataclass

Returns CorpusData objects instead of raw tuples.

Attributes: conversations, categories, and file_path.

Improved error handling

Raises FileNotFoundError for missing files or directories.

Raises OptionalDependencyImportError if PyYAML is missing.

Raises RuntimeError for invalid or unreadable corpus files.

Code cleanup and typing

Added type hints for all functions.

Clearer and more descriptive docstrings.

Sorted file listings for consistent behavior.

Benefits:

Easier to read and maintain.

More robust against missing dependencies or malformed files.

Ready for larger corpus datasets with caching and extension flexibility.

Backward Compatibility:

Existing YAML corpus files continue to work without any changes.

The dotted path interface is preserved.

Copy link
Copy Markdown
Owner

@gunthercox gunthercox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modernization changes to this pull request look great, but there are a few parts of it that could be good to skip for now.

Additionally, I noticed that tests are currently failing on this branch with a message indicating a possible issue related to the path-related changes:

FileNotFoundError: Corpus file or directory not found for: chatterbot.corpus.english.greetings

Tests can be run locally using the unittest library, for example:

python -m unittest discover -s tests

DATA_DIRECTORY = Path.home() / 'chatterbot_corpus' / 'data'

# Default corpus file extensions
CORPUS_EXTENSIONS = ['yml', 'yaml', 'json']
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json addition here seems seems unexpected considering the data being passed to yaml.safe_load() later on. It might be better to continue to support only yaml formatted files to keep explanations of the expected corpus format simpler.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All requested changes have been made; ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants