Merged
Conversation
- Added DocumentMetadataManager for centralized metadata handling of documents. - Introduced TextProcessor for advanced text processing with customizable configurations. - Created chunking strategies for text segmentation, including Separator and Character strategies. - Developed utility functions for text cleaning, keyword extraction, and text statistics. - Updated vector store implementation to align with new features. - Enhanced overall structure and modularity of the codebase. - Bumped version to 0.2.0 for new features and improvements.
- Deleted `debug_chunking.py`, `debug_chunking_detailed.py`, `debug_sentence_break.py`, and `debug_text_processor.py` as they were used for testing and debugging purposes. - These files contained various test cases for chunking text, processing sentences, and cleaning text, which are no longer needed in the repository.
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the library for better modularity, centralized configuration, and specialized detection components.
- Introduces dataclass-based configuration and presets for chunking, processing, loading, embedding, and vector stores.
- Adds
EncodingDetector,FileTypeDetector, and a Strategy-pattern-based chunker. - Overhauls loaders to use factory and metadata manager patterns, and updates utils to wrap the new implementations.
Reviewed Changes
Copilot reviewed 23 out of 25 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Bumped version to 0.2.0 and added Python 3.13 classifier. |
| lambda_rag_lite/config.py | New dataclasses for centralized, type-safe configuration and presets. |
| lambda_rag_lite/constants.py | Centralized file-extension and MIME mappings to remove duplication. |
| lambda_rag_lite/detectors/encoding.py | Added EncodingDetector for intelligent file-encoding detection. |
| lambda_rag_lite/detectors/file_type.py | Added FileTypeDetector for unified file-type inference. |
| lambda_rag_lite/strategies/chunking.py | Refactored text chunking into Strategy pattern with ChunkingConfig. |
| lambda_rag_lite/processors/text_processor.py | New TextProcessor using configurable processing and chunking. |
| lambda_rag_lite/utils.py | Replaced inline helpers with wrappers around the new architecture. |
| lambda_rag_lite/loaders.py | Updated loaders to use detectors, factory, and centralized metadata. |
| lambda_rag_lite/init.py | Exposed new classes and maintained backward-compatible API exports. |
Comments suppressed due to low confidence (2)
lambda_rag_lite/loaders.py:239
- TextLoader uses an inconsistent metadata key 'extension' instead of 'file_extension' used elsewhere. Consolidate metadata schema, possibly via DocumentMetadataManager, for consistency across loaders.
metadata = {
lambda_rag_lite/init.py:24
- [nitpick] Alias 'NewTextProcessor' may be confusing to consumers. Consider renaming to a more descriptive identifier or exposing it under its original class name 'TextProcessor' for clarity.
from .processors.text_processor import TextProcessor as NewTextProcessor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant enhancements to the
lambda_rag_litelibrary, focusing on modularization, centralized configurations, and improved functionality for document processing, chunking, and encoding detection. The most important changes include the addition of centralized configuration classes, modular architecture improvements, and new specialized detectors for encoding and file types.Modular Architecture Enhancements:
ChunkingConfig,TextProcessingConfig,LoaderConfig,EmbeddingConfig,VectorStoreConfig) to centralize settings and improve type safety. Includes presets for common use cases like small documents, academic papers, and code files. (lambda_rag_lite/config.py, lambda_rag_lite/config.pyR1-R235)EncodingDetector) and file type detection (FileTypeDetector) to handle specific tasks more effectively. (lambda_rag_lite/detectors/encoding.py, [1];lambda_rag_lite/detectors/__init__.py, [2]Code Quality and File Organization:
constants.pyfile to centralize file extensions, programming languages, and MIME type mappings, reducing duplication across modules. (lambda_rag_lite/constants.py, lambda_rag_lite/constants.pyR1-R197)__init__.py: Refactored imports inlambda_rag_lite/__init__.pyto include new classes and maintain backward compatibility with legacy APIs. (lambda_rag_lite/__init__.py, lambda_rag_lite/init.pyR10-R68)Configuration and Quality Tooling:
.qltyconfiguration files (qlty.toml,.yamllint.yaml) and updated.gitignoreto integrate code quality tools likeyamllint,ruff, andmarkdownlint. (.qlty/configs/.yamllint.yaml, [1];.qlty/qlty.toml, [2];.qlty/.gitignore, [3]Documentation and Versioning:
CHANGELOG.mdfor version0.2.0, including migration guides and directory structure changes. (CHANGELOG.md, CHANGELOG.mdR8-R125)