Feature/refactor by dmux · Pull Request #1 · dmux/lambda-rag-lite

dmux · 2025-06-29T21:36:34Z

This pull request introduces significant enhancements to the lambda_rag_lite library, focusing on modularization, centralized configurations, and improved functionality for document processing, chunking, and encoding detection. The most important changes include the addition of centralized configuration classes, modular architecture improvements, and new specialized detectors for encoding and file types.

Modular Architecture Enhancements:

Centralized Configuration: Added multiple configuration classes (ChunkingConfig, TextProcessingConfig, LoaderConfig, EmbeddingConfig, VectorStoreConfig) to centralize settings and improve type safety. Includes presets for common use cases like small documents, academic papers, and code files. (lambda_rag_lite/config.py, lambda_rag_lite/config.pyR1-R235)
Specialized Modules: Introduced new modules for encoding detection (EncodingDetector) and file type detection (FileTypeDetector) to handle specific tasks more effectively. (lambda_rag_lite/detectors/encoding.py, [1]; lambda_rag_lite/detectors/__init__.py, [2]

Code Quality and File Organization:

Constants Centralization: Added a constants.py file to centralize file extensions, programming languages, and MIME type mappings, reducing duplication across modules. (lambda_rag_lite/constants.py, lambda_rag_lite/constants.pyR1-R197)
Updated __init__.py: Refactored imports in lambda_rag_lite/__init__.py to include new classes and maintain backward compatibility with legacy APIs. (lambda_rag_lite/__init__.py, lambda_rag_lite/init.pyR10-R68)

Configuration and Quality Tooling:

Qlty Tooling Setup: Added .qlty configuration files (qlty.toml, .yamllint.yaml) and updated .gitignore to integrate code quality tools like yamllint, ruff, and markdownlint. (.qlty/configs/.yamllint.yaml, [1]; .qlty/qlty.toml, [2]; .qlty/.gitignore, [3]

Documentation and Versioning:

Changelog Update: Documented major refactoring and improvements in the CHANGELOG.md for version 0.2.0, including migration guides and directory structure changes. (CHANGELOG.md, CHANGELOG.mdR8-R125)

- Added DocumentMetadataManager for centralized metadata handling of documents. - Introduced TextProcessor for advanced text processing with customizable configurations. - Created chunking strategies for text segmentation, including Separator and Character strategies. - Developed utility functions for text cleaning, keyword extraction, and text statistics. - Updated vector store implementation to align with new features. - Enhanced overall structure and modularity of the codebase. - Bumped version to 0.2.0 for new features and improvements.

- Deleted `debug_chunking.py`, `debug_chunking_detailed.py`, `debug_sentence_break.py`, and `debug_text_processor.py` as they were used for testing and debugging purposes. - These files contained various test cases for chunking text, processing sentences, and cleaning text, which are no longer needed in the repository.

Copilot

Pull Request Overview

This PR refactors the library for better modularity, centralized configuration, and specialized detection components.

Introduces dataclass-based configuration and presets for chunking, processing, loading, embedding, and vector stores.
Adds EncodingDetector, FileTypeDetector, and a Strategy-pattern-based chunker.
Overhauls loaders to use factory and metadata manager patterns, and updates utils to wrap the new implementations.

Reviewed Changes

Copilot reviewed 23 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pyproject.toml	Bumped version to 0.2.0 and added Python 3.13 classifier.
lambda_rag_lite/config.py	New dataclasses for centralized, type-safe configuration and presets.
lambda_rag_lite/constants.py	Centralized file-extension and MIME mappings to remove duplication.
lambda_rag_lite/detectors/encoding.py	Added `EncodingDetector` for intelligent file-encoding detection.
lambda_rag_lite/detectors/file_type.py	Added `FileTypeDetector` for unified file-type inference.
lambda_rag_lite/strategies/chunking.py	Refactored text chunking into Strategy pattern with `ChunkingConfig`.
lambda_rag_lite/processors/text_processor.py	New `TextProcessor` using configurable processing and chunking.
lambda_rag_lite/utils.py	Replaced inline helpers with wrappers around the new architecture.
lambda_rag_lite/loaders.py	Updated loaders to use detectors, factory, and centralized metadata.
lambda_rag_lite/init.py	Exposed new classes and maintained backward-compatible API exports.

Comments suppressed due to low confidence (2)

lambda_rag_lite/loaders.py:239

TextLoader uses an inconsistent metadata key 'extension' instead of 'file_extension' used elsewhere. Consolidate metadata schema, possibly via DocumentMetadataManager, for consistency across loaders.

            metadata = {

lambda_rag_lite/init.py:24

[nitpick] Alias 'NewTextProcessor' may be confusing to consumers. Consider renaming to a more descriptive identifier or exposing it under its original class name 'TextProcessor' for clarity.

from .processors.text_processor import TextProcessor as NewTextProcessor

lambda_rag_lite/loaders.py

dmux added 3 commits June 29, 2025 18:31

feat: add support for Python 3.13 in classifiers

956bbf5

dmux self-assigned this Jun 29, 2025

dmux added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 29, 2025

dmux requested a review from Copilot June 29, 2025 21:36

Copilot AI reviewed Jun 29, 2025

View reviewed changes

lambda_rag_lite/loaders.py Show resolved Hide resolved

dmux merged commit 10d71b1 into main Jun 29, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/refactor#1

Feature/refactor#1
dmux merged 3 commits intomainfrom
feature/refactor

dmux commented Jun 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmux commented Jun 29, 2025

Modular Architecture Enhancements:

Code Quality and File Organization:

Configuration and Quality Tooling:

Documentation and Versioning:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants