-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Category: Enhancement / Refactor
Overview
This issue addresses several maintainability and code quality concerns across the clean_validate.py and search_top_posts.py modules. The goal is to make the content pipeline more robust, configurable, and easier to debug or extend.
Problem Areas
- Hardcoded Thresholds & Magic Numbers
Minimum word count, paragraph requirements, and other constants are scattered and hardcoded in logic and prompts.
Proposed Solution:
Move all such values to a shared configuration or constants module, and refer to them symbolically in code and prompts. Update documentation to show how these values can be customized.
- Error Handling and Logging
Several except Exception blocks could hide bugs; insufficient distinction between scraper, validation, and API errors.
Debug print statements and logs could expose sensitive data or cause log noise in production.
Proposed Solution:
Refactor error handling for granularity and clarity, log actionable (but sanitized) details, and use proper log levels. Remove any direct print statements.
- Prompt & Template Handling, JSON Parsing
Prompt templates are loaded from a fixed path without robust fallback.
Ad-hoc/regex parsing of LLM JSON responses risks brittle failures.
Proposed Solution:
Implement utility functions for safe prompt/template loading and LLM JSON extraction/parsing. Provide error or fallback messages for missing templates; thoroughly test these utilities.
- Data Quality and Schema Consistency
Validation and filtering logic is duplicated. There is inconsistency in schema documentation for output data, especially on error paths.
Proposed Solution:
Move post quality validation into reusable functions and ensure all output structures are consistently documented and enforced.
- Testing: Fallbacks and Edge Cases
Current test suite does not mock Gemini, scraper, or search client failures. Fallback scenarios and error paths are undertested.
Proposed Solution:
Add unit and integration tests to cover all major edge/failure cases; use fixtures to simulate external dependency issues.
Acceptance Criteria
- All thresholds and prompt parameters are centralized and customizable
- Exception handling is granular, with improved log messages and no sensitive data exposure or print statements
- Prompt/template loading and model response parsing use robust shared utilities
- Output dict/schema is consistent, with clear docstrings (success and failure cases)
- Tests cover key error/fallback flows and edge cases
I am a GSSoC'25 contributor and would like to take up this issue. Please assign it to me!