Skip to content

[REFACTOR] Robustness, Configurability, and Error Handling Improvements in Content Pipeline #42

@VasuGx

Description

@VasuGx

Category: Enhancement / Refactor

Overview
This issue addresses several maintainability and code quality concerns across the clean_validate.py and search_top_posts.py modules. The goal is to make the content pipeline more robust, configurable, and easier to debug or extend.

Problem Areas

  1. Hardcoded Thresholds & Magic Numbers
    Minimum word count, paragraph requirements, and other constants are scattered and hardcoded in logic and prompts.

Proposed Solution:

Move all such values to a shared configuration or constants module, and refer to them symbolically in code and prompts. Update documentation to show how these values can be customized.

  1. Error Handling and Logging
    Several except Exception blocks could hide bugs; insufficient distinction between scraper, validation, and API errors.

Debug print statements and logs could expose sensitive data or cause log noise in production.

Proposed Solution:

Refactor error handling for granularity and clarity, log actionable (but sanitized) details, and use proper log levels. Remove any direct print statements.

  1. Prompt & Template Handling, JSON Parsing
    Prompt templates are loaded from a fixed path without robust fallback.

Ad-hoc/regex parsing of LLM JSON responses risks brittle failures.

Proposed Solution:

Implement utility functions for safe prompt/template loading and LLM JSON extraction/parsing. Provide error or fallback messages for missing templates; thoroughly test these utilities.

  1. Data Quality and Schema Consistency
    Validation and filtering logic is duplicated. There is inconsistency in schema documentation for output data, especially on error paths.

Proposed Solution:

Move post quality validation into reusable functions and ensure all output structures are consistently documented and enforced.

  1. Testing: Fallbacks and Edge Cases
    Current test suite does not mock Gemini, scraper, or search client failures. Fallback scenarios and error paths are undertested.

Proposed Solution:

Add unit and integration tests to cover all major edge/failure cases; use fixtures to simulate external dependency issues.

Acceptance Criteria

  • All thresholds and prompt parameters are centralized and customizable
  • Exception handling is granular, with improved log messages and no sensitive data exposure or print statements
  • Prompt/template loading and model response parsing use robust shared utilities
  • Output dict/schema is consistent, with clear docstrings (success and failure cases)
  • Tests cover key error/fallback flows and edge cases

I am a GSSoC'25 contributor and would like to take up this issue. Please assign it to me!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions