Skip to content

Improve summarize function with enhanced processing#3

Open
Jerry-145 wants to merge 1 commit intoBracnch-1from
Jerry-145-patch-1
Open

Improve summarize function with enhanced processing#3
Jerry-145 wants to merge 1 commit intoBracnch-1from
Jerry-145-patch-1

Conversation

@Jerry-145
Copy link
Copy Markdown
Owner

Team Number : Team 161

Description

This PR significantly improves the /summarize endpoint to generate more comprehensive, structured, and higher-quality summaries.

Previously, summarization:

  • Retrieved a limited number of chunks (k=6)
  • Used a single-pass generation approach
  • Produced generic or incomplete summaries
  • Lacked length control
  • Did not enforce structured output

This update introduces a multi-stage hybrid summarization pipeline to improve coverage, structure, and flexibility.

Related Issue

Closes #9

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Style/UI improvement

Changes Made

  • Increased retrieval depth for summarization from k=6k=12
  • Implemented Extractive + Abstractive hybrid summarization
    • Stage 1: Extract key factual points
    • Stage 2: Generate structured final summary
  • Added summary length control (short, medium, long)
  • Introduced structured output format:
    • Executive Summary
    • Key Findings
    • Conclusion
  • Added basic validation to prevent incomplete summaries
  • Improved prompt engineering for better factual grounding

Screenshots (if applicable)

Testing

  • Tested /summarize endpoint with short, medium, and long length modes
  • Verified improved structure (Executive Summary, Key Findings, Conclusion)
  • Tested with small and large PDFs
  • No runtime errors during API execution
  • Output validated for completeness and structure

Checklist

  • My code follows the project's code style guidelines
  • I have performed a self-review of my code
  • I have commented my code where necessary
  • My changes generate no new warnings
  • I have tested my changes thoroughly
  • I have read and followed the CONTRIBUTING.md guidelines

Additional Notes

This update enhances summary quality without modifying:

  • Session handling
  • QA endpoint logic
  • Compare endpoint
  • Frontend behavior

The improvement is fully backward compatible and does not introduce breaking API changes.

Improved the /summarize endpoint to generate more structured, comprehensive, and high-quality document summaries.

Key Enhancements:

1. Increased retrieval depth:
   - Expanded similarity search from k=6 to k=12 to improve document coverage.
   - Ensures more relevant context is available before summarization.

2. Implemented hybrid summarization pipeline:
   - Stage 1 (Extractive): Extract top 10 key factual points from retrieved chunks.
   - Stage 2 (Abstractive): Generate a structured summary using extracted key points.

3. Added summary length control:
   - Supports short, medium (default), and long summary modes.
   - Dynamically adjusts generation token limits and bullet count instructions.

4. Enforced structured output format:
   - Executive Summary
   - Key Findings
   - Conclusion

5. Added basic completeness validation:
   - Ensures summary is not overly short.
   - Appends fallback note if document has limited summarizable content.

This update significantly improves summary clarity, structure, and flexibility while remaining fully backward compatible.
No changes were made to session handling, /ask endpoint, /compare endpoint, or frontend logic.
@Jerry-145
Copy link
Copy Markdown
Owner Author

Hi Maintainers 👋

While working on the /summarize improvements, I noticed a few structural issues in main.py that may cause runtime or maintenance problems:

  • Duplicate sessions dictionary definitions
  • Duplicate cleanup_expired_sessions() function definitions
  • Duplicate CompareRequest class definitions
  • Some incomplete or misplaced blocks (e.g., partially defined helper functions and indentation issues)
  • Missing imports like time, Path, and torch in certain sections
  • Minor variable inconsistencies (e.g., is_enc vs is_encoder_decoder, device not defined before use)

These may lead to unexpected behavior or errors during execution.

If this cleanup/refactor is still pending, I’d be happy to take it up and submit a structured fix to improve maintainability and stability.

Please let me know if I can work on this 🙌

— Team 161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant