Skip to content

Conversation

@nick-galluzzo
Copy link
Owner

This PR improves the commit message evaluation system by refining scoring criteria, enhancing validation test cases, and adding robust WHY context enhancement capabilities.

These changes make the evaluation more accurate, consistent, and comprehensive.

Key Features

Why (External) Context Processing

  • Users are now able to provide external context to the generator to help the generator improve their why score.

Enhanced Validation Suite

  • New Test Cases: Added comprehensive validation cases for documentation, feature implementations, and security fixes
  • Quality Categorization: Refactored validation cases with better categorization across quality levels:
    • Excellent (4.5-5.0) - Security fixes with business impact, performance optimizations with metrics
    • Good (3.5-4.4) - Clear feature implementations, bug fixes with context
    • Average (2.5-3.4) - Basic changes with minimal context
    • Poor (1.5-2.4) - Unclear or incomplete descriptions
    • Very Poor (0.5-1.4) - Misleading or uninformative messages

Improved WHAT/WHY Scoring System

  • Clearer Guidelines: Enhanced evaluation criteria with specific examples and scoring rationale
    Comprehensive Examples: Added detailed examples showing different score ranges with reasoning
    Chain-of-Thought Evaluation: Implemented structured evaluation process for more consistent scoring

WHY Context Enhancement

Enum-Based Classification: Introduced ContextQuality enum for better categorization:

  • GOOD - Adds meaningful user/business context
  • BAD - Irrelevant or confusing information
  • TECHNICAL - Implementation details already clear from diff
  • REDUNDANT - Repeats information from commit message
    Lenient Scoring Guidance: Added specific guidance for scoring low-impact changes appropriately
    Enhanced Decision Logic: Improved criteria for when to enhance messages with external context

Service Logic Improvements

  • Fixed Enhancement Flow: Corrected generation service to properly return enhanced results when WHY context is provided
  • Separation of Concerns: Better handling of WHY context enhancement as separate step to reduce noise in initial generation

Enhanced Reporting

  • WHAT/WHY Score Breakdown: Detailed reporting of individual scoring components

Impact

This enhancement improves the accuracy and consistency of commit message evaluation, providing:

  • More reliable scoring across different types of changes
  • Better guidance for when and how to enhance commit messages with context
  • Clearer evaluation criteria that align with industry best practices
  • Enhanced user experience with detailed score breakdowns and reasoning

The changes maintain backward compatibility while providing a more robust foundation for commit message quality assessment.

…rnal why context

- Introduce new prompt template for integrating external business/contextual information
- Add method to enhance existing commit messages with why context
- Modify service layer to accept optional why_context parameter
- Separate why context handling from initial commit message generation to reduce noise and improve focus

This change addresses inconsistent "why" scores in commit messages by implementing a two-stage prompting approach. Previously, the model would prioritize the core diff over external context when both
were provided in a single prompt, resulting in strong "what" descriptions but weak "whys." The solution separates the concerns: first generating a preliminary message from the diff, then enhancing it
with a second prompt that specifically focuses on integrating the "why" context. This approach improves accuracy and reduces noise, aligning with RAG patterns for better context-aware commit messages.
…mprehensive tests

- Modify CommitMessageGenerator to use message content instead of full object when enhancing with why context
- Add test coverage for why context enhancement including success, empty context, and error scenarios
- Extend GenerationService to support why context integration in commit message generation
- Add service layer tests for commit message generation with and without why context
- Improve error handling for AI enhancement failures

The previous implementation passed the full result object instead of just the message content, causing unnecessary data transfer and potential serialization issues. This change ensures only the relevant message
content is used for AI enhancement, improving efficiency and reducing complexity. Additionally, comprehensive tests were added to validate the new why context feature works correctly across different scenarios
including success cases, empty contexts, and error conditions. This provides confidence in the feature's reliability and makes future maintenance easier.
…ontext in commit messages

- Specify that preliminary message content should not be repeated
- Add criteria for including WHY context (relevance, helpfulness, impact level)
- Refine instruction to emphasize problem-solution-benefit structure
- Maintain focus on concise, conventional commit format output

The previous WHY context was too verbose and unfocused, leading to less effective commit message generation. This change streamlines the guidance to focus on impact and conciseness. By clearly stating the problem,
solution, and benefit, developers can generate more precise and useful commit messages that improve code maintainability and collaboration.
- Add / option to  command
- Pass context to
- Improve commit message generation with additional context

This enables users to provide custom context for better commit message generation, solving the issue of generic or irrelevant commit messages. By allowing users to pass context, the generated commit messages become
more accurate and meaningful, leading to better code documentation and easier maintenance.
…ples

- Add detailed scoring rubrics for WHAT and WHY components
- Include comprehensive examples covering excellent to poor quality messages
- Improve prompt logic to better distinguish valuable context from technical noise
- Update validation cases with more realistic scenarios and business impact descriptions
- Refine evaluation thresholds and criteria descriptions for consistency

This change improves the accuracy of commit message quality assessment by providing clearer guidelines and more
representative test cases.
…and instructions

- Update evaluation criteria to reference <ORIGINAL_COMMIT_MESSAGE> and <EXTERNAL_CONTEXT> placeholders
- Clarify definition of WHY in commit messages
- Improve instructions for handling external context
- Remove outdated examples that were causing confusion
- Focus on preserving original commit message when context doesn't add value
- Emphasize not making up information not present in provided context
- Streamline prompt to reduce redundancy and improve clarity

External context: Improved why context decision accuracy from 57.1% to 71.4% in benchmarks. This change correctly skips enhancement for cases like test_coverage_context that shouldn't be processed, marking the
first successful implementation of this behavior.
…prompt

Some models weren't returning JSON responses consistently. Explicitly stating the JSON requirement ensures all models
comply, addressing compatibility issues across different model providers. This change prevents parsing errors and ensures reliable evaluation results.
… suite

Adds a benchmarking suite to evaluate the effectiveness of WHY context enhancement in commit messages, ensuring it improves the WHY score without introducing noise or redundant technical detatails. Currently achieves a 75% success rate across most models, with areas for improvement identified in simple and average bug fix cases.
…ion output

Fixes issue where entire reasoning chain was being sent to commit message generation, causing verbose and confusing output. This change ensures only the essential prompt instructions are used, improving the quality and relevance of generated commit messages.
… truthfulness

- Add requirement for messages to focus on accuracy, validity, and truthfulness
- Update evaluation criteria to score 1 for WHAT/WHY when messages are untruthful or inaccurate
- Clarify that score 1 applies when changes are misrepresented, omitted, or described inaccurately
- Improve prompt guidance to ensure high-quality, honest commit message assessment
…hmarks

- Update validation suite with new test cases for documentation, feature implementations, and security fixes
- Refactor validation cases to better categorize quality levels (good, average, poor, very poor)
- Improve evaluation criteria for WHAT/WHY scoring with clearer guidelines
- Add WHY context guidance for lenient scoring of low-impact changes
- Update benchmark suite to use enum-based context quality classification
- Enhance result reporting with WHAT/WHY score breakdown
- Fix service logic to properly return enhanced results when why_context is provided
No functional changes; improves readability and conciceness of benchmark documentation
@nick-galluzzo nick-galluzzo merged commit 303a53a into main Aug 7, 2025
3 checks passed
@nick-galluzzo nick-galluzzo deleted the feat/add-why-context-enhancement branch August 7, 2025 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant