Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177
Merged
Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177
Conversation
Enhance the agentic chat subsystem by refining the tool execution history metadata and resolving critical bugs identified during quality evaluation. Key Improvements: - Update ChatIntermediateAction schema to capture tool_name and raw tool_result, enabling a high-fidelity audit trail for AI reasoning. - Fix shadowing bug in discovery executor where the 'date' argument conflicted with the 'date' class import, crashing trip searches. - Resolve missing argument in weather enrichment pipeline by correctly passing intent_location to the enricher function. - Prevent fuzzy site resolution from blocking subsequent text filters by preserving the location parameter after coordinate resolution. - Enforce sensible PPO2 defaults in tool schemas and system prompts to prevent unnecessary LLM clarification loops. Testing & Reliability: - Introduce test_chat_agent_integration.py to verify the wiring between LLM tool calls and Python backend logic. - Introduce test_chat_agent_comprehensive.py to validate complex edge cases including fuzzy name resolution and physics calculations. - Update ENTITY_ICONS imports in base executor to resolve NameErrors.
Overhaul the chat testing architecture to reduce fragmentation and improve maintainability. Merged 11 scattered test files into 3 logically organized primary files that align with the system's modular design. Key Changes: - Create test_chat_agent.py: Focused on the ReAct loop, tool calling logic, context resolution, and fuzzy location name mapping. - Create test_chat_executors.py: Focused on backend capability logic including spatial bounding boxes, directions, ratings, and physics. - Update test_chat_api.py: Maintained as the high-level REST endpoint and session management validation suite. - Remove 9 redundant and overlapping test files to eliminate clutter. - Fix data-dependency bugs in recommendation fixtures to ensure reliable test execution in isolated environments. This reorganization provides a clear map for future test development and ensures 70%+ coverage on critical chat service components.
- Add `calculate_distance` using Haversine formula to compute dynamic search radius based on bounding box size, replacing the hardcoded 100km fallback. - Update system prompt in `chat_service.py` to prevent LLM coordinate hallucinations for regions/cities, enforcing reliance on Nominatim. - Resolve `UnboundLocalError` in `discovery.py` by promoting geocoding imports to the module level. - Introduce `SearchGearRentalTool` to handle specific gear rental intents and filter fallback diving centers strictly by `GearRentalCost` existence. - Refine `CAREER_PATH` execution with regex tokenization and stop words to accurately extract certification entities. - Enhance `COMPARISON` intent sorting to prioritize exact word matches, resolving overlapping mock data issues in `test_comparison_logic`. - Increase data limits for discovery and comparison intents to provide the LLM with a denser context window.
Remove the generic `search_certifications` tool in favor of highly specific schemas (`compare_certifications`, `get_certification_path`, `get_dive_site_details`, and `search_diving_trips`) to eliminate LLM confusion and regex parsing in the backend. Add `get_user_dive_logs` and `get_reviews_and_comments` tools, routing them to new, dedicated executor modules (`user_data.py` and `reviews.py`). This allows the LLM to analyze personal logbooks and community feedback while strictly respecting the global `disable_diving_center_reviews` privacy setting. Fix the page context resolver to correctly map `dive_site.name` and inject rich physics metadata (depths, duration, serialized gas info) into the context window so the LLM can seamlessly perform SAC calculations on specific dive logs.
Add `analyze_chat_quality_diff.py` and `evaluate_qualitative.py` scripts to automate double-blind A/B testing of chatbot responses using an LLM as a judge, ensuring quantitative and qualitative regressions are caught. Update `evaluate_chat_quality.py` to fix typo 'Athens' -> 'Attica' in test prompt for gear rental validation. Add comprehensive Markdown documentation in `docs/development/chat_evaluation_methodology.md` establishing the standard operating procedure for running the new evaluation pipeline.
b6eb317 to
d16a648
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers a massive upgrade to the agentic chat subsystem, improving its accuracy, contextual awareness, and reliability. It transitions the LLM away from ambiguous, "catch-all" tools towards highly specific tool schemas with dedicated Python executors, resulting in a strict upgrade to chat quality (achieving a 100% pass rate on quantitative evaluations).
Additionally, this PR completely reorganizes the fragmented chat test suite for better maintainability, resolves several critical runtime bugs, and introduces a robust, double-blind LLM-as-a-judge evaluation methodology for future chat development.
Changes Made
🤖 Agent & Prompt Engineering Improvements
search_certifications) with explicit, dedicated tools:compare_certifications,get_certification_path,get_dive_site_details,search_diving_trips,get_user_dive_logs, andget_reviews_and_comments.max_depth,difficulty, andshore_direction.dive_site.nameand inject rich physics metadata (depths, duration, serialized gas info) into the context window, enabling seamless SAC calculations on specific dive log pages.🗺️ Dynamic Geocoding & Spatial Search
calculate_distanceingeo_utils.py). The search radius now scales proportionally to the size of the Nominatim bounding box (clamped between 5km and 200km).UnboundLocalErrorindiscovery.pyby movingget_empirical_region_boundsandget_external_region_boundsimports to the module level.🎯 Intent Extraction Refinements
others.pyconditional block to dedicated executor modules (user_data.pyandreviews.py), respecting global privacy settings likedisable_diving_center_reviews.SearchGearRentalTool): Created a dedicated tool schema. Updated the fallback logic to explicitlyjoin(GearRentalCost)to only recommend centers verified to offer rentals.CAREER_PATH): Implemented regex-based tokenization and stop-word filtering to accurately extract specific certification entities.COMPARISON): Improved the sorting heuristic to prioritize exact whole-word matches over partial substring matches. Increased the result cap from 10 to 20 to prevent real database seed data from crowding out exact matches.🐛 Bug Fixes & Reliability
dateargument conflicted with thedateclass import, crashing trip searches.intent_locationto the enricher function.ChatIntermediateActionschema to capturetool_nameand rawtool_result, enabling a high-fidelity audit trail.🧪 Testing & Evaluation Methodology
test_chat_agent.py,test_chat_executors.py,test_chat_api.py) that align with the system's modular architecture.analyze_chat_quality_diff.pyandevaluate_qualitative.pyscripts to automate double-blind A/B testing of chatbot responses.docs/development/chat_evaluation_methodology.mdestablishing the standard operating procedure for running the new evaluation pipeline.Testing
./docker-test-github-actions.sh). All tests pass (1430/1430). The newly consolidated test files correctly handle DB seed data overlaps.get_user_dive_logsandget_reviews_and_comments) correctly invoked backend logic and respected privacy constraints.Related Issues
Additional Notes
tools.pyand the dynamic Haversine radius calculation indiscovery.py.geo_utils), andDEEPSEEK_API_KEYmust be configured to run the new evaluation methodology scripts.