test: Add 4-layer integration testing framework#54
Conversation
…ion test docs - Fix persona card detection in CardStack with fallback to options - Add extractPersonasFromOptions and isPersonaStep helpers - Fix TypeScript errors in ChatMessage and ProcessMapBuilder - Improve BlueprintSidebar card stack rendering - Add sidebar cards state to useDesignSession hook - Update phase3 prompt for better card type compliance - Add integration testing strategy documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Implements Phase 1 of the recommended integration testing strategy: Layer 1 - Unit Tests (test_clara_tools.py): - Tests for sanitize_ask_options, sanitize_cards, ensure_other_option - Validates AG-UI event structure contracts - Card envelope schema validation (stepper, personas, info, etc.) - 52 unit tests, all passing Layer 2 - Contract Tests (test_sse_streaming.py): - SSE event formatting tests - Design session API tests (create, get, delete) - AG-UI event contract compliance tests - CardEnvelope structure validation - 19 integration tests, all passing Layer 3 - LLM Compliance Spec (flows/personas_step.yml): - YAML flow spec for developer-supervised testing - Critical assertion: personas card must have type: "personas" - Addresses the root cause of the persona card bug - Includes compliance notes and failure actions This follows the separation of concerns principle: - CI tests (Layer 1-2): Deterministic, no LLM required - Developer tests (Layer 3): Real LLM, supervised execution 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
PR Review: 4-Layer Testing FrameworkExcellent testing strategy separating deterministic CI tests from LLM compliance validation. Strengths
Issues & RecommendationsMedium Priority
Low Priority
Security: ExcellentAll inputs sanitized, injection prevented, DoS protections in place Performance: GoodFast in-process tests, suggest pytest-xdist for parallel execution Test Coverage: StrongWell-covered areas, minor gaps in DB-dependent tools (not blocking) Verdict: APPROVECode: 9/10 | Architecture: 10/10 | Docs: 10/10 Ship it! This solves LLM compliance testing elegantly. Questions
|
- Add clara.testing.flow_runner module for running YAML flow specs - Add /test-flow slash command for Claude Code - CLI usage: uv run python -m clara.testing.flow_runner personas_step 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request Review: 4-Layer Integration Testing FrameworkOverviewThis PR adds a comprehensive testing strategy with three main layers:
The implementation is well-documented and follows a thoughtful approach to testing LLM-driven applications. ✅ Strengths1. Excellent Architecture & Documentation
2. Comprehensive Unit Tests# src/backend/tests/unit/test_clara_tools.py
3. Well-Designed Flow Runner# src/backend/clara/testing/flow_runner.py
4. Security-Focused SanitizationThe new sanitization functions in
🔍 Issues & RecommendationsCritical Issues1. Missing Import in
|
Creates new test flow YAML files with template and documentation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request Review: 4-Layer Integration Testing FrameworkSummaryThis PR introduces a comprehensive testing strategy with unit tests, contract tests, and a YAML-based flow runner for LLM compliance testing. The implementation demonstrates excellent architectural thinking and addresses real testing challenges in AI-powered applications. Strengths1. Excellent Separation of Concerns ✅The 4-layer testing approach is well-designed:
This separation correctly identifies that LLM non-determinism should NOT be in CI. 2. Robust Input Sanitization ✅
3. Well-Designed Flow Spec Format ✅
4. Flexible Frontend Card Handling ✅
5. Documentation Quality 📚
Issues & Concerns🔴 Critical: Missing Test Execution in CIThe tests are written but I don't see updates to CI configuration (e.g., Recommendation: Add a workflow step like: - name: Run Unit Tests
run: |
cd src/backend
uv run pytest tests/unit/ -v
- name: Run Contract Tests
run: |
cd src/backend
uv run pytest tests/integration/test_sse_streaming.py -v🟡 Moderate: Type Safety GapsIssue 1: @dataclass
class AGUIEvent:
type: str
data: dict[str, Any]
name: str | None = None
self.collected_events: list[AGUIEvent] = []Issue 2: 🟡 Moderate: Error Handling in Flow RunnerIssue: async for line in response.aiter_lines():
if line.startswith("data:"):
try:
event_data = json.loads(line[5:].strip())
events.append(event_data)
except json.JSONDecodeError:
pass # ❌ Silent failureRecommendation: Log the error or count malformed events: except json.JSONDecodeError as e:
logger.warning(f"Failed to parse SSE event: {line[:100]}...", exc_info=e)
warnings.append(f"Malformed SSE event at line {len(events)}")🟡 Moderate: Database Session ManagementIssue: Recommendation: Either:
🟢 Minor: Inconsistent Card Type NamingObservation: Question: Is the singular form intentional? The prompt should output consistent types. Consider documenting the accepted variants or normalizing to one canonical form. 🟢 Minor: Magic Numbers
DATA_PREFIX = "data: "
if line.startswith(DATA_PREFIX):
event_data = json.loads(line[len(DATA_PREFIX):].strip())Security Review ✅Strengths:
Recommendations:
Performance ConsiderationsPotential Issues:1. In-Memory Session State ( _session_state: dict[str, dict[str, Any]] = {}This won't scale in production. Consider:
2. Flow Runner Network Timeouts def __init__(self, base_url: str = "http://localhost:8000", timeout: float = 60.0):
self.timeout = timeoutTest Coverage AssessmentExcellent Coverage:
Missing Coverage:
Architectural Recommendations1. Extract Event Types into Shared ModuleBoth backend and tests define With canonical event type definitions. 2. Consider Pydantic for Flow SpecsThe YAML parsing in class FlowSpec(BaseModel):
name: str
description: str
steps: list[FlowStep]
# ... with validation3. Add Observability HooksThe flow runner should emit structured logs/metrics:
This would enable trend analysis over time. Testing the Tests (Meta)Questions to verify:
Nitpicks & Style
Recommendations SummaryMust Fix (Blocking):
Should Fix (High Priority):
Nice to Have:
ConclusionThis is excellent work that addresses a real gap in testing AI-powered conversational UIs. The architectural thinking is sound, the separation between deterministic and non-deterministic tests is correct, and the implementation quality is high. The framework correctly identifies that LLM compliance cannot be tested deterministically in CI, and provides the right developer experience for supervised testing. Recommendation: ✅ Approve with minor changes Once the CI configuration is added and error handling improved, this will be a solid foundation for maintaining prompt quality over time. Estimated Review Time
Great work! 🎉 |
Summary
Testing Layers Implemented
Key Tests
TestSanitizeCards: Validates CardEnvelope structure including personas cardsTestSSEEventFormatting: Verifies SSE event format for AG-UI complianceTestAGUIEventContract: Tests CUSTOM events with clara:ask payloadspersonas_step.yml: Flow spec to catch the persona card bug (type: "personas" vs "info")Test Results
🤖 Generated with Claude Code