-
Notifications
You must be signed in to change notification settings - Fork 24
[LEADS-141] Add Latency Metrics to Evaluation Reports #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds capture, propagation, aggregation, and reporting of three streaming performance metrics—time_to_first_token, streaming_duration, tokens_per_second—by instrumenting the streaming parser, introducing a StreamingMetricsMixin, wiring metrics through models and pipeline, computing stats, updating outputs/tests, and adding CSV/README mentions. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Parser as parse_streaming_response
participant Context as StreamingContext
participant Perf as _PerformanceTracker
participant Proc as _process_event
participant Validator as _validate_response
Client->>Parser: open SSE stream (SSE lines)
Parser->>Context: init StreamingContext (Perf.start())
loop per SSE line
Parser->>Parser: _parse_sse_line -> event
Parser->>Proc: dispatch event with Context
alt content / token event
Proc->>Perf: record TTFT if first content/tool_call
Proc->>Context: append token / update interim response
else tool_call event
Proc->>Perf: record TTFT if first content/tool_call
Proc->>Context: append tool_call
else turn_complete event
Proc->>Context: record token counts (input/output)
else end event
Proc->>Perf: record end time
Perf->>Context: compute streaming_duration & tokens_per_second
Proc->>Context: set final_response
end
end
Parser->>Validator: validate Context (final_response, conversation_id)
Validator-->>Parser: valid / raise
Parser->>Client: return serialized response (includes streaming metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom Pre-merge Checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
41-66: Consider makingStreamingContextprivate for consistency.The
_PerformanceTrackerclass uses an underscore prefix to indicate internal use. SinceStreamingContextis also only used internally, consider renaming it to_StreamingContextfor consistency.
120-139: Add type hint forevent_dataparameter.The
event_dataparameter is missing a full type hint. Per coding guidelines, type hints should be provided for all functions.🔎 Proposed fix
-def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: +def _process_event(ctx: StreamingContext, event: str, event_data: dict[str, Any]) -> None:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🧠 Learnings (4)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (2)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(88-117)
tests/unit/core/models/test_api_additional.py (1)
src/lightspeed_evaluation/core/models/api.py (2)
APIResponse(80-149)from_raw_response(114-149)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
🔇 Additional comments (13)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
72-91: LGTM! Clean implementation of streaming metrics extraction.The streaming performance metrics extraction is well-implemented:
- Appropriate use of
getattrwithNonedefaults for optional fields- Conditional debug logging only when metrics are present
- Consistent with the existing token usage pattern
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics properly propagated to evaluation results.The streaming performance metrics are correctly extracted from
turn_dataand populated intoEvaluationResult, following the same conditional pattern as other turn-level fields.
218-236: LGTM! Error handling improved with consistent streaming metrics.Good refactoring to use a local
turn_datavariable for clarity, and the error result now includes streaming metrics consistently with the success path.tests/unit/core/api/test_streaming_parser.py (1)
260-377: LGTM! Comprehensive test coverage for streaming performance metrics.The new test class provides excellent coverage:
- Validates TTFT capture on different event types (token, tool_call, turn_complete)
- Checks metric relationships (streaming_duration ≥ time_to_first_token)
- Tests tokens_per_second calculation with and without token counts
- Verifies complete flow with all metrics
src/lightspeed_evaluation/core/models/api.py (2)
96-111: LGTM! Streaming metrics properly defined with validation.The streaming performance metric fields are well-defined:
- Appropriate
Optional[float]type for streaming-only metrics- Non-negative validation with
ge=0- Clear descriptions indicating streaming-only applicability
134-148: LGTM! Streaming metrics correctly extracted and propagated.The extraction logic correctly uses
.get()to retrieve optional streaming metrics from raw data and passes them to the APIResponse constructor.tests/unit/core/models/test_api_additional.py (1)
168-228: LGTM! Thorough test coverage for APIResponse streaming metrics.The tests comprehensively cover:
- Direct instantiation with streaming metrics
- Default None values when metrics are absent
- Parsing from raw response with and without streaming metrics
- Token count propagation alongside streaming metrics
src/lightspeed_evaluation/core/models/data.py (1)
81-96: LGTM! Streaming metrics properly defined in TurnData with validation.The streaming performance metrics are correctly defined with:
- Appropriate
Optional[float]type- Non-negative validation (
ge=0)- Clear descriptions
src/lightspeed_evaluation/core/api/streaming_parser.py (5)
1-14: LGTM!The imports are appropriate for the new performance tracking functionality, and
CONTENT_EVENTSas a tuple provides an immutable constant for event classification.
17-38: LGTM!Good use of
time.perf_counter()for high-resolution timing. The TTFT capture logic correctly ensures single capture via the_first_content_receivedflag.
69-85: LGTM!The throughput calculation correctly excludes TTFT from the generation time, and the function properly handles edge cases (no tokens, missing TTFT, zero/negative generation time).
142-147: LGTM!The validation logic correctly ensures required fields are present before returning the response. The use of
ValueErrorwas noted in the earlier comment about custom exceptions.
150-196: LGTM!The existing helper functions (
_parse_sse_line,_parse_tool_call,_format_tool_sequences) integrate correctly with the new context-based streaming flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
120-140: Use custom exceptions per coding guidelines.Line 125 raises
ValueErrorfor streaming API errors. Per the coding guidelines and a previous review comment, this should useAPIErrorfromcore.system.exceptionsinstead.🔎 Suggested fix
Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import APIErrorThen update the error handling:
def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: """Process a single streaming event and update context.""" if event == "error" and "token" in event_data: error_message = event_data["token"] logger.error("Received error event from streaming API: %s", error_message) - raise ValueError(f"Streaming API error: {error_message}") + raise APIError(f"Streaming API error: {error_message}")Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".
142-147: Use custom exceptions for validation errors.Lines 145 and 147 raise
ValueErrorfor validation failures. Per the coding guidelines and a previous review comment, these should useDataValidationErrorfromcore.system.exceptionsinstead.🔎 Suggested fix
Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import DataValidationErrorThen update the validation:
def _validate_response(ctx: StreamingContext) -> None: """Validate that required response fields are present.""" if not ctx.final_response: - raise ValueError("No final response found in streaming output") + raise DataValidationError("No final response found in streaming output") if not ctx.conversation_id: - raise ValueError("No Conversation ID found") + raise DataValidationError("No Conversation ID found")Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
17-38: Add type hints and docstrings for_PerformanceTracker.Per coding guidelines, all functions and methods should have type hints, and public APIs should have Google-style docstrings. While this is a private class, adding docstrings improves maintainability.
🔎 Suggested improvements
@dataclass class _PerformanceTracker: - """Tracks streaming performance metrics (TTFT, throughput).""" + """Tracks streaming performance metrics. + + Captures time to first token (TTFT) and calculates throughput metrics + for streaming responses. + + Attributes: + stream_start_time: Monotonic timestamp when streaming started + time_to_first_token: Time in seconds to receive first content token + _first_content_received: Internal flag tracking if first content seen + """ stream_start_time: float = field(default_factory=time.perf_counter) time_to_first_token: Optional[float] = None _first_content_received: bool = False - def capture_ttft(self) -> None: - """Capture time to first token if not already captured.""" + def capture_ttft(self) -> None: + """Capture time to first token if not already captured. + + Records the time elapsed from stream start to first content token. + Subsequent calls are ignored to preserve the first measurement. + """ if not self._first_content_received: self.time_to_first_token = time.perf_counter() - self.stream_start_time self._first_content_received = True logger.debug("Time to first token: %.3f seconds", self.time_to_first_token) - def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]: - """Calculate streaming duration and tokens per second.""" + def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]: + """Calculate streaming duration and tokens per second. + + Args: + output_tokens: Number of output tokens generated + + Returns: + Tuple of (streaming_duration, tokens_per_second) + """ streaming_duration = time.perf_counter() - self.stream_start_time tokens_per_second = _calculate_tokens_per_second( output_tokens, self.time_to_first_token, streaming_duration ) return streaming_duration, tokens_per_secondBased on coding guidelines: "Provide type hints for all public functions and methods" and "Use Google-style docstrings for all public APIs".
41-66: Enhance docstrings forStreamingContext.The class and method need more detailed Google-style docstrings per coding guidelines. The existing docstrings are minimal.
🔎 Suggested improvements
@dataclass class StreamingContext: # pylint: disable=too-many-instance-attributes - """Context for streaming response parsing.""" + """Context for streaming response parsing. + + Maintains state during SSE stream parsing, including response content, + conversation tracking, token counts, and performance measurements. + + Attributes: + conversation_id: Unique conversation identifier from API + final_response: Complete response text after streaming + tool_calls: List of tool invocations from the stream + input_tokens: Number of input tokens used + output_tokens: Number of output tokens generated + perf: Performance tracker for TTFT and throughput metrics + """ conversation_id: str = "" final_response: str = "" tool_calls: list[dict[str, Any]] = field(default_factory=list) input_tokens: int = 0 output_tokens: int = 0 perf: _PerformanceTracker = field(default_factory=_PerformanceTracker) def to_response_dict(self) -> dict[str, Any]: - """Convert context to response dictionary.""" + """Convert context to response dictionary. + + Returns: + Dictionary containing response text, tool calls, conversation ID, + token counts, and streaming performance metrics (TTFT, duration, + tokens per second). + """ streaming_duration, tokens_per_second = self.perf.get_metrics( self.output_tokens ) return { "response": self.final_response, "tool_calls": _format_tool_sequences(self.tool_calls), "conversation_id": self.conversation_id, "input_tokens": self.input_tokens, "output_tokens": self.output_tokens, "time_to_first_token": self.perf.time_to_first_token, "streaming_duration": streaming_duration, "tokens_per_second": tokens_per_second, }Based on coding guidelines: "Use Google-style docstrings for all public APIs".
69-85: Consider enhancing docstring for clarity.The function has a good implementation with proper edge case handling. The docstring could be expanded to document parameters and return value in Google-style format.
🔎 Suggested improvement
def _calculate_tokens_per_second( output_tokens: int, ttft: Optional[float], total_duration: float ) -> Optional[float]: - """Calculate tokens per second, excluding TTFT from throughput calculation.""" + """Calculate tokens per second, excluding TTFT from throughput calculation. + + Args: + output_tokens: Number of output tokens generated + ttft: Time to first token in seconds, or None if not captured + total_duration: Total streaming duration in seconds + + Returns: + Tokens per second throughput (output_tokens / generation_time), + or None if calculation is not possible (no tokens, no TTFT, or + negative generation time). + """ if output_tokens <= 0 or ttft is None: return None generation_time = total_duration - ttft if generation_time <= 0: return None tokens_per_second = output_tokens / generation_time logger.debug( "Streaming performance: %.3f tokens/sec (%d tokens in %.3f sec)", tokens_per_second, output_tokens, generation_time, ) return tokens_per_second
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/core/models/data.py
- config/system.yaml
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/models/test_api_additional.py
🧰 Additional context used
📓 Path-based instructions (3)
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.py
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/streaming_parser.py
🧠 Learnings (6)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (1)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(88-117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.13)
🔇 Additional comments (8)
tests/unit/core/api/test_streaming_parser.py (2)
36-39: LGTM! Existing test appropriately updated.The test now verifies that performance metrics are present in the response. This ensures backward compatibility while validating the new metrics are populated.
260-377: Excellent test coverage for streaming performance metrics.The new test suite comprehensively validates:
- Time to First Token (TTFT) capture on different event types
- Streaming duration calculation and its relationship to TTFT
- Tokens per second calculation with and without token counts
- Complete end-to-end flow with all metrics
The tests follow pytest conventions and coding guidelines.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics properly integrated.The streaming performance metrics are correctly extracted from turn_data with appropriate None fallbacks, consistent with the existing pattern for token usage metrics.
214-236: LGTM! Error result consistently includes streaming metrics.The refactoring to use a local
turn_datavariable improves readability and avoids repeated attribute access. Including streaming metrics in error results is appropriate—they provide valuable performance insights even when evaluations fail.src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
72-90: LGTM! Streaming metrics properly extracted and logged.The implementation correctly:
- Uses
getattrwithNonedefaults for optional streaming fields- Logs metrics only when available (conditional on TTFT presence)
- Employs defensive null handling in logging to prevent crashes
- Provides clear, formatted output with appropriate precision
The code follows project guidelines and integrates cleanly with existing patterns.
src/lightspeed_evaluation/core/models/api.py (2)
96-111: LGTM! Streaming metric fields properly defined.The new fields are well-structured with:
- Appropriate types (
Optional[float])- Non-negative validation (
ge=0)- Clear documentation indicating streaming-only availability
- Consistent naming conventions
134-138: LGTM! Metric extraction and population implemented correctly.The extraction uses
raw_data.get()without defaults, appropriately returningNonefor missing keys. This pattern is consistent with the optional nature of streaming metrics and aligns with existing token extraction logic.Also applies to: 146-148
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
88-117: LGTM! Well-structured refactoring with clear separation of concerns.The refactored implementation:
- Uses context-based state management for cleaner code
- Separates parsing, event processing, and validation into distinct phases
- Captures TTFT at the correct point (first content event)
- Includes comprehensive documentation of performance metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
120-147: Use custom exceptions fromcore.system.exceptionsfor error handling.Per the coding guidelines and previous review feedback, replace
ValueErrorwith custom exceptions:
- Line 125: Use
APIErrorfor streaming API errors- Lines 145, 147: Use
DataValidationErrorfor validation failures🔎 Proposed fix
+from lightspeed_evaluation.core.system.exceptions import APIError, DataValidationError + def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: """Process a single streaming event and update context.""" if event == "error" and "token" in event_data: error_message = event_data["token"] logger.error("Received error event from streaming API: %s", error_message) - raise ValueError(f"Streaming API error: {error_message}") + raise APIError(f"Streaming API error: {error_message}") # ... rest unchanged ... def _validate_response(ctx: StreamingContext) -> None: """Validate that required response fields are present.""" if not ctx.final_response: - raise ValueError("No final response found in streaming output") + raise DataValidationError("No final response found in streaming output") if not ctx.conversation_id: - raise ValueError("No Conversation ID found") + raise DataValidationError("No Conversation ID found")
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
41-66:StreamingContextis a public class—add Google-style docstring for theto_response_dictmethod.Per coding guidelines, public APIs should use Google-style docstrings. The
to_response_dictmethod lacks a return type description.🔎 Suggested docstring enhancement
def to_response_dict(self) -> dict[str, Any]: - """Convert context to response dictionary.""" + """Convert context to response dictionary. + + Returns: + dict[str, Any]: Dictionary containing response data and streaming + performance metrics (time_to_first_token, streaming_duration, + tokens_per_second). + """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/core/constants.py
- config/system.yaml
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/core/api/test_streaming_parser.py
- tests/unit/core/models/test_api_additional.py
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
📓 Path-based instructions (1)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.py
🧠 Learnings (6)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/api.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.13)
🔇 Additional comments (9)
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
17-38: Well-designed performance tracker with proper encapsulation.The
_PerformanceTrackerdataclass cleanly encapsulates TTFT capture logic with idempotentcapture_ttft()and correctly usestime.perf_counter()for high-resolution timing. The separation of concerns between timing capture and metrics calculation is good.
69-85: Tokens-per-second calculation correctly excludes TTFT from throughput.The logic properly calculates generation throughput by subtracting TTFT from total duration, and handles edge cases (zero/negative tokens, missing TTFT, zero/negative generation time) by returning
None.
88-117: Refactored parsing flow is cleaner and more maintainable.The event-driven approach with
StreamingContext,_process_event, and_validate_responseimproves readability and testability. The TTFT capture onCONTENT_EVENTSis correctly placed.src/lightspeed_evaluation/core/models/__init__.py (1)
15-15: LGTM!The
StreamingMetricsMixinimport and export follow the established patterns. The categorization under# Mixinscomment is appropriate.Also applies to: 47-48
src/lightspeed_evaluation/core/models/mixins.py (1)
1-33: Clean mixin design with proper validation constraints.The
StreamingMetricsMixinfollows Pydantic best practices:
- Uses
Optional[float]withdefault=Nonefor streaming-only metrics- Applies
ge=0validation to prevent negative values- Includes descriptive field documentation
This provides a reusable component for
APIResponse,TurnData, andEvaluationResult.src/lightspeed_evaluation/core/models/api.py (2)
7-8: Inheritance fromStreamingMetricsMixinis appropriate.Since
StreamingMetricsMixinextendsBaseModel,APIResponsecorrectly inherits Pydantic functionality along with the streaming metric fields.Also applies to: 82-82
119-134: Streaming metrics extraction and propagation looks correct.The
from_raw_responsemethod properly extracts the three streaming metrics using.get()(defaulting toNone), which aligns with the mixin's field defaults. The metrics are passed to the constructor where Pydantic'sge=0validation will apply.src/lightspeed_evaluation/core/models/data.py (2)
10-10:TurnDatacorrectly inherits streaming metrics via mixin.The inheritance change from
BaseModeltoStreamingMetricsMixinis appropriate. Since the mixin extendsBaseModel, all existing Pydantic functionality is preserved while adding the three streaming metric fields with proper validation.Also applies to: 36-36
384-384:EvaluationResultcorrectly inherits streaming metrics via mixin.The mixin inheritance provides
time_to_first_token,streaming_duration, andtokens_per_secondfields withge=0validation, addressing the previous review feedback about validation constraints.
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 💪 , one small comment bellow.
3f8860a to
1834d7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/lightspeed_evaluation/core/output/generator.py (3)
208-216: Update the docstring to document the new parameter.The
streaming_statsparameter was added to the method signature but the docstring was not updated to document it. According to the coding guidelines, all public APIs should have Google-style docstrings.🔎 Proposed docstring update
def _generate_json_summary( # pylint: disable=too-many-arguments,too-many-positional-arguments self, results: list[EvaluationResult], base_filename: str, basic_stats: dict[str, Any], detailed_stats: dict[str, Any], api_tokens: dict[str, Any], streaming_stats: dict[str, Any], ) -> Path: - """Generate JSON summary report.""" + """Generate JSON summary report. + + Args: + results: List of evaluation results. + base_filename: Base filename for the output file. + basic_stats: Basic statistics calculated from results. + detailed_stats: Detailed statistics by metric and conversation. + api_tokens: API token usage statistics. + streaming_stats: Streaming performance statistics. + + Returns: + Path to the generated JSON file. + """
265-273: Update the docstring to document the new parameter.The
streaming_statsparameter was added to the method signature but the docstring was not updated to document it. According to the coding guidelines, all public APIs should have Google-style docstrings.🔎 Proposed docstring update
def _generate_text_summary( # pylint: disable=too-many-arguments,too-many-positional-arguments self, results: list[EvaluationResult], base_filename: str, basic_stats: dict[str, Any], detailed_stats: dict[str, Any], api_tokens: dict[str, Any], streaming_stats: dict[str, Any], ) -> Path: - """Generate human-readable text summary.""" + """Generate human-readable text summary. + + Args: + results: List of evaluation results. + base_filename: Base filename for the output file. + basic_stats: Basic statistics calculated from results. + detailed_stats: Detailed statistics by metric and conversation. + api_tokens: API token usage statistics. + streaming_stats: Streaming performance statistics. + + Returns: + Path to the generated text file. + """
208-273: Update test calls to include thestreaming_statsparameter.Tests at lines 112-114 and 138-140 in
tests/unit/core/output/test_generator.pyare calling_generate_json_summary()and_generate_text_summary()without the requiredstreaming_statsparameter. Add an empty dict or stats["streaming"] as the sixth argument to both calls:json_file = handler._generate_json_summary( sample_results, "test", stats["basic"], stats["detailed"], api_tokens, stats.get("streaming", {}) )txt_file = handler._generate_text_summary( sample_results, "test", stats["basic"], stats["detailed"], api_tokens, stats.get("streaming", {}) )
🤖 Fix all issues with AI Agents
In @src/lightspeed_evaluation/core/output/generator.py:
- Around line 326-360: The docstring for _write_streaming_stats is too minimal;
update it to a Google-style docstring that documents the function purpose,
parameters and return value. Specifically, for method
_write_streaming_stats(self, f, streaming_stats) add a description line, an Args
section describing f (file-like object or writable stream) and streaming_stats
(dict with keys like "time_to_first_token", "streaming_duration",
"tokens_per_second" and their expected subkeys), and a Returns section
indicating None; also include a short Raises section if the method can raise
(e.g., KeyError or TypeError) or state that it does not raise, keeping the
docstring consistent with Google style.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/core/models/test_api_additional.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/core/api/test_streaming_parser.py
- config/system.yaml
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/core/models/mixins.py
🧰 Additional context used
📓 Path-based instructions (1)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
🧠 Learnings (10)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_streaming_stats(252-293)calculate_api_token_usage(227-249)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(318-381)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
🪛 GitHub Actions: Black
src/lightspeed_evaluation/core/output/generator.py
[error] 1-1: Black formatting check failed. 1 file would be reformatted. Run 'black --write' to fix code style issues.
🪛 GitHub Actions: Tests
src/lightspeed_evaluation/core/output/generator.py
[error] 1-1: TypeError in tests: OutputHandler._generate_json_summary() missing 1 required positional argument: 'streaming_stats'. This suggests a mismatch between test expectations and the method signature.
[error] 1-1: TypeError in tests: OutputHandler._generate_text_summary() missing 1 required positional argument: 'streaming_stats'. This suggests a mismatch between test expectations and the method signature.
🔇 Additional comments (12)
src/lightspeed_evaluation/core/models/__init__.py (1)
15-15: LGTM! Proper API exposure.The import and export of
StreamingMetricsMixincorrectly follows the module's pattern for exposing public APIs.Also applies to: 47-48
src/lightspeed_evaluation/core/output/statistics.py (1)
252-293: LGTM! Well-structured statistics function.The implementation correctly:
- Provides type hints for all parameters and return values
- Uses Google-style docstring
- Handles edge cases (empty lists gracefully return
{"count": 0})- Guards
statistics.stdevwith a length check to prevent errorssrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Consistent metric propagation.Streaming metrics are correctly propagated from
turn_datatoEvaluationResultusing safe conditional access, consistent with the existing pattern for API tokens.
218-218: LGTM! Improved consistency in error handling.The introduction of the local
turn_datavariable improves readability and ensures consistent field access patterns. Error results now include streaming metrics with proper parity to success results.Also applies to: 227-235
src/lightspeed_evaluation/core/models/api.py (2)
7-7: LGTM! Proper mixin integration.
APIResponsecorrectly inherits fromStreamingMetricsMixin, which adds the three streaming performance fields while maintaining PydanticBaseModelinheritance.Also applies to: 82-82
119-133: LGTM! Safe metric extraction.Streaming metrics are correctly extracted using
.get()(returnsNoneif absent) and properly passed to the constructor. The inline comment clarifies that these metrics are only available for the streaming endpoint.src/lightspeed_evaluation/core/api/streaming_parser.py (4)
17-40: LGTM! Well-structured performance tracker.The
_PerformanceTrackerclass correctly:
- Uses Pydantic
BaseModelwith proper configuration- Provides type hints for all methods
- Includes docstrings for methods
- Captures TTFT on first content and calculates throughput metrics
42-68: LGTM! Clean context class.
StreamingContextproperly encapsulates streaming state and provides a clean serialization method. Type hints and docstrings are present.
71-87: LGTM! Correct throughput calculation.The function correctly:
- Excludes TTFT from throughput calculation (measures generation speed)
- Handles edge cases (zero/negative values return
None)- Provides type hints and clear docstring
- Guards against division by zero
90-119: LGTM! Improved modularity.The refactored function correctly:
- Uses
StreamingContextto manage state- Delegates to helper functions for clarity
- Captures TTFT on first content event
- Includes updated docstring explaining performance metrics
src/lightspeed_evaluation/core/models/data.py (2)
10-10: LGTM! Streaming metrics integration is well-structured.The inheritance changes to use
StreamingMetricsMixinfor bothTurnDataandEvaluationResultare correctly implemented. This provides a consistent way to capture and propagate streaming performance metrics throughout the evaluation pipeline.Also applies to: 36-36, 384-384
427-445: LGTM! Output fields appropriately flatten complex structures.The new optional string fields in
EvaluationResultprovide string-formatted versions of complex structured data fromTurnData. This is appropriate for an output model that needs to support flat formats like CSV and human-readable JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 122-149: Import the custom exceptions APIError and
DataValidationError from lightspeed_evaluation.core.system.exceptions and
replace all ValueError raises in the streaming parser: in _process_event() when
handling the "error" event (currently raising ValueError with the streaming API
error) change to raise APIError with the same message; in _validate_response()
replace the two ValueError raises ("No final response found in streaming output"
and "No Conversation ID found") with DataValidationError. Ensure the file
imports APIError and DataValidationError and that the exception messages remain
descriptive.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
8-33: Consider adding Attributes section for Google-style compliance.While the current docstring is clear and informative, the coding guidelines specify Google-style docstrings for public APIs. Consider adding an Attributes section:
class StreamingMetricsMixin(BaseModel): """Mixin providing streaming performance metric fields. These fields capture performance metrics when using the streaming query endpoint. The fields are None when using the non-streaming query endpoint. Attributes: time_to_first_token: Time from request start to first content token in seconds. streaming_duration: Total time to receive all tokens in seconds. tokens_per_second: Output throughput calculation (tokens/second). """Based on coding guidelines, Google-style docstrings are required for public APIs.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/models/api.py
- config/system.yaml
- src/lightspeed_evaluation/core/models/init.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/data.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🧠 Learnings (10)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_streaming_stats(252-293)calculate_api_token_usage(227-249)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(90-119)
tests/unit/core/models/test_api_additional.py (1)
src/lightspeed_evaluation/core/models/api.py (2)
APIResponse(82-134)from_raw_response(99-134)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(318-381)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
🪛 GitHub Actions: Tests
src/lightspeed_evaluation/core/output/generator.py
[error] 1-1: TypeError: OutputHandler._generate_json_summary() missing 1 required positional argument: 'streaming_stats' when running pytest.
[error] 1-1: TypeError: OutputHandler._generate_text_summary() missing 1 required positional argument: 'streaming_stats' when running pytest.
🔇 Additional comments (18)
tests/unit/core/models/test_api_additional.py (4)
168-181: LGTM! Well-structured test for streaming metrics.The test correctly verifies that streaming performance metrics are properly stored and retrieved from APIResponse instances.
182-191: LGTM! Proper validation of default behavior.The test correctly verifies that streaming metrics default to None when not provided, which is the expected behavior for non-streaming endpoints.
193-211: LGTM! Comprehensive test coverage.The test effectively validates the complete parsing flow for streaming metrics through the from_raw_response factory method, including both token counts and streaming performance metrics.
213-228: LGTM! Complete coverage for non-streaming scenario.The test properly validates that streaming metrics remain None when not present in the raw response, ensuring correct behavior for non-streaming endpoints.
src/lightspeed_evaluation/core/models/data.py (3)
10-10: LGTM! Clean import of the new mixin.The import statement correctly brings in the StreamingMetricsMixin from the mixins module.
36-36: LGTM! Clean mixin integration.The inheritance changes cleanly add streaming metrics to both TurnData and EvaluationResult through the StreamingMetricsMixin, maintaining consistency across the codebase.
Also applies to: 384-384
427-429: LGTM! Appropriate field for serialized tool calls.The string-typed tool_calls field correctly stores the JSON-serialized representation of tool calls for evaluation results, consistent with the pattern used for other serialized fields like contexts and expected_tool_calls.
src/lightspeed_evaluation/core/output/statistics.py (1)
252-293: LGTM! Well-implemented streaming statistics aggregation.The function correctly:
- Follows coding guidelines with proper type hints and Google-style docstring
- Handles None values appropriately for optional streaming metrics
- Provides comprehensive statistics (count, mean, median, min, max, std)
- Gracefully handles edge cases (empty datasets, single values)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Consistent streaming metrics integration.The streaming performance metrics are properly integrated into the success path, following the same pattern as existing API token fields with appropriate None handling for both missing turn_data and missing metric values.
218-235: LGTM! Complete streaming metrics coverage for error path.The error result construction now correctly:
- Introduces a local turn_data variable for cleaner code
- Includes streaming performance metrics matching the success path
- Maintains consistent None handling across all fields
The error and success paths now have symmetric streaming metrics handling.
tests/unit/core/api/test_streaming_parser.py (2)
36-39: LGTM! Performance metrics assertions added correctly.Good addition of assertions to verify that streaming performance metrics (
time_to_first_token,streaming_duration,tokens_per_second) are included in the parsed result for complete responses.
260-377: Comprehensive test coverage for streaming performance metrics.The
TestStreamingPerformanceMetricsclass provides thorough coverage:
- TTFT capture on various content events (token, tool_call, turn_complete)
- Streaming duration validation (>= TTFT invariant)
- Tokens per second calculation with and without token counts
- Complete flow integration test
The test structure follows pytest conventions and mirrors the source code organization as per the coding guidelines.
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
17-40: Well-structured performance tracker implementation.The
_PerformanceTrackerclass cleanly encapsulates timing logic with appropriate use ofPrivateAttrfor internal state and Pydantic'sFieldwithdefault_factoryfor initialization.
42-68: Clean context model with proper serialization.
StreamingContextproperly encapsulates streaming state and delegates metric calculation to the performance tracker into_response_dict().
71-87: Correct throughput calculation with proper edge case handling.The function correctly:
- Excludes TTFT from throughput calculation (measuring generation time only)
- Returns
Nonefor invalid scenarios (no tokens, no TTFT, zero/negative generation time)- Includes helpful debug logging
src/lightspeed_evaluation/core/output/generator.py (3)
69-83: Streaming stats calculation integrated correctly.The conditional logic properly computes streaming stats when evaluation data is available and provides a consistent default structure (with
count: 0) when not. This aligns with the structure returned bycalculate_streaming_stats.
251-254:EvaluationResultproperly inherits all streaming metric attributes fromStreamingMetricsMixin.The attributes
time_to_first_token,streaming_duration, andtokens_per_secondare correctly defined as optional float fields with proper validation and type hints inStreamingMetricsMixin, whichEvaluationResultextends. No changes needed.
208-216: The code correctly passesstreaming_statsto both_generate_json_summary()and_generate_text_summary(). The parameter is calculated at lines 70–83 and properly passed at lines 127 and 138. The tests also correctly providestreaming_statswhen calling these methods (lines 117–124 and 153–160). No pipeline failure or breaking change exists, and no test updates are needed.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 122-149: Replace the generic ValueError usages with the
project-specific exceptions: import APIError and DataValidationError from
core.system.exceptions at the top of the file, change the raise in
_process_event (where streaming API error is detected) to raise APIError(...)
including the original error message, and change both raises in
_validate_response (missing final_response and missing conversation_id) to raise
DataValidationError(...) with appropriate messages; keep existing logging and
context updates intact.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.pytests/unit/core/output/test_generator.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/models/data.py
- src/lightspeed_evaluation/core/models/mixins.py
- src/lightspeed_evaluation/core/models/init.py
- tests/unit/core/models/test_api_additional.py
- config/system.yaml
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/models/api.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/output/test_generator.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/output/test_generator.py
🧠 Learnings (8)
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/api.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/generator.py
🧬 Code graph analysis (4)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(90-119)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(318-381)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_streaming_stats(252-293)calculate_api_token_usage(227-249)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
🔇 Additional comments (22)
src/lightspeed_evaluation/core/constants.py (1)
42-45: LGTM! Clean addition of streaming metrics columns.The new CSV columns are well-named, appropriately commented, and logically positioned after token-related columns.
src/lightspeed_evaluation/core/output/statistics.py (1)
252-293: Well-implemented streaming statistics calculation.The function properly:
- Provides type hints and Google-style documentation
- Aggregates metrics from evaluation data
- Handles edge cases (empty lists, single values)
- Computes standard statistical measures
The nested
_calc_statshelper is appropriate for this use case and keeps the logic cohesive.src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
72-90: LGTM! Streaming metrics properly integrated into data amendment flow.The implementation:
- Follows the existing pattern for extracting optional API response fields
- Provides appropriate debug logging for visibility
- Handles None values gracefully
- Maintains consistency with existing token usage extraction
tests/unit/core/api/test_streaming_parser.py (2)
36-39: Good addition of streaming metrics assertions.The assertions ensure that the performance metrics are present in the parsed response, aligning with the new functionality.
260-377: Excellent test coverage for streaming performance metrics.The new test class thoroughly validates:
- Time-to-first-token capture on various content events (token, tool_call, turn_complete)
- Streaming duration calculation
- Tokens-per-second computation with and without token counts
- Complete flow integration with all metrics
Tests follow pytest conventions and cover both happy paths and edge cases.
src/lightspeed_evaluation/core/models/api.py (3)
7-8: Correct import of StreamingMetricsMixin.The mixin properly extends APIResponse with streaming performance fields.
82-82: Clean inheritance change to add streaming metrics.Inheriting from
StreamingMetricsMixininstead ofBaseModeladds the three streaming performance fields (time_to_first_token, streaming_duration, tokens_per_second) with proper validation constraints. This is backward compatible since the fields default to None.
119-134: Streaming metrics properly extracted and validated.The implementation:
- Extracts the three streaming metrics from raw response data
- Passes them to the constructor where pydantic validation applies
- Handles absent metrics gracefully (None values)
- Maintains consistency with existing token extraction pattern
The validation constraints defined in
StreamingMetricsMixin(ge=0 for all metrics) ensure data integrity.tests/unit/core/output/test_generator.py (2)
111-124: LGTM! Test correctly accommodates streaming metrics.The streaming_stats initialization and parameter passing align with the production code changes. The structure matches the output of
calculate_streaming_statswhen no data is available.
147-160: LGTM! Consistent test updates for text summary generation.The streaming_stats initialization and parameter passing match the pattern used in the JSON summary test, maintaining consistency across test cases.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics correctly propagated in success path.The three streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second) are safely extracted from turn_data with appropriate fallbacks, consistent with the existing pattern for API token fields.
218-236: LGTM! Error result includes streaming metrics with safe access.The introduction of the local
turn_datavariable improves readability, and streaming metrics are consistently propagated to ERROR results alongside the existing API token fields.src/lightspeed_evaluation/core/api/streaming_parser.py (4)
17-39: LGTM! Performance tracker cleanly captures TTFT and duration.The
_PerformanceTrackeruses Pydantic's private attributes appropriately to ensure TTFT is captured only once on the first content event. The timing logic is sound and includes helpful debug logging.
42-68: LGTM! Context class provides clean serialization of streaming state.The
StreamingContextencapsulates all streaming state and provides a cleanto_response_dictmethod that integrates performance metrics into the response payload.
71-87: LGTM! Throughput calculation correctly excludes TTFT.The
_calculate_tokens_per_secondfunction correctly calculates throughput by excluding TTFT from the generation time, which provides a more accurate measure of token generation rate. The function handles edge cases (no tokens, no TTFT, zero generation time) safely.
90-119: LGTM! Context-driven parsing improves code organization.The refactored
parse_streaming_responseuses the context pattern effectively, delegating event processing to helper functions. The TTFT capture is correctly triggered for content events, and the flow is easier to follow than inline state management.src/lightspeed_evaluation/core/output/generator.py (6)
69-83: LGTM! Streaming stats properly computed or initialized.The code correctly calls
calculate_streaming_statswhen evaluation data is available and provides a sensible fallback structure with zero counts when no data is present, maintaining consistency across the reporting flow.
120-140: LGTM! Streaming stats consistently passed to report generators.Both JSON and TXT generation receive streaming stats via
stats.get("streaming", {}), ensuring graceful handling even if the key is missing from the stats dictionary.
215-238: LGTM! JSON summary integrates streaming performance metrics.The streaming_stats parameter is properly added to the signature, and the metrics are embedded in the summary under
streaming_performance, providing a top-level view of streaming performance alongside other summary statistics.
251-254: LGTM! Per-result streaming metrics included in JSON output.Each result entry in the JSON output includes the three streaming metrics (time_to_first_token, streaming_duration, tokens_per_second), enabling per-turn analysis of streaming performance.
272-291: LGTM! Text summary generation includes streaming stats.The streaming_stats parameter is properly threaded through to
_generate_text_summaryand rendered via the new_write_streaming_statshelper.
326-357: LGTM! Streaming stats section renders performance metrics clearly.The
_write_streaming_statsmethod provides a well-formatted text representation of streaming performance metrics (TTFT, duration, throughput) with mean, median, min, and max values. The early return whencount == 0prevents rendering empty sections.
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, few minor changes. Also please update the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
17-40: Consider enhancing docstrings for better maintainability.While
_PerformanceTrackeris a private class, its methods would benefit from Google-style docstrings that document parameters and return values. This improves maintainability.🔎 Suggested docstring improvements
class _PerformanceTracker(BaseModel): - """Tracks streaming performance metrics (TTFT, throughput).""" + """Tracks streaming performance metrics (TTFT, throughput). + + Attributes: + stream_start_time: Timestamp when streaming started (perf_counter). + time_to_first_token: Time in seconds to first content token, or None. + """ def capture_ttft(self) -> None: - """Capture time to first token if not already captured.""" + """Capture time to first token if not already captured. + + Sets time_to_first_token on first call; subsequent calls are no-ops. + """ def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]: - """Calculate streaming duration and tokens per second.""" + """Calculate streaming duration and tokens per second. + + Args: + output_tokens: Number of output tokens generated. + + Returns: + Tuple of (streaming_duration, tokens_per_second). + """
42-69: Enhance docstrings for public API per coding guidelines.
StreamingContextis a public class, so per the coding guidelines, it should use Google-style docstrings with proper Attributes and Returns sections.🔎 Suggested docstring improvements
class StreamingContext(BaseModel): - """Context for streaming response parsing.""" + """Context for streaming response parsing. + + Attributes: + conversation_id: Conversation identifier from API. + final_response: Complete response text after streaming. + tool_calls: List of tool call dictionaries. + input_tokens: Number of input tokens consumed. + output_tokens: Number of output tokens generated. + perf: Performance tracker for TTFT and throughput metrics. + """ def to_response_dict(self) -> dict[str, Any]: - """Convert context to response dictionary.""" + """Convert context to response dictionary. + + Returns: + Dictionary containing response, tool_calls, conversation_id, + token counts, and streaming performance metrics. + """src/lightspeed_evaluation/core/models/api.py (1)
98-134: Expand docstring to Google-style format.The
from_raw_responseclassmethod has a minimal docstring that doesn't document parameters or return value. Per coding guidelines, all public APIs should use Google-style docstrings.🔎 Proposed docstring enhancement
@classmethod def from_raw_response(cls, raw_data: dict[str, Any]) -> "APIResponse": - """Create APIResponse from raw API response data.""" + """Create APIResponse from raw API response data. + + Args: + raw_data: Dictionary containing raw API response with keys: + - response: Response text (required) + - conversation_id: Conversation tracking ID (required) + - tool_calls: Tool call sequences (optional) + - rag_chunks: RAG context chunks (optional) + - input_tokens: Input token count (optional) + - output_tokens: Output token count (optional) + - time_to_first_token: TTFT metric for streaming (optional) + - streaming_duration: Total streaming time (optional) + - tokens_per_second: Throughput metric (optional) + + Returns: + APIResponse instance with populated fields. + + Raises: + ValueError: If conversation_id is missing from raw_data. + """As per coding guidelines, all public APIs require Google-style docstrings.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
README.mdconfig/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.pytests/unit/core/output/test_generator.py
🚧 Files skipped from review as they are similar to previous changes (8)
- src/lightspeed_evaluation/core/models/init.py
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- tests/unit/core/api/test_streaming_parser.py
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/core/models/data.py
- tests/unit/core/output/test_generator.py
- src/lightspeed_evaluation/core/constants.py
- config/system.yaml
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/models/test_api_additional.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/models/test_api_additional.py
🧠 Learnings (9)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/api.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
README.md
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
README.md
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
README.md
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/generator.py
🧬 Code graph analysis (3)
tests/unit/core/models/test_api_additional.py (1)
src/lightspeed_evaluation/core/models/api.py (2)
APIResponse(82-134)from_raw_response(99-134)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_streaming_stats(252-293)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
🔇 Additional comments (17)
README.md (1)
14-14: Documentation looks good!The streaming performance metrics feature is clearly documented with appropriate detail for users. The table format and output descriptions provide good clarity.
Also applies to: 333-346
src/lightspeed_evaluation/core/api/streaming_parser.py (5)
5-5: Imports and constants are appropriate.The new imports support the performance tracking and Pydantic model definitions, and the CONTENT_EVENTS constant is properly defined for TTFT capture logic.
Also applies to: 9-9, 14-14
71-87: LGTM!The tokens-per-second calculation correctly excludes TTFT from the throughput calculation and handles edge cases appropriately (no tokens, missing TTFT, negative generation time).
122-142: LGTM!The event processing logic is clean and correctly handles all streaming events. The
ValueErrorexceptions are appropriate here, as they're converted toAPIErrorat the API boundary inclient.py(based on learnings).
144-149: LGTM!The validation logic correctly ensures required fields are present before returning the response. The
ValueErrorexceptions align with the established pattern (converted toAPIErrorat the API boundary).
90-120: Remove unnecessary pylint disable comment and use custom exceptions for error handling.The
pylint: disable=too-many-branchescomment is no longer needed—pylint does not report this issue after the refactoring. Additionally, replaceValueErrorexceptions (lines 127, 147, 149) withAPIErrorfromcore.system.exceptionsto comply with the framework's error handling guidelines.tests/unit/core/models/test_api_additional.py (1)
168-228: Excellent test coverage!The streaming metrics tests cover all the important scenarios:
- Direct instantiation with/without metrics
- Parsing from raw response with/without metrics
- Proper handling of None defaults for non-streaming endpoints
src/lightspeed_evaluation/core/models/mixins.py (1)
1-33: LGTM! Well-designed mixin.The
StreamingMetricsMixinis well-documented with clear field descriptions and appropriate type constraints. Thege=0constraint correctly allows zero values while preventing negative metrics.src/lightspeed_evaluation/core/models/api.py (3)
7-8: LGTM!The import of
StreamingMetricsMixinis clean and necessary for the inheritance change inAPIResponse.
82-82: LGTM!The inheritance change from
BaseModeltoStreamingMetricsMixinis correct and cleanly adds streaming performance metrics to the API response model.
119-122: LGTM!The extraction of streaming metrics using
.get()is appropriate and aligns with theStreamingMetricsMixinfields that default toNonefor non-streaming endpoints.src/lightspeed_evaluation/core/output/generator.py (6)
20-20: LGTM!The import of
calculate_streaming_statsis clean and necessary for computing streaming performance metrics.
69-79: LGTM!The conditional calculation of streaming stats is well-structured. Setting
stats["streaming"] = {}whenevaluation_datais absent provides a safe default and avoids downstream KeyErrors.
104-136: LGTM!Passing
stats.get("streaming", {})to both JSON and text summary methods is defensive and ensures safe handling even if the stats structure changes.
204-263: LGTM!The JSON summary generation correctly integrates streaming metrics:
streaming_performanceis conditionally included insummary_statsonly when data exists (line 233), keeping the summary clean.- Individual result entries always include streaming metric fields (lines 252-254), even if
None, which provides a predictable structure for machine parsing.Both approaches serve their respective contexts well.
265-299: LGTM!The text summary generation cleanly integrates the streaming statistics section via
_write_streaming_stats, maintaining consistency with other statistics sections.
326-357: LGTM!The
_write_streaming_statsmethod has clean logic:
- Early return when no streaming data is available (line 333-334), avoiding empty sections.
- Conditional rendering of each metric based on count, providing flexibility for partial data.
- Appropriate numeric formatting (3 decimals for time, 1 for throughput) for readability.
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e6e633b to
827856a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 90-119: The docstring for the public function
parse_streaming_response is not Google-style; update it to include an "Args"
section documenting the response parameter (type httpx.Response and that it must
be a streaming response) and a "Returns" section describing the returned
dict[str, Any] (what keys are present/meaning), while preserving the existing
description of TTFT, streaming duration and tokens/sec; keep the brief summary
line and place detailed descriptions under Args and Returns, and ensure the
docstring is triple-quoted and located immediately above
parse_streaming_response.
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
17-39: Consider expanding docstrings for better maintainability.While private classes aren't strictly required to have Google-style docstrings per the coding guidelines, the
_PerformanceTrackerclass would benefit from more detailed documentation:
- Class docstring could document the attributes and their purpose
capture_ttft()could document that it's idempotent (only captures once)get_metrics()could document the return tuple structureThis would improve maintainability for future developers working with the performance tracking logic.
42-68: Consider expanding docstrings for better maintainability.The
StreamingContextclass would benefit from more detailed documentation:
- Class docstring could document the attributes and the overall parsing state it maintains
to_response_dict()could document the return dictionary structure with all the keys (response, tool_calls, conversation_id, tokens, metrics)This would help future maintainers understand the streaming context lifecycle and response format.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
README.mdconfig/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.pytests/unit/core/output/test_generator.py
🚧 Files skipped from review as they are similar to previous changes (7)
- src/lightspeed_evaluation/core/constants.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- README.md
- config/system.yaml
- src/lightspeed_evaluation/core/models/mixins.py
- src/lightspeed_evaluation/core/models/init.py
- tests/unit/core/models/test_api_additional.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/output/test_generator.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/output/test_generator.py
🧠 Learnings (13)
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/output/generator.py
🧬 Code graph analysis (6)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(90-119)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(318-381)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
tests/unit/core/output/test_generator.py (1)
src/lightspeed_evaluation/core/output/generator.py (2)
_generate_json_summary(204-263)_generate_text_summary(265-299)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_streaming_stats(252-293)calculate_api_token_usage(227-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.12)
🔇 Additional comments (19)
src/lightspeed_evaluation/core/output/statistics.py (1)
252-293: LGTM! Clean implementation of streaming statistics.The new
calculate_streaming_statsfunction follows all coding guidelines:
- Proper type hints for the public API
- Google-style docstring with clear Args and Returns sections
- Correct aggregation logic that filters None values and handles empty datasets
- Consistent statistical measures (count, mean, median, min, max, std) matching the existing
_finalize_metric_statspatternThe internal helper
_calc_statsappropriately returns{"count": 0}for empty inputs, preventing division-by-zero errors.src/lightspeed_evaluation/core/models/data.py (3)
10-10: LGTM! Proper mixin import.The StreamingMetricsMixin import is correctly added to support streaming metrics integration.
36-36: LGTM! Correct inheritance for turn-level streaming metrics.TurnData now inherits from StreamingMetricsMixin, which provides the three streaming performance fields (time_to_first_token, streaming_duration, tokens_per_second). This is the appropriate level for capturing per-turn streaming metrics from API responses.
384-384: LGTM! Correct inheritance for result-level streaming metrics.EvaluationResult now inherits from StreamingMetricsMixin, enabling streaming metrics to propagate from TurnData through to evaluation results. This supports the downstream reporting requirements shown in the AI summary.
src/lightspeed_evaluation/core/models/api.py (3)
7-7: LGTM! Proper mixin import for API responses.The StreamingMetricsMixin import enables APIResponse to carry streaming metrics from the lightspeed-stack API.
82-82: LGTM! Correct inheritance at API boundary.APIResponse now inherits from StreamingMetricsMixin, which is the appropriate entry point for streaming metrics into the evaluation framework.
119-134: LGTM! Proper extraction and propagation of streaming metrics.The streaming metrics are correctly extracted from the raw API response using
.get()(returns None if absent) and passed to the APIResponse constructor. This aligns with the mixin's optional field design, where metrics are None for non-streaming endpoints.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics properly included in success results.The three streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second) are correctly propagated from turn_data to the evaluation result, with proper None handling when turn_data is absent.
218-236: LGTM! Consistent streaming metrics in error results.The error result path mirrors the success path by including streaming metrics from turn_data. The local variable pattern (line 218) improves readability and maintains consistency with the success path (line 165).
tests/unit/core/output/test_generator.py (2)
111-120: LGTM! Test updated to include streaming_stats parameter.The test correctly initializes streaming_stats as an empty dict and passes it to _generate_json_summary, aligning with the updated generator signature shown in the relevant code snippets.
143-152: LGTM! Test updated to include streaming_stats parameter.The test correctly initializes streaming_stats as an empty dict and passes it to _generate_text_summary, aligning with the updated generator signature shown in the relevant code snippets.
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
71-87: LGTM! Solid edge case handling.The tokens-per-second calculation correctly excludes TTFT from throughput and handles all edge cases (no tokens, no TTFT, zero/negative generation time). The debug logging is helpful for observability.
122-149: LGTM! Event processing and validation logic is correct.The event handling properly processes all streaming event types (start, turn_complete, tool_call, end, error) and validation ensures required fields are present. The use of
ValueErroris correct per the architecture—these exceptions are translated toAPIErrorat the API boundary in client.py.Based on learnings, ValueError is the appropriate exception type for internal validation in streaming_parser.py.
tests/unit/core/api/test_streaming_parser.py (2)
36-39: LGTM! Good test coverage extension.The assertions verify that the performance metrics are included in the response. This aligns well with the new functionality added to the streaming parser.
260-377: LGTM! Comprehensive performance metrics test coverage.The new
TestStreamingPerformanceMetricsclass provides excellent coverage of the streaming metrics functionality:
- TTFT capture on various event types (token, tool_call, turn_complete)
- Streaming duration measurement
- Tokens-per-second calculation with edge cases (with/without token counts)
- Complete end-to-end flow validation
The tests properly use the pytest
mockerfixture and follow the required test structure conventions.src/lightspeed_evaluation/core/output/generator.py (4)
20-20: LGTM! Clean streaming stats integration.The streaming statistics are properly integrated into the report generation flow with the same pattern used for API token usage—calculate when data is available, otherwise initialize as empty. This ensures downstream code can safely access streaming stats without errors.
Also applies to: 69-79
204-263: LGTM! Well-structured JSON summary integration.The streaming statistics are properly integrated into the JSON output:
- Conditionally includes
streaming_performancesection only when data is available (lines 233-234)- Per-result entries include streaming metrics fields (lines 251-254)
- Function signature properly updated with streaming_stats parameter (line 211)
The conditional inclusion pattern prevents empty sections in the summary while maintaining consistent per-result structure.
265-299: LGTM! Consistent text summary integration.The streaming statistics are properly integrated into the text summary generation, following the same pattern as other statistics sections (overall, tokens, metrics, conversations).
326-357: LGTM! Clean streaming stats rendering with proper guards.The
_write_streaming_statsmethod correctly:
- Returns early if no streaming data is available (line 334)
- Conditionally writes each metric section only when data exists (count > 0)
- Formats statistics consistently with mean, median, min, max values
- Maintains the same formatting style as other statistics sections
The defensive programming with count checks ensures clean output even with partial streaming data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
114-116: Docstring mentions exception types that differ from actual raises.The Raises section documents
APIErrorandDataValidationError, but the code raisesValueError. While the conversion happens inclient.py, this could confuse developers reading just this module.Consider updating the docstring to reflect the actual exception raised, with a note about the boundary conversion:
🔎 Suggested docstring clarification
Raises: - APIError: If an error event is received from the streaming API. - DataValidationError: If required response fields are missing. + ValueError: If streaming response is invalid (error event received, + missing final_response, or missing conversation_id). Note: These + are converted to APIError at the client boundary.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
README.mdconfig/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.pytests/unit/core/output/test_generator.py
🚧 Files skipped from review as they are similar to previous changes (9)
- config/system.yaml
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/output/test_generator.py
- src/lightspeed_evaluation/core/models/api.py
- tests/unit/core/models/test_api_additional.py
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- src/lightspeed_evaluation/core/models/init.py
- src/lightspeed_evaluation/core/output/statistics.py
- README.md
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.py
🧠 Learnings (13)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2026-01-06T10:26:12.614Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/output/generator.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (3)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(90-137)
src/lightspeed_evaluation/core/output/generator.py (1)
src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_streaming_stats(252-293)calculate_api_token_usage(227-249)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
🔇 Additional comments (19)
src/lightspeed_evaluation/core/models/mixins.py (1)
1-33: LGTM!The
StreamingMetricsMixinis well-designed as a reusable Pydantic mixin. The implementation correctly:
- Uses
Optional[float]withdefault=Nonefor optional streaming metrics- Applies
ge=0validation constraints to prevent negative values- Includes clear field descriptions for documentation
- Provides a helpful class docstring explaining when fields are populated
src/lightspeed_evaluation/core/models/data.py (4)
10-10: LGTM!Clean import of the new mixin for use in this module.
36-36: LGTM!
TurnDatanow inherits fromStreamingMetricsMixin, which cleanly adds the streaming performance fields (time_to_first_token,streaming_duration,tokens_per_second) with proper validation constraints. This is the appropriate location since streaming metrics are captured at the turn level from API responses.
384-384: LGTM!
EvaluationResultnow inherits fromStreamingMetricsMixin, ensuring streaming metrics are available in evaluation results with consistent validation (ge=0). This addresses the previous review comment about adding non-negative validation constraints to these fields.
427-429: LGTM!The
tool_callsfield addition is appropriate withOptional[str]type (formatted string representation) and clear description.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM!Streaming performance metrics are correctly propagated from
turn_datato the successEvaluationResult. The conditional patternturn_data.X if turn_data else Noneis consistent with how other optional fields are handled.
218-235: LGTM!The error result creation now:
- Uses a local
turn_datavariable for cleaner access- Propagates streaming metrics consistently with the success path
- Properly handles the case when
turn_dataisNoneThis ensures streaming metrics are captured even for failed evaluations, which is valuable for performance analysis.
src/lightspeed_evaluation/core/api/streaming_parser.py (5)
17-39: LGTM!
_PerformanceTrackeris well-designed:
- Uses Pydantic
BaseModelfor consistency (addresses previous review feedback)PrivateAttrfor internal state (_first_content_received) that shouldn't be serializedcapture_ttft()correctly captures TTFT only onceget_metrics()cleanly delegates throughput calculation to the utility function
42-68: LGTM!
StreamingContexteffectively encapsulates streaming state:
- All fields have appropriate defaults
to_response_dict()provides clean serialization for the final response- Composition with
_PerformanceTrackerkeeps concerns separated
71-87: LGTM!The
_calculate_tokens_per_secondfunction correctly:
- Returns
Nonewhen output_tokens ≤ 0 or TTFT is unavailable- Excludes TTFT from throughput calculation (
generation_time = total_duration - ttft)- Returns
Noneif generation_time ≤ 0 (edge case protection)- Logs performance metrics at debug level
90-137: LGTM!The refactored
parse_streaming_responsefunction:
- Has a proper Google-style docstring with Args, Returns, and Raises sections (addresses previous review feedback)
- Uses the new context-based flow for cleaner state management
- Captures TTFT on content events (
token,tool_call,turn_complete)- Delegates event processing and validation to separate functions
Note: The docstring mentions
APIErrorandDataValidationErrorin the Raises section, but the code raisesValueError. Based on learnings, this is intentional asclient.pyconvertsValueErrortoAPIErrorat the API boundary.
140-167: LGTM!The
_process_eventand_validate_responsefunctions are well-structured:
- Single responsibility - each handles a specific concern
- Proper logging at each processing step
- Error handling with clear messages
The use of
ValueErrorhere is intentional per project conventions (converted toAPIErrorinclient.py).tests/unit/core/api/test_streaming_parser.py (2)
36-39: LGTM!Good addition - verifying that performance metrics keys are present in the result ensures the API contract is maintained.
260-377: LGTM!Excellent test coverage for the new streaming performance metrics:
test_time_to_first_token_captured: Verifies TTFT is captured and non-negativetest_streaming_duration_captured: Verifies duration is captured and >= TTFTtest_tokens_per_second_with_token_counts: Verifies throughput calculation with token datatest_tokens_per_second_without_token_counts: Verifies throughput isNonewithout tokenstest_ttft_captured_on_token_event: Verifies TTFT capture ontokeneventstest_ttft_captured_on_tool_call_event: Verifies TTFT capture ontool_calleventstest_performance_metrics_with_complete_flow: End-to-end verificationThe tests follow pytest conventions and use the
mockerfixture as per coding guidelines.src/lightspeed_evaluation/core/output/generator.py (5)
20-20: LGTM!Clean import of the new statistics function for streaming metrics calculation.
69-79: LGTM!Streaming stats are properly calculated when
evaluation_datais available and initialized as an empty dict otherwise. This ensures consistent data structure across both paths.
227-234: LGTM!Good conditional inclusion -
streaming_performanceis only added to the JSON summary when there's actual streaming data. This keeps the output clean when streaming metrics aren't available.
251-254: LGTM!Per-result streaming metrics are included in the JSON output, enabling detailed analysis of individual evaluation performance.
326-357: LGTM!The
_write_streaming_statsmethod is well-implemented:
- Returns early when no data is available (line 334)
- Checks each metric category before writing
- Uses appropriate formatting (
.3ffor time,.1ffor throughput)- Adds proper section headers and formatting
The docstring is concise but clear. Previous review feedback about expanding it was marked as addressed.
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you !!
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Data / Schema
Analytics / Reporting
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.