[LEADS-141] Add Latency Metrics to Evaluation Reports #127

bsatapat-jpg · 2025-12-26T09:49:53Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude-4.5-opus-high
Generated by: Cursor

Related Tickets & Documents

Related Issue # https://issues.redhat.com/browse/LEADS-141
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Added three streaming performance metrics per result: time_to_first_token, streaming_duration, and tokens_per_second; CSV output now includes these columns.
Data / Schema
- Core result types and models optionally carry these streaming metrics.
Analytics / Reporting
- Aggregated streaming statistics included in JSON/TXT summaries and CSV outputs; per-result streaming fields surfaced in reports.
Tests
- Expanded unit tests covering capture, propagation, calculation, and missing-metric cases.
Documentation
- README updated to describe streaming metrics and their inclusion in outputs.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-26T09:50:02Z

Walkthrough

Adds capture, propagation, aggregation, and reporting of three streaming performance metrics—time_to_first_token, streaming_duration, tokens_per_second—by instrumenting the streaming parser, introducing a StreamingMetricsMixin, wiring metrics through models and pipeline, computing stats, updating outputs/tests, and adding CSV/README mentions.

Changes

Cohort / File(s)	Summary
Config `config/system.yaml`	Added CSV columns: `time_to_first_token`, `streaming_duration`, `tokens_per_second`.
Streaming parser `src/lightspeed_evaluation/core/api/streaming_parser.py`	Introduces public `StreamingContext` dataclass and private `_PerformanceTracker`; refactors parsing to event-driven flow (`_parse_sse_line`, `_process_event`, `_validate_response`); captures TTFT, streaming duration, and tokens/sec and returns metrics with response.
Constants `src/lightspeed_evaluation/core/constants.py`	Inserted streaming metric names into `SUPPORTED_CSV_COLUMNS`.
Models (mixins, api, data, exports) `src/lightspeed_evaluation/core/models/mixins.py`, `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/core/models/api.py`, `src/lightspeed_evaluation/core/models/data.py`	Added `StreamingMetricsMixin` (three optional non-negative float fields) and exported it; adapted `APIResponse`, `TurnData`, and `EvaluationResult` to include the mixin and propagate metrics; `APIResponse.from_raw_response` now reads streaming fields.
Pipeline integration `src/lightspeed_evaluation/pipeline/evaluation/amender.py`, `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Amender populates streaming metrics into `TurnData`; evaluator includes streaming metrics in SUCCESS and ERROR `EvaluationResult` payloads and logs metrics conditionally.
Output & statistics `src/lightspeed_evaluation/core/output/generator.py`, `src/lightspeed_evaluation/core/output/statistics.py`	Added `calculate_streaming_stats` to aggregate per-turn metrics; generator threads `streaming_stats` through JSON/TXT summaries and per-result entries; added `_write_streaming_stats` and updated summary method signatures to accept `streaming_stats`.
Tests `tests/unit/core/api/test_streaming_parser.py`, `tests/unit/core/models/test_api_additional.py`, `tests/unit/core/output/test_generator.py`	Added unit tests for TTFT, streaming_duration, tokens_per_second capture and calculations; updated generator tests to pass `streaming_stats`.
Docs `README.md`	Documented Streaming Performance Metrics and their inclusion in CSV/JSON/TXT outputs.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant Parser as parse_streaming_response
    participant Context as StreamingContext
    participant Perf as _PerformanceTracker
    participant Proc as _process_event
    participant Validator as _validate_response

    Client->>Parser: open SSE stream (SSE lines)
    Parser->>Context: init StreamingContext (Perf.start())
    loop per SSE line
        Parser->>Parser: _parse_sse_line -> event
        Parser->>Proc: dispatch event with Context
        alt content / token event
            Proc->>Perf: record TTFT if first content/tool_call
            Proc->>Context: append token / update interim response
        else tool_call event
            Proc->>Perf: record TTFT if first content/tool_call
            Proc->>Context: append tool_call
        else turn_complete event
            Proc->>Context: record token counts (input/output)
        else end event
            Proc->>Perf: record end time
            Perf->>Context: compute streaming_duration & tokens_per_second
            Proc->>Context: set final_response
        end
    end
    Parser->>Validator: validate Context (final_response, conversation_id)
    Validator-->>Parser: valid / raise
    Parser->>Client: return serialized response (includes streaming metrics)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

API integration & refactoring #47 — overlaps on streaming parser, models, and constants changes touching parsing and metric fields.
[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation #112 — similar modifications to streaming parsing and token/streaming metric handling.

Suggested reviewers

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main feature: adding latency metrics to evaluation reports. The changeset comprehensively implements streaming performance metrics (TTFT, streaming duration, tokens per second) across parser, models, output generation, and test files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom Pre-merge Checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)
41-66: Consider making StreamingContext private for consistency.

The _PerformanceTracker class uses an underscore prefix to indicate internal use. Since StreamingContext is also only used internally, consider renaming it to _StreamingContext for consistency.

120-139: Add type hint for event_data parameter.

The event_data parameter is missing a full type hint. Per coding guidelines, type hints should be provided for all functions.
🔎 Proposed fix
-def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
+def _process_event(ctx: StreamingContext, event: str, event_data: dict[str, Any]) -> None:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 612e95c and 8f2e6c8.

📒 Files selected for processing (9)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🧠 Learnings (4)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

🧬 Code graph analysis (2)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (88-117)

tests/unit/core/models/test_api_additional.py (1)

src/lightspeed_evaluation/core/models/api.py (2)

APIResponse (80-149)

from_raw_response (114-149)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)

🔇 Additional comments (13)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-91: LGTM! Clean implementation of streaming metrics extraction.

The streaming performance metrics extraction is well-implemented:

Appropriate use of getattr with None defaults for optional fields

Conditional debug logging only when metrics are present

Consistent with the existing token usage pattern

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics properly propagated to evaluation results.

The streaming performance metrics are correctly extracted from turn_data and populated into EvaluationResult, following the same conditional pattern as other turn-level fields.

218-236: LGTM! Error handling improved with consistent streaming metrics.

Good refactoring to use a local turn_data variable for clarity, and the error result now includes streaming metrics consistently with the success path.

tests/unit/core/api/test_streaming_parser.py (1)

260-377: LGTM! Comprehensive test coverage for streaming performance metrics.

The new test class provides excellent coverage:

Validates TTFT capture on different event types (token, tool_call, turn_complete)

Checks metric relationships (streaming_duration ≥ time_to_first_token)

Tests tokens_per_second calculation with and without token counts

Verifies complete flow with all metrics

src/lightspeed_evaluation/core/models/api.py (2)

96-111: LGTM! Streaming metrics properly defined with validation.

The streaming performance metric fields are well-defined:

Appropriate Optional[float] type for streaming-only metrics

Non-negative validation with ge=0

Clear descriptions indicating streaming-only applicability

134-148: LGTM! Streaming metrics correctly extracted and propagated.

The extraction logic correctly uses .get() to retrieve optional streaming metrics from raw data and passes them to the APIResponse constructor.

tests/unit/core/models/test_api_additional.py (1)

168-228: LGTM! Thorough test coverage for APIResponse streaming metrics.

The tests comprehensively cover:

Direct instantiation with streaming metrics

Default None values when metrics are absent

Parsing from raw response with and without streaming metrics

Token count propagation alongside streaming metrics

src/lightspeed_evaluation/core/models/data.py (1)

81-96: LGTM! Streaming metrics properly defined in TurnData with validation.

The streaming performance metrics are correctly defined with:

Appropriate Optional[float] type

Non-negative validation (ge=0)

Clear descriptions

src/lightspeed_evaluation/core/api/streaming_parser.py (5)

1-14: LGTM!

The imports are appropriate for the new performance tracking functionality, and CONTENT_EVENTS as a tuple provides an immutable constant for event classification.

17-38: LGTM!

Good use of time.perf_counter() for high-resolution timing. The TTFT capture logic correctly ensures single capture via the _first_content_received flag.

69-85: LGTM!

The throughput calculation correctly excludes TTFT from the generation time, and the function properly handles edge cases (no tokens, missing TTFT, zero/negative generation time).

142-147: LGTM!

The validation logic correctly ensures required fields are present before returning the response. The use of ValueError was noted in the earlier comment about custom exceptions.

150-196: LGTM!

The existing helper functions (_parse_sse_line, _parse_tool_call, _format_tool_sequences) integrate correctly with the new context-based streaming flow.

src/lightspeed_evaluation/core/api/streaming_parser.py

src/lightspeed_evaluation/core/models/data.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)
120-140: Use custom exceptions per coding guidelines.

Line 125 raises ValueError for streaming API errors. Per the coding guidelines and a previous review comment, this should use APIError from core.system.exceptions instead.
🔎 Suggested fix

Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import APIError
Then update the error handling:
 def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
     """Process a single streaming event and update context."""
     if event == "error" and "token" in event_data:
         error_message = event_data["token"]
         logger.error("Received error event from streaming API: %s", error_message)
-        raise ValueError(f"Streaming API error: {error_message}")
+        raise APIError(f"Streaming API error: {error_message}")
Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".

142-147: Use custom exceptions for validation errors.

Lines 145 and 147 raise ValueError for validation failures. Per the coding guidelines and a previous review comment, these should use DataValidationError from core.system.exceptions instead.
🔎 Suggested fix

Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import DataValidationError
Then update the validation:
 def _validate_response(ctx: StreamingContext) -> None:
     """Validate that required response fields are present."""
     if not ctx.final_response:
-        raise ValueError("No final response found in streaming output")
+        raise DataValidationError("No final response found in streaming output")
     if not ctx.conversation_id:
-        raise ValueError("No Conversation ID found")
+        raise DataValidationError("No Conversation ID found")
Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".

🧹 Nitpick comments (3)

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

17-38: Add type hints and docstrings for _PerformanceTracker.

Per coding guidelines, all functions and methods should have type hints, and public APIs should have Google-style docstrings. While this is a private class, adding docstrings improves maintainability.

🔎 Suggested improvements

 @dataclass
 class _PerformanceTracker:
-    """Tracks streaming performance metrics (TTFT, throughput)."""
+    """Tracks streaming performance metrics.
+    
+    Captures time to first token (TTFT) and calculates throughput metrics
+    for streaming responses.
+    
+    Attributes:
+        stream_start_time: Monotonic timestamp when streaming started
+        time_to_first_token: Time in seconds to receive first content token
+        _first_content_received: Internal flag tracking if first content seen
+    """
 
     stream_start_time: float = field(default_factory=time.perf_counter)
     time_to_first_token: Optional[float] = None
     _first_content_received: bool = False
 
-    def capture_ttft(self) -> None:
-        """Capture time to first token if not already captured."""
+    def capture_ttft(self) -> None:
+        """Capture time to first token if not already captured.
+        
+        Records the time elapsed from stream start to first content token.
+        Subsequent calls are ignored to preserve the first measurement.
+        """
         if not self._first_content_received:
             self.time_to_first_token = time.perf_counter() - self.stream_start_time
             self._first_content_received = True
             logger.debug("Time to first token: %.3f seconds", self.time_to_first_token)
 
-    def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]:
-        """Calculate streaming duration and tokens per second."""
+    def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]:
+        """Calculate streaming duration and tokens per second.
+        
+        Args:
+            output_tokens: Number of output tokens generated
+            
+        Returns:
+            Tuple of (streaming_duration, tokens_per_second)
+        """
         streaming_duration = time.perf_counter() - self.stream_start_time
         tokens_per_second = _calculate_tokens_per_second(
             output_tokens, self.time_to_first_token, streaming_duration
         )
         return streaming_duration, tokens_per_second

Based on coding guidelines: "Provide type hints for all public functions and methods" and "Use Google-style docstrings for all public APIs".

41-66: Enhance docstrings for StreamingContext.

The class and method need more detailed Google-style docstrings per coding guidelines. The existing docstrings are minimal.

🔎 Suggested improvements

 @dataclass
 class StreamingContext:  # pylint: disable=too-many-instance-attributes
-    """Context for streaming response parsing."""
+    """Context for streaming response parsing.
+    
+    Maintains state during SSE stream parsing, including response content,
+    conversation tracking, token counts, and performance measurements.
+    
+    Attributes:
+        conversation_id: Unique conversation identifier from API
+        final_response: Complete response text after streaming
+        tool_calls: List of tool invocations from the stream
+        input_tokens: Number of input tokens used
+        output_tokens: Number of output tokens generated
+        perf: Performance tracker for TTFT and throughput metrics
+    """
 
     conversation_id: str = ""
     final_response: str = ""
     tool_calls: list[dict[str, Any]] = field(default_factory=list)
     input_tokens: int = 0
     output_tokens: int = 0
     perf: _PerformanceTracker = field(default_factory=_PerformanceTracker)
 
     def to_response_dict(self) -> dict[str, Any]:
-        """Convert context to response dictionary."""
+        """Convert context to response dictionary.
+        
+        Returns:
+            Dictionary containing response text, tool calls, conversation ID,
+            token counts, and streaming performance metrics (TTFT, duration,
+            tokens per second).
+        """
         streaming_duration, tokens_per_second = self.perf.get_metrics(
             self.output_tokens
         )
         return {
             "response": self.final_response,
             "tool_calls": _format_tool_sequences(self.tool_calls),
             "conversation_id": self.conversation_id,
             "input_tokens": self.input_tokens,
             "output_tokens": self.output_tokens,
             "time_to_first_token": self.perf.time_to_first_token,
             "streaming_duration": streaming_duration,
             "tokens_per_second": tokens_per_second,
         }

Based on coding guidelines: "Use Google-style docstrings for all public APIs".

69-85: Consider enhancing docstring for clarity.

The function has a good implementation with proper edge case handling. The docstring could be expanded to document parameters and return value in Google-style format.

🔎 Suggested improvement

 def _calculate_tokens_per_second(
     output_tokens: int, ttft: Optional[float], total_duration: float
 ) -> Optional[float]:
-    """Calculate tokens per second, excluding TTFT from throughput calculation."""
+    """Calculate tokens per second, excluding TTFT from throughput calculation.
+    
+    Args:
+        output_tokens: Number of output tokens generated
+        ttft: Time to first token in seconds, or None if not captured
+        total_duration: Total streaming duration in seconds
+        
+    Returns:
+        Tokens per second throughput (output_tokens / generation_time),
+        or None if calculation is not possible (no tokens, no TTFT, or
+        negative generation time).
+    """
     if output_tokens <= 0 or ttft is None:
         return None
     generation_time = total_duration - ttft
     if generation_time <= 0:
         return None
     tokens_per_second = output_tokens / generation_time
     logger.debug(
         "Streaming performance: %.3f tokens/sec (%d tokens in %.3f sec)",
         tokens_per_second,
         output_tokens,
         generation_time,
     )
     return tokens_per_second

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2e6c8 and ed90e09.

📒 Files selected for processing (9)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/core/models/data.py
config/system.yaml
src/lightspeed_evaluation/core/constants.py
tests/unit/core/models/test_api_additional.py

🧰 Additional context used

📓 Path-based instructions (3)

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/api/streaming_parser.py

🧠 Learnings (6)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

🧬 Code graph analysis (1)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (88-117)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.13)

🔇 Additional comments (8)

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM! Existing test appropriately updated.

The test now verifies that performance metrics are present in the response. This ensures backward compatibility while validating the new metrics are populated.

260-377: Excellent test coverage for streaming performance metrics.

The new test suite comprehensively validates:

Time to First Token (TTFT) capture on different event types

Streaming duration calculation and its relationship to TTFT

Tokens per second calculation with and without token counts

Complete end-to-end flow with all metrics

The tests follow pytest conventions and coding guidelines.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics properly integrated.

The streaming performance metrics are correctly extracted from turn_data with appropriate None fallbacks, consistent with the existing pattern for token usage metrics.

214-236: LGTM! Error result consistently includes streaming metrics.

The refactoring to use a local turn_data variable improves readability and avoids repeated attribute access. Including streaming metrics in error results is appropriate—they provide valuable performance insights even when evaluations fail.

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-90: LGTM! Streaming metrics properly extracted and logged.

The implementation correctly:

Uses getattr with None defaults for optional streaming fields

Logs metrics only when available (conditional on TTFT presence)

Employs defensive null handling in logging to prevent crashes

Provides clear, formatted output with appropriate precision

The code follows project guidelines and integrates cleanly with existing patterns.

src/lightspeed_evaluation/core/models/api.py (2)

96-111: LGTM! Streaming metric fields properly defined.

The new fields are well-structured with:

Appropriate types (Optional[float])

Non-negative validation (ge=0)

Clear documentation indicating streaming-only availability

Consistent naming conventions

134-138: LGTM! Metric extraction and population implemented correctly.

The extraction uses raw_data.get() without defaults, appropriately returning None for missing keys. This pattern is consistent with the optional nature of streaming metrics and aligns with existing token extraction logic.

Also applies to: 146-148

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

88-117: LGTM! Well-structured refactoring with clear separation of concerns.

The refactored implementation:

Uses context-based state management for cleaner code

Separates parsing, event processing, and validation into distinct phases

Captures TTFT at the correct point (first content event)

Includes comprehensive documentation of performance metrics

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

120-147: Use custom exceptions from core.system.exceptions for error handling.

Per the coding guidelines and previous review feedback, replace ValueError with custom exceptions:

Line 125: Use APIError for streaming API errors
Lines 145, 147: Use DataValidationError for validation failures

🔎 Proposed fix

+from lightspeed_evaluation.core.system.exceptions import APIError, DataValidationError
+
 def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
     """Process a single streaming event and update context."""
     if event == "error" and "token" in event_data:
         error_message = event_data["token"]
         logger.error("Received error event from streaming API: %s", error_message)
-        raise ValueError(f"Streaming API error: {error_message}")
+        raise APIError(f"Streaming API error: {error_message}")
     # ... rest unchanged ...

 def _validate_response(ctx: StreamingContext) -> None:
     """Validate that required response fields are present."""
     if not ctx.final_response:
-        raise ValueError("No final response found in streaming output")
+        raise DataValidationError("No final response found in streaming output")
     if not ctx.conversation_id:
-        raise ValueError("No Conversation ID found")
+        raise DataValidationError("No Conversation ID found")

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)
41-66: StreamingContext is a public class—add Google-style docstring for the to_response_dict method.

Per coding guidelines, public APIs should use Google-style docstrings. The to_response_dict method lacks a return type description.
🔎 Suggested docstring enhancement
     def to_response_dict(self) -> dict[str, Any]:
-        """Convert context to response dictionary."""
+        """Convert context to response dictionary.
+
+        Returns:
+            dict[str, Any]: Dictionary containing response data and streaming
+                performance metrics (time_to_first_token, streaming_duration,
+                tokens_per_second).
+        """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed90e09 and ff4133d.

📒 Files selected for processing (11)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (6)

src/lightspeed_evaluation/core/constants.py
config/system.yaml
src/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧰 Additional context used

📓 Path-based instructions (1)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py

🧠 Learnings (6)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/api.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (3)

src/lightspeed_evaluation/core/models/__init__.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.13)

🔇 Additional comments (9)

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

17-38: Well-designed performance tracker with proper encapsulation.

The _PerformanceTracker dataclass cleanly encapsulates TTFT capture logic with idempotent capture_ttft() and correctly uses time.perf_counter() for high-resolution timing. The separation of concerns between timing capture and metrics calculation is good.

69-85: Tokens-per-second calculation correctly excludes TTFT from throughput.

The logic properly calculates generation throughput by subtracting TTFT from total duration, and handles edge cases (zero/negative tokens, missing TTFT, zero/negative generation time) by returning None.

88-117: Refactored parsing flow is cleaner and more maintainable.

The event-driven approach with StreamingContext, _process_event, and _validate_response improves readability and testability. The TTFT capture on CONTENT_EVENTS is correctly placed.

src/lightspeed_evaluation/core/models/__init__.py (1)

15-15: LGTM!

The StreamingMetricsMixin import and export follow the established patterns. The categorization under # Mixins comment is appropriate.

Also applies to: 47-48

src/lightspeed_evaluation/core/models/mixins.py (1)

1-33: Clean mixin design with proper validation constraints.

The StreamingMetricsMixin follows Pydantic best practices:

Uses Optional[float] with default=None for streaming-only metrics

Applies ge=0 validation to prevent negative values

Includes descriptive field documentation

This provides a reusable component for APIResponse, TurnData, and EvaluationResult.

src/lightspeed_evaluation/core/models/api.py (2)

7-8: Inheritance from StreamingMetricsMixin is appropriate.

Since StreamingMetricsMixin extends BaseModel, APIResponse correctly inherits Pydantic functionality along with the streaming metric fields.

Also applies to: 82-82

119-134: Streaming metrics extraction and propagation looks correct.

The from_raw_response method properly extracts the three streaming metrics using .get() (defaulting to None), which aligns with the mixin's field defaults. The metrics are passed to the constructor where Pydantic's ge=0 validation will apply.

src/lightspeed_evaluation/core/models/data.py (2)

10-10: TurnData correctly inherits streaming metrics via mixin.

The inheritance change from BaseModel to StreamingMetricsMixin is appropriate. Since the mixin extends BaseModel, all existing Pydantic functionality is preserved while adding the three streaming metric fields with proper validation.

Also applies to: 36-36

384-384: EvaluationResult correctly inherits streaming metrics via mixin.

The mixin inheritance provides time_to_first_token, streaming_duration, and tokens_per_second fields with ge=0 validation, addressing the previous review feedback about validation constraints.

VladimirKadlec

Nice work 💪 , one small comment bellow.

src/lightspeed_evaluation/core/api/streaming_parser.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/lightspeed_evaluation/core/output/generator.py (3)

208-216: Update the docstring to document the new parameter.

The streaming_stats parameter was added to the method signature but the docstring was not updated to document it. According to the coding guidelines, all public APIs should have Google-style docstrings.

🔎 Proposed docstring update

     def _generate_json_summary(  # pylint: disable=too-many-arguments,too-many-positional-arguments
         self,
         results: list[EvaluationResult],
         base_filename: str,
         basic_stats: dict[str, Any],
         detailed_stats: dict[str, Any],
         api_tokens: dict[str, Any],
         streaming_stats: dict[str, Any],
     ) -> Path:
-        """Generate JSON summary report."""
+        """Generate JSON summary report.
+
+        Args:
+            results: List of evaluation results.
+            base_filename: Base filename for the output file.
+            basic_stats: Basic statistics calculated from results.
+            detailed_stats: Detailed statistics by metric and conversation.
+            api_tokens: API token usage statistics.
+            streaming_stats: Streaming performance statistics.
+
+        Returns:
+            Path to the generated JSON file.
+        """

265-273: Update the docstring to document the new parameter.

The streaming_stats parameter was added to the method signature but the docstring was not updated to document it. According to the coding guidelines, all public APIs should have Google-style docstrings.

🔎 Proposed docstring update

     def _generate_text_summary(  # pylint: disable=too-many-arguments,too-many-positional-arguments
         self,
         results: list[EvaluationResult],
         base_filename: str,
         basic_stats: dict[str, Any],
         detailed_stats: dict[str, Any],
         api_tokens: dict[str, Any],
         streaming_stats: dict[str, Any],
     ) -> Path:
-        """Generate human-readable text summary."""
+        """Generate human-readable text summary.
+
+        Args:
+            results: List of evaluation results.
+            base_filename: Base filename for the output file.
+            basic_stats: Basic statistics calculated from results.
+            detailed_stats: Detailed statistics by metric and conversation.
+            api_tokens: API token usage statistics.
+            streaming_stats: Streaming performance statistics.
+
+        Returns:
+            Path to the generated text file.
+        """

208-273: Update test calls to include the streaming_stats parameter.

Tests at lines 112-114 and 138-140 in tests/unit/core/output/test_generator.py are calling _generate_json_summary() and _generate_text_summary() without the required streaming_stats parameter. Add an empty dict or stats["streaming"] as the sixth argument to both calls:

json_file = handler._generate_json_summary(
    sample_results, "test", stats["basic"], stats["detailed"], api_tokens, stats.get("streaming", {})
)

txt_file = handler._generate_text_summary(
    sample_results, "test", stats["basic"], stats["detailed"], api_tokens, stats.get("streaming", {})
)

🤖 Fix all issues with AI Agents

In @src/lightspeed_evaluation/core/output/generator.py:
- Around line 326-360: The docstring for _write_streaming_stats is too minimal;
update it to a Google-style docstring that documents the function purpose,
parameters and return value. Specifically, for method
_write_streaming_stats(self, f, streaming_stats) add a description line, an Args
section describing f (file-like object or writable stream) and streaming_stats
(dict with keys like "time_to_first_token", "streaming_duration",
"tokens_per_second" and their expected subkeys), and a Returns section
indicating None; also include a short Raises section if the method can raise
(e.g., KeyError or TypeError) or state that it does not raise, keeping the
docstring consistent with Google style.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9864e3 and 3f8860a.

📒 Files selected for processing (13)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (6)

tests/unit/core/models/test_api_additional.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/unit/core/api/test_streaming_parser.py
config/system.yaml
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/mixins.py

🧰 Additional context used

📓 Path-based instructions (1)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

🧠 Learnings (10)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_streaming_stats (252-293)

calculate_api_token_usage (227-249)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (318-381)

src/lightspeed_evaluation/core/models/__init__.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

🪛 GitHub Actions: Black

src/lightspeed_evaluation/core/output/generator.py

[error] 1-1: Black formatting check failed. 1 file would be reformatted. Run 'black --write' to fix code style issues.

🪛 GitHub Actions: Tests

src/lightspeed_evaluation/core/output/generator.py

[error] 1-1: TypeError in tests: OutputHandler._generate_json_summary() missing 1 required positional argument: 'streaming_stats'. This suggests a mismatch between test expectations and the method signature.

[error] 1-1: TypeError in tests: OutputHandler._generate_text_summary() missing 1 required positional argument: 'streaming_stats'. This suggests a mismatch between test expectations and the method signature.

🔇 Additional comments (12)

src/lightspeed_evaluation/core/models/__init__.py (1)

15-15: LGTM! Proper API exposure.

The import and export of StreamingMetricsMixin correctly follows the module's pattern for exposing public APIs.

Also applies to: 47-48

src/lightspeed_evaluation/core/output/statistics.py (1)

252-293: LGTM! Well-structured statistics function.

The implementation correctly:

Provides type hints for all parameters and return values

Uses Google-style docstring

Handles edge cases (empty lists gracefully return {"count": 0})

Guards statistics.stdev with a length check to prevent errors

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Consistent metric propagation.

Streaming metrics are correctly propagated from turn_data to EvaluationResult using safe conditional access, consistent with the existing pattern for API tokens.

218-218: LGTM! Improved consistency in error handling.

The introduction of the local turn_data variable improves readability and ensures consistent field access patterns. Error results now include streaming metrics with proper parity to success results.

Also applies to: 227-235

src/lightspeed_evaluation/core/models/api.py (2)

7-7: LGTM! Proper mixin integration.

APIResponse correctly inherits from StreamingMetricsMixin, which adds the three streaming performance fields while maintaining Pydantic BaseModel inheritance.

Also applies to: 82-82

119-133: LGTM! Safe metric extraction.

Streaming metrics are correctly extracted using .get() (returns None if absent) and properly passed to the constructor. The inline comment clarifies that these metrics are only available for the streaming endpoint.

src/lightspeed_evaluation/core/api/streaming_parser.py (4)

17-40: LGTM! Well-structured performance tracker.

The _PerformanceTracker class correctly:

Uses Pydantic BaseModel with proper configuration

Provides type hints for all methods

Includes docstrings for methods

Captures TTFT on first content and calculates throughput metrics

42-68: LGTM! Clean context class.

StreamingContext properly encapsulates streaming state and provides a clean serialization method. Type hints and docstrings are present.

71-87: LGTM! Correct throughput calculation.

The function correctly:

Excludes TTFT from throughput calculation (measures generation speed)

Handles edge cases (zero/negative values return None)

Provides type hints and clear docstring

Guards against division by zero

90-119: LGTM! Improved modularity.

The refactored function correctly:

Uses StreamingContext to manage state

Delegates to helper functions for clarity

Captures TTFT on first content event

Includes updated docstring explaining performance metrics

src/lightspeed_evaluation/core/models/data.py (2)

10-10: LGTM! Streaming metrics integration is well-structured.

The inheritance changes to use StreamingMetricsMixin for both TurnData and EvaluationResult are correctly implemented. This provides a consistent way to capture and propagate streaming performance metrics throughout the evaluation pipeline.

Also applies to: 36-36, 384-384

427-445: LGTM! Output fields appropriately flatten complex structures.

The new optional string fields in EvaluationResult provide string-formatted versions of complex structured data from TurnData. This is appropriate for an output model that needs to support flat formats like CSV and human-readable JSON.

src/lightspeed_evaluation/core/output/generator.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI Agents

In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 122-149: Import the custom exceptions APIError and
DataValidationError from lightspeed_evaluation.core.system.exceptions and
replace all ValueError raises in the streaming parser: in _process_event() when
handling the "error" event (currently raising ValueError with the streaming API
error) change to raise APIError with the same message; in _validate_response()
replace the two ValueError raises ("No final response found in streaming output"
and "No Conversation ID found") with DataValidationError. Ensure the file
imports APIError and DataValidationError and that the exception messages remain
descriptive.

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/models/mixins.py (1)
8-33: Consider adding Attributes section for Google-style compliance.

While the current docstring is clear and informative, the coding guidelines specify Google-style docstrings for public APIs. Consider adding an Attributes section:
class StreamingMetricsMixin(BaseModel):
    """Mixin providing streaming performance metric fields.

    These fields capture performance metrics when using the streaming query endpoint.
    The fields are None when using the non-streaming query endpoint.

    Attributes:
        time_to_first_token: Time from request start to first content token in seconds.
        streaming_duration: Total time to receive all tokens in seconds.
        tokens_per_second: Output throughput calculation (tokens/second).
    """
Based on coding guidelines, Google-style docstrings are required for public APIs.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f8860a and 1834d7b.

📒 Files selected for processing (13)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (5)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/core/models/api.py
config/system.yaml
src/lightspeed_evaluation/core/models/init.py

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/models/data.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🧠 Learnings (10)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

src/lightspeed_evaluation/core/output/generator.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_streaming_stats (252-293)

calculate_api_token_usage (227-249)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (90-119)

tests/unit/core/models/test_api_additional.py (1)

src/lightspeed_evaluation/core/models/api.py (2)

APIResponse (82-134)

from_raw_response (99-134)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (318-381)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

🪛 GitHub Actions: Tests

src/lightspeed_evaluation/core/output/generator.py

[error] 1-1: TypeError: OutputHandler._generate_json_summary() missing 1 required positional argument: 'streaming_stats' when running pytest.

[error] 1-1: TypeError: OutputHandler._generate_text_summary() missing 1 required positional argument: 'streaming_stats' when running pytest.

🔇 Additional comments (18)

tests/unit/core/models/test_api_additional.py (4)

168-181: LGTM! Well-structured test for streaming metrics.

The test correctly verifies that streaming performance metrics are properly stored and retrieved from APIResponse instances.

182-191: LGTM! Proper validation of default behavior.

The test correctly verifies that streaming metrics default to None when not provided, which is the expected behavior for non-streaming endpoints.

193-211: LGTM! Comprehensive test coverage.

The test effectively validates the complete parsing flow for streaming metrics through the from_raw_response factory method, including both token counts and streaming performance metrics.

213-228: LGTM! Complete coverage for non-streaming scenario.

The test properly validates that streaming metrics remain None when not present in the raw response, ensuring correct behavior for non-streaming endpoints.

src/lightspeed_evaluation/core/models/data.py (3)

10-10: LGTM! Clean import of the new mixin.

The import statement correctly brings in the StreamingMetricsMixin from the mixins module.

36-36: LGTM! Clean mixin integration.

The inheritance changes cleanly add streaming metrics to both TurnData and EvaluationResult through the StreamingMetricsMixin, maintaining consistency across the codebase.

Also applies to: 384-384

427-429: LGTM! Appropriate field for serialized tool calls.

The string-typed tool_calls field correctly stores the JSON-serialized representation of tool calls for evaluation results, consistent with the pattern used for other serialized fields like contexts and expected_tool_calls.

src/lightspeed_evaluation/core/output/statistics.py (1)

252-293: LGTM! Well-implemented streaming statistics aggregation.

The function correctly:

Follows coding guidelines with proper type hints and Google-style docstring

Handles None values appropriately for optional streaming metrics

Provides comprehensive statistics (count, mean, median, min, max, std)

Gracefully handles edge cases (empty datasets, single values)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Consistent streaming metrics integration.

The streaming performance metrics are properly integrated into the success path, following the same pattern as existing API token fields with appropriate None handling for both missing turn_data and missing metric values.

218-235: LGTM! Complete streaming metrics coverage for error path.

The error result construction now correctly:

Introduces a local turn_data variable for cleaner code

Includes streaming performance metrics matching the success path

Maintains consistent None handling across all fields

The error and success paths now have symmetric streaming metrics handling.

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM! Performance metrics assertions added correctly.

Good addition of assertions to verify that streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second) are included in the parsed result for complete responses.

260-377: Comprehensive test coverage for streaming performance metrics.

The TestStreamingPerformanceMetrics class provides thorough coverage:

TTFT capture on various content events (token, tool_call, turn_complete)

Streaming duration validation (>= TTFT invariant)

Tokens per second calculation with and without token counts

Complete flow integration test

The test structure follows pytest conventions and mirrors the source code organization as per the coding guidelines.

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

17-40: Well-structured performance tracker implementation.

The _PerformanceTracker class cleanly encapsulates timing logic with appropriate use of PrivateAttr for internal state and Pydantic's Field with default_factory for initialization.

42-68: Clean context model with proper serialization.

StreamingContext properly encapsulates streaming state and delegates metric calculation to the performance tracker in to_response_dict().

71-87: Correct throughput calculation with proper edge case handling.

The function correctly:

Excludes TTFT from throughput calculation (measuring generation time only)

Returns None for invalid scenarios (no tokens, no TTFT, zero/negative generation time)

Includes helpful debug logging

src/lightspeed_evaluation/core/output/generator.py (3)

69-83: Streaming stats calculation integrated correctly.

The conditional logic properly computes streaming stats when evaluation data is available and provides a consistent default structure (with count: 0) when not. This aligns with the structure returned by calculate_streaming_stats.

251-254: EvaluationResult properly inherits all streaming metric attributes from StreamingMetricsMixin.

The attributes time_to_first_token, streaming_duration, and tokens_per_second are correctly defined as optional float fields with proper validation and type hints in StreamingMetricsMixin, which EvaluationResult extends. No changes needed.

208-216: The code correctly passes streaming_stats to both _generate_json_summary() and _generate_text_summary(). The parameter is calculated at lines 70–83 and properly passed at lines 127 and 138. The tests also correctly provide streaming_stats when calling these methods (lines 117–124 and 153–160). No pipeline failure or breaking change exists, and no test updates are needed.

Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI Agents

In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 122-149: Replace the generic ValueError usages with the
project-specific exceptions: import APIError and DataValidationError from
core.system.exceptions at the top of the file, change the raise in
_process_event (where streaming API error is detected) to raise APIError(...)
including the original error message, and change both raises in
_validate_response (missing final_response and missing conversation_id) to raise
DataValidationError(...) with appropriate messages; keep existing logging and
context updates intact.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1834d7b and 54f7077.

📒 Files selected for processing (14)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/output/test_generator.py

🚧 Files skipped from review as they are similar to previous changes (5)

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/models/init.py
tests/unit/core/models/test_api_additional.py
config/system.yaml

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/models/api.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/output/test_generator.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/output/test_generator.py

🧠 Learnings (8)

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/api.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

src/lightspeed_evaluation/core/output/generator.py

🧬 Code graph analysis (4)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (90-119)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (318-381)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_streaming_stats (252-293)

calculate_api_token_usage (227-249)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

🔇 Additional comments (22)

src/lightspeed_evaluation/core/constants.py (1)

42-45: LGTM! Clean addition of streaming metrics columns.

The new CSV columns are well-named, appropriately commented, and logically positioned after token-related columns.

src/lightspeed_evaluation/core/output/statistics.py (1)

252-293: Well-implemented streaming statistics calculation.

The function properly:

Provides type hints and Google-style documentation

Aggregates metrics from evaluation data

Handles edge cases (empty lists, single values)

Computes standard statistical measures

The nested _calc_stats helper is appropriate for this use case and keeps the logic cohesive.

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-90: LGTM! Streaming metrics properly integrated into data amendment flow.

The implementation:

Follows the existing pattern for extracting optional API response fields

Provides appropriate debug logging for visibility

Handles None values gracefully

Maintains consistency with existing token usage extraction

tests/unit/core/api/test_streaming_parser.py (2)

36-39: Good addition of streaming metrics assertions.

The assertions ensure that the performance metrics are present in the parsed response, aligning with the new functionality.

260-377: Excellent test coverage for streaming performance metrics.

The new test class thoroughly validates:

Time-to-first-token capture on various content events (token, tool_call, turn_complete)

Streaming duration calculation

Tokens-per-second computation with and without token counts

Complete flow integration with all metrics

Tests follow pytest conventions and cover both happy paths and edge cases.

src/lightspeed_evaluation/core/models/api.py (3)

7-8: Correct import of StreamingMetricsMixin.

The mixin properly extends APIResponse with streaming performance fields.

82-82: Clean inheritance change to add streaming metrics.

Inheriting from StreamingMetricsMixin instead of BaseModel adds the three streaming performance fields (time_to_first_token, streaming_duration, tokens_per_second) with proper validation constraints. This is backward compatible since the fields default to None.

119-134: Streaming metrics properly extracted and validated.

The implementation:

Extracts the three streaming metrics from raw response data

Passes them to the constructor where pydantic validation applies

Handles absent metrics gracefully (None values)

Maintains consistency with existing token extraction pattern

The validation constraints defined in StreamingMetricsMixin (ge=0 for all metrics) ensure data integrity.

tests/unit/core/output/test_generator.py (2)

111-124: LGTM! Test correctly accommodates streaming metrics.

The streaming_stats initialization and parameter passing align with the production code changes. The structure matches the output of calculate_streaming_stats when no data is available.

147-160: LGTM! Consistent test updates for text summary generation.

The streaming_stats initialization and parameter passing match the pattern used in the JSON summary test, maintaining consistency across test cases.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics correctly propagated in success path.

The three streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second) are safely extracted from turn_data with appropriate fallbacks, consistent with the existing pattern for API token fields.

218-236: LGTM! Error result includes streaming metrics with safe access.

The introduction of the local turn_data variable improves readability, and streaming metrics are consistently propagated to ERROR results alongside the existing API token fields.

src/lightspeed_evaluation/core/api/streaming_parser.py (4)

17-39: LGTM! Performance tracker cleanly captures TTFT and duration.

The _PerformanceTracker uses Pydantic's private attributes appropriately to ensure TTFT is captured only once on the first content event. The timing logic is sound and includes helpful debug logging.

42-68: LGTM! Context class provides clean serialization of streaming state.

The StreamingContext encapsulates all streaming state and provides a clean to_response_dict method that integrates performance metrics into the response payload.

71-87: LGTM! Throughput calculation correctly excludes TTFT.

The _calculate_tokens_per_second function correctly calculates throughput by excluding TTFT from the generation time, which provides a more accurate measure of token generation rate. The function handles edge cases (no tokens, no TTFT, zero generation time) safely.

90-119: LGTM! Context-driven parsing improves code organization.

The refactored parse_streaming_response uses the context pattern effectively, delegating event processing to helper functions. The TTFT capture is correctly triggered for content events, and the flow is easier to follow than inline state management.

src/lightspeed_evaluation/core/output/generator.py (6)

69-83: LGTM! Streaming stats properly computed or initialized.

The code correctly calls calculate_streaming_stats when evaluation data is available and provides a sensible fallback structure with zero counts when no data is present, maintaining consistency across the reporting flow.

120-140: LGTM! Streaming stats consistently passed to report generators.

Both JSON and TXT generation receive streaming stats via stats.get("streaming", {}), ensuring graceful handling even if the key is missing from the stats dictionary.

215-238: LGTM! JSON summary integrates streaming performance metrics.

The streaming_stats parameter is properly added to the signature, and the metrics are embedded in the summary under streaming_performance, providing a top-level view of streaming performance alongside other summary statistics.

251-254: LGTM! Per-result streaming metrics included in JSON output.

Each result entry in the JSON output includes the three streaming metrics (time_to_first_token, streaming_duration, tokens_per_second), enabling per-turn analysis of streaming performance.

272-291: LGTM! Text summary generation includes streaming stats.

The streaming_stats parameter is properly threaded through to _generate_text_summary and rendered via the new _write_streaming_stats helper.

326-357: LGTM! Streaming stats section renders performance metrics clearly.

The _write_streaming_stats method provides a well-formatted text representation of streaming performance metrics (TTFT, duration, throughput) with mean, median, min, and max values. The early return when count == 0 prevents rendering empty sections.

src/lightspeed_evaluation/core/api/streaming_parser.py

asamal4

LGTM, few minor changes. Also please update the readme.

src/lightspeed_evaluation/core/output/generator.py

config/system.yaml

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)

17-40: Consider enhancing docstrings for better maintainability.

While _PerformanceTracker is a private class, its methods would benefit from Google-style docstrings that document parameters and return values. This improves maintainability.

🔎 Suggested docstring improvements

 class _PerformanceTracker(BaseModel):
-    """Tracks streaming performance metrics (TTFT, throughput)."""
+    """Tracks streaming performance metrics (TTFT, throughput).
+    
+    Attributes:
+        stream_start_time: Timestamp when streaming started (perf_counter).
+        time_to_first_token: Time in seconds to first content token, or None.
+    """

     def capture_ttft(self) -> None:
-        """Capture time to first token if not already captured."""
+        """Capture time to first token if not already captured.
+        
+        Sets time_to_first_token on first call; subsequent calls are no-ops.
+        """

     def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]:
-        """Calculate streaming duration and tokens per second."""
+        """Calculate streaming duration and tokens per second.
+        
+        Args:
+            output_tokens: Number of output tokens generated.
+            
+        Returns:
+            Tuple of (streaming_duration, tokens_per_second).
+        """

42-69: Enhance docstrings for public API per coding guidelines.

StreamingContext is a public class, so per the coding guidelines, it should use Google-style docstrings with proper Attributes and Returns sections.

🔎 Suggested docstring improvements

 class StreamingContext(BaseModel):
-    """Context for streaming response parsing."""
+    """Context for streaming response parsing.
+    
+    Attributes:
+        conversation_id: Conversation identifier from API.
+        final_response: Complete response text after streaming.
+        tool_calls: List of tool call dictionaries.
+        input_tokens: Number of input tokens consumed.
+        output_tokens: Number of output tokens generated.
+        perf: Performance tracker for TTFT and throughput metrics.
+    """

     def to_response_dict(self) -> dict[str, Any]:
-        """Convert context to response dictionary."""
+        """Convert context to response dictionary.
+        
+        Returns:
+            Dictionary containing response, tool_calls, conversation_id,
+            token counts, and streaming performance metrics.
+        """

src/lightspeed_evaluation/core/models/api.py (1)

98-134: Expand docstring to Google-style format.

The from_raw_response classmethod has a minimal docstring that doesn't document parameters or return value. Per coding guidelines, all public APIs should use Google-style docstrings.

🔎 Proposed docstring enhancement

     @classmethod
     def from_raw_response(cls, raw_data: dict[str, Any]) -> "APIResponse":
-        """Create APIResponse from raw API response data."""
+        """Create APIResponse from raw API response data.
+
+        Args:
+            raw_data: Dictionary containing raw API response with keys:
+                - response: Response text (required)
+                - conversation_id: Conversation tracking ID (required)
+                - tool_calls: Tool call sequences (optional)
+                - rag_chunks: RAG context chunks (optional)
+                - input_tokens: Input token count (optional)
+                - output_tokens: Output token count (optional)
+                - time_to_first_token: TTFT metric for streaming (optional)
+                - streaming_duration: Total streaming time (optional)
+                - tokens_per_second: Throughput metric (optional)
+
+        Returns:
+            APIResponse instance with populated fields.
+
+        Raises:
+            ValueError: If conversation_id is missing from raw_data.
+        """

As per coding guidelines, all public APIs require Google-style docstrings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54f7077 and dbe9eb8.

📒 Files selected for processing (15)

README.md
config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/output/test_generator.py

🚧 Files skipped from review as they are similar to previous changes (8)

src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/models/data.py
tests/unit/core/output/test_generator.py
src/lightspeed_evaluation/core/constants.py
config/system.yaml

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/models/test_api_additional.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/models/test_api_additional.py

🧠 Learnings (9)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/api.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-08-22T09:16:29.070Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

README.md

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

README.md

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

README.md

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

src/lightspeed_evaluation/core/output/generator.py

🧬 Code graph analysis (3)

tests/unit/core/models/test_api_additional.py (1)

src/lightspeed_evaluation/core/models/api.py (2)

APIResponse (82-134)

from_raw_response (99-134)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (1)

calculate_streaming_stats (252-293)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: tests (3.13)
GitHub Check: tests (3.12)

🔇 Additional comments (17)

README.md (1)

14-14: Documentation looks good!

The streaming performance metrics feature is clearly documented with appropriate detail for users. The table format and output descriptions provide good clarity.

Also applies to: 333-346

src/lightspeed_evaluation/core/api/streaming_parser.py (5)

5-5: Imports and constants are appropriate.

The new imports support the performance tracking and Pydantic model definitions, and the CONTENT_EVENTS constant is properly defined for TTFT capture logic.

Also applies to: 9-9, 14-14

71-87: LGTM!

The tokens-per-second calculation correctly excludes TTFT from the throughput calculation and handles edge cases appropriately (no tokens, missing TTFT, negative generation time).

122-142: LGTM!

The event processing logic is clean and correctly handles all streaming events. The ValueError exceptions are appropriate here, as they're converted to APIError at the API boundary in client.py (based on learnings).

144-149: LGTM!

The validation logic correctly ensures required fields are present before returning the response. The ValueError exceptions align with the established pattern (converted to APIError at the API boundary).

90-120: Remove unnecessary pylint disable comment and use custom exceptions for error handling.

The pylint: disable=too-many-branches comment is no longer needed—pylint does not report this issue after the refactoring. Additionally, replace ValueError exceptions (lines 127, 147, 149) with APIError from core.system.exceptions to comply with the framework's error handling guidelines.

tests/unit/core/models/test_api_additional.py (1)

168-228: Excellent test coverage!

The streaming metrics tests cover all the important scenarios:

Direct instantiation with/without metrics

Parsing from raw response with/without metrics

Proper handling of None defaults for non-streaming endpoints

src/lightspeed_evaluation/core/models/mixins.py (1)

1-33: LGTM! Well-designed mixin.

The StreamingMetricsMixin is well-documented with clear field descriptions and appropriate type constraints. The ge=0 constraint correctly allows zero values while preventing negative metrics.

src/lightspeed_evaluation/core/models/api.py (3)

7-8: LGTM!

The import of StreamingMetricsMixin is clean and necessary for the inheritance change in APIResponse.

82-82: LGTM!

The inheritance change from BaseModel to StreamingMetricsMixin is correct and cleanly adds streaming performance metrics to the API response model.

119-122: LGTM!

The extraction of streaming metrics using .get() is appropriate and aligns with the StreamingMetricsMixin fields that default to None for non-streaming endpoints.

src/lightspeed_evaluation/core/output/generator.py (6)

20-20: LGTM!

The import of calculate_streaming_stats is clean and necessary for computing streaming performance metrics.

69-79: LGTM!

The conditional calculation of streaming stats is well-structured. Setting stats["streaming"] = {} when evaluation_data is absent provides a safe default and avoids downstream KeyErrors.

104-136: LGTM!

Passing stats.get("streaming", {}) to both JSON and text summary methods is defensive and ensures safe handling even if the stats structure changes.

204-263: LGTM!

The JSON summary generation correctly integrates streaming metrics:

streaming_performance is conditionally included in summary_stats only when data exists (line 233), keeping the summary clean.

Individual result entries always include streaming metric fields (lines 252-254), even if None, which provides a predictable structure for machine parsing.

Both approaches serve their respective contexts well.

265-299: LGTM!

The text summary generation cleanly integrates the streaming statistics section via _write_streaming_stats, maintaining consistency with other statistics sections.

326-357: LGTM!

The _write_streaming_stats method has clean logic:

Early return when no streaming data is available (line 333-334), avoiding empty sections.

Conditional rendering of each metric based on count, providing flexibility for partial data.

Appropriate numeric formatting (3 decimals for time, 1 for throughput) for readability.

src/lightspeed_evaluation/pipeline/evaluation/amender.py

VladimirKadlec

LGTM

README.md

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI Agents

In @src/lightspeed_evaluation/core/api/streaming_parser.py:
- Around line 90-119: The docstring for the public function
parse_streaming_response is not Google-style; update it to include an "Args"
section documenting the response parameter (type httpx.Response and that it must
be a streaming response) and a "Returns" section describing the returned
dict[str, Any] (what keys are present/meaning), while preserving the existing
description of TTFT, streaming duration and tokens/sec; keep the brief summary
line and place detailed descriptions under Args and Returns, and ensure the
docstring is triple-quoted and located immediately above
parse_streaming_response.

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)

17-39: Consider expanding docstrings for better maintainability.

While private classes aren't strictly required to have Google-style docstrings per the coding guidelines, the _PerformanceTracker class would benefit from more detailed documentation:

Class docstring could document the attributes and their purpose

capture_ttft() could document that it's idempotent (only captures once)

get_metrics() could document the return tuple structure

This would improve maintainability for future developers working with the performance tracking logic.

42-68: Consider expanding docstrings for better maintainability.

The StreamingContext class would benefit from more detailed documentation:

Class docstring could document the attributes and the overall parsing state it maintains

to_response_dict() could document the return dictionary structure with all the keys (response, tool_calls, conversation_id, tokens, metrics)

This would help future maintainers understand the streaming context lifecycle and response format.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6e633b and 827856a.

📒 Files selected for processing (15)

README.md
config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/output/test_generator.py

🚧 Files skipped from review as they are similar to previous changes (7)

src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
README.md
config/system.yaml
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/models/init.py
tests/unit/core/models/test_api_additional.py

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/generator.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/output/test_generator.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/output/test_generator.py

🧠 Learnings (13)

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-08-22T09:16:29.070Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/output/generator.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

src/lightspeed_evaluation/core/output/generator.py

🧬 Code graph analysis (6)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (90-119)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationData (318-381)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

tests/unit/core/output/test_generator.py (1)

src/lightspeed_evaluation/core/output/generator.py (2)

_generate_json_summary (204-263)

_generate_text_summary (265-299)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_streaming_stats (252-293)

calculate_api_token_usage (227-249)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.12)

🔇 Additional comments (19)

src/lightspeed_evaluation/core/output/statistics.py (1)

252-293: LGTM! Clean implementation of streaming statistics.

The new calculate_streaming_stats function follows all coding guidelines:

Proper type hints for the public API

Google-style docstring with clear Args and Returns sections

Correct aggregation logic that filters None values and handles empty datasets

Consistent statistical measures (count, mean, median, min, max, std) matching the existing _finalize_metric_stats pattern

The internal helper _calc_stats appropriately returns {"count": 0} for empty inputs, preventing division-by-zero errors.

src/lightspeed_evaluation/core/models/data.py (3)

10-10: LGTM! Proper mixin import.

The StreamingMetricsMixin import is correctly added to support streaming metrics integration.

36-36: LGTM! Correct inheritance for turn-level streaming metrics.

TurnData now inherits from StreamingMetricsMixin, which provides the three streaming performance fields (time_to_first_token, streaming_duration, tokens_per_second). This is the appropriate level for capturing per-turn streaming metrics from API responses.

384-384: LGTM! Correct inheritance for result-level streaming metrics.

EvaluationResult now inherits from StreamingMetricsMixin, enabling streaming metrics to propagate from TurnData through to evaluation results. This supports the downstream reporting requirements shown in the AI summary.

src/lightspeed_evaluation/core/models/api.py (3)

7-7: LGTM! Proper mixin import for API responses.

The StreamingMetricsMixin import enables APIResponse to carry streaming metrics from the lightspeed-stack API.

82-82: LGTM! Correct inheritance at API boundary.

APIResponse now inherits from StreamingMetricsMixin, which is the appropriate entry point for streaming metrics into the evaluation framework.

119-134: LGTM! Proper extraction and propagation of streaming metrics.

The streaming metrics are correctly extracted from the raw API response using .get() (returns None if absent) and passed to the APIResponse constructor. This aligns with the mixin's optional field design, where metrics are None for non-streaming endpoints.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics properly included in success results.

The three streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second) are correctly propagated from turn_data to the evaluation result, with proper None handling when turn_data is absent.

218-236: LGTM! Consistent streaming metrics in error results.

The error result path mirrors the success path by including streaming metrics from turn_data. The local variable pattern (line 218) improves readability and maintains consistency with the success path (line 165).

tests/unit/core/output/test_generator.py (2)

111-120: LGTM! Test updated to include streaming_stats parameter.

The test correctly initializes streaming_stats as an empty dict and passes it to _generate_json_summary, aligning with the updated generator signature shown in the relevant code snippets.

143-152: LGTM! Test updated to include streaming_stats parameter.

The test correctly initializes streaming_stats as an empty dict and passes it to _generate_text_summary, aligning with the updated generator signature shown in the relevant code snippets.

src/lightspeed_evaluation/core/api/streaming_parser.py (2)

71-87: LGTM! Solid edge case handling.

The tokens-per-second calculation correctly excludes TTFT from throughput and handles all edge cases (no tokens, no TTFT, zero/negative generation time). The debug logging is helpful for observability.

122-149: LGTM! Event processing and validation logic is correct.

The event handling properly processes all streaming event types (start, turn_complete, tool_call, end, error) and validation ensures required fields are present. The use of ValueError is correct per the architecture—these exceptions are translated to APIError at the API boundary in client.py.

Based on learnings, ValueError is the appropriate exception type for internal validation in streaming_parser.py.

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM! Good test coverage extension.

The assertions verify that the performance metrics are included in the response. This aligns well with the new functionality added to the streaming parser.

260-377: LGTM! Comprehensive performance metrics test coverage.

The new TestStreamingPerformanceMetrics class provides excellent coverage of the streaming metrics functionality:

TTFT capture on various event types (token, tool_call, turn_complete)

Streaming duration measurement

Tokens-per-second calculation with edge cases (with/without token counts)

Complete end-to-end flow validation

The tests properly use the pytest mocker fixture and follow the required test structure conventions.

src/lightspeed_evaluation/core/output/generator.py (4)

20-20: LGTM! Clean streaming stats integration.

The streaming statistics are properly integrated into the report generation flow with the same pattern used for API token usage—calculate when data is available, otherwise initialize as empty. This ensures downstream code can safely access streaming stats without errors.

Also applies to: 69-79

204-263: LGTM! Well-structured JSON summary integration.

The streaming statistics are properly integrated into the JSON output:

Conditionally includes streaming_performance section only when data is available (lines 233-234)

Per-result entries include streaming metrics fields (lines 251-254)

Function signature properly updated with streaming_stats parameter (line 211)

The conditional inclusion pattern prevents empty sections in the summary while maintaining consistent per-result structure.

265-299: LGTM! Consistent text summary integration.

The streaming statistics are properly integrated into the text summary generation, following the same pattern as other statistics sections (overall, tokens, metrics, conversations).

326-357: LGTM! Clean streaming stats rendering with proper guards.

The _write_streaming_stats method correctly:

Returns early if no streaming data is available (line 334)

Conditionally writes each metric section only when data exists (count > 0)

Formats statistics consistently with mean, median, min, max values

Maintains the same formatting style as other statistics sections

The defensive programming with count checks ensures clean output even with partial streaming data.

src/lightspeed_evaluation/core/api/streaming_parser.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)
114-116: Docstring mentions exception types that differ from actual raises.

The Raises section documents APIError and DataValidationError, but the code raises ValueError. While the conversion happens in client.py, this could confuse developers reading just this module.

Consider updating the docstring to reflect the actual exception raised, with a note about the boundary conversion:
🔎 Suggested docstring clarification
     Raises:
-        APIError: If an error event is received from the streaming API.
-        DataValidationError: If required response fields are missing.
+        ValueError: If streaming response is invalid (error event received,
+            missing final_response, or missing conversation_id). Note: These
+            are converted to APIError at the client boundary.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 827856a and 3251e5a.

📒 Files selected for processing (15)

README.md
config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
tests/unit/core/output/test_generator.py

🚧 Files skipped from review as they are similar to previous changes (9)

config/system.yaml
src/lightspeed_evaluation/core/constants.py
tests/unit/core/output/test_generator.py
src/lightspeed_evaluation/core/models/api.py
tests/unit/core/models/test_api_additional.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/core/models/init.py
src/lightspeed_evaluation/core/output/statistics.py
README.md

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py

🧠 Learnings (13)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: In streaming_parser.py, using ValueError for response validation errors is intentional because the API boundary maps these to APIError via _handle_validation_error (see client.py lines 192-193). Ensure downstream validation consistently raises ValueError and that the API boundary translation preserves error semantics; update tests to verify that APIError is raised at the boundary with appropriate messaging.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2026-01-06T10:26:12.614Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 127
File: src/lightspeed_evaluation/core/api/streaming_parser.py:122-149
Timestamp: 2026-01-06T10:26:12.614Z
Learning: DataValidationError in src/lightspeed_evaluation/core/system/exceptions.py is used exclusively for data file validation scenarios (YAML parsing, Pydantic model validation, file existence checks) in validator.py, not for runtime API response validation errors.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-08-22T09:16:29.070Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use Google-style docstrings for all public APIs

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/output/generator.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (3)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (90-137)

src/lightspeed_evaluation/core/output/generator.py (1)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_streaming_stats (252-293)

calculate_api_token_usage (227-249)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: tests (3.13)
GitHub Check: tests (3.12)

🔇 Additional comments (19)

src/lightspeed_evaluation/core/models/mixins.py (1)

1-33: LGTM!

The StreamingMetricsMixin is well-designed as a reusable Pydantic mixin. The implementation correctly:

Uses Optional[float] with default=None for optional streaming metrics

Applies ge=0 validation constraints to prevent negative values

Includes clear field descriptions for documentation

Provides a helpful class docstring explaining when fields are populated

src/lightspeed_evaluation/core/models/data.py (4)

10-10: LGTM!

Clean import of the new mixin for use in this module.

36-36: LGTM!

TurnData now inherits from StreamingMetricsMixin, which cleanly adds the streaming performance fields (time_to_first_token, streaming_duration, tokens_per_second) with proper validation constraints. This is the appropriate location since streaming metrics are captured at the turn level from API responses.

384-384: LGTM!

EvaluationResult now inherits from StreamingMetricsMixin, ensuring streaming metrics are available in evaluation results with consistent validation (ge=0). This addresses the previous review comment about adding non-negative validation constraints to these fields.

427-429: LGTM!

The tool_calls field addition is appropriate with Optional[str] type (formatted string representation) and clear description.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM!

Streaming performance metrics are correctly propagated from turn_data to the success EvaluationResult. The conditional pattern turn_data.X if turn_data else None is consistent with how other optional fields are handled.

218-235: LGTM!

The error result creation now:

Uses a local turn_data variable for cleaner access

Propagates streaming metrics consistently with the success path

Properly handles the case when turn_data is None

This ensures streaming metrics are captured even for failed evaluations, which is valuable for performance analysis.

src/lightspeed_evaluation/core/api/streaming_parser.py (5)

17-39: LGTM!

_PerformanceTracker is well-designed:

Uses Pydantic BaseModel for consistency (addresses previous review feedback)

PrivateAttr for internal state (_first_content_received) that shouldn't be serialized

capture_ttft() correctly captures TTFT only once

get_metrics() cleanly delegates throughput calculation to the utility function

42-68: LGTM!

StreamingContext effectively encapsulates streaming state:

All fields have appropriate defaults

to_response_dict() provides clean serialization for the final response

Composition with _PerformanceTracker keeps concerns separated

71-87: LGTM!

The _calculate_tokens_per_second function correctly:

Returns None when output_tokens ≤ 0 or TTFT is unavailable

Excludes TTFT from throughput calculation (generation_time = total_duration - ttft)

Returns None if generation_time ≤ 0 (edge case protection)

Logs performance metrics at debug level

90-137: LGTM!

The refactored parse_streaming_response function:

Has a proper Google-style docstring with Args, Returns, and Raises sections (addresses previous review feedback)

Uses the new context-based flow for cleaner state management

Captures TTFT on content events (token, tool_call, turn_complete)

Delegates event processing and validation to separate functions

Note: The docstring mentions APIError and DataValidationError in the Raises section, but the code raises ValueError. Based on learnings, this is intentional as client.py converts ValueError to APIError at the API boundary.

140-167: LGTM!

The _process_event and _validate_response functions are well-structured:

Single responsibility - each handles a specific concern

Proper logging at each processing step

Error handling with clear messages

The use of ValueError here is intentional per project conventions (converted to APIError in client.py).

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM!

Good addition - verifying that performance metrics keys are present in the result ensures the API contract is maintained.

260-377: LGTM!

Excellent test coverage for the new streaming performance metrics:

test_time_to_first_token_captured: Verifies TTFT is captured and non-negative

test_streaming_duration_captured: Verifies duration is captured and >= TTFT

test_tokens_per_second_with_token_counts: Verifies throughput calculation with token data

test_tokens_per_second_without_token_counts: Verifies throughput is None without tokens

test_ttft_captured_on_token_event: Verifies TTFT capture on token events

test_ttft_captured_on_tool_call_event: Verifies TTFT capture on tool_call events

test_performance_metrics_with_complete_flow: End-to-end verification

The tests follow pytest conventions and use the mocker fixture as per coding guidelines.

src/lightspeed_evaluation/core/output/generator.py (5)

20-20: LGTM!

Clean import of the new statistics function for streaming metrics calculation.

69-79: LGTM!

Streaming stats are properly calculated when evaluation_data is available and initialized as an empty dict otherwise. This ensures consistent data structure across both paths.

227-234: LGTM!

Good conditional inclusion - streaming_performance is only added to the JSON summary when there's actual streaming data. This keeps the output clean when streaming metrics aren't available.

251-254: LGTM!

Per-result streaming metrics are included in the JSON output, enabling detailed analysis of individual evaluation performance.

326-357: LGTM!

The _write_streaming_stats method is well-implemented:

Returns early when no data is available (line 334)

Checks each metric category before writing

Uses appropriate formatting (.3f for time, .1f for throughput)

Adds proper section headers and formatting

The docstring is concise but clear. Previous review feedback about expanding it was marked as addressed.

asamal4

LGTM. Thank you !!

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

src/lightspeed_evaluation/core/api/streaming_parser.py Show resolved Hide resolved

src/lightspeed_evaluation/core/models/data.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 8f2e6c8 to ed90e09 Compare December 26, 2025 10:07

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from ed90e09 to ff4133d Compare December 26, 2025 10:14

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from ff4133d to a9864e3 Compare December 26, 2025 10:40

VladimirKadlec reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/core/api/streaming_parser.py Outdated Show resolved Hide resolved

src/lightspeed_evaluation/core/api/streaming_parser.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch 2 times, most recently from 3f8860a to 1834d7b Compare January 6, 2026 09:57

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/core/output/generator.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 1834d7b to 54f7077 Compare January 6, 2026 10:01

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/core/api/streaming_parser.py Show resolved Hide resolved

asamal4 reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/core/output/generator.py Outdated Show resolved Hide resolved

config/system.yaml Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 54f7077 to dbe9eb8 Compare January 6, 2026 11:11

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/pipeline/evaluation/amender.py Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from dbe9eb8 to 46c5442 Compare January 6, 2026 11:27

lightspeed-core deleted a comment from biswajit-satapathy1998 Jan 6, 2026

bsatapat-jpg requested review from VladimirKadlec and asamal4 January 6, 2026 13:00

VladimirKadlec approved these changes Jan 6, 2026

View reviewed changes

asamal4 reviewed Jan 6, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch 2 times, most recently from e6e633b to 827856a Compare January 6, 2026 17:17

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

src/lightspeed_evaluation/core/api/streaming_parser.py Show resolved Hide resolved

[LEADS-141] Add Latency Metrics to Evaluation Reports

3251e5a

bsatapat-jpg force-pushed the dev branch from 827856a to 3251e5a Compare January 7, 2026 05:01

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

asamal4 approved these changes Jan 7, 2026

View reviewed changes

asamal4 merged commit a0778f5 into lightspeed-core:main Jan 7, 2026
15 checks passed

bsatapat-jpg deleted the dev branch January 7, 2026 06:08

coderabbitai bot mentioned this pull request Jan 8, 2026

[LEADS-172] Support classical evaluation metrics #130

Open

15 tasks

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Conversation

bsatapat-jpg commented Dec 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

bsatapat-jpg commented Dec 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 26, 2025 •

edited

Loading