feat: Add bigframes.execution_history API to track BigQuery jobs#16588
feat: Add bigframes.execution_history API to track BigQuery jobs#16588
Conversation
35a379d to
eab6cdb
Compare
There was a problem hiding this comment.
Code Review
This pull request implements an execution history feature to track and display BigQuery and local Polars jobs initiated during a session. Key changes include the addition of a JobMetadata dataclass, updates to ExecutionMetrics for job tracking, and a specialized _ExecutionHistory DataFrame for formatted output. Review feedback identifies opportunities to improve error logging in the HTML representation, remove redundant attribute assignments in the metrics logic, and ensure that bytes processed during local executions are consistently aggregated into session-level metrics.
| except Exception: | ||
| return super()._repr_html_() # type: ignore |
There was a problem hiding this comment.
Using a broad except Exception: can hide bugs in the HTML representation logic. It's better to catch specific exceptions or at least log the caught exception to aid in debugging if the formatting fails. This will provide visibility into any issues without breaking the user's interactive session.
| except Exception: | |
| return super()._repr_html_() # type: ignore | |
| except Exception as e: | |
| logger.warning("Failed to generate custom HTML representation for execution history: %s", e) | |
| return super()._repr_html_() # type: ignore |
| metadata.total_bytes_processed = bytes_processed | ||
| metadata.total_slot_ms = slot_millis |
There was a problem hiding this comment.
These assignments are redundant because JobMetadata.from_job already populates total_bytes_processed and total_slot_ms from the query_job object when the job is a QueryJob. The values from get_performance_stats are sourced from the same attributes on the job object. Removing these lines will make the code cleaner.
| if isinstance(event, bigframes.core.events.ExecutionFinished): | ||
| if event.result and isinstance(event.result, LocalExecuteResult): | ||
| self.execution_count += 1 | ||
| bytes_processed = event.result.total_bytes_processed or 0 |
There was a problem hiding this comment.
The execution_count is being incremented for local Polars executions, but bytes_processed is not. For consistency with how other job types are handled in count_job_stats, self.bytes_processed should also be updated. This ensures that metrics like session.bytes_processed_sum are comprehensive. Note that the docstring for bytes_processed_sum might need to be updated in a separate change to reflect that it includes more than just BigQuery jobs.
| bytes_processed = event.result.total_bytes_processed or 0 | |
| bytes_processed = event.result.total_bytes_processed or 0 | |
| self.bytes_processed += bytes_processed |
This PR promotes execution_history() to the top-level bigframes namespace and upgrades it to track rich metadata for every BigQuery job executed during your session.
Key User Benefits:
Easier Access: Call bigframes.execution_history() directly instead of digging into sub-namespaces.
Rich Metadata Tracking: Captures structured statistics for both Query Jobs and Load Jobs including:
- job_id and a direct Google Cloud Console URL for easy debugging.
- Performance metrics: total_bytes_processed, duration_seconds, and slot_millis.
- Query details (truncated preview of the SQL ran).
Clean, Focused Logs: Automatically filters out internal library overhead (like schema validations and index uniqueness checks) so your history only shows the data processing steps you actually care about.
Usage Example:
verified at:
More testcases and notebook update will be checked in using separate PRs for easier review.
Fixes #<481840739> 🦕