[analytics engine] Add /_analytics/ppl/_explain endpoint with stage profiling#21660
Conversation
PR Reviewer Guide 🔍(Review updated until commit dd2724f)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to dd2724f Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 3bf6532
Suggestions up to commit 0a13457
Suggestions up to commit 16d5059
Suggestions up to commit 0603780
Suggestions up to commit b869a77
|
|
❌ Gradle check result for ef94fb8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit a50b194.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit 7584928 |
7584928 to
c5d257f
Compare
|
Persistent review updated to latest commit c5d257f |
5771c1c to
1f3e852
Compare
|
Persistent review updated to latest commit 1f3e852 |
|
Persistent review updated to latest commit a8279c2 |
|
❌ Gradle check result for a8279c2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Diff analyzer is complaining that test plugin only REST endpoint |
|
Persistent review updated to latest commit 827c886 |
|
❌ Gradle check result for 827c886: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 0ad0bad |
|
❌ Gradle check result for 0ad0bad: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
0ad0bad to
ead3a6b
Compare
|
Persistent review updated to latest commit ead3a6b |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21660 +/- ##
=========================================
Coverage 73.42% 73.42%
+ Complexity 75223 75182 -41
=========================================
Files 6023 6023
Lines 341475 341475
Branches 49141 49141
=========================================
+ Hits 250717 250728 +11
+ Misses 70841 70762 -79
- Partials 19917 19985 +68 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ead3a6b to
083a81c
Compare
|
Persistent review updated to latest commit 083a81c |
|
❌ Gradle check result for 083a81c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit 3337501 |
b869a77 to
0603780
Compare
|
Persistent review updated to latest commit 0603780 |
0603780 to
16d5059
Compare
|
Persistent review updated to latest commit 16d5059 |
|
Persistent review updated to latest commit ce7c2dd |
PR Code Suggestions ✨Explore these optional code suggestions:
|
AbstractStageExecution.fireOperationListeners was passing getClass().getSimpleName() as the stageType to onStageStart and friends — that's the execution class name (e.g. "ShardFragmentStageExecution"), which is the wrong stable surface for an external rollup. The Stage's StageExecutionType enum (SHARD_FRAGMENT, COORDINATOR_REDUCE, LOCAL_PASSTHROUGH, LOCAL_COMPUTE) is the canonical name and is what PR opensearch-project#21660's StageProfile.execution_type already uses. Pass stage.getExecutionType().name() instead. The stats endpoint's stages_by_type buckets now key on the enum name, matching the explain API's field naming and surviving any future class-name refactor. Signed-off-by: Michael Oviedo <mikeovi@amazon.com>
|
Persistent review updated to latest commit 0a13457 |
0a13457 to
3bf6532
Compare
|
Persistent review updated to latest commit 3bf6532 |
Mirror the SQL plugin's /_explain URL-based routing pattern in the test-ppl-frontend harness. Add explain flag to PPLRequest and register the POST /_analytics/ppl/_explain route in RestPPLQueryAction. Signed-off-by: Finn Carroll <carrofin@amazon.com>
Rebase of the explain API feature onto the new scheduler architecture (post PR opensearch-project#21699). Key changes from the previous version: - Profile types (QueryProfile, StageProfile, TaskProfile, ProfiledResult) in analytics-api with executeWithProfile on QueryPlanExecutor interface - QueryProfileBuilder rewritten to use StageExecution.tasks() directly (no TaskTracker needed — tasks are accessible from the stage) - Scheduler.execute() returns QueryExecution for post-execution inspection - QueryExecution.getGraph() accessor added for profile snapshot - DefaultPlanExecutor: unified executeInternal with profile boolean, captures planning_time_ms and execution_time_ms - PPL frontend: /_analytics/ppl/_explain endpoint with full wiring - Integration tests (ExplainApiIT) Signed-off-by: Finn Carroll <carrofin@amazon.com>
Fixes missingJavadoc check for the QueryProfileBuilder package. Signed-off-by: Finn Carroll <carrofin@amazon.com>
Remove the dual execute/executeWithProfile paths. Always call executeWithProfile and conditionally include the profile in the response based on the explain flag. Eliminates code duplication in the test frontend. Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Local IDE setting accidentally committed. Should not be in the repo. Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
- Remove partition_id (only relevant for future shuffle exchanges) - Remove start_ms/end_ms (redundant with elapsed_ms) - describeTarget returns node ID for all task types instead of '(local)' Task output now contains only: node, state, elapsed_ms. Signed-off-by: Finn Carroll <carrofin@amazon.com>
The dsl-query-executor tests use QueryPlanExecutor as a lambda (single abstract method). Making executeWithProfile abstract broke this. Restore the default implementation that throws UnsupportedOperationException — real implementations (DefaultPlanExecutor) still override it, but test lambdas that only need execute() continue to work. Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
Signed-off-by: Finn Carroll <carrofin@amazon.com>
executeWithProfile wraps failures in ProfiledResult, bypassing DefaultPlanExecutor's convertingListener which converts native exceptions (e.g. CircuitBreakingException). The non-profile path must use execute() directly so exception conversion works correctly. Signed-off-by: Finn Carroll <carrofin@amazon.com>
3bf6532 to
dd2724f
Compare
|
Persistent review updated to latest commit dd2724f |
Description
Adds an explain/profile API to the analytics engine, accessible via
POST /_analytics/ppl/_explain. The endpoint executes the query and returns per-stage timing from the coordinator's perspective alongside normal results.Follow up work which is not included in this PR:
RestUnifiedQueryAction.explain()to useexecuteWithProfileResponse format
Full response format with SQL plugin (SQL PR) with some test data.
{ "columns": [ "avg(score)", "name" ], "rows": [ [ 92.7, "eve" ], [ 91.0, "carol" ], [ 95.5, "alice" ], [ 88.1, "dave" ], [ 87.3, "bob" ] ], "profile": { "query_id": "a3009ddf-629b-45bc-ac61-5ab607d38cf5", "full_plan": [ "OpenSearchProject(avg(score)=[$1], name=[$0], viableBackends=[[datafusion]])", " OpenSearchProject(name=[$0], avg(score)=[ANNOTATED_PROJECT_EXPR(id=2, backends=[datafusion], /($1, $2))], viableBackends=[[datafusion]])", " OpenSearchAggregate(group=[{0}], agg#0=[SUM(AGG_CALL_ANNOTATION(id=0, viableBackends=[datafusion]), $1)], agg#1=[COUNT(AGG_CALL_ANNOTATION(id=1, viableBackends=[datafusion]), $1)], mode=[FINAL], viableBackends=[[datafusion]])", " OpenSearchExchangeReducer(viableBackends=[[datafusion]], exchange=[ExchangeInfo[distributionType=SINGLETON, partitionKeyIndices=[]]])", " OpenSearchAggregate(group=[{0}], agg#0=[SUM(AGG_CALL_ANNOTATION(id=0, viableBackends=[datafusion]), $1)], agg#1=[COUNT(AGG_CALL_ANNOTATION(id=1, viableBackends=[datafusion]), $1)], mode=[PARTIAL], viableBackends=[[datafusion]])", " OpenSearchProject(name=[$1], score=[$2], viableBackends=[[datafusion]])", " OpenSearchTableScan(table=[[test_parquet]], viableBackends=[[datafusion]])" ], "planning_time_ms": 17, "execution_time_ms": 12, "stages": [ { "stage_id": 0, "execution_type": "SHARD_FRAGMENT", "distribution": "SINGLETON", "state": "SUCCEEDED", "start_ms": 1779209122752, "end_ms": 1779209122763, "elapsed_ms": 11, "rows_processed": 5, "tasks_completed": 2, "tasks_failed": 0, "fragment": [ "OpenSearchAggregate(group=[{0}], agg#0=[SUM(AGG_CALL_ANNOTATION(id=0, viableBackends=[datafusion]), $1)], agg#1=[COUNT(AGG_CALL_ANNOTATION(id=1, viableBackends=[datafusion]), $1)], mode=[PARTIAL], viableBackends=[[datafusion]])", " OpenSearchProject(name=[$1], score=[$2], viableBackends=[[datafusion]])", " OpenSearchTableScan(table=[[test_parquet]], viableBackends=[[datafusion]])" ], "tasks": [ { "partition_id": 0, "node": "8L307qJ0RTmJLo6_vPhl4A/shard[0]", "state": "FINISHED", "start_ms": 1779209122752, "end_ms": 1779209122762, "elapsed_ms": 10 }, { "partition_id": 1, "node": "8L307qJ0RTmJLo6_vPhl4A/shard[1]", "state": "FINISHED", "start_ms": 1779209122752, "end_ms": 1779209122763, "elapsed_ms": 11 } ] }, { "stage_id": 1, "execution_type": "COORDINATOR_REDUCE", "state": "SUCCEEDED", "start_ms": 1779209122752, "end_ms": 1779209122764, "elapsed_ms": 12, "rows_processed": 0, "tasks_completed": 1, "tasks_failed": 0, "fragment": [ "OpenSearchProject(avg(score)=[$1], name=[$0], viableBackends=[[datafusion]])", " OpenSearchProject(name=[$0], avg(score)=[ANNOTATED_PROJECT_EXPR(id=2, backends=[datafusion], /($1, $2))], viableBackends=[[datafusion]])", " OpenSearchAggregate(group=[{0}], agg#0=[SUM(AGG_CALL_ANNOTATION(id=0, viableBackends=[datafusion]), $1)], agg#1=[COUNT(AGG_CALL_ANNOTATION(id=1, viableBackends=[datafusion]), $1)], mode=[FINAL], viableBackends=[[datafusion]])", " OpenSearchExchangeReducer(viableBackends=[[datafusion]], exchange=[ExchangeInfo[distributionType=SINGLETON, partitionKeyIndices=[]]])", " OpenSearchStageInputScan(childStageId=[0], viableBackends=[[datafusion]])" ], "tasks": [ { "partition_id": 0, "node": "(local)", "state": "FINISHED", "start_ms": 1779209122752, "end_ms": 1779209122764, "elapsed_ms": 12 } ] } ] } }Related Issues
N/A
Check List
- [ ] API changes companion pull request created, if applicable.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.