-
Notifications
You must be signed in to change notification settings - Fork 2k
feat(cubestore): error when a query plan node materializes too many rows #10997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
waralexrom
wants to merge
3
commits into
master
Choose a base branch
from
cubestore-materialized-rows-limit
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
cc5ae4e
feat(cubestore): error when a query plan node materializes too many rows
waralexrom ed1529c
feat(cubestore): report row counts in materialized rows limit error
waralexrom b50385e
feat(cubestore): default materialized rows limit to partition split t…
waralexrom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
201 changes: 201 additions & 0 deletions
201
rust/cubestore/cubestore/src/queryplanner/materialized_rows_limit.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| use crate::CubeError; | ||
| use async_trait::async_trait; | ||
| use datafusion::arrow::datatypes::SchemaRef; | ||
| use datafusion::arrow::record_batch::RecordBatch; | ||
| use datafusion::error::DataFusionError; | ||
| use datafusion::execution::TaskContext; | ||
| use datafusion::physical_plan::{ | ||
| DisplayAs, DisplayFormatType, ExecutionPlan, PlanProperties, RecordBatchStream, | ||
| SendableRecordBatchStream, | ||
| }; | ||
| use futures::stream::Stream; | ||
| use futures::StreamExt; | ||
| use std::any::Any; | ||
| use std::fmt::Formatter; | ||
| use std::pin::Pin; | ||
| use std::sync::atomic::{AtomicUsize, Ordering}; | ||
| use std::sync::Arc; | ||
| use std::task::{Context, Poll}; | ||
|
|
||
| /// Errors out when the wrapped stream produces more than `limit` rows in total across all | ||
| /// partitions. Placed at points of the plan where rows get materialized in memory. | ||
| #[derive(Debug)] | ||
| pub struct MaterializedRowsLimitExec { | ||
| pub input: Arc<dyn ExecutionPlan>, | ||
| pub limit: usize, | ||
| /// Human-readable description of the materialization point, used in the error message. | ||
| pub stage: &'static str, | ||
| /// Total across all partitions. Never reset: plans are built per query and executed once. | ||
| rows: Arc<AtomicUsize>, | ||
| } | ||
|
|
||
| impl MaterializedRowsLimitExec { | ||
| pub fn new(input: Arc<dyn ExecutionPlan>, limit: usize, stage: &'static str) -> Self { | ||
| Self { | ||
| input, | ||
| limit, | ||
| stage, | ||
| rows: Arc::new(AtomicUsize::new(0)), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl DisplayAs for MaterializedRowsLimitExec { | ||
| fn fmt_as(&self, _t: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result { | ||
| write!( | ||
| f, | ||
| "MaterializedRowsLimitExec, limit: {}, stage: {}", | ||
| self.limit, self.stage | ||
| ) | ||
| } | ||
| } | ||
|
|
||
| #[async_trait] | ||
| impl ExecutionPlan for MaterializedRowsLimitExec { | ||
| fn name(&self) -> &str { | ||
| "MaterializedRowsLimitExec" | ||
| } | ||
|
|
||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn schema(&self) -> SchemaRef { | ||
| self.input.schema() | ||
| } | ||
|
|
||
| fn properties(&self) -> &PlanProperties { | ||
| self.input.properties() | ||
| } | ||
|
|
||
| fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> { | ||
| vec![&self.input] | ||
| } | ||
|
|
||
| fn with_new_children( | ||
| self: Arc<Self>, | ||
| children: Vec<Arc<dyn ExecutionPlan>>, | ||
| ) -> Result<Arc<dyn ExecutionPlan>, DataFusionError> { | ||
| assert_eq!(children.len(), 1); | ||
| Ok(Arc::new(MaterializedRowsLimitExec { | ||
| input: children.into_iter().next().unwrap(), | ||
| limit: self.limit, | ||
| stage: self.stage, | ||
| rows: self.rows.clone(), | ||
| })) | ||
| } | ||
|
|
||
| fn execute( | ||
| &self, | ||
| partition: usize, | ||
| context: Arc<TaskContext>, | ||
| ) -> Result<SendableRecordBatchStream, DataFusionError> { | ||
| if partition >= self.input.properties().partitioning.partition_count() { | ||
| return Err(DataFusionError::Internal(format!( | ||
| "MaterializedRowsLimitExec invalid partition {}", | ||
| partition | ||
| ))); | ||
| } | ||
|
|
||
| let input = self.input.execute(partition, context)?; | ||
| Ok(Box::pin(MaterializedRowsLimitStream { | ||
| schema: self.schema(), | ||
| limit: self.limit, | ||
| stage: self.stage, | ||
| rows: self.rows.clone(), | ||
| input, | ||
| })) | ||
| } | ||
| } | ||
|
|
||
| struct MaterializedRowsLimitStream { | ||
| schema: SchemaRef, | ||
| limit: usize, | ||
| stage: &'static str, | ||
| rows: Arc<AtomicUsize>, | ||
| input: SendableRecordBatchStream, | ||
| } | ||
|
|
||
| impl Stream for MaterializedRowsLimitStream { | ||
| type Item = Result<RecordBatch, DataFusionError>; | ||
|
|
||
| fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> { | ||
| self.input.poll_next_unpin(cx).map(|x| match x { | ||
| Some(Ok(batch)) => { | ||
| let total = | ||
| self.rows.fetch_add(batch.num_rows(), Ordering::Relaxed) + batch.num_rows(); | ||
| if total > self.limit { | ||
| Some(Err(CubeError::user(format!( | ||
| "Query execution stage '{}' materialized at least {} rows \ | ||
| which exceeds the limit of {} rows. \ | ||
| Consider creating a pre-aggregation that performs this stage \ | ||
| ahead of time.", | ||
| self.stage, total, self.limit | ||
| )) | ||
| .into())) | ||
|
claude[bot] marked this conversation as resolved.
|
||
| } else { | ||
| Some(Ok(batch)) | ||
| } | ||
| } | ||
| other => other, | ||
| }) | ||
| } | ||
|
|
||
| fn size_hint(&self) -> (usize, Option<usize>) { | ||
| // same number of record batches | ||
| self.input.size_hint() | ||
| } | ||
| } | ||
|
|
||
| impl RecordBatchStream for MaterializedRowsLimitStream { | ||
| fn schema(&self) -> SchemaRef { | ||
| self.schema.clone() | ||
| } | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use datafusion::arrow::array::Int64Array; | ||
| use datafusion::arrow::datatypes::{DataType, Field, Schema}; | ||
| use datafusion::physical_plan::collect; | ||
| use datafusion_datasource::memory::MemorySourceConfig; | ||
|
|
||
| fn batches(sizes: &[usize]) -> (SchemaRef, Vec<RecordBatch>) { | ||
| let schema = Arc::new(Schema::new(vec![Field::new("a", DataType::Int64, false)])); | ||
| let batches = sizes | ||
| .iter() | ||
| .map(|size| { | ||
| let array = Int64Array::from((0..*size as i64).collect::<Vec<_>>()); | ||
| RecordBatch::try_new(schema.clone(), vec![Arc::new(array)]).unwrap() | ||
| }) | ||
| .collect(); | ||
| (schema, batches) | ||
| } | ||
|
|
||
| async fn run_with_limit( | ||
| sizes: &[usize], | ||
| limit: usize, | ||
| ) -> Result<Vec<RecordBatch>, DataFusionError> { | ||
| let (schema, batches) = batches(sizes); | ||
| let input = MemorySourceConfig::try_new_exec(&[batches], schema, None).unwrap(); | ||
| let limited = Arc::new(MaterializedRowsLimitExec::new(input, limit, "test stage")); | ||
| collect(limited, Arc::new(TaskContext::default())).await | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn passes_under_limit() { | ||
| let r = run_with_limit(&[3, 4], 7).await.unwrap(); | ||
| assert_eq!(r.iter().map(|b| b.num_rows()).sum::<usize>(), 7); | ||
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn errors_over_limit() { | ||
| let err = run_with_limit(&[3, 4], 6).await.unwrap_err(); | ||
| let message = err.to_string(); | ||
| assert!(message.contains("'test stage'"), "{}", message); | ||
| assert!(message.contains("at least 7 rows"), "{}", message); | ||
| assert!(message.contains("limit of 6 rows"), "{}", message); | ||
| assert!(message.contains("pre-aggregation"), "{}", message); | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
71 changes: 71 additions & 0 deletions
71
rust/cubestore/cubestore/src/queryplanner/optimizations/materialized_rows_limit.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| use crate::queryplanner::materialized_rows_limit::MaterializedRowsLimitExec; | ||
| use crate::queryplanner::planning::WorkerExec; | ||
| use datafusion::error::DataFusionError; | ||
| use datafusion::physical_plan::aggregates::AggregateExec; | ||
| use datafusion::physical_plan::joins::{CrossJoinExec, HashJoinExec}; | ||
| use datafusion::physical_plan::sorts::sort::SortExec; | ||
| use datafusion::physical_plan::windows::WindowAggExec; | ||
| use datafusion::physical_plan::{ExecutionPlan, InputOrderMode}; | ||
| use std::sync::Arc; | ||
|
|
||
| /// Add `MaterializedRowsLimitExec` at the points of the plan where rows accumulate in memory: | ||
| /// sort and window inputs, join build sides, aggregation outputs and worker results. Streaming | ||
| /// nodes are left as is. | ||
| pub fn add_materialized_rows_limit_exec( | ||
| p: Arc<dyn ExecutionPlan>, | ||
| limit: usize, | ||
| ) -> Result<Arc<dyn ExecutionPlan>, DataFusionError> { | ||
| let p_any = p.as_any(); | ||
| if let Some(sort) = p_any.downcast_ref::<SortExec>() { | ||
| // Sort with a fetch keeps only top `fetch` rows in memory, so it stays under the limit on | ||
| // its own when `fetch <= limit`. Otherwise its buffer holds `min(input, fetch)` rows, so | ||
| // counting the input errors exactly when the buffer outgrows the limit. | ||
| if sort.fetch().map_or(true, |fetch| fetch > limit) { | ||
| return wrap_children(&p, &[(0, "sort input")], limit); | ||
| } | ||
| } else if p_any.is::<HashJoinExec>() { | ||
|
claude[bot] marked this conversation as resolved.
|
||
| // HashJoinExec always builds the hash table from its left input. | ||
| return wrap_children(&p, &[(0, "hash join build side")], limit); | ||
| } else if p_any.is::<CrossJoinExec>() { | ||
| return wrap_children(&p, &[(0, "cross join left side")], limit); | ||
| } else if p_any.is::<WindowAggExec>() { | ||
| return wrap_children(&p, &[(0, "window input")], limit); | ||
| } else if let Some(agg) = p_any.downcast_ref::<AggregateExec>() { | ||
| // A sorted aggregation streams groups out instead of accumulating a hash table. | ||
| if agg.input_order_mode() != &InputOrderMode::Sorted { | ||
| return Ok(wrap(p, limit, "aggregation groups")); | ||
| } | ||
| } else if p_any.is::<WorkerExec>() { | ||
| return wrap_children(&p, &[(0, "worker result")], limit); | ||
| } | ||
| Ok(p) | ||
| } | ||
|
|
||
| pub fn wrap( | ||
| p: Arc<dyn ExecutionPlan>, | ||
| limit: usize, | ||
| stage: &'static str, | ||
| ) -> Arc<dyn ExecutionPlan> { | ||
| Arc::new(MaterializedRowsLimitExec::new(p, limit, stage)) | ||
| } | ||
|
|
||
| fn wrap_children( | ||
| p: &Arc<dyn ExecutionPlan>, | ||
| wraps: &[(usize, &'static str)], | ||
| limit: usize, | ||
| ) -> Result<Arc<dyn ExecutionPlan>, DataFusionError> { | ||
| let mut children: Vec<_> = p.children().into_iter().cloned().collect(); | ||
| let mut changed = false; | ||
| for (i, stage) in wraps { | ||
| // The child rows may already be counted by an adjacent limit node. | ||
| if !children[*i].as_any().is::<MaterializedRowsLimitExec>() { | ||
| children[*i] = wrap(children[*i].clone(), limit, stage); | ||
| changed = true; | ||
| } | ||
| } | ||
| if changed { | ||
| p.clone().with_new_children(children) | ||
| } else { | ||
| Ok(p.clone()) | ||
| } | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.