Feat!: Add the ability to control concurrency between model batches during the evaluation (#2450)

izeigerman · web-flow · commit 69bc6027ad60 · 2024-04-16T10:01:09.000-07:00
diff --git a/docs/concepts/models/overview.md b/docs/concepts/models/overview.md
@@ -250,6 +250,9 @@ For models that are incremental, the following parameters can be specified in th
 ### batch_size
 - Batch size is used to optimize backfilling incremental data. It determines the maximum number of intervals to run in a single job. For example, if a model specifies a cron of `@hourly` and a batch_size of `12`, when backfilling 3 days of data, the scheduler will spawn 6 jobs. (3 days * 24 hours/day = 72 hour intervals to fill. 72 intervals / 12 intervals per job = 6 jobs.)
 
+### batch_concurrency
+- The maximum number of [batches](#batch_size) that can run concurrently for this model. If not specified, the concurrency is only constrained by the number of concurrent tasks set in the connection settings.
+
 ### forward_only
 - Set this to true to indicate that all changes to this model should be [forward-only](../plans.md#forward-only-plans).
 
diff --git a/docs/reference/model_configuration.md b/docs/reference/model_configuration.md
@@ -21,7 +21,6 @@ Configuration options for SQLMesh model properties. Supported by all model kinds
 | `interval_unit`    | The temporal granularity of the model's data intervals. Supported values: `year`, `month`, `day`, `hour`, `half_hour`, `quarter_hour`, `five_minute`. (Default: inferred from `cron`)                                                                                                                                            |        str        |    N     |
 | `start`            | The date/time that determines the earliest date interval that should be processed by a model. Can be a datetime string, epoch time in milliseconds, or a relative datetime such as `1 year ago`.                                                                                                                                 |    str \| int     |    N     |
 | `end`              | The date/time that determines the latest date interval that should be processed by a model. Can be a datetime string, epoch time in milliseconds, or a relative datetime such as `1 year ago`.                                                                                                                                   |    str \| int     |    N     |
-| `batch_size`       | The maximum number of intervals that can be evaluated in a single backfill task. If this is `None`, all intervals will be processed as part of a single task. If this is set, a model's backfill will be chunked such that each individual task only contains jobs with the maximum of `batch_size` intervals. (Default: `None`) |        int        |    N     |
 | `grains`           | The column(s) whose combination uniquely identifies each row in the model                                                                                                                                                                                                                                                        | str \| array[str] |    N     |
 | `references`       | The model column(s) used to join to other models' grains                                                                                                                                                                                                                                                                         | str \| array[str] |    N     |
 | `depends_on`       | Models on which this model depends. (Default: dependencies inferred from model code)                                                                                                                                                                                                                                             |    array[str]     |    N     |
@@ -45,7 +44,6 @@ The SQLMesh project-level `model_defaults` key supports the following options, d
 - owner
 - start
 - end
-- batch_size
 - storage_format
 
 ## Model kind properties
@@ -74,10 +72,11 @@ Python model configuration object: [FullKind()](https://sqlmesh.readthedocs.io/e
 
 Configuration options for all incremental models (in addition to [general model properties](#general-model-properties)).
 
-| Option       | Description                                                                                                                                                                                                                                                                                                                      | Type | Required |
-| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--: | :------: |
-| `batch_size` | The maximum number of intervals that can be evaluated in a single backfill task. If this is `None`, all intervals will be processed as part of a single task. If this is set, a model's backfill will be chunked such that each individual task only contains jobs with the maximum of `batch_size` intervals. (Default: `None`) | int  |    N     |
-| `lookback`   | The number of time unit intervals prior to the current interval that should be processed. (Default: `0`)                                                                                                                                                                                                                         | int  |    N     |
+| Option              | Description                                                                                                                                                                                                                                                                                                                      | Type | Required |
+|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----:|:--------:|
+| `batch_size`        | The maximum number of intervals that can be evaluated in a single backfill task. If this is `None`, all intervals will be processed as part of a single task. If this is set, a model's backfill will be chunked such that each individual task only contains jobs with the maximum of `batch_size` intervals. (Default: `None`) | int  | N        |
+| `batch_concurrency` | The maximum number of batches that can run concurrently for this model (Default: the number of concurrent tasks set in the connection settings).                                                                                                                                                                                 | int  | N        |
+| `lookback`          | The number of time unit intervals prior to the current interval that should be processed. (Default: `0`)                                                                                                                                                                                                                         | int  | N        |
 
 #### Incremental by time range
 
@@ -172,4 +171,4 @@ Options specified within the `kind` property's `csv_settings` property (override
 | `lineterminator`   | Character used to denote a line break. More information at the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).                                                                                                             | str  | N        |
 | `encoding`         | Encoding to use for UTF when reading/writing (ex. 'utf-8'). More information at the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).                                                                                        | str  | N        |
 
-Python model configuration object: [SeedKind()](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#SeedKind)
+Python model configuration object: [SeedKind()](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#SeedKind)
diff --git a/sqlmesh/core/config/model.py b/sqlmesh/core/config/model.py
@@ -19,9 +19,6 @@ class ModelDefaultsConfig(BaseConfig):
         start: The earliest date that the model will be backfilled for. If this is None,
             then the date is inferred by taking the most recent start date of its ancestors.
             The start date can be a static datetime or a relative datetime like "1 year ago"
-        batch_size: The maximum number of intervals that can be run per backfill job. If this is None,
-            then backfilling this model will do all of history in one job. If this is set, a model's backfill
-            will be chunked such that each individual job will only contain jobs with max `batch_size` intervals.
         storage_format: The storage format used to store the physical table, only applicable in certain engines.
             (eg. 'parquet')
     """
@@ -31,7 +28,6 @@ class ModelDefaultsConfig(BaseConfig):
     cron: t.Optional[str] = None
     owner: t.Optional[str] = None
     start: t.Optional[TimeLike] = None
-    batch_size: t.Optional[int] = None
     storage_format: t.Optional[str] = None
 
     _model_kind_validator = model_kind_validator
diff --git a/sqlmesh/core/model/definition.py b/sqlmesh/core/model/definition.py
@@ -73,7 +73,6 @@ class _Model(ModelMeta, frozen=True):
             name           sushi.order_items,
             owner          jen,
             cron           '@daily',
-            batch_size     30,
             start          '2020-01-01',
             partitioned_by ds
         );
@@ -101,9 +100,6 @@ class _Model(ModelMeta, frozen=True):
             The start date can be a static datetime or a relative datetime like "1 year ago"
         end: The date that the model will be backfilled up until. Follows the same syntax as 'start',
             should be omitted if there is no end date.
-        batch_size: The maximum number of incremental intervals that can be run per backfill job. If this is None,
-            then backfilling this model will do all of history in one job. If this is set, a model's backfill
-            will be chunked such that each individual job will only contain jobs with max `batch_size` intervals.
         lookback: The number of previous incremental intervals in the lookback window.
         storage_format: The storage format used to store the physical table, only applicable in certain engines.
             (eg. 'parquet')
@@ -750,6 +746,7 @@ def metadata_hash(self, audits: t.Dict[str, ModelAudit]) -> str:
             str(self.end) if self.end else None,
             str(self.retention) if self.retention else None,
             str(self.batch_size) if self.batch_size is not None else None,
+            str(self.batch_concurrency) if self.batch_concurrency is not None else None,
             json.dumps(self.mapping_schema, sort_keys=True),
             *sorted(self.tags),
             *sorted(ref.json(sort_keys=True) for ref in self.all_references),
@@ -2024,7 +2021,6 @@ def _refs_to_sql(values: t.Any) -> exp.Expression:
 META_FIELD_CONVERTER: t.Dict[str, t.Callable] = {
     "start": lambda value: exp.Literal.string(value),
     "cron": lambda value: exp.Literal.string(value),
-    "batch_size": lambda value: exp.Literal.number(value),
     "partitioned_by_": _single_expr_or_tuple,
     "clustered_by": _single_value_or_tuple,
     "depends_on_": lambda value: exp.Tuple(expressions=sorted(value)),
diff --git a/sqlmesh/core/model/kind.py b/sqlmesh/core/model/kind.py
@@ -254,6 +254,7 @@ def to_property(self, dialect: str = "") -> exp.Property:
 class _Incremental(_ModelKind):
     dialect: str = ""
     batch_size: t.Optional[SQLGlotPositiveInt] = None
+    batch_concurrency: t.Optional[SQLGlotPositiveInt] = None
     lookback: t.Optional[SQLGlotPositiveInt] = None
     forward_only: SQLGlotBool = False
     disable_restatement: SQLGlotBool = False
@@ -303,6 +304,7 @@ class IncrementalByUniqueKeyKind(_Incremental):
     name: Literal[ModelKindName.INCREMENTAL_BY_UNIQUE_KEY] = ModelKindName.INCREMENTAL_BY_UNIQUE_KEY
     unique_key: SQLGlotListOfFields
     when_matched: t.Optional[exp.When] = None
+    batch_concurrency: Literal[1] = 1
 
     @field_validator("when_matched", mode="before")
     @field_validator_v1_args
diff --git a/sqlmesh/core/model/meta.py b/sqlmesh/core/model/meta.py
@@ -315,6 +315,11 @@ def batch_size(self) -> t.Optional[int]:
         """The maximal number of units in a single task for a backfill."""
         return getattr(self.kind, "batch_size", None)
 
+    @property
+    def batch_concurrency(self) -> t.Optional[int]:
+        """The maximal number of batches that can run concurrently for a backfill."""
+        return getattr(self.kind, "batch_concurrency", None)
+
     @cached_property
     def table_properties(self) -> t.Dict[str, exp.Expression]:
         """A dictionary of table properties."""
diff --git a/sqlmesh/core/node.py b/sqlmesh/core/node.py
@@ -270,6 +270,11 @@ def batch_size(self) -> t.Optional[int]:
         """The maximal number of units in a single task for a backfill."""
         return None
 
+    @property
+    def batch_concurrency(self) -> t.Optional[int]:
+        """The maximal number of batches that can run concurrently for a backfill."""
+        return None
+
     @property
     def data_hash(self) -> str:
         """
diff --git a/sqlmesh/core/scheduler.py b/sqlmesh/core/scheduler.py
@@ -361,15 +361,28 @@ def _dag(self, batches: SnapshotToBatches) -> DAG[SchedulingUnit]:
                         for i, interval in enumerate(p_intervals):
                             upstream_dependencies.append((p_sid.name, (interval, i)))
 
+            batch_concurrency = snapshot.node.batch_concurrency
+            if snapshot.depends_on_past:
+                batch_concurrency = 1
+
             for i, interval in enumerate(intervals):
                 node = (snapshot.name, (interval, i))
                 dag.add(node, upstream_dependencies)
 
                 if len(intervals) > 1:
                     dag.add((snapshot.name, terminal_node), [node])
 
-                if snapshot.depends_on_past and i > 0:
-                    dag.add(node, [(snapshot.name, (intervals[i - 1], i - 1))])
+                if batch_concurrency and i >= batch_concurrency:
+                    batch_idx_to_wait_for = i - batch_concurrency
+                    dag.add(
+                        node,
+                        [
+                            (
+                                snapshot.name,
+                                (intervals[batch_idx_to_wait_for], batch_idx_to_wait_for),
+                            )
+                        ],
+                    )
         return dag
 
 
diff --git a/sqlmesh/migrations/v0046_add_batch_concurrency.py b/sqlmesh/migrations/v0046_add_batch_concurrency.py
@@ -0,0 +1,8 @@
+"""Add the batch_concurrency attribute to the incremental model kinds.
+
+This results in a change to the metadata hash.
+"""
+
+
+def migrate(state_sync, **kwargs):  # type: ignore
+    pass
diff --git a/sqlmesh/schedulers/airflow/dag_generator.py b/sqlmesh/schedulers/airflow/dag_generator.py
@@ -6,7 +6,7 @@
 
 import pendulum
 from airflow import DAG
-from airflow.models import BaseOperator, baseoperator
+from airflow.models import BaseOperator
 from airflow.operators.python import PythonOperator
 from airflow.sensors.base import BaseSensorOperator
 
@@ -437,7 +437,7 @@ def _create_backfill_tasks(
             snapshot = snapshots[sid]
             sanitized_model_name = sanitize_name(snapshot.node.name)
 
-            snapshot_intervals_chain: t.List[t.Union[BaseOperator, t.List[BaseOperator]]] = []
+            snapshot_task_pairs: t.List[t.Tuple[BaseOperator, BaseOperator]] = []
 
             snapshot_start_task = EmptyOperator(
                 task_id=f"snapshot_backfill__{sanitized_model_name}__{snapshot.identifier}__start"
@@ -457,32 +457,30 @@ def _create_backfill_tasks(
                     deployability_index=deployability_index,
                     plan_id=plan_id,
                 )
-
                 external_sensor_task = self._create_hwm_external_sensor(
                     snapshot, start=start, end=end
                 )
                 if external_sensor_task:
-                    if snapshot.depends_on_past:
-                        snapshot_intervals_chain.extend([external_sensor_task, evaluation_task])
-                    else:
-                        (
-                            snapshot_start_task
-                            >> external_sensor_task
-                            >> evaluation_task
-                            >> snapshot_end_task
-                        )
+                    (
+                        snapshot_start_task
+                        >> external_sensor_task
+                        >> evaluation_task
+                        >> snapshot_end_task
+                    )
+                    snapshot_task_pairs.append((external_sensor_task, evaluation_task))
                 else:
-                    if snapshot.depends_on_past:
-                        snapshot_intervals_chain.append(evaluation_task)
-                    else:
-                        snapshot_start_task >> evaluation_task >> snapshot_end_task
+                    snapshot_start_task >> evaluation_task >> snapshot_end_task
+                    snapshot_task_pairs.append((evaluation_task, evaluation_task))
 
+            batch_concurrency = snapshot.node.batch_concurrency
             if snapshot.depends_on_past:
-                baseoperator.chain(
-                    snapshot_start_task, *snapshot_intervals_chain, snapshot_end_task
-                )
-            elif not intervals_per_snapshot.intervals:
+                batch_concurrency = 1
+
+            if not intervals_per_snapshot.intervals:
                 snapshot_start_task >> snapshot_end_task
+            elif batch_concurrency:
+                for i in range(batch_concurrency, len(snapshot_task_pairs)):
+                    snapshot_task_pairs[i - batch_concurrency][1] >> snapshot_task_pairs[i][0]
 
             snapshot_to_tasks[snapshot.snapshot_id] = (
                 snapshot_start_task,
diff --git a/tests/core/test_scheduler.py b/tests/core/test_scheduler.py
diff --git a/tests/core/test_snapshot.py b/tests/core/test_snapshot.py
diff --git a/tests/schedulers/airflow/test_client.py b/tests/schedulers/airflow/test_client.py