[FEATURE] Add Union command in PPL

 ## Proposal — Command Syntax, Options, and Precise Semantics

  ### Command

  ```ppl
  union <dataset> [, <dataset>...]

  Arguments

  dataset (required, repeatable)

  The dataset(s) whose results will be merged. Each dataset can be one of the following OpenSearch-supported sources:

  1. Index / index pattern / alias
    - Examples: logs-*, my-index, security-events-alias
  2. Subsearch result set
    - Example:
    [search index=a | where status=200 | fields user action]

  Multiple datasets must be separated by commas.

  Syntax examples:
  | union dataset_a, dataset_b, dataset_c
  | union [search index=a], [search index=b]
  | union logs-*, security-events-alias

  Out of scope
  - Saved searches as datasets
  - Semantic data models
  - Lookup tables as first-class dataset sources

  Constraints

  - At least one dataset must be specified
  - Datasets must be resolvable at analysis time

  Position in pipeline

  As first command:
  | union dataset_a, dataset_b
  All datasets must be explicit arguments.

  Mid-pipeline:
  search index=a | where status=200 | union dataset_b
  The upstream result set is implicitly included as the first dataset.

  This is equivalent to:
  | union [search index=a | where status=200], dataset_b

  ---
  Precise Semantics

  Dataset Processing

  Input
  - First command: only the explicitly specified datasets
  - Mid-pipeline: upstream result set (implicit) + explicit datasets

  Output
  - A single merged result set containing all rows from all inputs

  Schema Merging

  Output schema is the union of all field names across all datasets

  For each output row:
  - Fields present in the originating dataset retain their values
  - Fields absent in the originating dataset are set to null

  Type resolution
  - If types are compatible, coerce to a common supertype (e.g., int + float → float)
  - Otherwise, coerce to string

  Ordering Semantics

  - No implicit ordering guarantees
  - The output order is determined by dataset evaluation order
  - Deterministic ordering requires an explicit sort command

  Value Preservation

  UNION ALL semantics
  - No deduplication
  - All rows from all inputs are preserved exactly once
  - Output cardinality = sum of input cardinalities

  Error Handling

  The command fails if:
  - No datasets are provided
  - Any dataset cannot be resolved or accessed
  - Schema resolution fails due to unrecoverable type conflicts

  ---
  Examples

  1. Basic union (same schema)

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Dataset B:
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ alice │ login  │
  ├───────┼────────┤
  │ bob   │ logout │
  └───────┴────────┘
  Query
  ... | union dataset_b

  Output
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ joe   │ login  │
  ├───────┼────────┤
  │ sam   │ logout │
  ├───────┼────────┤
  │ alice │ login  │
  ├───────┼────────┤
  │ bob   │ logout │
  └───────┴────────┘
  ---
  2. Union as first command

  Input

  Index A:
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Index B:
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ alice │ login  │
  ├───────┼────────┤
  │ bob   │ logout │
  └───────┴────────┘
  Query
  | union [search index=a | fields user, action], [search index=b | fields user, action]

  Output
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ joe   │ login  │
  ├───────┼────────┤
  │ sam   │ logout │
  ├───────┼────────┤
  │ alice │ login  │
  ├───────┼────────┤
  │ bob   │ logout │
  └───────┴────────┘
  ---
  3. Non-overlapping schemas

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Dataset B:
  ┌──────┬────────┐
  │ host │ status │
  ├──────┼────────┤
  │ web1 │ 200    │
  ├──────┼────────┤
  │ web2 │ 404    │
  └──────┴────────┘
  Query
  ... | union dataset_b

  Output
  ┌──────┬────────┬──────┬────────┐
  │ user │ action │ host │ status │
  ├──────┼────────┼──────┼────────┤
  │ joe  │ login  │ null │ null   │
  ├──────┼────────┼──────┼────────┤
  │ sam  │ logout │ null │ null   │
  ├──────┼────────┼──────┼────────┤
  │ null │ null   │ web1 │ 200    │
  ├──────┼────────┼──────┼────────┤
  │ null │ null   │ web2 │ 404    │
  └──────┴────────┴──────┴────────┘
  ---
  4. Partial schema overlap

  Input

  Dataset A (upstream):
  ┌──────┬────────┬───────────┐
  │ user │ action │ timestamp │
  ├──────┼────────┼───────────┤
  │ joe  │ login  │ 100       │
  └──────┴────────┴───────────┘
  Dataset B:
  ┌──────┬──────┬────────┐
  │ user │ host │ status │
  ├──────┼──────┼────────┤
  │ joe  │ web1 │ 200    │
  └──────┴──────┴────────┘
  Query
  ... | union dataset_b

  Output
  ┌──────┬────────┬───────────┬──────┬────────┐
  │ user │ action │ timestamp │ host │ status │
  ├──────┼────────┼───────────┼──────┼────────┤
  │ joe  │ login  │ 100       │ null │ null   │
  ├──────┼────────┼───────────┼──────┼────────┤
  │ joe  │ null   │ null      │ web1 │ 200    │
  └──────┴────────┴───────────┴──────┴────────┘
  ---
  5. Subsearch as dataset

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Index B:
  ┌───────┬────────┬────────┐
  │ user  │ action │ status │
  ├───────┼────────┼────────┤
  │ alice │ login  │ 200    │
  ├───────┼────────┼────────┤
  │ bob   │ logout │ 404    │
  └───────┴────────┴────────┘
  Query
  ... | union [search index=b | where status=200 | fields user, action]

  Output
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ joe   │ login  │
  ├───────┼────────┤
  │ sam   │ logout │
  ├───────┼────────┤
  │ alice │ login  │
  └───────┴────────┘
  ---
  6. Multiple datasets

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  └──────┴────────┘
  Dataset B:
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ alice │ logout │
  └───────┴────────┘
  Dataset C:
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ bob  │ login  │
  └──────┴────────┘
  Query
  ... | union dataset_b, dataset_c

  Output
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ joe   │ login  │
  ├───────┼────────┤
  │ alice │ logout │
  ├───────┼────────┤
  │ bob   │ login  │
  └───────┴────────┘
  ---
  7. Duplicate rows preserved

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Dataset B:
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ bob  │ logout │
  └──────┴────────┘
  Query
  ... | union dataset_b

  Output
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ bob  │ logout │
  └──────┴────────┘
  Note: Duplicates remain; no deduplication is performed (row joe | login appears twice).

  ---
  8. Empty dataset

  Input

  Dataset A (upstream):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Dataset B (empty):
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  └──────┴────────┘
  Query
  ... | union dataset_b

  Output
  ┌──────┬────────┐
  │ user │ action │
  ├──────┼────────┤
  │ joe  │ login  │
  ├──────┼────────┤
  │ sam  │ logout │
  └──────┴────────┘
  Note: If a dataset returns zero rows, it contributes nothing to the output.

  ---
  9. Mid-pipeline union behavior

  Input

  Index A:
  ┌──────┬────────┬────────┐
  │ user │ action │ status │
  ├──────┼────────┼────────┤
  │ joe  │ login  │ 200    │
  ├──────┼────────┼────────┤
  │ sam  │ logout │ 200    │
  ├──────┼────────┼────────┤
  │ bob  │ login  │ 404    │
  └──────┴────────┴────────┘
  Index B:
  ┌───────┬────────┐
  │ user  │ action │
  ├───────┼────────┤
  │ alice │ login  │
  ├───────┼────────┤
  │ carol │ logout │
  └───────┴────────┘
  Query
  search index=a | where status=200 | union [search index=b]

  Explanation: The filtered result from index A (rows with status=200) is implicitly the first dataset, then unioned with all rows from index B.

  Output
  ┌───────┬────────┬────────┐
  │ user  │ action │ status │
  ├───────┼────────┼────────┤
  │ joe   │ login  │ 200    │
  ├───────┼────────┼────────┤
  │ sam   │ logout │ 200    │
  ├───────┼────────┼────────┤
  │ alice │ login  │ null   │
  ├───────┼────────┼────────┤
  │ carol │ logout │ null   │
  └───────┴────────┴────────┘
  ---
  10. Union with explicit ordering

  Input

  Dataset A:
  ┌──────┬───────────┐
  │ user │ timestamp │
  ├──────┼───────────┤
  │ joe  │ 200       │
  ├──────┼───────────┤
  │ sam  │ 100       │
  └──────┴───────────┘
  Dataset B:
  ┌───────┬───────────┐
  │ user  │ timestamp │
  ├───────┼───────────┤
  │ alice │ 150       │
  ├───────┼───────────┤
  │ bob   │ 250       │
  └───────┴───────────┘
  Query
  | union dataset_a, dataset_b | sort timestamp desc

  Output
  ┌───────┬───────────┐
  │ user  │ timestamp │
  ├───────┼───────────┤
  │ bob   │ 250       │
  ├───────┼───────────┤
  │ joe   │ 200       │
  ├───────┼───────────┤
  │ alice │ 150       │
  ├───────┼───────────┤
  │ sam   │ 100       │
  └───────┴───────────┘
  Note: Without explicit sort, output order is not guaranteed. The sort command provides deterministic ordering.

  ---
  Implementation Notes (OpenSearch-specific)

  Execution model

  - Implemented as a logical UNION ALL operator in the Calcite plan
  - No special execution routing or streaming classification

  Ordering

  - Ordering is not preserved or inferred
  - Explicit sort is required for deterministic results

  Performance

  - Dataset queries are evaluated independently
  - Results are appended in evaluation order

  Field alignment

  - NULL projection is applied to align schemas before union

  Delimiter syntax

  - Comma (,) is the standard delimiter between datasets
  - Whitespace around commas is optional but recommended for readability

  ---
  Non-Goals (Explicitly Out of Scope)

  - Time-based result interleaving
  - Implicit _time or @timestamp ordering
  - Saved search datasets
  - Semantic data models
  - Lookup tables as dataset sources
  - Subsearch-specific execution knobs (maxout, maxtime, caching)

  These may be explored in future iterations.

  ---
  Comparison

  OpenSearch PPL union vs Splunk SPL union
  ┌───────────────────┬─────────────────┬─────────────────────┐
  │      Feature      │   Splunk SPL    │ OpenSearch PPL │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Deduplication     │ UNION ALL       │ UNION ALL           │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Schema handling   │ Union of fields │ Union of fields     │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Time interleaving │ Yes (streaming) │ No                  │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Implicit ordering │ _time DESC      │ None                │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Delimiter syntax  │ Space or comma  │ Comma only          │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Data models       │ Supported       │ Not supported       │
  ├───────────────────┼─────────────────┼─────────────────────┤
  │ Saved searches    │ Supported       │ Not supported       │
  └───────────────────┴─────────────────┴─────────────────────┘


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add Union command in PPL #5110

Proposal — Command Syntax, Options, and Precise Semantics

Command

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Add Union command in PPL #5110

Description

Proposal — Command Syntax, Options, and Precise Semantics

Command

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions