Skip to content

[FEATURE] Add Union command in PPL #5110

@srikanthpadakanti

Description

@srikanthpadakanti

Proposal — Command Syntax, Options, and Precise Semantics

Command

union <dataset> [, <dataset>...]

Arguments

dataset (required, repeatable)

The dataset(s) whose results will be merged. Each dataset can be one of the following OpenSearch-supported sources:

1. Index / index pattern / alias
  - Examples: logs-*, my-index, security-events-alias
2. Subsearch result set
  - Example:
  [search index=a | where status=200 | fields user action]

Multiple datasets must be separated by commas.

Syntax examples:
| union dataset_a, dataset_b, dataset_c
| union [search index=a], [search index=b]
| union logs-*, security-events-alias

Out of scope
- Saved searches as datasets
- Semantic data models
- Lookup tables as first-class dataset sources

Constraints

- At least one dataset must be specified
- Datasets must be resolvable at analysis time

Position in pipeline

As first command:
| union dataset_a, dataset_b
All datasets must be explicit arguments.

Mid-pipeline:
search index=a | where status=200 | union dataset_b
The upstream result set is implicitly included as the first dataset.

This is equivalent to:
| union [search index=a | where status=200], dataset_b

---
Precise Semantics

Dataset Processing

Input
- First command: only the explicitly specified datasets
- Mid-pipeline: upstream result set (implicit) + explicit datasets

Output
- A single merged result set containing all rows from all inputs

Schema Merging

Output schema is the union of all field names across all datasets

For each output row:
- Fields present in the originating dataset retain their values
- Fields absent in the originating dataset are set to null

Type resolution
- If types are compatible, coerce to a common supertype (e.g., int + float → float)
- Otherwise, coerce to string

Ordering Semantics

- No implicit ordering guarantees
- The output order is determined by dataset evaluation order
- Deterministic ordering requires an explicit sort command

Value Preservation

UNION ALL semantics
- No deduplication
- All rows from all inputs are preserved exactly once
- Output cardinality = sum of input cardinalities

Error Handling

The command fails if:
- No datasets are provided
- Any dataset cannot be resolved or accessed
- Schema resolution fails due to unrecoverable type conflicts

---
Examples

1. Basic union (same schema)

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ alice │ login  │
├───────┼────────┤
│ bob   │ logout │
└───────┴────────┘
Query
... | union dataset_b

Output
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ joe   │ login  │
├───────┼────────┤
│ sam   │ logout │
├───────┼────────┤
│ alice │ login  │
├───────┼────────┤
│ bob   │ logout │
└───────┴────────┘
---
2. Union as first command

Input

Index A:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ alice │ login  │
├───────┼────────┤
│ bob   │ logout │
└───────┴────────┘
Query
| union [search index=a | fields user, action], [search index=b | fields user, action]

Output
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ joe   │ login  │
├───────┼────────┤
│ sam   │ logout │
├───────┼────────┤
│ alice │ login  │
├───────┼────────┤
│ bob   │ logout │
└───────┴────────┘
---
3. Non-overlapping schemas

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ host │ status │
├──────┼────────┤
│ web1 │ 200    │
├──────┼────────┤
│ web2 │ 404    │
└──────┴────────┘
Query
... | union dataset_b

Output
┌──────┬────────┬──────┬────────┐
│ user │ action │ host │ status │
├──────┼────────┼──────┼────────┤
│ joe  │ login  │ null │ null   │
├──────┼────────┼──────┼────────┤
│ sam  │ logout │ null │ null   │
├──────┼────────┼──────┼────────┤
│ null │ null   │ web1 │ 200    │
├──────┼────────┼──────┼────────┤
│ null │ null   │ web2 │ 404    │
└──────┴────────┴──────┴────────┘
---
4. Partial schema overlap

Input

Dataset A (upstream):
┌──────┬────────┬───────────┐
│ user │ action │ timestamp │
├──────┼────────┼───────────┤
│ joe  │ login  │ 100       │
└──────┴────────┴───────────┘
Dataset B:
┌──────┬──────┬────────┐
│ user │ host │ status │
├──────┼──────┼────────┤
│ joe  │ web1 │ 200    │
└──────┴──────┴────────┘
Query
... | union dataset_b

Output
┌──────┬────────┬───────────┬──────┬────────┐
│ user │ action │ timestamp │ host │ status │
├──────┼────────┼───────────┼──────┼────────┤
│ joe  │ login  │ 100       │ null │ null   │
├──────┼────────┼───────────┼──────┼────────┤
│ joe  │ null   │ null      │ web1 │ 200    │
└──────┴────────┴───────────┴──────┴────────┘
---
5. Subsearch as dataset

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┬────────┐
│ user  │ action │ status │
├───────┼────────┼────────┤
│ alice │ login  │ 200    │
├───────┼────────┼────────┤
│ bob   │ logout │ 404    │
└───────┴────────┴────────┘
Query
... | union [search index=b | where status=200 | fields user, action]

Output
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ joe   │ login  │
├───────┼────────┤
│ sam   │ logout │
├───────┼────────┤
│ alice │ login  │
└───────┴────────┘
---
6. Multiple datasets

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ alice │ logout │
└───────┴────────┘
Dataset C:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ bob  │ login  │
└──────┴────────┘
Query
... | union dataset_b, dataset_c

Output
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ joe   │ login  │
├───────┼────────┤
│ alice │ logout │
├───────┼────────┤
│ bob   │ login  │
└───────┴────────┘
---
7. Duplicate rows preserved

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ bob  │ logout │
└──────┴────────┘
Query
... | union dataset_b

Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ bob  │ logout │
└──────┴────────┘
Note: Duplicates remain; no deduplication is performed (row joe | login appears twice).

---
8. Empty dataset

Input

Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Dataset B (empty):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
└──────┴────────┘
Query
... | union dataset_b

Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe  │ login  │
├──────┼────────┤
│ sam  │ logout │
└──────┴────────┘
Note: If a dataset returns zero rows, it contributes nothing to the output.

---
9. Mid-pipeline union behavior

Input

Index A:
┌──────┬────────┬────────┐
│ user │ action │ status │
├──────┼────────┼────────┤
│ joe  │ login  │ 200    │
├──────┼────────┼────────┤
│ sam  │ logout │ 200    │
├──────┼────────┼────────┤
│ bob  │ login  │ 404    │
└──────┴────────┴────────┘
Index B:
┌───────┬────────┐
│ user  │ action │
├───────┼────────┤
│ alice │ login  │
├───────┼────────┤
│ carol │ logout │
└───────┴────────┘
Query
search index=a | where status=200 | union [search index=b]

Explanation: The filtered result from index A (rows with status=200) is implicitly the first dataset, then unioned with all rows from index B.

Output
┌───────┬────────┬────────┐
│ user  │ action │ status │
├───────┼────────┼────────┤
│ joe   │ login  │ 200    │
├───────┼────────┼────────┤
│ sam   │ logout │ 200    │
├───────┼────────┼────────┤
│ alice │ login  │ null   │
├───────┼────────┼────────┤
│ carol │ logout │ null   │
└───────┴────────┴────────┘
---
10. Union with explicit ordering

Input

Dataset A:
┌──────┬───────────┐
│ user │ timestamp │
├──────┼───────────┤
│ joe  │ 200       │
├──────┼───────────┤
│ sam  │ 100       │
└──────┴───────────┘
Dataset B:
┌───────┬───────────┐
│ user  │ timestamp │
├───────┼───────────┤
│ alice │ 150       │
├───────┼───────────┤
│ bob   │ 250       │
└───────┴───────────┘
Query
| union dataset_a, dataset_b | sort timestamp desc

Output
┌───────┬───────────┐
│ user  │ timestamp │
├───────┼───────────┤
│ bob   │ 250       │
├───────┼───────────┤
│ joe   │ 200       │
├───────┼───────────┤
│ alice │ 150       │
├───────┼───────────┤
│ sam   │ 100       │
└───────┴───────────┘
Note: Without explicit sort, output order is not guaranteed. The sort command provides deterministic ordering.

---
Implementation Notes (OpenSearch-specific)

Execution model

- Implemented as a logical UNION ALL operator in the Calcite plan
- No special execution routing or streaming classification

Ordering

- Ordering is not preserved or inferred
- Explicit sort is required for deterministic results

Performance

- Dataset queries are evaluated independently
- Results are appended in evaluation order

Field alignment

- NULL projection is applied to align schemas before union

Delimiter syntax

- Comma (,) is the standard delimiter between datasets
- Whitespace around commas is optional but recommended for readability

---
Non-Goals (Explicitly Out of Scope)

- Time-based result interleaving
- Implicit _time or @timestamp ordering
- Saved search datasets
- Semantic data models
- Lookup tables as dataset sources
- Subsearch-specific execution knobs (maxout, maxtime, caching)

These may be explored in future iterations.

---
Comparison

OpenSearch PPL union vs Splunk SPL union
┌───────────────────┬─────────────────┬─────────────────────┐
│      Feature      │   Splunk SPL    │ OpenSearch PPL │
├───────────────────┼─────────────────┼─────────────────────┤
│ Deduplication     │ UNION ALL       │ UNION ALL           │
├───────────────────┼─────────────────┼─────────────────────┤
│ Schema handling   │ Union of fields │ Union of fields     │
├───────────────────┼─────────────────┼─────────────────────┤
│ Time interleaving │ Yes (streaming) │ No                  │
├───────────────────┼─────────────────┼─────────────────────┤
│ Implicit ordering │ _time DESC      │ None                │
├───────────────────┼─────────────────┼─────────────────────┤
│ Delimiter syntax  │ Space or comma  │ Comma only          │
├───────────────────┼─────────────────┼─────────────────────┤
│ Data models       │ Supported       │ Not supported       │
├───────────────────┼─────────────────┼─────────────────────┤
│ Saved searches    │ Supported       │ Not supported       │
└───────────────────┴─────────────────┴─────────────────────┘

Metadata

Metadata

Labels

PPLPiped processing languageenhancementNew feature or requestfeature

Type

No type

Projects

Status

Not Started

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions