generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 181
Open
Labels
PPLPiped processing languagePiped processing languageenhancementNew feature or requestNew feature or requestfeature
Description
Proposal — Command Syntax, Options, and Precise Semantics
Command
union <dataset> [, <dataset>...]
Arguments
dataset (required, repeatable)
The dataset(s) whose results will be merged. Each dataset can be one of the following OpenSearch-supported sources:
1. Index / index pattern / alias
- Examples: logs-*, my-index, security-events-alias
2. Subsearch result set
- Example:
[search index=a | where status=200 | fields user action]
Multiple datasets must be separated by commas.
Syntax examples:
| union dataset_a, dataset_b, dataset_c
| union [search index=a], [search index=b]
| union logs-*, security-events-alias
Out of scope
- Saved searches as datasets
- Semantic data models
- Lookup tables as first-class dataset sources
Constraints
- At least one dataset must be specified
- Datasets must be resolvable at analysis time
Position in pipeline
As first command:
| union dataset_a, dataset_b
All datasets must be explicit arguments.
Mid-pipeline:
search index=a | where status=200 | union dataset_b
The upstream result set is implicitly included as the first dataset.
This is equivalent to:
| union [search index=a | where status=200], dataset_b
---
Precise Semantics
Dataset Processing
Input
- First command: only the explicitly specified datasets
- Mid-pipeline: upstream result set (implicit) + explicit datasets
Output
- A single merged result set containing all rows from all inputs
Schema Merging
Output schema is the union of all field names across all datasets
For each output row:
- Fields present in the originating dataset retain their values
- Fields absent in the originating dataset are set to null
Type resolution
- If types are compatible, coerce to a common supertype (e.g., int + float → float)
- Otherwise, coerce to string
Ordering Semantics
- No implicit ordering guarantees
- The output order is determined by dataset evaluation order
- Deterministic ordering requires an explicit sort command
Value Preservation
UNION ALL semantics
- No deduplication
- All rows from all inputs are preserved exactly once
- Output cardinality = sum of input cardinalities
Error Handling
The command fails if:
- No datasets are provided
- Any dataset cannot be resolved or accessed
- Schema resolution fails due to unrecoverable type conflicts
---
Examples
1. Basic union (same schema)
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
Query
... | union dataset_b
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
---
2. Union as first command
Input
Index A:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
Query
| union [search index=a | fields user, action], [search index=b | fields user, action]
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ bob │ logout │
└───────┴────────┘
---
3. Non-overlapping schemas
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ host │ status │
├──────┼────────┤
│ web1 │ 200 │
├──────┼────────┤
│ web2 │ 404 │
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┬──────┬────────┐
│ user │ action │ host │ status │
├──────┼────────┼──────┼────────┤
│ joe │ login │ null │ null │
├──────┼────────┼──────┼────────┤
│ sam │ logout │ null │ null │
├──────┼────────┼──────┼────────┤
│ null │ null │ web1 │ 200 │
├──────┼────────┼──────┼────────┤
│ null │ null │ web2 │ 404 │
└──────┴────────┴──────┴────────┘
---
4. Partial schema overlap
Input
Dataset A (upstream):
┌──────┬────────┬───────────┐
│ user │ action │ timestamp │
├──────┼────────┼───────────┤
│ joe │ login │ 100 │
└──────┴────────┴───────────┘
Dataset B:
┌──────┬──────┬────────┐
│ user │ host │ status │
├──────┼──────┼────────┤
│ joe │ web1 │ 200 │
└──────┴──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┬───────────┬──────┬────────┐
│ user │ action │ timestamp │ host │ status │
├──────┼────────┼───────────┼──────┼────────┤
│ joe │ login │ 100 │ null │ null │
├──────┼────────┼───────────┼──────┼────────┤
│ joe │ null │ null │ web1 │ 200 │
└──────┴────────┴───────────┴──────┴────────┘
---
5. Subsearch as dataset
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Index B:
┌───────┬────────┬────────┐
│ user │ action │ status │
├───────┼────────┼────────┤
│ alice │ login │ 200 │
├───────┼────────┼────────┤
│ bob │ logout │ 404 │
└───────┴────────┴────────┘
Query
... | union [search index=b | where status=200 | fields user, action]
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ sam │ logout │
├───────┼────────┤
│ alice │ login │
└───────┴────────┘
---
6. Multiple datasets
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
└──────┴────────┘
Dataset B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ logout │
└───────┴────────┘
Dataset C:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ bob │ login │
└──────┴────────┘
Query
... | union dataset_b, dataset_c
Output
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ joe │ login │
├───────┼────────┤
│ alice │ logout │
├───────┼────────┤
│ bob │ login │
└───────┴────────┘
---
7. Duplicate rows preserved
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B:
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ bob │ logout │
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ bob │ logout │
└──────┴────────┘
Note: Duplicates remain; no deduplication is performed (row joe | login appears twice).
---
8. Empty dataset
Input
Dataset A (upstream):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Dataset B (empty):
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
└──────┴────────┘
Query
... | union dataset_b
Output
┌──────┬────────┐
│ user │ action │
├──────┼────────┤
│ joe │ login │
├──────┼────────┤
│ sam │ logout │
└──────┴────────┘
Note: If a dataset returns zero rows, it contributes nothing to the output.
---
9. Mid-pipeline union behavior
Input
Index A:
┌──────┬────────┬────────┐
│ user │ action │ status │
├──────┼────────┼────────┤
│ joe │ login │ 200 │
├──────┼────────┼────────┤
│ sam │ logout │ 200 │
├──────┼────────┼────────┤
│ bob │ login │ 404 │
└──────┴────────┴────────┘
Index B:
┌───────┬────────┐
│ user │ action │
├───────┼────────┤
│ alice │ login │
├───────┼────────┤
│ carol │ logout │
└───────┴────────┘
Query
search index=a | where status=200 | union [search index=b]
Explanation: The filtered result from index A (rows with status=200) is implicitly the first dataset, then unioned with all rows from index B.
Output
┌───────┬────────┬────────┐
│ user │ action │ status │
├───────┼────────┼────────┤
│ joe │ login │ 200 │
├───────┼────────┼────────┤
│ sam │ logout │ 200 │
├───────┼────────┼────────┤
│ alice │ login │ null │
├───────┼────────┼────────┤
│ carol │ logout │ null │
└───────┴────────┴────────┘
---
10. Union with explicit ordering
Input
Dataset A:
┌──────┬───────────┐
│ user │ timestamp │
├──────┼───────────┤
│ joe │ 200 │
├──────┼───────────┤
│ sam │ 100 │
└──────┴───────────┘
Dataset B:
┌───────┬───────────┐
│ user │ timestamp │
├───────┼───────────┤
│ alice │ 150 │
├───────┼───────────┤
│ bob │ 250 │
└───────┴───────────┘
Query
| union dataset_a, dataset_b | sort timestamp desc
Output
┌───────┬───────────┐
│ user │ timestamp │
├───────┼───────────┤
│ bob │ 250 │
├───────┼───────────┤
│ joe │ 200 │
├───────┼───────────┤
│ alice │ 150 │
├───────┼───────────┤
│ sam │ 100 │
└───────┴───────────┘
Note: Without explicit sort, output order is not guaranteed. The sort command provides deterministic ordering.
---
Implementation Notes (OpenSearch-specific)
Execution model
- Implemented as a logical UNION ALL operator in the Calcite plan
- No special execution routing or streaming classification
Ordering
- Ordering is not preserved or inferred
- Explicit sort is required for deterministic results
Performance
- Dataset queries are evaluated independently
- Results are appended in evaluation order
Field alignment
- NULL projection is applied to align schemas before union
Delimiter syntax
- Comma (,) is the standard delimiter between datasets
- Whitespace around commas is optional but recommended for readability
---
Non-Goals (Explicitly Out of Scope)
- Time-based result interleaving
- Implicit _time or @timestamp ordering
- Saved search datasets
- Semantic data models
- Lookup tables as dataset sources
- Subsearch-specific execution knobs (maxout, maxtime, caching)
These may be explored in future iterations.
---
Comparison
OpenSearch PPL union vs Splunk SPL union
┌───────────────────┬─────────────────┬─────────────────────┐
│ Feature │ Splunk SPL │ OpenSearch PPL │
├───────────────────┼─────────────────┼─────────────────────┤
│ Deduplication │ UNION ALL │ UNION ALL │
├───────────────────┼─────────────────┼─────────────────────┤
│ Schema handling │ Union of fields │ Union of fields │
├───────────────────┼─────────────────┼─────────────────────┤
│ Time interleaving │ Yes (streaming) │ No │
├───────────────────┼─────────────────┼─────────────────────┤
│ Implicit ordering │ _time DESC │ None │
├───────────────────┼─────────────────┼─────────────────────┤
│ Delimiter syntax │ Space or comma │ Comma only │
├───────────────────┼─────────────────┼─────────────────────┤
│ Data models │ Supported │ Not supported │
├───────────────────┼─────────────────┼─────────────────────┤
│ Saved searches │ Supported │ Not supported │
└───────────────────┴─────────────────┴─────────────────────┘
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PPLPiped processing languagePiped processing languageenhancementNew feature or requestNew feature or requestfeature
Type
Projects
Status
Not Started
Status
Todo