Skip to content

Add auto-extract mode for spath command#5140

Merged
penghuo merged 7 commits intoopensearch-project:mainfrom
dai-chen:add-auto-extract-mode-support-in-spath
Feb 13, 2026
Merged

Add auto-extract mode for spath command#5140
penghuo merged 7 commits intoopensearch-project:mainfrom
dai-chen:add-auto-extract-mode-support-in-spath

Conversation

@dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Feb 12, 2026

Description

As a follow-up after #5139, this PR implements the auto-extract mode for the spath command per the proposal in #4307. When path parameter is omitted, spath rewrites to json_extract_all(input) function and returns a map<string, string> with flattened keys instead of expanding dynamic columns.

Minor differences from proposal:

  • Malformed JSON returns partial results

Related Issues

Resolves #4307 (partially, ergonomic map access syntax are follow-ups)

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>
…serve issue

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen self-assigned this Feb 12, 2026
@dai-chen dai-chen added enhancement New feature or request PPL Piped processing language labels Feb 12, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Made SPath.path optional to enable an auto-extract mode; spath without a path now rewrites to json_extract_all and returns a flattened Map<String,String>. Changes touch core AST, json UDF behavior, parser/rewrite logic, tests, test data/mappings, and documentation.

Changes

Cohort / File(s) Summary
Core AST
core/src/main/java/org/opensearch/sql/ast/tree/SPath.java
Made path nullable and added conditional rewrite logic in rewriteAsEval to route pathless spath to auto-extract behavior.
JSON UDF Implementation & Tests
core/src/main/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImpl.java, core/src/test/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImplTest.java
Changed json_extract_all to produce stringified values (Map<String,String>), updated return type inference to VARCHAR, added stringifyMap helper, and adapted unit tests to expect string values.
PPL Parser & Rewrite Tests
ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java, ppl/src/test/java/org/opensearch/sql/ppl/utils/SPathRewriteTest.java
Removed runtime validation that required a path; allow constructing SPath with null path and added tests verifying auto-extract rewrite (with/without output).
PPL Parser Tests / Anonymizer
ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java, ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java
Added tests covering spath with no path and anonymization of pathless spath queries.
Calcite PPL Tests
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLSpathTest.java
Reworked tests to fluent API and added both path-mode and auto-extract-mode tests (with output variants).
Integration Tests
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java, integ-test/src/test/java/org/opensearch/sql/calcite/standalone/JsonExtractAllFunctionIT.java
Added integration tests exercising auto-extract scenarios (nested fields, arrays, duplicate keys, stringify/null handling) and adapted helpers to Map<String,String>.
Docs & Test Data / Mappings
docs/user/ppl/cmd/spath.md, doctest/test_data/structured.json, doctest/test_mapping/structured.json
Docs updated to describe dual modes (path-based and auto-extract) with flattening rules and examples; test data and mapping add doc_auto entries to exercise auto-extract.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as PPL Parser
    participant SPath as SPath AST
    participant Rewriter as SPath Rewriter
    participant JsonExtract as JsonExtractAll
    participant Output as Result Map

    User->>Parser: spath input=field (no path)
    Parser->>SPath: Create SPath(input=field, path=null)
    SPath->>Rewriter: rewriteAsEval(context)
    Rewriter->>Rewriter: Detect path is null
    Rewriter->>JsonExtract: Use json_extract_all(field)
    JsonExtract->>JsonExtract: Parse JSON and flatten structure
    JsonExtract->>JsonExtract: Stringify all values
    JsonExtract->>Output: Return Map<String,String>
    Output->>User: Flattened key-value pairs (strings)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

calcite

Suggested reviewers

  • ykmr1224
  • penghuo
  • ps48
  • kavithacm
  • derek-ho
  • joshuali925
  • GumpacG
  • Swiddis
  • anirudha
  • forestmvey
🚥 Pre-merge checks | ✅ 4 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (16 files):

⚔️ build.gradle (content)
⚔️ core/src/main/java/org/opensearch/sql/ast/tree/SPath.java (content)
⚔️ core/src/main/java/org/opensearch/sql/executor/QueryService.java (content)
⚔️ core/src/main/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImpl.java (content)
⚔️ core/src/test/java/org/opensearch/sql/executor/QueryServiceTest.java (content)
⚔️ core/src/test/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImplTest.java (content)
⚔️ docs/user/ppl/cmd/spath.md (content)
⚔️ doctest/test_data/structured.json (content)
⚔️ doctest/test_mapping/structured.json (content)
⚔️ integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java (content)
⚔️ integ-test/src/test/java/org/opensearch/sql/calcite/standalone/JsonExtractAllFunctionIT.java (content)
⚔️ ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLSpathTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/utils/SPathRewriteTest.java (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding auto-extract mode for the spath command, which aligns with the core functionality implemented across the changeset.
Linked Issues check ✅ Passed The PR implements the dynamic columns support objective from issue #4307 by enabling spath to extract all JSON attributes as a flattened map<string, string> when path is omitted, partially resolving the issue as stated.
Out of Scope Changes check ✅ Passed All code changes are scoped to implementing the auto-extract mode for spath: modifying SPath to allow nullable path, updating JsonExtractAllFunctionImpl for string-value mapping, adding comprehensive tests, and updating documentation with new auto-extract mode details.
Description check ✅ Passed The PR description clearly relates to the changeset, describing the auto-extract mode implementation for spath command when path parameter is omitted.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch add-auto-extract-mode-support-in-spath
  • Post resolved changes as copyable diffs in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 41-43: There are two adjacent blockquotes ("Auto-extract mode
processes the entire input field..." and "Invalid or malformed JSON returns
partial results...") separated by a blank line which triggers MD028; remove the
blank line and either merge the two notes into a single blockquote or place
regular text between them so they are no longer consecutive blockquotes—update
the lines containing those two blockquote texts to be joined into one blockquote
or separated by non-blockquote content.

In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 40-48: Add three new test documents to the auto-extract fixtures
for CalcitePPLSpathCommandIT by creating additional Request objects (similar to
autoExtractDoc) and performing them via client().performRequest: (1) a document
where the target auto-extract field(s) are explicitly null to exercise null
handling, (2) a document where those fields contain empty JSON object "{}" and
an empty JSON array "[]" to cover boundary/empty-input behavior, and (3) a
document with a deliberately malformed JSON string (e.g., truncated or missing
quotes/braces) to trigger error paths; each should use the same PUT to the
/test_spath_auto/_doc/{id}?refresh=true pattern with unique ids and matching
JSON entity strings so tests validate null, empty, and invalid inputs for the
auto-extract mode.
- Around line 59-63: The testSpathAutoExtract test only asserts schema via
verifySchema(schema("doc","struct")) which is too weak; update
testSpathAutoExtract to also assert actual extracted values and multi-document
behavior by parsing the JSONObject returned from executeQuery("source=test_spath
| spath input=doc") and adding row-level assertions that specific keys/values
exist for each document (e.g., expected map entries inside the "doc" struct),
and/or use existing row-assert helper methods to assert the expected number of
rows and exact field values; keep the initial verifySchema call but add concrete
assertions that validate the map extraction across multiple documents.

penghuo
penghuo previously approved these changes Feb 13, 2026
Signed-off-by: Chen Dai <daichen@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 102-111: Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen requested a review from penghuo February 13, 2026 17:50
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 161-163: The example table in spath.md has an inconsistent
representation for single-element arrays: update the second example row so the
"tags{}" output is "[python]" (i.e., change 'tags{}': 'python' to 'tags{}':
'[python]') to match the bracketed array stringification used in the other rows
and documented rules; locate the example table in spath.md and adjust that cell
text accordingly.

Copy link
Collaborator

@Swiddis Swiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach! lgtm with comments

Map.of(
"data.items{}{}", "[1, 2]",
"data.items{}{}.hello", "3",
"data.items{}", "4",
Copy link
Collaborator

@Swiddis Swiddis Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Is this bracket syntax easily supported by downstream commands that want to extract these fields, or will there need to be odd escaping?

That might only be a concern for the mapping functions in a future PR, but I would still carefully review if this is the extraction syntax we want for arrays. At a glance this isn't intuitive to me but after squinting at the docs for a bit I kinda got it?

e.g. It's not obvious to me what will happen if I try "items": [[1, 2], [3, 4]], do I get two items{}{} keys? Does one overwrite the other? Are they merged into one array? (From code review I know which one it is, but not as a user given only the docs & this example)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we don't have good workaround until we move to Map<String,Any> or schemaless. Currently it just preserves all the values from "conflicting" key. Let me verify your example.

Copy link
Collaborator Author

@dai-chen dai-chen Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is items{}{} as expected. I can add more cases in doctest or this UT if that's your suggestion. Thanks!

@penghuo penghuo merged commit be44a8e into opensearch-project:main Feb 13, 2026
50 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(Calcite PPL) spath command dynamic columns support

3 participants