Add auto-extract mode for spath command#5140
Add auto-extract mode for spath command#5140penghuo merged 7 commits intoopensearch-project:mainfrom
spath command#5140Conversation
Signed-off-by: Chen Dai <daichen@amazon.com>
…serve issue Signed-off-by: Chen Dai <daichen@amazon.com>
📝 WalkthroughWalkthroughMade SPath.path optional to enable an auto-extract mode; spath without a path now rewrites to json_extract_all and returns a flattened Map<String,String>. Changes touch core AST, json UDF behavior, parser/rewrite logic, tests, test data/mappings, and documentation. Changes
Sequence DiagramsequenceDiagram
participant User
participant Parser as PPL Parser
participant SPath as SPath AST
participant Rewriter as SPath Rewriter
participant JsonExtract as JsonExtractAll
participant Output as Result Map
User->>Parser: spath input=field (no path)
Parser->>SPath: Create SPath(input=field, path=null)
SPath->>Rewriter: rewriteAsEval(context)
Rewriter->>Rewriter: Detect path is null
Rewriter->>JsonExtract: Use json_extract_all(field)
JsonExtract->>JsonExtract: Parse JSON and flatten structure
JsonExtract->>JsonExtract: Stringify all values
JsonExtract->>Output: Return Map<String,String>
Output->>User: Flattened key-value pairs (strings)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Comment |
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 41-43: There are two adjacent blockquotes ("Auto-extract mode
processes the entire input field..." and "Invalid or malformed JSON returns
partial results...") separated by a blank line which triggers MD028; remove the
blank line and either merge the two notes into a single blockquote or place
regular text between them so they are no longer consecutive blockquotes—update
the lines containing those two blockquote texts to be joined into one blockquote
or separated by non-blockquote content.
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 40-48: Add three new test documents to the auto-extract fixtures
for CalcitePPLSpathCommandIT by creating additional Request objects (similar to
autoExtractDoc) and performing them via client().performRequest: (1) a document
where the target auto-extract field(s) are explicitly null to exercise null
handling, (2) a document where those fields contain empty JSON object "{}" and
an empty JSON array "[]" to cover boundary/empty-input behavior, and (3) a
document with a deliberately malformed JSON string (e.g., truncated or missing
quotes/braces) to trigger error paths; each should use the same PUT to the
/test_spath_auto/_doc/{id}?refresh=true pattern with unique ids and matching
JSON entity strings so tests validate null, empty, and invalid inputs for the
auto-extract mode.
- Around line 59-63: The testSpathAutoExtract test only asserts schema via
verifySchema(schema("doc","struct")) which is too weak; update
testSpathAutoExtract to also assert actual extracted values and multi-document
behavior by parsing the JSONObject returned from executeQuery("source=test_spath
| spath input=doc") and adding row-level assertions that specific keys/values
exist for each document (e.g., expected map entries inside the "doc" struct),
and/or use existing row-assert helper methods to assert the expected number of
rows and exact field values; keep the initial verifySchema call but add concrete
assertions that validate the map extraction across multiple documents.
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Outdated
Show resolved
Hide resolved
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Show resolved
Hide resolved
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chen Dai <daichen@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 102-111: Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Show resolved
Hide resolved
Signed-off-by: Chen Dai <daichen@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 161-163: The example table in spath.md has an inconsistent
representation for single-element arrays: update the second example row so the
"tags{}" output is "[python]" (i.e., change 'tags{}': 'python' to 'tags{}':
'[python]') to match the bracketed array stringification used in the other rows
and documented rules; locate the example table in spath.md and adjust that cell
text accordingly.
Swiddis
left a comment
There was a problem hiding this comment.
I like this approach! lgtm with comments
...src/main/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImpl.java
Show resolved
Hide resolved
| Map.of( | ||
| "data.items{}{}", "[1, 2]", | ||
| "data.items{}{}.hello", "3", | ||
| "data.items{}", "4", |
There was a problem hiding this comment.
thought: Is this bracket syntax easily supported by downstream commands that want to extract these fields, or will there need to be odd escaping?
That might only be a concern for the mapping functions in a future PR, but I would still carefully review if this is the extraction syntax we want for arrays. At a glance this isn't intuitive to me but after squinting at the docs for a bit I kinda got it?
e.g. It's not obvious to me what will happen if I try "items": [[1, 2], [3, 4]], do I get two items{}{} keys? Does one overwrite the other? Are they merged into one array? (From code review I know which one it is, but not as a user given only the docs & this example)
There was a problem hiding this comment.
Yes, I think we don't have good workaround until we move to Map<String,Any> or schemaless. Currently it just preserves all the values from "conflicting" key. Let me verify your example.
There was a problem hiding this comment.
The result is items{}{} as expected. I can add more cases in doctest or this UT if that's your suggestion. Thanks!
Description
As a follow-up after #5139, this PR implements the auto-extract mode for the
spathcommand per the proposal in #4307. Whenpathparameter is omitted,spathrewrites tojson_extract_all(input)function and returns amap<string, string>with flattened keys instead of expanding dynamic columns.Minor differences from proposal:
Related Issues
Resolves #4307 (partially, ergonomic map access syntax are follow-ups)
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.