-
Notifications
You must be signed in to change notification settings - Fork 31
feat(cdk): Add RecordExpander component for nested array extraction #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
devin-ai-integration
wants to merge
12
commits into
main
Choose a base branch
from
devin/1764690419-dpath-extractor-expansion
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
feat(cdk): Add RecordExpander component for nested array extraction #859
devin-ai-integration
wants to merge
12
commits into
main
from
devin/1764690419-dpath-extractor-expansion
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…thExtractor - Add optional expand_records_from_field parameter to extract items from nested arrays - Add optional remain_original_record parameter to preserve parent record context - Implement _expand_record method to handle array expansion logic - Add comprehensive unit tests covering all edge cases - Maintain backward compatibility with existing functionality Co-Authored-By: unknown <>
Contributor
Author
Original prompt from API User |
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1764690419-dpath-extractor-expansion#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1764690419-dpath-extractor-expansionHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
Co-Authored-By: unknown <>
Co-Authored-By: unknown <>
Co-Authored-By: unknown <>
2 tasks
- Create new RecordExpander class in airbyte_cdk/sources/declarative/expanders/ - Move expand_records_from_field and remain_original_record parameters from DpathExtractor to RecordExpander - Update DpathExtractor to accept optional record_expander attribute - Register RecordExpander in manifest component transformer - Update unit tests to use new RecordExpander class structure - All 24 tests passing, MyPy and Ruff checks passing This refactoring improves separation of concerns by isolating record expansion logic into a dedicated component. Co-Authored-By: unknown <>
- Add RecordExpander definition to declarative_component_schema.yaml - Add record_expander property to DpathExtractor schema - Update create_dpath_extractor in model_to_component_factory.py to handle record_expander - Auto-generate models from schema using poetry run poe build - All 24 tests passing This completes the schema registration for RecordExpander component, enabling YAML manifests to properly instantiate RecordExpander when used with DpathExtractor. Co-Authored-By: unknown <>
Apply cleaner logic using 'yield from' consistently: - When extracted is a list without record_expander, use 'yield from extracted' - Check 'if not self.record_expander' instead of nested if/else - Remove unnecessary 'yield from []' for empty case All 24 tests passing. Suggested by @DanyloGL. Co-Authored-By: unknown <>
Changes: - Add back 'else: yield from []' in DpathExtractor for explicit empty case - Update RecordExpander to return nothing when expand_records_from_field path doesn't exist or isn't a list - Update unit tests to expect no records instead of original record when expansion fails This makes RecordExpander stricter: it only emits records when successfully expanding a list. For Stripe invoice_line_items, this ensures we only emit line items, not invoice objects. All 24 tests passing. Requested by @DanyloGL. Co-Authored-By: unknown <>
Changes: 1. Remove TypeError from exception handler (only catch KeyError per dpath.get docs) 2. Add wildcard (*) support to RecordExpander for matching multiple arrays 3. Update docstring and schema to document wildcard support 4. Add 5 new unit tests for wildcard expansion scenarios 5. Regenerate models from updated schema When wildcards are used, RecordExpander: - Uses dpath.values() to find all matches - Filters for list-valued matches only - Expands items from all matched lists - Returns nothing if no list matches found All 29 tests passing. Requested by @DanyloGL. Co-Authored-By: unknown <>
MyPy was complaining that dpath.values() and dpath.get() return 'object' type. Added cast(Iterable[Any], ...) for dpath.values() and cast(Any, ...) for dpath.get() to satisfy MyPy type checking while maintaining runtime behavior. All 29 tests passing. MyPy check now passes. Co-Authored-By: unknown <>
Unified the wildcard and non-wildcard branches by collecting all arrays to process into a single list, then using one common loop for expansion. This eliminates the duplicated item iteration and record expansion logic. All 29 tests passing. MyPy check passes. Co-Authored-By: unknown <>
Changes per Daryna's feedback: 1. Removed isinstance(m, list) filter - now checking inside loop 2. Renamed 'matches' to 'extracted' 3. Removed type casts - using 'extracted: Any' instead 4. Renamed 'nested_array' to 'record' (loop var), using 'parent_record' for original 5. Removed 'if not nested_array:' check (redundant with for loop) All 29 tests passing. MyPy check passes. Co-Authored-By: unknown <>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(cdk): Add RecordExpander component for nested array extraction
Summary
This PR adds a new
RecordExpandercomponent to the CDK that enables extracting items from nested array fields and emitting each item as a separate record. This is needed to fix the Stripeinvoice_line_itemsstream issue where the events endpoint returns invoice objects with nestedlines.dataarrays, but we need to emit each line item as a separate record.Key changes:
RecordExpanderclass inairbyte_cdk/sources/declarative/expanders/DpathExtractorvia optionalrecord_expanderparameter["sections", "*", "items"])remain_original_recordflag to embed parent record contextExample usage:
Review & Testing Checklist for Human
This is a YELLOW risk PR (medium confidence). Please verify:
Type safety with
Anyannotation: The code usesextracted: Anyon line 73 and 86 ofrecord_expander.pyto work around MyPy's limitations with the dpath library (which returnsobjecttype). Review whether this is acceptable or if we need a different approach (e.g.,type: ignorecomments).Variable naming clarity: The code uses
parent_record = recordto preserve the original record, then reusesrecordas a loop variable in the wildcard branch (line 74). While this works correctly, verify this naming pattern is clear enough or if we should use different variable names.Wildcard path behavior: Test with real nested data structures to ensure wildcard matching works correctly. The logic iterates extracted values and only processes items that are lists (line 75:
if isinstance(record, list)). Verify this handles all edge cases properly.Backward compatibility: Verify existing connectors without
record_expanderstill work correctly (the parameter is optional and defaults to None).End-to-end testing: This PR only includes unit tests. The real-world behavior needs to be verified with the Stripe connector in the companion PR (fix(source-stripe): Fix invoice_line_items incremental stream to emit line items instead of invoices (do not merge) airbyte#70294).
Recommended test plan:
poetry run pytest unit_tests/sources/declarative/extractors/test_dpath_extractor.pyNotes