Skip to content

Fix PySparkProcessor V3 ProcessingInput construction#5759

Closed
Evan-W-ang wants to merge 2 commits into
aws:masterfrom
Evan-W-ang:fix/pysparkprocessor-v3-processinginput
Closed

Fix PySparkProcessor V3 ProcessingInput construction#5759
Evan-W-ang wants to merge 2 commits into
aws:masterfrom
Evan-W-ang:fix/pysparkprocessor-v3-processinginput

Conversation

@Evan-W-ang

Copy link
Copy Markdown

Use V3-compatible ProcessingInput construction in PySparkProcessor.

PySparkProcessor still built internal ProcessingInput objects with the
legacy source/destination fields in _stage_configuration() and
_stage_submit_deps(). In V3, ProcessingInput now expects s3_input, so
those internal code paths can fail during pipeline definition or upsert
with validation errors.

This change updates both code paths to build ProcessingInput with
ProcessingS3Input while preserving the same staged S3 URIs and local
mount paths. It also adds regression tests covering configuration
staging and local dependency staging

@Evan-W-ang

Copy link
Copy Markdown
Author

Summary

This PR updates PySparkProcessor to construct ProcessingInput using the
V3-compatible s3_input=ProcessingS3Input(...) shape instead of the legacy
source / destination fields.

Problem

In V3, sagemaker.core.processing.ProcessingInput no longer accepts:

  • source
  • destination

and instead expects V3 fields such as input_name and s3_input.

However, PySparkProcessor still used the legacy constructor internally in:

  • _stage_configuration()
  • _stage_submit_deps()

This can cause validation failures during pipeline definition / upsert.

Fix

This change:

  1. replaces internal legacy ProcessingInput(...) construction with
    V3-style ProcessingS3Input(...)
  2. preserves the existing S3 staging behavior
  3. preserves the existing local mount path behavior
  4. avoids relying on legacy .destination access where an explicit local path is sufficient

Tests

Added regression tests covering:

  • _stage_configuration() building a V3-compatible ProcessingInput
  • _stage_submit_deps() building a V3-compatible ProcessingInput for local dependencies

Example failure before this change

ValidationError: 2 validation errors for ProcessingInput
source
  Extra inputs are not permitted
destination
  Extra inputs are not permitted

Motivation

Users migrating to V3 naturally update their own processing inputs/outputs to the new schema, but Spark processing can still fail because of internal legacy construction in 
PySparkProcessor. This patch makes that internal behavior consistent with the V3 processing models.


**Test command**
```bash
cd ~/sagemaker-python-sdk/sagemaker-core
. .venv/bin/activate
python -m pytest tests/unit/spark/test_processing.py tests/unit/test_processing.py -q

Files to include

sagemaker-core/src/sagemaker/core/spark/processing.py
sagemaker-core/tests/unit/spark/test_processing.py

@NathanCYee

NathanCYee commented May 29, 2026

Copy link
Copy Markdown

Hi Evan,

Thanks for opening this PR. I noticed the spark_event_logs_s3_uri parameter also has a similar issue with ProcessingOutput.

ValidationError: 4 validation errors for ProcessingOutput
output_name
  Field required [type=missing, input_value={'source': '/opt/ml/proce...oad_mode': 'Continuous'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.13/v/missing
source
  Extra inputs are not permitted [type=extra_forbidden, input_value='/opt/ml/processing/spark-events/', 
input_type=str]
    For further information visit https://errors.pydantic.dev/2.13/v/extra_forbidden
destination
  Extra inputs are not permitted [type=extra_forbidden, 
input_value='s3://amazon-sagemaker-74...owdtqwioecn/spark-logs/', input_type=str]
    For further information visit https://errors.pydantic.dev/2.13/v/extra_forbidden
s3_upload_mode
  Extra inputs are not permitted [type=extra_forbidden, input_value='Continuous', input_type=str]
    For further information visit https://errors.pydantic.dev/2.13/v/extra_forbidden

This is blocking the use of the PySparkProcessor. Would be good for someone to escalate a review of this.

@Evan-W-ang

Copy link
Copy Markdown
Author

Hi @NathanCYee ,

Thanks a lot for catching this issue and calling it out, especially on spark_event_logs_s3_uri and ProcessingOutput.

I’ve submitted a new code update to address it. When you have a moment, could you please take another look and review the latest changes?

Really appreciate your help on this.

@Evan-W-ang Evan-W-ang closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants