-
Notifications
You must be signed in to change notification settings - Fork 18
feat: Spark major version upgrade skill #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
csmith49
wants to merge
4
commits into
main
Choose a base branch
from
fix/spark-migration
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
8b9831f
Add spark-version-upgrade skill with references and marketplace regis…
csmith49 d70d538
Remove local references, point to official Apache Spark migration guide
csmith49 8f2e4ff
Move Apache migration guide reference to Phase 1
csmith49 3cc55a8
updating marketplace guidance
csmith49 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # spark-version-upgrade | ||
|
|
||
| Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. | ||
|
|
||
| ## Triggers | ||
|
|
||
| This skill is activated by the following keywords: | ||
|
|
||
| - `spark upgrade` | ||
| - `spark migration` | ||
| - `spark version` | ||
| - `upgrade spark` | ||
| - `spark 3` | ||
| - `spark 4` | ||
| - `pyspark upgrade` | ||
|
|
||
| ## Overview | ||
|
|
||
| This skill provides a structured, six-phase workflow for upgrading Apache Spark applications: | ||
|
|
||
| | Phase | Description | | ||
| |-------|-------------| | ||
| | 1. Inventory & Impact Analysis | Scan the codebase, identify Spark usage, document scope | | ||
| | 2. Build File Updates | Bump Spark/Scala/Java versions in Maven, SBT, Gradle, or pip | | ||
| | 3. API Migration | Replace removed/deprecated APIs (SQLContext, Accumulator, etc.) | | ||
| | 4. Configuration Migration | Rename/remove deprecated Spark config properties | | ||
| | 5. SQL & DataFrame Migration | Fix breaking SQL behavior (ANSI mode, type coercion, date parsing) | | ||
| | 6. Test Validation | Compile, test, compare output to pre-upgrade baseline | | ||
|
|
||
| ## Supported Upgrade Paths | ||
|
|
||
| - **Spark 2.x → 3.x** — Major API removals (SQLContext, HiveContext, Accumulator v1), Scala 2.12/2.13 | ||
| - **Spark 3.x → 4.x** — ANSI mode default, Java 17+ requirement, Scala 2.13 only, legacy flag removal | ||
|
|
||
| ## Languages & Build Systems | ||
|
|
||
| - **Languages**: Scala, Java, Python (PySpark) | ||
| - **Build systems**: Maven, SBT, Gradle, pip/uv | ||
|
|
||
| ## Reference Material | ||
|
|
||
| - [Apache Spark Migration Guide](https://spark.apache.org/docs/latest/migration-guide.html) — The official, up-to-date guide covering API removals, configuration changes, SQL behavior, PySpark, Structured Streaming, and MLlib for every Spark release | ||
|
|
||
| ## Example Usage | ||
|
|
||
| Ask the agent: | ||
|
|
||
| > "Upgrade this project from Spark 2.4 to Spark 3.5" | ||
|
|
||
| > "Migrate our PySpark codebase to Spark 4.0" | ||
|
|
||
| > "Fix all Spark deprecation warnings in this repo" | ||
|
|
||
| The agent will follow the six-phase workflow, producing a `spark_upgrade_impact.md` document and systematically updating build files, code, configuration, and SQL queries. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,233 @@ | ||
| --- | ||
| name: spark-version-upgrade | ||
| description: Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. | ||
| license: MIT | ||
| compatibility: Requires Java 8+/11+/17+, Scala 2.12/2.13, Maven/Gradle/SBT, Apache Spark | ||
| triggers: | ||
| - spark upgrade | ||
| - spark migration | ||
| - spark version | ||
| - upgrade spark | ||
| - spark 3 | ||
| - spark 4 | ||
| - pyspark upgrade | ||
| --- | ||
|
|
||
| Upgrade Apache Spark applications between major versions with a structured, phase-by-phase workflow. | ||
|
|
||
| ## When to Use | ||
|
|
||
| - Migrating from Spark 2.x → 3.x or Spark 3.x → 4.x | ||
| - Updating PySpark, Spark SQL, or Structured Streaming applications | ||
| - Resolving deprecation warnings before a Spark version bump | ||
|
|
||
| ## Workflow Overview | ||
|
|
||
| 1. **Inventory & Impact Analysis** — Scan the codebase and assess scope | ||
| 2. **Build File Updates** — Bump Spark/Scala/Java dependencies | ||
| 3. **API Migration** — Replace deprecated and removed APIs | ||
| 4. **Configuration Migration** — Update Spark config properties | ||
| 5. **SQL & DataFrame Migration** — Fix query-level breaking changes | ||
| 6. **Test Validation** — Compile, run tests, verify results | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 1: Inventory & Impact Analysis | ||
|
|
||
| Before changing any code, assess what needs to change. Read the official Apache Spark migration guide for the target version — it documents every API removal, config rename, and behavioral change per release: | ||
| https://spark.apache.org/docs/latest/migration-guide.html | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] Read the migration guide section for the target Spark version | ||
| - [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`) | ||
| - [ ] Identify target Spark version | ||
| - [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'` | ||
| - [ ] List all Spark config properties: `grep -rn 'spark\.' --include='*.conf' --include='*.properties' --include='*.scala' --include='*.java' --include='*.py' | grep -v 'test'` | ||
| - [ ] Check for custom `SparkSession` or `SparkContext` extensions | ||
| - [ ] Identify connector dependencies (Hive, Kafka, Cassandra, Delta, Iceberg) | ||
| - [ ] Document findings in `spark_upgrade_impact.md` | ||
|
|
||
| ### Output | ||
|
|
||
| ``` | ||
| spark_upgrade_impact.md # Summary of affected files, APIs, and configs | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 2: Build File Updates | ||
|
|
||
| Update dependency versions and resolve compilation. | ||
|
|
||
| ### Maven (`pom.xml`) | ||
|
|
||
| ```xml | ||
| <!-- Update Spark version property --> | ||
| <spark.version>3.5.1</spark.version> <!-- or 4.0.0 --> | ||
| <scala.version>2.13.12</scala.version> <!-- Spark 3.x: 2.12/2.13; Spark 4.x: 2.13 --> | ||
|
|
||
| <!-- Update artifact IDs if Scala cross-version changed --> | ||
| <artifactId>spark-core_2.13</artifactId> | ||
| <artifactId>spark-sql_2.13</artifactId> | ||
| ``` | ||
|
|
||
| ### SBT (`build.sbt`) | ||
|
|
||
| ```scala | ||
| val sparkVersion = "3.5.1" // or "4.0.0" | ||
| scalaVersion := "2.13.12" | ||
|
|
||
| libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion | ||
| libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion | ||
| ``` | ||
|
|
||
| ### Gradle (`build.gradle`) | ||
|
|
||
| ```groovy | ||
| ext { | ||
| sparkVersion = '3.5.1' // or '4.0.0' | ||
| } | ||
| dependencies { | ||
| implementation "org.apache.spark:spark-core_2.13:${sparkVersion}" | ||
| implementation "org.apache.spark:spark-sql_2.13:${sparkVersion}" | ||
| } | ||
| ``` | ||
|
|
||
| ### PySpark (`requirements.txt` / `pyproject.toml`) | ||
|
|
||
| ``` | ||
| pyspark==3.5.1 # or 4.0.0 | ||
| ``` | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] Update Spark version in build file | ||
| - [ ] Update Scala version if crossing 2.12→2.13 boundary | ||
| - [ ] Update Java source/target level if required (Spark 4.x requires Java 17+) | ||
| - [ ] Update connector library versions to match new Spark version | ||
| - [ ] Resolve dependency conflicts (`mvn dependency:tree` / `sbt dependencyTree`) | ||
| - [ ] Confirm project compiles (errors at this stage are expected — they guide Phase 3) | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 3: API Migration | ||
|
|
||
| Replace removed and deprecated APIs. Work through compiler errors systematically. | ||
|
|
||
| ### Common Patterns | ||
|
|
||
| Consult the official Apache Spark migration guide for the complete list of changes for each version: | ||
| https://spark.apache.org/docs/latest/migration-guide.html | ||
|
|
||
| #### SparkSession Creation (2.x → 3.x) | ||
|
|
||
| ```scala | ||
| // BEFORE (Spark 1.x/2.x) | ||
| val sc = new SparkContext(conf) | ||
| val sqlContext = new SQLContext(sc) | ||
|
|
||
| // AFTER (Spark 2.x+/3.x) | ||
| val spark = SparkSession.builder() | ||
| .config(conf) | ||
| .enableHiveSupport() // if needed | ||
| .getOrCreate() | ||
| val sc = spark.sparkContext | ||
| ``` | ||
|
|
||
| #### RDD to DataFrame (2.x → 3.x) | ||
|
|
||
| ```scala | ||
| // BEFORE | ||
| rdd.toDF() // implicit from SQLContext | ||
|
|
||
| // AFTER | ||
| import spark.implicits._ | ||
| rdd.toDF() // implicit from SparkSession | ||
| ``` | ||
|
|
||
| #### Accumulator API (2.x → 3.x) | ||
|
|
||
| ```scala | ||
| // BEFORE | ||
| val acc = sc.accumulator(0) | ||
|
|
||
| // AFTER | ||
| val acc = sc.longAccumulator("name") | ||
| ``` | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] Replace `SQLContext` / `HiveContext` with `SparkSession` | ||
| - [ ] Replace deprecated `Accumulator` with `AccumulatorV2` | ||
| - [ ] Update `DataFrame` → `Dataset[Row]` where needed | ||
| - [ ] Replace removed `RDD.mapPartitionsWithContext` with `mapPartitions` | ||
| - [ ] Fix `SparkConf` deprecated setters | ||
| - [ ] Update custom `UserDefinedFunction` registration | ||
| - [ ] Migrate `Experimental` / `DeveloperApi` usages that were removed | ||
| - [ ] Verify all compilation errors from Phase 2 are resolved | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 4: Configuration Migration | ||
|
|
||
| Spark renames and removes configuration properties between versions. The official migration guide documents every renamed and removed property per release: | ||
| https://spark.apache.org/docs/latest/migration-guide.html | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] Rename deprecated config keys (e.g., `spark.shuffle.file.buffer.kb` → `spark.shuffle.file.buffer`) | ||
| - [ ] Update removed configs to their replacements | ||
| - [ ] Review `spark-defaults.conf`, application code, and submit scripts | ||
| - [ ] Check for hardcoded config values in test fixtures | ||
| - [ ] Verify `SparkSession.builder().config(...)` calls use current property names | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 5: SQL & DataFrame Migration | ||
|
|
||
| Spark SQL behavior changes between versions can silently alter query results. | ||
|
|
||
| ### Key Breaking Changes (2.x → 3.x) | ||
|
|
||
| - `CAST` to integer no longer truncates silently — set `spark.sql.ansi.enabled` if needed | ||
| - `FROM` clause is required in `SELECT` (no more `SELECT 1`) | ||
| - Column resolution order changed in subqueries | ||
| - `spark.sql.legacy.timeParserPolicy` controls date/time parsing behavior | ||
|
|
||
| ### Key Breaking Changes (3.x → 4.x) | ||
|
|
||
| - ANSI mode is default (`spark.sql.ansi.enabled=true`) | ||
| - Stricter type coercion in comparisons | ||
| - `spark.sql.legacy.*` flags removed | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] Audit SQL strings and DataFrame expressions for changed behavior | ||
| - [ ] Add explicit `CAST` where implicit coercion relied on legacy behavior | ||
| - [ ] Update date/time format patterns to match new parser | ||
| - [ ] Test SQL queries with representative data and compare output to pre-upgrade baseline | ||
| - [ ] Set `spark.sql.legacy.*` flags temporarily if needed for phased migration | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 6: Test Validation | ||
|
|
||
| ### Checklist | ||
|
|
||
| - [ ] All code compiles without errors | ||
| - [ ] All existing unit tests pass | ||
| - [ ] All existing integration tests pass | ||
| - [ ] Run Spark jobs locally with sample data and compare output to pre-upgrade baseline | ||
| - [ ] No deprecation warnings remain (or are documented with a migration timeline) | ||
| - [ ] Update CI/CD pipeline to use new Spark version | ||
| - [ ] Document any `spark.sql.legacy.*` flags that are set temporarily | ||
|
|
||
| ## Done When | ||
|
|
||
| ✓ Project compiles against target Spark version | ||
| ✓ All tests pass | ||
| ✓ No removed APIs remain in code | ||
| ✓ Configuration properties are current | ||
| ✓ SQL queries produce correct results | ||
| ✓ Upgrade impact documented in `spark_upgrade_impact.md` | ||
csmith49 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟠 Important: Per AGENTS.md CI guidance, "every directory under
skills/should be listed inmarketplaces/default.json." This PR only adds the skill tolarge-codebase.json. Verify whether it also needs to be indefault.jsonto prevent CI failures, or if specialized marketplace registration is sufficient.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guidance is out-dated -- skills must be in some marketplace. I've updated the AGENTS.md file accordingly.