From 8b9831f590949b1e85f8fe5a4b3f57edc4407df2 Mon Sep 17 00:00:00 2001 From: Calvin Smithu Date: Tue, 17 Mar 2026 12:20:38 -0600 Subject: [PATCH 1/4] Add spark-version-upgrade skill with references and marketplace registration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - SKILL.md: 6-phase upgrade workflow (inventory, build, API, config, SQL, test) covering Spark 2.x→3.x and 3.x→4.x with actionable checklists - README.md: human-readable overview with trigger keywords and examples - references/api-changes.md: removed/deprecated API catalog with before/after code - references/config-changes.md: config property rename/removal mapping - Registered in marketplaces/large-codebase.json Co-authored-by: openhands --- marketplaces/large-codebase.json | 14 ++ skills/spark-version-upgrade/README.md | 55 +++++ skills/spark-version-upgrade/SKILL.md | 231 ++++++++++++++++++ .../references/api-changes.md | 152 ++++++++++++ .../references/config-changes.md | 78 ++++++ 5 files changed, 530 insertions(+) create mode 100644 skills/spark-version-upgrade/README.md create mode 100644 skills/spark-version-upgrade/SKILL.md create mode 100644 skills/spark-version-upgrade/references/api-changes.md create mode 100644 skills/spark-version-upgrade/references/config-changes.md diff --git a/marketplaces/large-codebase.json b/marketplaces/large-codebase.json index 8083923..3e07222 100644 --- a/marketplaces/large-codebase.json +++ b/marketplaces/large-codebase.json @@ -47,6 +47,20 @@ "evaluation", "reporting" ] + }, + { + "name": "spark-version-upgrade", + "source": "./spark-version-upgrade", + "description": "Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation.", + "category": "development", + "keywords": [ + "spark", + "upgrade", + "migration", + "pyspark", + "scala", + "big-data" + ] } ] } diff --git a/skills/spark-version-upgrade/README.md b/skills/spark-version-upgrade/README.md new file mode 100644 index 0000000..1e0d6bb --- /dev/null +++ b/skills/spark-version-upgrade/README.md @@ -0,0 +1,55 @@ +# spark-version-upgrade + +Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. + +## Triggers + +This skill is activated by the following keywords: + +- `spark upgrade` +- `spark migration` +- `spark version` +- `upgrade spark` +- `spark 3` +- `spark 4` +- `pyspark upgrade` + +## Overview + +This skill provides a structured, six-phase workflow for upgrading Apache Spark applications: + +| Phase | Description | +|-------|-------------| +| 1. Inventory & Impact Analysis | Scan the codebase, identify Spark usage, document scope | +| 2. Build File Updates | Bump Spark/Scala/Java versions in Maven, SBT, Gradle, or pip | +| 3. API Migration | Replace removed/deprecated APIs (SQLContext, Accumulator, etc.) | +| 4. Configuration Migration | Rename/remove deprecated Spark config properties | +| 5. SQL & DataFrame Migration | Fix breaking SQL behavior (ANSI mode, type coercion, date parsing) | +| 6. Test Validation | Compile, test, compare output to pre-upgrade baseline | + +## Supported Upgrade Paths + +- **Spark 2.x → 3.x** — Major API removals (SQLContext, HiveContext, Accumulator v1), Scala 2.12/2.13 +- **Spark 3.x → 4.x** — ANSI mode default, Java 17+ requirement, Scala 2.13 only, legacy flag removal + +## Languages & Build Systems + +- **Languages**: Scala, Java, Python (PySpark) +- **Build systems**: Maven, SBT, Gradle, pip/uv + +## Reference Material + +- [references/api-changes.md](references/api-changes.md) — Catalog of removed/deprecated APIs with before/after code +- [references/config-changes.md](references/config-changes.md) — Spark configuration property rename/removal mapping + +## Example Usage + +Ask the agent: + +> "Upgrade this project from Spark 2.4 to Spark 3.5" + +> "Migrate our PySpark codebase to Spark 4.0" + +> "Fix all Spark deprecation warnings in this repo" + +The agent will follow the six-phase workflow, producing a `spark_upgrade_impact.md` document and systematically updating build files, code, configuration, and SQL queries. diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md new file mode 100644 index 0000000..fe1ca87 --- /dev/null +++ b/skills/spark-version-upgrade/SKILL.md @@ -0,0 +1,231 @@ +--- +name: spark-version-upgrade +description: Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. +license: MIT +compatibility: Requires Java 8+/11+/17+, Scala 2.12/2.13, Maven/Gradle/SBT, Apache Spark +triggers: + - spark upgrade + - spark migration + - spark version + - upgrade spark + - spark 3 + - spark 4 + - pyspark upgrade +--- + +Upgrade Apache Spark applications between major versions with a structured, phase-by-phase workflow. + +## When to Use + +- Migrating from Spark 2.x → 3.x or Spark 3.x → 4.x +- Updating PySpark, Spark SQL, or Structured Streaming applications +- Resolving deprecation warnings before a Spark version bump + +## Workflow Overview + +1. **Inventory & Impact Analysis** — Scan the codebase and assess scope +2. **Build File Updates** — Bump Spark/Scala/Java dependencies +3. **API Migration** — Replace deprecated and removed APIs +4. **Configuration Migration** — Update Spark config properties +5. **SQL & DataFrame Migration** — Fix query-level breaking changes +6. **Test Validation** — Compile, run tests, verify results + +--- + +## Phase 1: Inventory & Impact Analysis + +Before changing any code, assess what needs to change. + +### Checklist + +- [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`) +- [ ] Identify target Spark version +- [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'` +- [ ] List all Spark config properties: `grep -rn 'spark\.' --include='*.conf' --include='*.properties' --include='*.scala' --include='*.java' --include='*.py' | grep -v 'test'` +- [ ] Check for custom `SparkSession` or `SparkContext` extensions +- [ ] Identify connector dependencies (Hive, Kafka, Cassandra, Delta, Iceberg) +- [ ] Document findings in `spark_upgrade_impact.md` + +### Output + +``` +spark_upgrade_impact.md # Summary of affected files, APIs, and configs +``` + +--- + +## Phase 2: Build File Updates + +Update dependency versions and resolve compilation. + +### Maven (`pom.xml`) + +```xml + +3.5.1 +2.13.12 + + +spark-core_2.13 +spark-sql_2.13 +``` + +### SBT (`build.sbt`) + +```scala +val sparkVersion = "3.5.1" // or "4.0.0" +scalaVersion := "2.13.12" + +libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion +libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion +``` + +### Gradle (`build.gradle`) + +```groovy +ext { + sparkVersion = '3.5.1' // or '4.0.0' +} +dependencies { + implementation "org.apache.spark:spark-core_2.13:${sparkVersion}" + implementation "org.apache.spark:spark-sql_2.13:${sparkVersion}" +} +``` + +### PySpark (`requirements.txt` / `pyproject.toml`) + +``` +pyspark==3.5.1 # or 4.0.0 +``` + +### Checklist + +- [ ] Update Spark version in build file +- [ ] Update Scala version if crossing 2.12→2.13 boundary +- [ ] Update Java source/target level if required (Spark 4.x requires Java 17+) +- [ ] Update connector library versions to match new Spark version +- [ ] Resolve dependency conflicts (`mvn dependency:tree` / `sbt dependencyTree`) +- [ ] Confirm project compiles (errors at this stage are expected — they guide Phase 3) + +--- + +## Phase 3: API Migration + +Replace removed and deprecated APIs. Work through compiler errors systematically. + +### Common Patterns + +See [references/api-changes.md](references/api-changes.md) for a full catalog. + +#### SparkSession Creation (2.x → 3.x) + +```scala +// BEFORE (Spark 1.x/2.x) +val sc = new SparkContext(conf) +val sqlContext = new SQLContext(sc) + +// AFTER (Spark 2.x+/3.x) +val spark = SparkSession.builder() + .config(conf) + .enableHiveSupport() // if needed + .getOrCreate() +val sc = spark.sparkContext +``` + +#### RDD to DataFrame (2.x → 3.x) + +```scala +// BEFORE +rdd.toDF() // implicit from SQLContext + +// AFTER +import spark.implicits._ +rdd.toDF() // implicit from SparkSession +``` + +#### Accumulator API (2.x → 3.x) + +```scala +// BEFORE +val acc = sc.accumulator(0) + +// AFTER +val acc = sc.longAccumulator("name") +``` + +### Checklist + +- [ ] Replace `SQLContext` / `HiveContext` with `SparkSession` +- [ ] Replace deprecated `Accumulator` with `AccumulatorV2` +- [ ] Update `DataFrame` → `Dataset[Row]` where needed +- [ ] Replace removed `RDD.mapPartitionsWithContext` with `mapPartitions` +- [ ] Fix `SparkConf` deprecated setters +- [ ] Update custom `UserDefinedFunction` registration +- [ ] Migrate `Experimental` / `DeveloperApi` usages that were removed +- [ ] Verify all compilation errors from Phase 2 are resolved + +--- + +## Phase 4: Configuration Migration + +Spark renames and removes configuration properties between versions. + +See [references/config-changes.md](references/config-changes.md) for the full mapping. + +### Checklist + +- [ ] Rename deprecated config keys (e.g., `spark.shuffle.file.buffer.kb` → `spark.shuffle.file.buffer`) +- [ ] Update removed configs to their replacements +- [ ] Review `spark-defaults.conf`, application code, and submit scripts +- [ ] Check for hardcoded config values in test fixtures +- [ ] Verify `SparkSession.builder().config(...)` calls use current property names + +--- + +## Phase 5: SQL & DataFrame Migration + +Spark SQL behavior changes between versions can silently alter query results. + +### Key Breaking Changes (2.x → 3.x) + +- `CAST` to integer no longer truncates silently — set `spark.sql.ansi.enabled` if needed +- `FROM` clause is required in `SELECT` (no more `SELECT 1`) +- Column resolution order changed in subqueries +- `spark.sql.legacy.timeParserPolicy` controls date/time parsing behavior + +### Key Breaking Changes (3.x → 4.x) + +- ANSI mode is default (`spark.sql.ansi.enabled=true`) +- Stricter type coercion in comparisons +- `spark.sql.legacy.*` flags removed + +### Checklist + +- [ ] Audit SQL strings and DataFrame expressions for changed behavior +- [ ] Add explicit `CAST` where implicit coercion relied on legacy behavior +- [ ] Update date/time format patterns to match new parser +- [ ] Test SQL queries with representative data and compare output to pre-upgrade baseline +- [ ] Set `spark.sql.legacy.*` flags temporarily if needed for phased migration + +--- + +## Phase 6: Test Validation + +### Checklist + +- [ ] All code compiles without errors +- [ ] All existing unit tests pass +- [ ] All existing integration tests pass +- [ ] Run Spark jobs locally with sample data and compare output to pre-upgrade baseline +- [ ] No deprecation warnings remain (or are documented with a migration timeline) +- [ ] Update CI/CD pipeline to use new Spark version +- [ ] Document any `spark.sql.legacy.*` flags that are set temporarily + +## Done When + +✓ Project compiles against target Spark version +✓ All tests pass +✓ No removed APIs remain in code +✓ Configuration properties are current +✓ SQL queries produce correct results +✓ Upgrade impact documented in `spark_upgrade_impact.md` \ No newline at end of file diff --git a/skills/spark-version-upgrade/references/api-changes.md b/skills/spark-version-upgrade/references/api-changes.md new file mode 100644 index 0000000..c4cfbe1 --- /dev/null +++ b/skills/spark-version-upgrade/references/api-changes.md @@ -0,0 +1,152 @@ +# Spark API Changes Reference + +## Spark 2.x → 3.x Removals + +### Core API + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `SparkContext.accumulator()` | `SparkContext.longAccumulator()` / `doubleAccumulator()` | Use `AccumulatorV2` for custom types | +| `SparkContext.tachyonStore` | Removed entirely | Tachyon/Alluxio off-heap store dropped | +| `RDD.mapPartitionsWithContext` | `RDD.mapPartitions` | Task context available via `TaskContext.get()` | +| `RDD.toJavaRDD()` (implicit) | Explicit `JavaRDD.fromRDD(rdd)` | Implicit conversions tightened | + +### SQL API + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `SQLContext` | `SparkSession` | Use `spark.sql(...)` instead of `sqlContext.sql(...)` | +| `HiveContext` | `SparkSession.builder().enableHiveSupport()` | | +| `DataFrame` (type alias) | `Dataset[Row]` | `DataFrame` still works as alias but prefer `Dataset[Row]` | +| `createExternalTable` | `createTable` | Method renamed | +| `registerTempTable` | `createOrReplaceTempView` | | +| `SQLContext.read` | `SparkSession.read` | | +| `SQLContext.createDataFrame` | `SparkSession.createDataFrame` | | + +### Streaming + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `DStream` API (spark-streaming) | Structured Streaming (`spark-sql`) | DStream still works but is maintenance-only | +| `StreamingContext.awaitTermination` | `SparkSession.streams.awaitAnyTermination` | For Structured Streaming | +| `StreamingContext.remember` | Watermark-based state management | | + +### ML / MLlib + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `org.apache.spark.mllib` (RDD-based) | `org.apache.spark.ml` (DataFrame-based) | RDD-based MLlib is deprecated | +| `LabeledPoint` from mllib | `ml.feature` transformers | Use DataFrame pipelines | +| `mllib.classification.SVMWithSGD` | `ml.classification.LinearSVC` | | +| `mllib.clustering.KMeans` | `ml.clustering.KMeans` | Same algorithm, DataFrame API | + +--- + +## Spark 3.x → 4.x Removals + +### Core API + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `SparkContext.hadoopConfiguration` (mutable) | `SparkSession.sessionState.newHadoopConf()` | Per-session Hadoop config | +| `JavaSparkContext.sc()` | `JavaSparkContext.toSparkContext()` | Method renamed | +| Scala 2.12 support | Scala 2.13 only | All `_2.12` artifacts dropped | +| Java 8/11 support | Java 17+ required | | + +### SQL API + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `spark.sql.legacy.*` flags | No replacement — ANSI behavior is permanent | Audit all legacy flags | +| Non-ANSI `CAST` behavior | Explicit error handling or `TRY_CAST` | Overflows now throw errors | +| `spark.sql.legacy.timeParserPolicy` | New parser is default | Joda → java.time | +| Implicit type coercion in comparisons | Explicit `CAST` required | `string = int` no longer auto-casts | + +### PySpark + +| Removed | Replacement | Notes | +|---------|-------------|-------| +| `pyspark.sql.types.ArrayType.containsNull` default change | Explicitly set `containsNull=True` | Default changed | +| `DataFrame.toJSON()` returns `Dataset[String]` | `.collect()` to materialize | Behavior aligned with Scala | +| Python 3.8 support | Python 3.9+ required | | + +--- + +## Common Migration Patterns + +### Pattern: SQLContext → SparkSession + +```scala +// BEFORE +val conf = new SparkConf().setAppName("MyApp") +val sc = new SparkContext(conf) +val sqlContext = new SQLContext(sc) +val df = sqlContext.read.json("data.json") + +// AFTER +val spark = SparkSession.builder() + .appName("MyApp") + .getOrCreate() +val df = spark.read.json("data.json") +``` + +### Pattern: Accumulator v1 → v2 + +```scala +// BEFORE +val counter = sc.accumulator(0, "my-counter") +rdd.foreach(x => counter += 1) + +// AFTER +val counter = sc.longAccumulator("my-counter") +rdd.foreach(x => counter.add(1)) +``` + +### Pattern: registerTempTable → createOrReplaceTempView + +```scala +// BEFORE +df.registerTempTable("my_table") + +// AFTER +df.createOrReplaceTempView("my_table") +``` + +### Pattern: PySpark UDF Registration + +```python +# BEFORE (Spark 2.x) +from pyspark.sql.functions import udf +from pyspark.sql.types import StringType +my_udf = udf(lambda x: x.upper(), StringType()) + +# AFTER (Spark 3.x+ preferred) +from pyspark.sql.functions import udf +from pyspark.sql.types import StringType + +@udf(returnType=StringType()) +def my_udf(x): + return x.upper() +``` + +### Pattern: ANSI Mode Error Handling (3.x → 4.x) + +```sql +-- BEFORE (non-ANSI, returns NULL on overflow) +SELECT CAST('999999999999' AS INT) + +-- AFTER (ANSI mode, throws error — use TRY_CAST for NULL behavior) +SELECT TRY_CAST('999999999999' AS INT) +``` + +### Pattern: Date/Time Parsing (2.x → 3.x) + +```scala +// BEFORE (lenient Joda-based parsing) +spark.sql("SELECT to_date('2023-1-5', 'yyyy-MM-dd')") + +// AFTER (strict java.time parsing — single-digit month/day needs adjusted pattern) +spark.sql("SELECT to_date('2023-1-5', 'yyyy-M-d')") +// Or set legacy policy temporarily: +// spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") +``` diff --git a/skills/spark-version-upgrade/references/config-changes.md b/skills/spark-version-upgrade/references/config-changes.md new file mode 100644 index 0000000..27a1f00 --- /dev/null +++ b/skills/spark-version-upgrade/references/config-changes.md @@ -0,0 +1,78 @@ +# Spark Configuration Changes Reference + +## Spark 2.x → 3.x: Renamed Properties + +| Old Property | New Property | +|-------------|-------------| +| `spark.shuffle.file.buffer.kb` | `spark.shuffle.file.buffer` | +| `spark.shuffle.consolidateFiles` | Removed (always consolidated) | +| `spark.reducer.maxMbInFlight` | `spark.reducer.maxSizeInFlight` | +| `spark.shuffle.memoryFraction` | Removed (unified memory management) | +| `spark.storage.memoryFraction` | Removed (unified memory management) | +| `spark.storage.unrollFraction` | Removed (unified memory management) | +| `spark.yarn.am.port` | `spark.driver.port` | +| `spark.tachyonStore.baseDir` | Removed | +| `spark.tachyonStore.url` | Removed | +| `spark.sql.tungsten.enabled` | Removed (always enabled) | +| `spark.sql.codegen.wholeStage` | `spark.sql.codegen.wholeStage` (default changed to `true`) | +| `spark.sql.parquet.int96AsTimestamp` | `spark.sql.parquet.int96AsTimestamp` (default changed to `true`) | +| `spark.sql.hive.convertMetastoreParquet` | Still valid but default behavior changed | +| `spark.akka.*` | Removed (Akka replaced with Netty RPC) | + +## Spark 2.x → 3.x: New Important Defaults + +| Property | Old Default | New Default | Impact | +|----------|------------|-------------|--------| +| `spark.sql.adaptive.enabled` | `false` | `true` (3.2+) | AQE auto-optimizes shuffles and joins | +| `spark.sql.ansi.enabled` | `false` | `false` (3.x) | Opt-in for strict SQL behavior | +| `spark.sql.sources.partitionOverwriteMode` | `static` | `static` | Consider `dynamic` for INSERT OVERWRITE | +| `spark.sql.legacy.timeParserPolicy` | N/A | `EXCEPTION` | Strict date/time parsing | +| `spark.sql.legacy.createHiveTableByDefault` | `true` | `false` | Tables default to data source format | + +## Spark 3.x → 4.x: Removed Properties + +| Removed Property | Migration Action | +|-----------------|-----------------| +| `spark.sql.legacy.timeParserPolicy` | Remove — new parser is permanent | +| `spark.sql.legacy.allowNegativeScaleOfDecimal` | Remove — negative scale not allowed | +| `spark.sql.legacy.createHiveTableByDefault` | Remove — data source tables default | +| `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` | Remove — native Avro is standard | +| `spark.sql.legacy.setopsPrecedence.enabled` | Remove — SQL standard precedence permanent | +| `spark.sql.legacy.exponentLiteralAsDecimal.enabled` | Remove — standard behavior permanent | +| `spark.sql.legacy.allowHashOnMapType` | Remove | + +## Spark 3.x → 4.x: Changed Defaults + +| Property | Old Default (3.x) | New Default (4.x) | Impact | +|----------|-------------------|-------------------|--------| +| `spark.sql.ansi.enabled` | `false` | `true` | Overflow/cast errors throw instead of returning NULL | +| `spark.sql.storeAssignmentPolicy` | `ANSI` | `STRICT` | Stricter type checking on INSERT | +| `spark.sql.adaptive.coalescePartitions.enabled` | `true` | `true` | No change, but AQE behavior refined | +| `spark.sql.sources.default` | `parquet` | `parquet` | No change | + +## How to Find Config Usage in Your Codebase + +```bash +# Find all Spark config references +grep -rn 'spark\.' --include='*.scala' --include='*.java' --include='*.py' \ + --include='*.conf' --include='*.properties' --include='*.xml' --include='*.yaml' + +# Find legacy flags specifically +grep -rn 'spark.sql.legacy' --include='*.scala' --include='*.java' --include='*.py' \ + --include='*.conf' --include='*.properties' + +# Find spark-defaults.conf +find . -name 'spark-defaults.conf' -o -name 'spark-env.sh' + +# Find spark-submit scripts with --conf flags +grep -rn '\-\-conf spark\.' --include='*.sh' --include='*.bash' --include='Makefile' +``` + +## Migration Strategy for Legacy Flags + +When upgrading to Spark 4.x, `spark.sql.legacy.*` flags are removed. To migrate safely: + +1. **Audit**: List all `spark.sql.legacy.*` flags in your codebase +2. **Test without them**: Remove each flag on Spark 3.x and run tests to surface failures +3. **Fix code**: Update SQL/DataFrame code to work under non-legacy behavior +4. **Then upgrade**: Bump to Spark 4.x after all legacy flags are eliminated From d70d538dac0bd095dcc8ebca54a459288e4a85f9 Mon Sep 17 00:00:00 2001 From: Calvin Smithu Date: Tue, 17 Mar 2026 12:22:46 -0600 Subject: [PATCH 2/4] Remove local references, point to official Apache Spark migration guide Replace references/api-changes.md and references/config-changes.md with the canonical upstream URL: https://spark.apache.org/docs/latest/migration-guide.html Co-authored-by: openhands --- skills/spark-version-upgrade/README.md | 3 +- skills/spark-version-upgrade/SKILL.md | 8 +- .../references/api-changes.md | 152 ------------------ .../references/config-changes.md | 78 --------- 4 files changed, 5 insertions(+), 236 deletions(-) delete mode 100644 skills/spark-version-upgrade/references/api-changes.md delete mode 100644 skills/spark-version-upgrade/references/config-changes.md diff --git a/skills/spark-version-upgrade/README.md b/skills/spark-version-upgrade/README.md index 1e0d6bb..282bd15 100644 --- a/skills/spark-version-upgrade/README.md +++ b/skills/spark-version-upgrade/README.md @@ -39,8 +39,7 @@ This skill provides a structured, six-phase workflow for upgrading Apache Spark ## Reference Material -- [references/api-changes.md](references/api-changes.md) — Catalog of removed/deprecated APIs with before/after code -- [references/config-changes.md](references/config-changes.md) — Spark configuration property rename/removal mapping +- [Apache Spark Migration Guide](https://spark.apache.org/docs/latest/migration-guide.html) — The official, up-to-date guide covering API removals, configuration changes, SQL behavior, PySpark, Structured Streaming, and MLlib for every Spark release ## Example Usage diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md index fe1ca87..979902b 100644 --- a/skills/spark-version-upgrade/SKILL.md +++ b/skills/spark-version-upgrade/SKILL.md @@ -115,7 +115,8 @@ Replace removed and deprecated APIs. Work through compiler errors systematically ### Common Patterns -See [references/api-changes.md](references/api-changes.md) for a full catalog. +Consult the official Apache Spark migration guide for the complete list of changes for each version: +https://spark.apache.org/docs/latest/migration-guide.html #### SparkSession Creation (2.x → 3.x) @@ -168,9 +169,8 @@ val acc = sc.longAccumulator("name") ## Phase 4: Configuration Migration -Spark renames and removes configuration properties between versions. - -See [references/config-changes.md](references/config-changes.md) for the full mapping. +Spark renames and removes configuration properties between versions. The official migration guide documents every renamed and removed property per release: +https://spark.apache.org/docs/latest/migration-guide.html ### Checklist diff --git a/skills/spark-version-upgrade/references/api-changes.md b/skills/spark-version-upgrade/references/api-changes.md deleted file mode 100644 index c4cfbe1..0000000 --- a/skills/spark-version-upgrade/references/api-changes.md +++ /dev/null @@ -1,152 +0,0 @@ -# Spark API Changes Reference - -## Spark 2.x → 3.x Removals - -### Core API - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `SparkContext.accumulator()` | `SparkContext.longAccumulator()` / `doubleAccumulator()` | Use `AccumulatorV2` for custom types | -| `SparkContext.tachyonStore` | Removed entirely | Tachyon/Alluxio off-heap store dropped | -| `RDD.mapPartitionsWithContext` | `RDD.mapPartitions` | Task context available via `TaskContext.get()` | -| `RDD.toJavaRDD()` (implicit) | Explicit `JavaRDD.fromRDD(rdd)` | Implicit conversions tightened | - -### SQL API - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `SQLContext` | `SparkSession` | Use `spark.sql(...)` instead of `sqlContext.sql(...)` | -| `HiveContext` | `SparkSession.builder().enableHiveSupport()` | | -| `DataFrame` (type alias) | `Dataset[Row]` | `DataFrame` still works as alias but prefer `Dataset[Row]` | -| `createExternalTable` | `createTable` | Method renamed | -| `registerTempTable` | `createOrReplaceTempView` | | -| `SQLContext.read` | `SparkSession.read` | | -| `SQLContext.createDataFrame` | `SparkSession.createDataFrame` | | - -### Streaming - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `DStream` API (spark-streaming) | Structured Streaming (`spark-sql`) | DStream still works but is maintenance-only | -| `StreamingContext.awaitTermination` | `SparkSession.streams.awaitAnyTermination` | For Structured Streaming | -| `StreamingContext.remember` | Watermark-based state management | | - -### ML / MLlib - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `org.apache.spark.mllib` (RDD-based) | `org.apache.spark.ml` (DataFrame-based) | RDD-based MLlib is deprecated | -| `LabeledPoint` from mllib | `ml.feature` transformers | Use DataFrame pipelines | -| `mllib.classification.SVMWithSGD` | `ml.classification.LinearSVC` | | -| `mllib.clustering.KMeans` | `ml.clustering.KMeans` | Same algorithm, DataFrame API | - ---- - -## Spark 3.x → 4.x Removals - -### Core API - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `SparkContext.hadoopConfiguration` (mutable) | `SparkSession.sessionState.newHadoopConf()` | Per-session Hadoop config | -| `JavaSparkContext.sc()` | `JavaSparkContext.toSparkContext()` | Method renamed | -| Scala 2.12 support | Scala 2.13 only | All `_2.12` artifacts dropped | -| Java 8/11 support | Java 17+ required | | - -### SQL API - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `spark.sql.legacy.*` flags | No replacement — ANSI behavior is permanent | Audit all legacy flags | -| Non-ANSI `CAST` behavior | Explicit error handling or `TRY_CAST` | Overflows now throw errors | -| `spark.sql.legacy.timeParserPolicy` | New parser is default | Joda → java.time | -| Implicit type coercion in comparisons | Explicit `CAST` required | `string = int` no longer auto-casts | - -### PySpark - -| Removed | Replacement | Notes | -|---------|-------------|-------| -| `pyspark.sql.types.ArrayType.containsNull` default change | Explicitly set `containsNull=True` | Default changed | -| `DataFrame.toJSON()` returns `Dataset[String]` | `.collect()` to materialize | Behavior aligned with Scala | -| Python 3.8 support | Python 3.9+ required | | - ---- - -## Common Migration Patterns - -### Pattern: SQLContext → SparkSession - -```scala -// BEFORE -val conf = new SparkConf().setAppName("MyApp") -val sc = new SparkContext(conf) -val sqlContext = new SQLContext(sc) -val df = sqlContext.read.json("data.json") - -// AFTER -val spark = SparkSession.builder() - .appName("MyApp") - .getOrCreate() -val df = spark.read.json("data.json") -``` - -### Pattern: Accumulator v1 → v2 - -```scala -// BEFORE -val counter = sc.accumulator(0, "my-counter") -rdd.foreach(x => counter += 1) - -// AFTER -val counter = sc.longAccumulator("my-counter") -rdd.foreach(x => counter.add(1)) -``` - -### Pattern: registerTempTable → createOrReplaceTempView - -```scala -// BEFORE -df.registerTempTable("my_table") - -// AFTER -df.createOrReplaceTempView("my_table") -``` - -### Pattern: PySpark UDF Registration - -```python -# BEFORE (Spark 2.x) -from pyspark.sql.functions import udf -from pyspark.sql.types import StringType -my_udf = udf(lambda x: x.upper(), StringType()) - -# AFTER (Spark 3.x+ preferred) -from pyspark.sql.functions import udf -from pyspark.sql.types import StringType - -@udf(returnType=StringType()) -def my_udf(x): - return x.upper() -``` - -### Pattern: ANSI Mode Error Handling (3.x → 4.x) - -```sql --- BEFORE (non-ANSI, returns NULL on overflow) -SELECT CAST('999999999999' AS INT) - --- AFTER (ANSI mode, throws error — use TRY_CAST for NULL behavior) -SELECT TRY_CAST('999999999999' AS INT) -``` - -### Pattern: Date/Time Parsing (2.x → 3.x) - -```scala -// BEFORE (lenient Joda-based parsing) -spark.sql("SELECT to_date('2023-1-5', 'yyyy-MM-dd')") - -// AFTER (strict java.time parsing — single-digit month/day needs adjusted pattern) -spark.sql("SELECT to_date('2023-1-5', 'yyyy-M-d')") -// Or set legacy policy temporarily: -// spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") -``` diff --git a/skills/spark-version-upgrade/references/config-changes.md b/skills/spark-version-upgrade/references/config-changes.md deleted file mode 100644 index 27a1f00..0000000 --- a/skills/spark-version-upgrade/references/config-changes.md +++ /dev/null @@ -1,78 +0,0 @@ -# Spark Configuration Changes Reference - -## Spark 2.x → 3.x: Renamed Properties - -| Old Property | New Property | -|-------------|-------------| -| `spark.shuffle.file.buffer.kb` | `spark.shuffle.file.buffer` | -| `spark.shuffle.consolidateFiles` | Removed (always consolidated) | -| `spark.reducer.maxMbInFlight` | `spark.reducer.maxSizeInFlight` | -| `spark.shuffle.memoryFraction` | Removed (unified memory management) | -| `spark.storage.memoryFraction` | Removed (unified memory management) | -| `spark.storage.unrollFraction` | Removed (unified memory management) | -| `spark.yarn.am.port` | `spark.driver.port` | -| `spark.tachyonStore.baseDir` | Removed | -| `spark.tachyonStore.url` | Removed | -| `spark.sql.tungsten.enabled` | Removed (always enabled) | -| `spark.sql.codegen.wholeStage` | `spark.sql.codegen.wholeStage` (default changed to `true`) | -| `spark.sql.parquet.int96AsTimestamp` | `spark.sql.parquet.int96AsTimestamp` (default changed to `true`) | -| `spark.sql.hive.convertMetastoreParquet` | Still valid but default behavior changed | -| `spark.akka.*` | Removed (Akka replaced with Netty RPC) | - -## Spark 2.x → 3.x: New Important Defaults - -| Property | Old Default | New Default | Impact | -|----------|------------|-------------|--------| -| `spark.sql.adaptive.enabled` | `false` | `true` (3.2+) | AQE auto-optimizes shuffles and joins | -| `spark.sql.ansi.enabled` | `false` | `false` (3.x) | Opt-in for strict SQL behavior | -| `spark.sql.sources.partitionOverwriteMode` | `static` | `static` | Consider `dynamic` for INSERT OVERWRITE | -| `spark.sql.legacy.timeParserPolicy` | N/A | `EXCEPTION` | Strict date/time parsing | -| `spark.sql.legacy.createHiveTableByDefault` | `true` | `false` | Tables default to data source format | - -## Spark 3.x → 4.x: Removed Properties - -| Removed Property | Migration Action | -|-----------------|-----------------| -| `spark.sql.legacy.timeParserPolicy` | Remove — new parser is permanent | -| `spark.sql.legacy.allowNegativeScaleOfDecimal` | Remove — negative scale not allowed | -| `spark.sql.legacy.createHiveTableByDefault` | Remove — data source tables default | -| `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` | Remove — native Avro is standard | -| `spark.sql.legacy.setopsPrecedence.enabled` | Remove — SQL standard precedence permanent | -| `spark.sql.legacy.exponentLiteralAsDecimal.enabled` | Remove — standard behavior permanent | -| `spark.sql.legacy.allowHashOnMapType` | Remove | - -## Spark 3.x → 4.x: Changed Defaults - -| Property | Old Default (3.x) | New Default (4.x) | Impact | -|----------|-------------------|-------------------|--------| -| `spark.sql.ansi.enabled` | `false` | `true` | Overflow/cast errors throw instead of returning NULL | -| `spark.sql.storeAssignmentPolicy` | `ANSI` | `STRICT` | Stricter type checking on INSERT | -| `spark.sql.adaptive.coalescePartitions.enabled` | `true` | `true` | No change, but AQE behavior refined | -| `spark.sql.sources.default` | `parquet` | `parquet` | No change | - -## How to Find Config Usage in Your Codebase - -```bash -# Find all Spark config references -grep -rn 'spark\.' --include='*.scala' --include='*.java' --include='*.py' \ - --include='*.conf' --include='*.properties' --include='*.xml' --include='*.yaml' - -# Find legacy flags specifically -grep -rn 'spark.sql.legacy' --include='*.scala' --include='*.java' --include='*.py' \ - --include='*.conf' --include='*.properties' - -# Find spark-defaults.conf -find . -name 'spark-defaults.conf' -o -name 'spark-env.sh' - -# Find spark-submit scripts with --conf flags -grep -rn '\-\-conf spark\.' --include='*.sh' --include='*.bash' --include='Makefile' -``` - -## Migration Strategy for Legacy Flags - -When upgrading to Spark 4.x, `spark.sql.legacy.*` flags are removed. To migrate safely: - -1. **Audit**: List all `spark.sql.legacy.*` flags in your codebase -2. **Test without them**: Remove each flag on Spark 3.x and run tests to surface failures -3. **Fix code**: Update SQL/DataFrame code to work under non-legacy behavior -4. **Then upgrade**: Bump to Spark 4.x after all legacy flags are eliminated From 8f2e4ffa9b2eb6304c0436a893b0d00094014f42 Mon Sep 17 00:00:00 2001 From: Calvin Smithu Date: Tue, 17 Mar 2026 12:23:54 -0600 Subject: [PATCH 3/4] Move Apache migration guide reference to Phase 1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The guide is most relevant during inventory — read it before making changes. Co-authored-by: openhands --- skills/spark-version-upgrade/SKILL.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md index 979902b..41cfd47 100644 --- a/skills/spark-version-upgrade/SKILL.md +++ b/skills/spark-version-upgrade/SKILL.md @@ -34,10 +34,12 @@ Upgrade Apache Spark applications between major versions with a structured, phas ## Phase 1: Inventory & Impact Analysis -Before changing any code, assess what needs to change. +Before changing any code, assess what needs to change. Read the official Apache Spark migration guide for the target version — it documents every API removal, config rename, and behavioral change per release: +https://spark.apache.org/docs/latest/migration-guide.html ### Checklist +- [ ] Read the migration guide section for the target Spark version - [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`) - [ ] Identify target Spark version - [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'` From 3cc55a8bff3a28ce7877e837fd2f1acf2736cfbc Mon Sep 17 00:00:00 2001 From: Calvin Smithu Date: Tue, 17 Mar 2026 12:32:26 -0600 Subject: [PATCH 4/4] updating marketplace guidance --- AGENTS.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index e7636b9..cba9865 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -92,8 +92,7 @@ When editing or adding skills in this repo, follow these rules (and add new skil ## CI / validation gotchas -- The test suite expects **every directory under `skills/`** to be listed in `marketplaces/default.json`. - - If you add a new skill (or rebase onto a main branch that added skills), update the marketplace file or CI will fail with `Skills missing from marketplace: [...]`. +- The test suite expects **every directory under `skills/`** to be listed in a marketplace. If you add a new skill (or rebase onto a main branch that added skills), update the appropriate marketplace file or CI will fail with `Skills missing from marketplace: [...]`. ## PR review plugin notes