diff --git a/AGENTS.md b/AGENTS.md index e7636b9..cba9865 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -92,8 +92,7 @@ When editing or adding skills in this repo, follow these rules (and add new skil ## CI / validation gotchas -- The test suite expects **every directory under `skills/`** to be listed in `marketplaces/default.json`. - - If you add a new skill (or rebase onto a main branch that added skills), update the marketplace file or CI will fail with `Skills missing from marketplace: [...]`. +- The test suite expects **every directory under `skills/`** to be listed in a marketplace. If you add a new skill (or rebase onto a main branch that added skills), update the appropriate marketplace file or CI will fail with `Skills missing from marketplace: [...]`. ## PR review plugin notes diff --git a/marketplaces/large-codebase.json b/marketplaces/large-codebase.json index 8083923..3e07222 100644 --- a/marketplaces/large-codebase.json +++ b/marketplaces/large-codebase.json @@ -47,6 +47,20 @@ "evaluation", "reporting" ] + }, + { + "name": "spark-version-upgrade", + "source": "./spark-version-upgrade", + "description": "Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation.", + "category": "development", + "keywords": [ + "spark", + "upgrade", + "migration", + "pyspark", + "scala", + "big-data" + ] } ] } diff --git a/skills/spark-version-upgrade/README.md b/skills/spark-version-upgrade/README.md new file mode 100644 index 0000000..282bd15 --- /dev/null +++ b/skills/spark-version-upgrade/README.md @@ -0,0 +1,54 @@ +# spark-version-upgrade + +Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. + +## Triggers + +This skill is activated by the following keywords: + +- `spark upgrade` +- `spark migration` +- `spark version` +- `upgrade spark` +- `spark 3` +- `spark 4` +- `pyspark upgrade` + +## Overview + +This skill provides a structured, six-phase workflow for upgrading Apache Spark applications: + +| Phase | Description | +|-------|-------------| +| 1. Inventory & Impact Analysis | Scan the codebase, identify Spark usage, document scope | +| 2. Build File Updates | Bump Spark/Scala/Java versions in Maven, SBT, Gradle, or pip | +| 3. API Migration | Replace removed/deprecated APIs (SQLContext, Accumulator, etc.) | +| 4. Configuration Migration | Rename/remove deprecated Spark config properties | +| 5. SQL & DataFrame Migration | Fix breaking SQL behavior (ANSI mode, type coercion, date parsing) | +| 6. Test Validation | Compile, test, compare output to pre-upgrade baseline | + +## Supported Upgrade Paths + +- **Spark 2.x → 3.x** — Major API removals (SQLContext, HiveContext, Accumulator v1), Scala 2.12/2.13 +- **Spark 3.x → 4.x** — ANSI mode default, Java 17+ requirement, Scala 2.13 only, legacy flag removal + +## Languages & Build Systems + +- **Languages**: Scala, Java, Python (PySpark) +- **Build systems**: Maven, SBT, Gradle, pip/uv + +## Reference Material + +- [Apache Spark Migration Guide](https://spark.apache.org/docs/latest/migration-guide.html) — The official, up-to-date guide covering API removals, configuration changes, SQL behavior, PySpark, Structured Streaming, and MLlib for every Spark release + +## Example Usage + +Ask the agent: + +> "Upgrade this project from Spark 2.4 to Spark 3.5" + +> "Migrate our PySpark codebase to Spark 4.0" + +> "Fix all Spark deprecation warnings in this repo" + +The agent will follow the six-phase workflow, producing a `spark_upgrade_impact.md` document and systematically updating build files, code, configuration, and SQL queries. diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md new file mode 100644 index 0000000..41cfd47 --- /dev/null +++ b/skills/spark-version-upgrade/SKILL.md @@ -0,0 +1,233 @@ +--- +name: spark-version-upgrade +description: Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation. +license: MIT +compatibility: Requires Java 8+/11+/17+, Scala 2.12/2.13, Maven/Gradle/SBT, Apache Spark +triggers: + - spark upgrade + - spark migration + - spark version + - upgrade spark + - spark 3 + - spark 4 + - pyspark upgrade +--- + +Upgrade Apache Spark applications between major versions with a structured, phase-by-phase workflow. + +## When to Use + +- Migrating from Spark 2.x → 3.x or Spark 3.x → 4.x +- Updating PySpark, Spark SQL, or Structured Streaming applications +- Resolving deprecation warnings before a Spark version bump + +## Workflow Overview + +1. **Inventory & Impact Analysis** — Scan the codebase and assess scope +2. **Build File Updates** — Bump Spark/Scala/Java dependencies +3. **API Migration** — Replace deprecated and removed APIs +4. **Configuration Migration** — Update Spark config properties +5. **SQL & DataFrame Migration** — Fix query-level breaking changes +6. **Test Validation** — Compile, run tests, verify results + +--- + +## Phase 1: Inventory & Impact Analysis + +Before changing any code, assess what needs to change. Read the official Apache Spark migration guide for the target version — it documents every API removal, config rename, and behavioral change per release: +https://spark.apache.org/docs/latest/migration-guide.html + +### Checklist + +- [ ] Read the migration guide section for the target Spark version +- [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`) +- [ ] Identify target Spark version +- [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'` +- [ ] List all Spark config properties: `grep -rn 'spark\.' --include='*.conf' --include='*.properties' --include='*.scala' --include='*.java' --include='*.py' | grep -v 'test'` +- [ ] Check for custom `SparkSession` or `SparkContext` extensions +- [ ] Identify connector dependencies (Hive, Kafka, Cassandra, Delta, Iceberg) +- [ ] Document findings in `spark_upgrade_impact.md` + +### Output + +``` +spark_upgrade_impact.md # Summary of affected files, APIs, and configs +``` + +--- + +## Phase 2: Build File Updates + +Update dependency versions and resolve compilation. + +### Maven (`pom.xml`) + +```xml + +3.5.1 +2.13.12 + + +spark-core_2.13 +spark-sql_2.13 +``` + +### SBT (`build.sbt`) + +```scala +val sparkVersion = "3.5.1" // or "4.0.0" +scalaVersion := "2.13.12" + +libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion +libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion +``` + +### Gradle (`build.gradle`) + +```groovy +ext { + sparkVersion = '3.5.1' // or '4.0.0' +} +dependencies { + implementation "org.apache.spark:spark-core_2.13:${sparkVersion}" + implementation "org.apache.spark:spark-sql_2.13:${sparkVersion}" +} +``` + +### PySpark (`requirements.txt` / `pyproject.toml`) + +``` +pyspark==3.5.1 # or 4.0.0 +``` + +### Checklist + +- [ ] Update Spark version in build file +- [ ] Update Scala version if crossing 2.12→2.13 boundary +- [ ] Update Java source/target level if required (Spark 4.x requires Java 17+) +- [ ] Update connector library versions to match new Spark version +- [ ] Resolve dependency conflicts (`mvn dependency:tree` / `sbt dependencyTree`) +- [ ] Confirm project compiles (errors at this stage are expected — they guide Phase 3) + +--- + +## Phase 3: API Migration + +Replace removed and deprecated APIs. Work through compiler errors systematically. + +### Common Patterns + +Consult the official Apache Spark migration guide for the complete list of changes for each version: +https://spark.apache.org/docs/latest/migration-guide.html + +#### SparkSession Creation (2.x → 3.x) + +```scala +// BEFORE (Spark 1.x/2.x) +val sc = new SparkContext(conf) +val sqlContext = new SQLContext(sc) + +// AFTER (Spark 2.x+/3.x) +val spark = SparkSession.builder() + .config(conf) + .enableHiveSupport() // if needed + .getOrCreate() +val sc = spark.sparkContext +``` + +#### RDD to DataFrame (2.x → 3.x) + +```scala +// BEFORE +rdd.toDF() // implicit from SQLContext + +// AFTER +import spark.implicits._ +rdd.toDF() // implicit from SparkSession +``` + +#### Accumulator API (2.x → 3.x) + +```scala +// BEFORE +val acc = sc.accumulator(0) + +// AFTER +val acc = sc.longAccumulator("name") +``` + +### Checklist + +- [ ] Replace `SQLContext` / `HiveContext` with `SparkSession` +- [ ] Replace deprecated `Accumulator` with `AccumulatorV2` +- [ ] Update `DataFrame` → `Dataset[Row]` where needed +- [ ] Replace removed `RDD.mapPartitionsWithContext` with `mapPartitions` +- [ ] Fix `SparkConf` deprecated setters +- [ ] Update custom `UserDefinedFunction` registration +- [ ] Migrate `Experimental` / `DeveloperApi` usages that were removed +- [ ] Verify all compilation errors from Phase 2 are resolved + +--- + +## Phase 4: Configuration Migration + +Spark renames and removes configuration properties between versions. The official migration guide documents every renamed and removed property per release: +https://spark.apache.org/docs/latest/migration-guide.html + +### Checklist + +- [ ] Rename deprecated config keys (e.g., `spark.shuffle.file.buffer.kb` → `spark.shuffle.file.buffer`) +- [ ] Update removed configs to their replacements +- [ ] Review `spark-defaults.conf`, application code, and submit scripts +- [ ] Check for hardcoded config values in test fixtures +- [ ] Verify `SparkSession.builder().config(...)` calls use current property names + +--- + +## Phase 5: SQL & DataFrame Migration + +Spark SQL behavior changes between versions can silently alter query results. + +### Key Breaking Changes (2.x → 3.x) + +- `CAST` to integer no longer truncates silently — set `spark.sql.ansi.enabled` if needed +- `FROM` clause is required in `SELECT` (no more `SELECT 1`) +- Column resolution order changed in subqueries +- `spark.sql.legacy.timeParserPolicy` controls date/time parsing behavior + +### Key Breaking Changes (3.x → 4.x) + +- ANSI mode is default (`spark.sql.ansi.enabled=true`) +- Stricter type coercion in comparisons +- `spark.sql.legacy.*` flags removed + +### Checklist + +- [ ] Audit SQL strings and DataFrame expressions for changed behavior +- [ ] Add explicit `CAST` where implicit coercion relied on legacy behavior +- [ ] Update date/time format patterns to match new parser +- [ ] Test SQL queries with representative data and compare output to pre-upgrade baseline +- [ ] Set `spark.sql.legacy.*` flags temporarily if needed for phased migration + +--- + +## Phase 6: Test Validation + +### Checklist + +- [ ] All code compiles without errors +- [ ] All existing unit tests pass +- [ ] All existing integration tests pass +- [ ] Run Spark jobs locally with sample data and compare output to pre-upgrade baseline +- [ ] No deprecation warnings remain (or are documented with a migration timeline) +- [ ] Update CI/CD pipeline to use new Spark version +- [ ] Document any `spark.sql.legacy.*` flags that are set temporarily + +## Done When + +✓ Project compiles against target Spark version +✓ All tests pass +✓ No removed APIs remain in code +✓ Configuration properties are current +✓ SQL queries produce correct results +✓ Upgrade impact documented in `spark_upgrade_impact.md` \ No newline at end of file