From 7203478081605fe07260885a4d7eccc1e10b4f31 Mon Sep 17 00:00:00 2001 From: Jason O'Sullivan Date: Wed, 4 Mar 2026 12:14:55 +0000 Subject: [PATCH 1/5] HDDS-14303. updating spark3 user guide --- .../04-user-guide/02-integrations/06-spark.md | 189 ++++++++++++++++- .../04-user-guide/03-integrations/06-spark.md | 190 +++++++++++++++++- 2 files changed, 373 insertions(+), 6 deletions(-) diff --git a/docs/04-user-guide/02-integrations/06-spark.md b/docs/04-user-guide/02-integrations/06-spark.md index 10f30e6887..c55e811e30 100644 --- a/docs/04-user-guide/02-integrations/06-spark.md +++ b/docs/04-user-guide/02-integrations/06-spark.md @@ -1,8 +1,189 @@ --- -draft: true +sidebar_label: Spark --- -# Spark +# Using Apache Spark with Ozone -**TODO:** File a subtask under [HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this page or section. -**TODO:** Uncomment link to this page in src/pages/index.js +Apache Spark is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. + +:::note +This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. +::: + +## Overview + +Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended, especially in CDP environments. + +Key benefits include: + +- Storing large datasets generated or consumed by Spark jobs directly in Ozone. +- Leveraging Ozone's scalability and object storage features for Spark workloads. +- Using standard Spark DataFrame and RDD APIs to interact with Ozone data. + +## Prerequisites + +1. **Ozone Cluster:** A running Ozone cluster. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3.jar` must be available on the Spark driver and executor classpath. +3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`) and now requires them from the runtime classpath ([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark 3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the Spark classpath alongside the existing Hadoop JARs. +4. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone cluster. + +## Configuration + +### 1. Core Site (`core-site.xml`) + +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../01-client-interfaces/02-ofs.md#configuration). + +### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) + +While Spark often picks up settings from `core-site.xml` on the classpath, explicitly setting the implementation can sometimes be necessary: + +```properties +spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem +spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem +``` + +### 3. Security (Kerberos) + +If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. Configure the following property in `spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI: + +```properties +# For YARN deployments in spark3+ +spark.kerberos.access.hadoopFileSystems=ofs://ozone1/ +``` + +Replace `ozone1` with your OM Service ID. Ensure the user running the Spark job has a valid Kerberos ticket (`kinit`). + +## Usage Examples + +You can read and write data using `ofs://` URIs like any other Hadoop-compatible filesystem. + +**URI Format:** `ofs://///path/to/key>` + +### Reading Data (Scala) + +```scala +import org.apache.spark.sql.SparkSession + +val spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() + +// Read a CSV file from Ozone +val df = spark.read.format("csv") + .option("header", "true") + .option("inferSchema", "true") + .load("ofs://ozone1/volume1/bucket1/input/data.csv") + +df.show() + +spark.stop() +``` + +### Writing Data (Scala) + +```scala +import org.apache.spark.sql.SparkSession + +val spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate() + +// Assume 'df' is a DataFrame you want to write +val data = Seq(("Alice", 1), ("Bob", 2), ("Charlie", 3)) +val df = spark.createDataFrame(data).toDF("name", "id") + +// Write DataFrame to Ozone as Parquet files +df.write.mode("overwrite") + .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet") + +spark.stop() +``` + +### Reading Data (Python) + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() + +# Read a CSV file from Ozone +df = spark.read.format("csv") \ + .option("header", "true") \ + .option("inferSchema", "true") \ + .load("ofs://ozone1/volume1/bucket1/input/data.csv") + +df.show() + +spark.stop() +``` + +### Writing Data (Python) + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate() + +# Assume 'df' is a DataFrame you want to write +data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)] +columns = ["name", "id"] +df = spark.createDataFrame(data, columns) + +# Write DataFrame to Ozone as Parquet files +df.write.mode("overwrite") \ + .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet") + +spark.stop() +``` + +## Spark on Kubernetes + +The recommended approach for running Spark on Kubernetes with Ozone is to bake the ozone-filesystem-hadoop3-client-*.jar, the hadoop-common-3.4.x.jar (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. + +1. **Build a Custom Spark Image:** Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: +```dockerfile +FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu + +USER root + +ADD https://repo1.maven.org/maven2/org/apache/ozone/ozone-filesystem-hadoop3-client/2.1.0/ozone-filesystem-hadoop3-client-2.1.0.jar \ + /opt/spark/jars/ + +# Ozone 2.1.0+ requires Hadoop 3.4.x classes (HDDS-13574). +# Add alongside (not replacing) Spark's bundled hadoop-common-3.3.4.jar. +ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.4.2/hadoop-common-3.4.2.jar \ + /opt/spark/jars/ + +COPY core-site.xml /opt/spark/conf/core-site.xml +COPY ozone_write.py /opt/spark/work-dir/ozone_write.py + +USER spark +``` +Where core-site.xml contains at minimum: +```xml + + + + + fs.ofs.impl + org.apache.hadoop.fs.ozone.RootedOzoneFileSystem + + + fs.o3fs.impl + org.apache.hadoop.fs.ozone.OzoneFileSystem + + + ozone.om.address + om-host.example.com:9862 + + +``` +2. **Submit `Spark-submit`:** + ```bash + ./bin/spark-submit \ + --master k8s://https://:6443 \ + --deploy-mode cluster \ + --name spark-ozone-example \ + --conf spark.executor.instances=2 \ + --conf spark.kubernetes.container.image=/spark-ozone:latest \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ + --conf spark.kubernetes.namespace= \ + local:///opt/spark/work-dir/ozone_example.py + ``` +Replace , , and with your environment values. \ No newline at end of file diff --git a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md index 5d0235c29e..c55e811e30 100644 --- a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md +++ b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md @@ -1,3 +1,189 @@ -# Spark +--- +sidebar_label: Spark +--- -**TODO:** File a subtask under [HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this page or section. +# Using Apache Spark with Ozone + +Apache Spark is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. + +:::note +This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. +::: + +## Overview + +Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended, especially in CDP environments. + +Key benefits include: + +- Storing large datasets generated or consumed by Spark jobs directly in Ozone. +- Leveraging Ozone's scalability and object storage features for Spark workloads. +- Using standard Spark DataFrame and RDD APIs to interact with Ozone data. + +## Prerequisites + +1. **Ozone Cluster:** A running Ozone cluster. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3.jar` must be available on the Spark driver and executor classpath. +3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`) and now requires them from the runtime classpath ([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark 3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the Spark classpath alongside the existing Hadoop JARs. +4. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone cluster. + +## Configuration + +### 1. Core Site (`core-site.xml`) + +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../01-client-interfaces/02-ofs.md#configuration). + +### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) + +While Spark often picks up settings from `core-site.xml` on the classpath, explicitly setting the implementation can sometimes be necessary: + +```properties +spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem +spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem +``` + +### 3. Security (Kerberos) + +If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. Configure the following property in `spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI: + +```properties +# For YARN deployments in spark3+ +spark.kerberos.access.hadoopFileSystems=ofs://ozone1/ +``` + +Replace `ozone1` with your OM Service ID. Ensure the user running the Spark job has a valid Kerberos ticket (`kinit`). + +## Usage Examples + +You can read and write data using `ofs://` URIs like any other Hadoop-compatible filesystem. + +**URI Format:** `ofs://///path/to/key>` + +### Reading Data (Scala) + +```scala +import org.apache.spark.sql.SparkSession + +val spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() + +// Read a CSV file from Ozone +val df = spark.read.format("csv") + .option("header", "true") + .option("inferSchema", "true") + .load("ofs://ozone1/volume1/bucket1/input/data.csv") + +df.show() + +spark.stop() +``` + +### Writing Data (Scala) + +```scala +import org.apache.spark.sql.SparkSession + +val spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate() + +// Assume 'df' is a DataFrame you want to write +val data = Seq(("Alice", 1), ("Bob", 2), ("Charlie", 3)) +val df = spark.createDataFrame(data).toDF("name", "id") + +// Write DataFrame to Ozone as Parquet files +df.write.mode("overwrite") + .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet") + +spark.stop() +``` + +### Reading Data (Python) + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() + +# Read a CSV file from Ozone +df = spark.read.format("csv") \ + .option("header", "true") \ + .option("inferSchema", "true") \ + .load("ofs://ozone1/volume1/bucket1/input/data.csv") + +df.show() + +spark.stop() +``` + +### Writing Data (Python) + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate() + +# Assume 'df' is a DataFrame you want to write +data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)] +columns = ["name", "id"] +df = spark.createDataFrame(data, columns) + +# Write DataFrame to Ozone as Parquet files +df.write.mode("overwrite") \ + .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet") + +spark.stop() +``` + +## Spark on Kubernetes + +The recommended approach for running Spark on Kubernetes with Ozone is to bake the ozone-filesystem-hadoop3-client-*.jar, the hadoop-common-3.4.x.jar (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. + +1. **Build a Custom Spark Image:** Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: +```dockerfile +FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu + +USER root + +ADD https://repo1.maven.org/maven2/org/apache/ozone/ozone-filesystem-hadoop3-client/2.1.0/ozone-filesystem-hadoop3-client-2.1.0.jar \ + /opt/spark/jars/ + +# Ozone 2.1.0+ requires Hadoop 3.4.x classes (HDDS-13574). +# Add alongside (not replacing) Spark's bundled hadoop-common-3.3.4.jar. +ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.4.2/hadoop-common-3.4.2.jar \ + /opt/spark/jars/ + +COPY core-site.xml /opt/spark/conf/core-site.xml +COPY ozone_write.py /opt/spark/work-dir/ozone_write.py + +USER spark +``` +Where core-site.xml contains at minimum: +```xml + + + + + fs.ofs.impl + org.apache.hadoop.fs.ozone.RootedOzoneFileSystem + + + fs.o3fs.impl + org.apache.hadoop.fs.ozone.OzoneFileSystem + + + ozone.om.address + om-host.example.com:9862 + + +``` +2. **Submit `Spark-submit`:** + ```bash + ./bin/spark-submit \ + --master k8s://https://:6443 \ + --deploy-mode cluster \ + --name spark-ozone-example \ + --conf spark.executor.instances=2 \ + --conf spark.kubernetes.container.image=/spark-ozone:latest \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ + --conf spark.kubernetes.namespace= \ + local:///opt/spark/work-dir/ozone_example.py + ``` +Replace , , and with your environment values. \ No newline at end of file From 02e9e8895296797594e659e6122b609a8b943eab Mon Sep 17 00:00:00 2001 From: Jason O'Sullivan Date: Wed, 4 Mar 2026 12:30:47 +0000 Subject: [PATCH 2/5] HDDS-14303. updating spark3 user guide --- .../04-user-guide/02-integrations/06-spark.md | 48 +++++++++++-------- .../04-user-guide/03-integrations/06-spark.md | 42 +++++++++------- 2 files changed, 53 insertions(+), 37 deletions(-) diff --git a/docs/04-user-guide/02-integrations/06-spark.md b/docs/04-user-guide/02-integrations/06-spark.md index c55e811e30..69f01ef20c 100644 --- a/docs/04-user-guide/02-integrations/06-spark.md +++ b/docs/04-user-guide/02-integrations/06-spark.md @@ -12,13 +12,13 @@ This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Ap ## Overview -Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended, especially in CDP environments. +Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended. Key benefits include: - Storing large datasets generated or consumed by Spark jobs directly in Ozone. - Leveraging Ozone's scalability and object storage features for Spark workloads. -- Using standard Spark DataFrame and RDD APIs to interact with Ozone data. +- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data. ## Prerequisites @@ -103,9 +103,9 @@ from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() # Read a CSV file from Ozone -df = spark.read.format("csv") \ - .option("header", "true") \ - .option("inferSchema", "true") \ +df = spark.read.format("csv") + .option("header", "true") + .option("inferSchema", "true") .load("ofs://ozone1/volume1/bucket1/input/data.csv") df.show() @@ -134,9 +134,12 @@ spark.stop() ## Spark on Kubernetes -The recommended approach for running Spark on Kubernetes with Ozone is to bake the ozone-filesystem-hadoop3-client-*.jar, the hadoop-common-3.4.x.jar (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. +The recommended approach for running Spark on Kubernetes with Ozone is to bake the `ozone-filesystem-hadoop3-client-*.jar` JAR, the `hadoop-common-3.4.x.jar` JAR (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. + +### Build a Custom Spark Image + +Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: -1. **Build a Custom Spark Image:** Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: ```dockerfile FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu @@ -155,7 +158,9 @@ COPY ozone_write.py /opt/spark/work-dir/ozone_write.py USER spark ``` + Where core-site.xml contains at minimum: + ```xml @@ -174,16 +179,19 @@ Where core-site.xml contains at minimum: ``` -2. **Submit `Spark-submit`:** - ```bash - ./bin/spark-submit \ - --master k8s://https://:6443 \ - --deploy-mode cluster \ - --name spark-ozone-example \ - --conf spark.executor.instances=2 \ - --conf spark.kubernetes.container.image=/spark-ozone:latest \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ - --conf spark.kubernetes.namespace= \ - local:///opt/spark/work-dir/ozone_example.py - ``` -Replace , , and with your environment values. \ No newline at end of file + +### Submit `Spark-submit` + +```bash +./bin/spark-submit \ + --master k8s://https://YOUR_KUBERNETES_API_SERVER:6443 \ + --deploy-mode cluster \ + --name spark-ozone-example \ + --conf spark.executor.instances=2 \ + --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ + --conf spark.kubernetes.namespace=YOUR_NAMESPACE \ + local:///opt/spark/work-dir/ozone_example.py +``` + +Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. diff --git a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md index c55e811e30..1a7e61f0b3 100644 --- a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md +++ b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md @@ -12,13 +12,13 @@ This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Ap ## Overview -Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended, especially in CDP environments. +Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended. Key benefits include: - Storing large datasets generated or consumed by Spark jobs directly in Ozone. - Leveraging Ozone's scalability and object storage features for Spark workloads. -- Using standard Spark DataFrame and RDD APIs to interact with Ozone data. +- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data. ## Prerequisites @@ -134,9 +134,12 @@ spark.stop() ## Spark on Kubernetes -The recommended approach for running Spark on Kubernetes with Ozone is to bake the ozone-filesystem-hadoop3-client-*.jar, the hadoop-common-3.4.x.jar (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. +The recommended approach for running Spark on Kubernetes with Ozone is to bake the `ozone-filesystem-hadoop3-client-*.jar` JAR, the `hadoop-common-3.4.x.jar` JAR (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image. + +### Build a Custom Spark Image + +Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: -1. **Build a Custom Spark Image:** Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/, which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/: ```dockerfile FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu @@ -155,7 +158,9 @@ COPY ozone_write.py /opt/spark/work-dir/ozone_write.py USER spark ``` + Where core-site.xml contains at minimum: + ```xml @@ -174,16 +179,19 @@ Where core-site.xml contains at minimum: ``` -2. **Submit `Spark-submit`:** - ```bash - ./bin/spark-submit \ - --master k8s://https://:6443 \ - --deploy-mode cluster \ - --name spark-ozone-example \ - --conf spark.executor.instances=2 \ - --conf spark.kubernetes.container.image=/spark-ozone:latest \ - --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ - --conf spark.kubernetes.namespace= \ - local:///opt/spark/work-dir/ozone_example.py - ``` -Replace , , and with your environment values. \ No newline at end of file + +### Submit `Spark-submit` + +```bash +./bin/spark-submit \ + --master k8s://https://YOUR_KUBERNETES_API_SERVER:6443 \ + --deploy-mode cluster \ + --name spark-ozone-example \ + --conf spark.executor.instances=2 \ + --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ + --conf spark.kubernetes.namespace=YOUR_NAMESPACE \ + local:///opt/spark/work-dir/ozone_example.py +``` + +Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. From d619f3fcf47b2a4d93928827f6359877de612510 Mon Sep 17 00:00:00 2001 From: Jason O'Sullivan Date: Wed, 4 Mar 2026 12:39:21 +0000 Subject: [PATCH 3/5] HDDS-14303. updating spark3 user guide --- .../04-user-guide/02-integrations/06-spark.md | 19 +++++++++++++------ .../04-user-guide/03-integrations/06-spark.md | 13 ++++++++++--- 2 files changed, 23 insertions(+), 9 deletions(-) diff --git a/docs/04-user-guide/02-integrations/06-spark.md b/docs/04-user-guide/02-integrations/06-spark.md index 69f01ef20c..765835577d 100644 --- a/docs/04-user-guide/02-integrations/06-spark.md +++ b/docs/04-user-guide/02-integrations/06-spark.md @@ -4,7 +4,7 @@ sidebar_label: Spark # Using Apache Spark with Ozone -Apache Spark is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. +[Apache Spark](https://spark.apache.org/) is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. :::note This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. @@ -12,7 +12,10 @@ This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Ap ## Overview -Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended. +Spark interacts with Ozone primarily through the OzoneFileSystem connector, which allows access using the `ofs://` URI scheme. +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol, which is useful for porting existing cloud-native Spark applications to Ozone without changing application code. + +The older `o3fs://` scheme is supported for legacy compatibility but is not recommended for new deployments. Key benefits include: @@ -39,7 +42,6 @@ While Spark often picks up settings from `core-site.xml` on the classpath, expli ```properties spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem -spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem ``` ### 3. Security (Kerberos) @@ -103,9 +105,9 @@ from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() # Read a CSV file from Ozone -df = spark.read.format("csv") - .option("header", "true") - .option("inferSchema", "true") +df = spark.read.format("csv") \ + .option("header", "true") \ + .option("inferSchema", "true") \ .load("ofs://ozone1/volume1/bucket1/input/data.csv") df.show() @@ -195,3 +197,8 @@ Where core-site.xml contains at minimum: ``` Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. + +## Using the S3A Protocol + +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol. This is useful for porting existing cloud-native Spark applications to Ozone without changing application code. +For configuration details, refer to the [S3A documentation](../01-client-interfaces/04-s3a.md). diff --git a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md index 1a7e61f0b3..765835577d 100644 --- a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md +++ b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md @@ -4,7 +4,7 @@ sidebar_label: Spark # Using Apache Spark with Ozone -Apache Spark is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. +[Apache Spark](https://spark.apache.org/) is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. :::note This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. @@ -12,7 +12,10 @@ This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Ap ## Overview -Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended. +Spark interacts with Ozone primarily through the OzoneFileSystem connector, which allows access using the `ofs://` URI scheme. +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol, which is useful for porting existing cloud-native Spark applications to Ozone without changing application code. + +The older `o3fs://` scheme is supported for legacy compatibility but is not recommended for new deployments. Key benefits include: @@ -39,7 +42,6 @@ While Spark often picks up settings from `core-site.xml` on the classpath, expli ```properties spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem -spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem ``` ### 3. Security (Kerberos) @@ -195,3 +197,8 @@ Where core-site.xml contains at minimum: ``` Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. + +## Using the S3A Protocol + +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol. This is useful for porting existing cloud-native Spark applications to Ozone without changing application code. +For configuration details, refer to the [S3A documentation](../01-client-interfaces/04-s3a.md). From 091a3ef728d96477bec6dde1758d099fafd2d2b7 Mon Sep 17 00:00:00 2001 From: Jason O'Sullivan Date: Wed, 4 Mar 2026 12:52:35 +0000 Subject: [PATCH 4/5] HDDS-14303. updating spark3 user guide --- .../04-user-guide/02-integrations/06-spark.md | 17 +++++++-------- .../04-user-guide/03-integrations/06-spark.md | 21 +++++++++---------- 2 files changed, 18 insertions(+), 20 deletions(-) diff --git a/docs/04-user-guide/02-integrations/06-spark.md b/docs/04-user-guide/02-integrations/06-spark.md index 765835577d..8791213f17 100644 --- a/docs/04-user-guide/02-integrations/06-spark.md +++ b/docs/04-user-guide/02-integrations/06-spark.md @@ -26,7 +26,7 @@ Key benefits include: ## Prerequisites 1. **Ozone Cluster:** A running Ozone cluster. -2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3.jar` must be available on the Spark driver and executor classpath. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-client-*.jar` must be available on the Spark driver and executor classpath. 3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`) and now requires them from the runtime classpath ([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark 3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the Spark classpath alongside the existing Hadoop JARs. 4. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone cluster. @@ -46,7 +46,9 @@ spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem ### 3. Security (Kerberos) -If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. Configure the following property in `spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI: +If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. + +Configure the following property in `spark-defaults.conf` or via `--conf`, specifying your Ozone filesystem URI: ```properties # For YARN deployments in spark3+ @@ -59,7 +61,7 @@ Replace `ozone1` with your OM Service ID. Ensure the user running the Spark job You can read and write data using `ofs://` URIs like any other Hadoop-compatible filesystem. -**URI Format:** `ofs://///path/to/key>` +**URI Format:** `ofs://///path/to/key` ### Reading Data (Scala) @@ -171,10 +173,6 @@ Where core-site.xml contains at minimum: fs.ofs.impl org.apache.hadoop.fs.ozone.RootedOzoneFileSystem - - fs.o3fs.impl - org.apache.hadoop.fs.ozone.OzoneFileSystem - ozone.om.address om-host.example.com:9862 @@ -182,7 +180,7 @@ Where core-site.xml contains at minimum: ``` -### Submit `Spark-submit` +### Submit a Spark Job ```bash ./bin/spark-submit \ @@ -193,7 +191,7 @@ Where core-site.xml contains at minimum: --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.namespace=YOUR_NAMESPACE \ - local:///opt/spark/work-dir/ozone_example.py + local:///opt/spark/work-dir/ozone_write.py ``` Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. @@ -201,4 +199,5 @@ Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with you ## Using the S3A Protocol Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol. This is useful for porting existing cloud-native Spark applications to Ozone without changing application code. + For configuration details, refer to the [S3A documentation](../01-client-interfaces/04-s3a.md). diff --git a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md index 765835577d..9035ba092b 100644 --- a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md +++ b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md @@ -26,7 +26,7 @@ Key benefits include: ## Prerequisites 1. **Ozone Cluster:** A running Ozone cluster. -2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3.jar` must be available on the Spark driver and executor classpath. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-client-*.jar` must be available on the Spark driver and executor classpath. 3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`) and now requires them from the runtime classpath ([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark 3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the Spark classpath alongside the existing Hadoop JARs. 4. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone cluster. @@ -34,7 +34,7 @@ Key benefits include: ### 1. Core Site (`core-site.xml`) -For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../01-client-interfaces/02-ofs.md#configuration). +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../client-interfaces/ofs#configuration). ### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) @@ -46,7 +46,9 @@ spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem ### 3. Security (Kerberos) -If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. Configure the following property in `spark-defaults.conf`or via`--conf`, specifying your Ozone filesystem URI: +If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. + +Configure the following property in `spark-defaults.conf` or via `--conf`, specifying your Ozone filesystem URI: ```properties # For YARN deployments in spark3+ @@ -59,7 +61,7 @@ Replace `ozone1` with your OM Service ID. Ensure the user running the Spark job You can read and write data using `ofs://` URIs like any other Hadoop-compatible filesystem. -**URI Format:** `ofs://///path/to/key>` +**URI Format:** `ofs://///path/to/key` ### Reading Data (Scala) @@ -171,10 +173,6 @@ Where core-site.xml contains at minimum: fs.ofs.impl org.apache.hadoop.fs.ozone.RootedOzoneFileSystem - - fs.o3fs.impl - org.apache.hadoop.fs.ozone.OzoneFileSystem - ozone.om.address om-host.example.com:9862 @@ -182,7 +180,7 @@ Where core-site.xml contains at minimum: ``` -### Submit `Spark-submit` +### Submit a Spark Job ```bash ./bin/spark-submit \ @@ -193,7 +191,7 @@ Where core-site.xml contains at minimum: --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.namespace=YOUR_NAMESPACE \ - local:///opt/spark/work-dir/ozone_example.py + local:///opt/spark/work-dir/ozone_write.py ``` Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with your environment values. @@ -201,4 +199,5 @@ Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with you ## Using the S3A Protocol Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol. This is useful for porting existing cloud-native Spark applications to Ozone without changing application code. -For configuration details, refer to the [S3A documentation](../01-client-interfaces/04-s3a.md). + +For configuration details, refer to the [S3A documentation](../client-interfaces/s3a). From 881abf935ca91d65dd16585e2821078a42cad7ab Mon Sep 17 00:00:00 2001 From: Jason O'Sullivan Date: Wed, 4 Mar 2026 14:09:56 +0000 Subject: [PATCH 5/5] HDDS-14303. updating spark3 user guide --- docs/04-user-guide/02-integrations/06-spark.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/04-user-guide/02-integrations/06-spark.md b/docs/04-user-guide/02-integrations/06-spark.md index 8791213f17..9035ba092b 100644 --- a/docs/04-user-guide/02-integrations/06-spark.md +++ b/docs/04-user-guide/02-integrations/06-spark.md @@ -34,7 +34,7 @@ Key benefits include: ### 1. Core Site (`core-site.xml`) -For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../01-client-interfaces/02-ofs.md#configuration). +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../client-interfaces/ofs#configuration). ### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) @@ -200,4 +200,4 @@ Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with you Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol. This is useful for porting existing cloud-native Spark applications to Ozone without changing application code. -For configuration details, refer to the [S3A documentation](../01-client-interfaces/04-s3a.md). +For configuration details, refer to the [S3A documentation](../client-interfaces/s3a).