Create new example notebook for Sessions + Spark

DarthMax · DarthMax · commit e511a66c5a38 · 2025-11-28T13:55:19.000+01:00
diff --git a/examples/graph-analytics-serverless-spark.ipynb b/examples/graph-analytics-serverless-spark.ipynb
@@ -0,0 +1,362 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": [
+     "aura"
+    ]
+   },
+   "source": [
+    "# Aura Graph Analytics with Spark"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text"
+   },
+   "source": [
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless.ipynb) in the Neo4j Graph Data Science Client Github repository.\n",
+    "\n",
+    "The notebook shows how to use the `graphdatascience` Python library to create, manage, and use a GDS Session.\n",
+    "\n",
+    "We consider a graph of people and fruits, which we're using as a simple example to show how to connect your AuraDB instance to a GDS Session, run algorithms, and eventually write back your analytical results to the AuraDB database. \n",
+    "We will cover all management operations: creation, listing, and deletion.\n",
+    "\n",
+    "If you are using self managed DB, follow [this example](../graph-analytics-serverless-self-managed)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "This notebook requires having an AuraDB instance available and have the Aura Graph Analytics [feature](https://neo4j.com/docs/aura/graph-analytics/#aura-gds-serverless) enabled for your project.\n",
+    "\n",
+    "You also need to have the `graphdatascience` Python library installed, version `1.15` or later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install \"graphdatascience>=1.18a2\" python-dotenv \"pyspark[sql]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "# This allows to load required secrets from `.env` file in local directory\n",
+    "# This can include Aura API Credentials and Database Credentials.\n",
+    "# If file does not exist this is a noop.\n",
+    "load_dotenv(\"sessions.env\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Connecting to a Spark Session\n",
+    "\n",
+    "To interact with the Spark Cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n",
+    "Working with a remote Spark cluster will work similarly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "os.environ[\"JAVA_HOME\"] = \"/home/max/.sdkman/candidates/java/current\"\n",
+    "\n",
+    "spark = SparkSession.builder.master(\"local[4]\").appName(\"GraphAnalytics\").getOrCreate()\n",
+    "\n",
+    "# Enable Arrow-based columnar data transfers\n",
+    "spark.conf.set(\"spark.sql.execution.arrow.pyspark.enabled\", \"true\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Aura API credentials\n",
+    "\n",
+    "The entry point for managing GDS Sessions is the `GdsSessions` object, which requires creating [Aura API credentials](https://neo4j.com/docs/aura/api/authentication)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from graphdatascience.session import AuraAPICredentials, GdsSessions\n",
+    "\n",
+    "# you can also use AuraAPICredentials.from_env() to load credentials from environment variables\n",
+    "api_credentials = AuraAPICredentials(\n",
+    "    client_id=os.environ[\"CLIENT_ID\"],\n",
+    "    client_secret=os.environ[\"CLIENT_SECRET\"],\n",
+    "    # If your account is a member of several project, you must also specify the project ID to use\n",
+    "    project_id=os.environ.get(\"PROJECT_ID\", None),\n",
+    ")\n",
+    "\n",
+    "sessions = GdsSessions(api_credentials=api_credentials)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating a new session\n",
+    "\n",
+    "A new session is created by calling `sessions.get_or_create()` with the following parameters:\n",
+    "\n",
+    "* A session name, which lets you reconnect to an existing session by calling `get_or_create` again.\n",
+    "* The session memory. \n",
+    "* The cloud location.\n",
+    "* A time-to-live (TTL), which ensures that the session is automatically deleted after being unused for the set time, to avoid incurring costs.\n",
+    "\n",
+    "See the API reference [documentation](https://neo4j.com/docs/graph-data-science-client/current/api/sessions/gds_sessions/#graphdatascience.session.gds_sessions.GdsSessions.get_or_create) or the manual for more details on the parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import timedelta\n",
+    "\n",
+    "from graphdatascience.session import CloudLocation, SessionMemory\n",
+    "\n",
+    "# Create a GDS session!\n",
+    "gds = sessions.get_or_create(\n",
+    "    # we give it a representative name\n",
+    "    session_name=\"people_and_fruits\",\n",
+    "    memory=SessionMemory.m_2GB,\n",
+    "    ttl=timedelta(minutes=30),\n",
+    "    cloud_location=CloudLocation(\"gcp\", \"europe-west1\"),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Adding a dataset\n",
+    "\n",
+    "As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import io\n",
+    "import os\n",
+    "import zipfile\n",
+    "\n",
+    "import requests\n",
+    "\n",
+    "download_path = \"bike_trips_data\"\n",
+    "if not os.path.exists(download_path):\n",
+    "    url = \"https://www.kaggle.com/api/v1/datasets/download/gabrielramos87/bike-trips\"\n",
+    "\n",
+    "    response = requests.get(url)\n",
+    "    response.raise_for_status()\n",
+    "\n",
+    "    # Unzip the content\n",
+    "    with zipfile.ZipFile(io.BytesIO(response.content)) as z:\n",
+    "        z.extractall(download_path)\n",
+    "\n",
+    "df = spark.read.csv(download_path, header=True, inferSchema=True)\n",
+    "df.createOrReplaceTempView(\"bike_trips\")\n",
+    "df.limit(10).show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Projecting Graphs\n",
+    "\n",
+    "Now that we have our dataset available within our Spark session it is time to project it to the GDS Session.\n",
+    "\n",
+    "We first need to get access to the GDSArrowClient. This client allows us to directly communicate with the Arrow Flight server provided by the session.\n",
+    "\n",
+    "Our input data already resembles edge triplets, where each of the rows represents an edge from a source station to a target station. This allows us to use the arrows servers graph import from triplets functionality, which requires the following protocol:\n",
+    "\n",
+    "1. Send an action `v2/graph.project.fromTriplets`\n",
+    "   This will initialize the import process and allows us to specify the graph name, and settings like `undirected_relationship_types`. It returns a job id, that we need to reference the import job in the following steps.\n",
+    "2. Send the data in batches to the arrow server.\n",
+    "3. Send another action called `v2/graph.project.fromTriples.done` to tell the import process that no more data will be send. This will trigger the final graph creation inside the session.\n",
+    "4. Wait for the import process to reach the `DONE` state.\n",
+    "\n",
+    "While the overall process is straight forward, we need to somehow tell Spark to"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import pyarrow\n",
+    "from pyspark.sql import functions\n",
+    "\n",
+    "graph_name = \"bike_trips\"\n",
+    "\n",
+    "arrow_client = gds.arrow_client()\n",
+    "\n",
+    "# 1. Start the import process\n",
+    "job_id = arrow_client.create_graph_from_triplets(graph_name, concurrency=4)\n",
+    "\n",
+    "\n",
+    "# Define a function that receives an arrow batch and uploads it to the session\n",
+    "def upload_batch(iterator):\n",
+    "    for batch in iterator:\n",
+    "        arrow_client.upload_triplets(job_id, [batch])\n",
+    "        yield pyarrow.RecordBatch.from_pandas(pd.DataFrame({\"batch_rows_imported\": [len(batch)]}))\n",
+    "\n",
+    "\n",
+    "# Select the source target pairs from our source data\n",
+    "source_target_pairs = spark.sql(\"\"\"\n",
+    "  SELECT start_station_id AS sourceNode, end_station_id AS targetNode\n",
+    "  FROM bike_trips\n",
+    "\"\"\")\n",
+    "\n",
+    "# 2. Use the `mapInArrow` function to upload the data to the sessions. Returns a dataframe with a single column with the batch sizes.\n",
+    "uploaded_batches = source_target_pairs.mapInArrow(upload_batch, \"batch_rows_imported long\")\n",
+    "\n",
+    "# Aggregate the batch sizes to receive the row count.\n",
+    "uploaded_batches.agg(functions.sum(\"batch_rows_imported\").alias(\"rows_imported\")).show()\n",
+    "\n",
+    "# 3. Finish the import process\n",
+    "arrow_client.triplet_load_done(job_id)\n",
+    "\n",
+    "# 4. Wait for the import to finish\n",
+    "while not arrow_client.job_status(job_id).succeeded():\n",
+    "    pass\n",
+    "\n",
+    "G = gds.v2.graph.get(graph_name)\n",
+    "G"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Running Algorithms\n",
+    "\n",
+    "We can run algorithms on the constructed graph using the standard GDS Python Client API. See the other tutorials for more examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Running PageRank ...\")\n",
+    "pr_result = gds.v2.page_rank.mutate(G, mutate_property=\"pagerank\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Sending the computation result back to Spark\n",
+    "\n",
+    "Once the computation is done. We might want to further use the result in Spark.\n",
+    "We can do this in a similar to the projection, by streaming batches of data into each of the Spark workers.\n",
+    "Retrieving the data is a bit more complicated since we need some input data frame in order to trigger computations on the Spark workers.\n",
+    "We use a data range equal to the size of workers we have in our cluster as our driving table.\n",
+    "On the workers we will disregard the input and instead stream the computation data from the GDS Session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Start the node property export on the session\n",
+    "job_id = arrow_client.get_node_properties(G.name(), [\"pagerank\"])\n",
+    "\n",
+    "\n",
+    "# Define a function that receives data from the GDS Session and turns it into data batches\n",
+    "def retrieve_data(ignored):\n",
+    "    stream_data = arrow_client.stream_job(G.name(), job_id)\n",
+    "    batches = pyarrow.Table.from_pandas(stream_data).to_batches(1000)\n",
+    "    for b in batches:\n",
+    "        yield b\n",
+    "\n",
+    "\n",
+    "# Create DataFrame with a single column and one row per worker\n",
+    "input_partitions = spark.range(spark.sparkContext.defaultParallelism).toDF(\"batch_id\")\n",
+    "# 2. Stream the data from the GDS Session into the Spark workers\n",
+    "received_batches = input_partitions.mapInArrow(retrieve_data, \"nodeId long, pagerank double\")\n",
+    "# Optional: Repartition the data to make sure it is distributed equally\n",
+    "result = received_batches.repartition(numPartitions=spark.sparkContext.defaultParallelism)\n",
+    "\n",
+    "result.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cleanup\n",
+    "\n",
+    "Now that we have finished our analysis, we can delete the session and stop the spark connection.\n",
+    "\n",
+    "Deleting the session will release all resources associated with it, and stop incurring costs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gds.delete()\n",
+    "spark.stop()"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}