diff --git a/autovec_unstructured/__frontmatter__.md b/autovec_unstructured/__frontmatter__.md deleted file mode 100644 index a33b8fb8..00000000 --- a/autovec_unstructured/__frontmatter__.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -# frontmatter -path: "/tutorial-couchbase-autovectorization-langchain" -title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services -short_title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets -description: - - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your unstructured data into vector embeddings. - - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain. -content_type: tutorial -filter: sdk -technology: - - Artificial Intelligence -tags: - - LangChain -sdk_language: - - python -length: 20 Mins ---- diff --git a/autovec_unstructured/autovec_unstructured.ipynb b/autovec_unstructured/autovec_unstructured.ipynb index 9b647f48..df573de3 100644 --- a/autovec_unstructured/autovec_unstructured.ipynb +++ b/autovec_unstructured/autovec_unstructured.ipynb @@ -1,27 +1,16 @@ { "cells": [ - { - "cell_type": "markdown", - "id": "6f623039", - "metadata": { - "jp-MarkdownHeadingCollapsed": true - }, - "source": [ - "# Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services \n", - "This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services Auto-Vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain." - ] - }, { "cell_type": "markdown", "id": "a4d47a8a", "metadata": {}, "source": [ - "# 1. Create and Deploy Your Operational cluster on Capella\n", + "# Create and Deploy Your Operational cluster on Capella\n", "To get started with Couchbase Capella, create an account and use it to deploy a cluster. \n", "\n", "Make sure that you deploy a `Multi-node` cluster with `data`, `index`, `query` and `eventing` services enabled. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", " \n", - "### Couchbase Capella Configuration\n", + "## Couchbase Capella Configuration\n", "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", "- Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket you will be using for this tutorial (e.g., `Unstructured_data_bucket`) with Read and Write permissions.\n", "- [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." @@ -34,9 +23,9 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "# 2. Deploying the Model\n", - "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.\n", - "## 2.1: Selecting the Model \n", + "# Deploying the Model\n", + "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us. Make sure the model is deployed in the same region as that of database for workflows to work. To know more about model services click [here](https://docs.couchbase.com/ai/build/model-service/deploy-embed-model.html).\n", + "## Selecting the Model \n", "1. To select the model, you first need to navigate to the \"AI Services\" tab, then select \"Models\" and click on \"Deploy New Model\".\n", " \n", " \n", @@ -45,7 +34,7 @@ " \n", " \n", "\n", - "## 2.2: Access Control to the Model\n", + "## Access Control to the Model\n", "\n", "1. After deploying the model, go to the \"Models\" tab in the AI Services and click on \"Setup Access\".\n", "\n", @@ -65,7 +54,7 @@ "id": "e7552113", "metadata": {}, "source": [ - "# 3. Data upload from S3 bucket to Couchbase (with chunking and vectorization)" + "# Data upload from S3 bucket to Couchbase (with chunking and vectorization)" ] }, { @@ -103,9 +92,9 @@ "6) On selection of the S3 bucket, various options will be displayed as described below.\n", "\n", " \n", - "- `Index Configuration` allows the workflow to **automatically create a Search index** on the generated embeddings. This Search index is essential for performing vector similarity searches. \n", + "- `Index Configuration` allows the workflow to **automatically create a Hyperscale Vector Search index** on the generated embeddings. This Vector Search index is essential for performing vector similarity searches. \n", " - If you enable this option (recommended), the workflow will create a properly configured Search index that includes vector field mappings for your embeddings.\n", - " - If you skip this step, you'll need to manually create a Search index later before you can perform vector searches. See the [Search Index Creation Guide](https://docs.couchbase.com/server/current/search/create-search-indexes.html) below for manual setup instructions.\n", + " - If you skip this step, you'll need to manually create a Vector Search index later to perform optimised vector searches. See the [Vector Search Index Creation Guide](https://docs.couchbase.com/server/current/vector-index/vectors-and-indexes-overview.html) below for manual setup instructions.\n", "- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported.\n", "- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document.\n", "- Click on `Next`.\n", @@ -126,7 +115,7 @@ "\n", " \n", " \n", - " - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.\n", + " - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in model deployment section or it can be entered manually as well.\n", " - Choices between private and insecure networking is available to choose.\n", " - A click on `Next` will land you at the final page of the workflow.\n", " \n", @@ -147,7 +136,7 @@ "id": "4f7321a7", "metadata": {}, "source": [ - "# 4. Vector Search Using Couchbase Search Service\n", + "# Vector Search Using Couchbase Search Service\n", "\n", "The following code cells implement semantic vector search against the embeddings generated by the Auto-Vectorization workflow. These searches are powered by **Couchbase's Search service**.\n", "\n", @@ -165,7 +154,7 @@ }, "outputs": [], "source": [ - "!pip install langchain-couchbase langchain-openai" + "!pip install langchain-couchbase==1.0.1 langchain-openai" ] }, { @@ -173,8 +162,8 @@ "id": "ea920e0f-bd81-4a74-841a-86a11cb8aec4", "metadata": {}, "source": [ - "`langchain-couchbase - Version: 0.5.0` \\\n", - "`pip install langchain-openai - Version: 0.3.34` \n", + "`langchain-couchbase >= Version: 1.0.1` \\\n", + "`langchain-openai - Version: 0.3.34` \n", "\n", "Now, please proceed to execute the cells in order to run the vector similarity search.\n", "\n", @@ -183,7 +172,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "id": "5e8ba0fc", "metadata": {}, "outputs": [], @@ -193,7 +182,8 @@ "from couchbase.options import ClusterOptions\n", "\n", "from langchain_openai import OpenAIEmbeddings\n", - "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy\n", "\n", "from datetime import timedelta" ] @@ -209,20 +199,28 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 9, "id": "f44ea528-1ec1-41ce-90db-bdd0d87b5cff", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connected!\n" + ] + } + ], "source": [ - "endpoint = \"CLUSTER_CONNECTION_STRING\" # Replace this with Connection String\n", - "username = \"YOUR_USERNAME\" # Replace this with your username\n", - "password = \"YOUR_PASSWORD\" # Replace this with your password\n", - "auth = PasswordAuthenticator(username, password)\n", + "endpoint = \"COUCHBASE_CAPELLA_ENDPOINT\" # Replace this with Connection String\n", + "username = \"COUCHBASE_CAPELLA_USERNAME\"\n", + "password = \"COUCHBASE_CAPELLA_PASSWORD\" \n", "\n", + "auth = PasswordAuthenticator(username, password)\n", "options = ClusterOptions(auth)\n", "cluster = Cluster(endpoint, options)\n", - "\n", - "cluster.wait_until_ready(timedelta(seconds=5))" + "cluster.wait_until_ready(timedelta(seconds=10))\n", + "print(\"Connected!\")" ] }, { @@ -232,9 +230,9 @@ "source": [ "# Selection of Buckets / Scope / Collection / Index / Embedder\n", " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", - " - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically during the workflow setup (step 3.6) or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n", + " - `index_name` specifies the **Capella Search index name**. This is the Search index created automatically in the workflow setup section or manually as described in the same step. You can find this index name in the **Search** tab of your Capella cluster.\n", " - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n", - " - `open_api_key` is the api key token created in `step 2.3`.\n", + " - `open_api_key` is the api key token created in model deployment section.\n", " - `open_api_base` is the Capella model services endpoint found in the models section.\n", " - for more details visit [openAIEmbeddings](https://docs.langchain.com/oss/python/integrations/text_embedding/openai).\n", "\n", @@ -243,7 +241,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 35, "id": "1d77404b", "metadata": {}, "outputs": [], @@ -251,13 +249,13 @@ "bucket_name = \"Unstructured_data_bucket\"\n", "scope_name = \"_default\"\n", "collection_name = \"_default\"\n", - "index_name = \"hyperscale_autovec_workflow_text-to-embed\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n", + "index_name = \"hyperscale_autovec_workflow_text-embedding\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n", " \n", "# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n", "embedder = OpenAIEmbeddings(\n", - " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", - " openai_api_key=\"CAPELLA_MODEL_KEY\",\n", - " openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n", + " model=\"nvidia/llama-3.2-nv-embedqa-1b-v2\", # This is the model that will be used to create the embedding of the query.\n", + " openai_api_key=\"COUCHBASE_CAPELLA_MODEL_API_KEY\",\n", + " openai_api_base=\"COUCHBASE_CAPELLA_MODEL_ENDPOINT/v1\",\n", " check_embedding_ctx_length=False,\n", " tiktoken_enabled=False, \n", ")" @@ -269,32 +267,31 @@ "metadata": {}, "source": [ "# VectorStore Construction\n", - " - Creates a [CouchbaseSearchVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) instance that interfaces with **Couchbase's Search service** to perform vector similarity searches.\n", + " - Creates a [CouchbaseQueryVectorStore](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-query-vector-store) instance that interfaces with **Couchbase's Query service** to perform vector similarity searches using [Hyperscale/Composite](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) indexes. \n", " - The vector store:\n", " * Knows where to read documents (`bucket/scope/collection`).\n", - " * References the Search index (`index_name`) that contains vector field mappings.\n", " * Knows the embedding field (the vector produced by the Auto-Vectorization workflow).\n", " * Uses the provided embedder to embed queries on-demand for similarity search.\n", - " - If your Auto-Vectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", + " - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "id": "8efd0e80", "metadata": {}, "outputs": [], "source": [ - "vector_store = CouchbaseSearchVectorStore(\n", + "vector_store = CouchbaseQueryVectorStore(\n", " cluster=cluster,\n", " bucket_name=bucket_name,\n", " scope_name=scope_name,\n", " collection_name=collection_name,\n", " embedding=embedder,\n", - " index_name=index_name,\n", " text_key=\"text-to-embed\", # Your document's text field\n", - " embedding_key=\"text-embedding\" # This is the field in which your vector (embedding) is stored in the cluster.\n", + " embedding_key=\"text-embedding\",\n", + " distance_metric=DistanceStrategy.COSINE # This is the field in which your vector (embedding) is stored in the cluster.\n", ")" ] }, @@ -304,8 +301,8 @@ "metadata": {}, "source": [ "# Performing a Similarity Search\n", - " - Defines a natural language query (e.g., \"How to setup java SDK?\").\n", - " - Calls `similarity_search_with_score(k=3)` to retrieve the top 3 most semantically similar documents using **Couchbase's Search service**.\n", + " - Defines a natural language query (e.g., \"What are the pre-requisite for java SDK?\").\n", + " - Calls `similarity_search(query, k=3)` to retrieve the top 3 most semantically similar documents using **Couchbase's Hyperscale Vector Search** service.\n", " - The Search service performs efficient vector similarity search using the index created earlier.\n", " - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`).\n", " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", @@ -314,7 +311,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 34, "id": "eb87c6e6", "metadata": {}, "outputs": [ @@ -322,22 +319,27 @@ "name": "stdout", "output_type": "stream", "text": [ - "1. — Score: 0.8052 — Content: Section Title: Set Up the Java SDK\n", - "Content: Run the command mvn install to pull in all the dependencies and finish your SDK setup.\n", - "2. — Score: 0.7971 — Content: Section Title: Set Up the Java SDK\n", - "Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom. xml. Paste the following code block into your pom. xm1 file: Open a terminal window and navigate to your student directory.\n", - "3. — Score: 0.7745 — Content: Section Title: Prerequisites\n", - "Content: e You have installed the Java Software Development Kit (version 8, 11, 17, or 21). o The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version.\n" + "\n", + "--- Result 1 ---\n", + "Section Title: Prerequisites\n", + "Content: You have installed the Java Software Development Kit (version 8, 11, 17, or 21). The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version.\n", + "\n", + "--- Result 2 ---\n", + "Section Title: Connect the SDK to Your Cluster\n", + "Content: Important: directory whenever you\n", + "\n", + "--- Result 3 ---\n", + "Section Title: Set Up the Java SDK\n", + "Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom.xml . Paste the following code block into your pom.xml file: Open a terminal window and navigate to your student directory. Run the command mvn install to pull in all the dependencies and finish your SDK setup. Next, connect the Java SDK to your cluster.\n" ] } ], "source": [ - "query = \"How to setup java SDK?\"\n", - "results = vector_store.similarity_search_with_score(query, k=3)\n", - "\n", - "for rank, (doc, score) in enumerate(results, start=1):\n", - " text = getattr(doc, \"page_content\", None)\n", - " print(f\"{rank}. — Score: {score:.4f} — Content: {text}\")\n" + "query = \"What are the pre-requisite for java SDK?\"\n", + "results = vector_store.similarity_search(query, k=3)\n", + "for i, doc in enumerate(results, 1):\n", + " print(f\"\\n--- Result {i} ---\")\n", + " print(doc.page_content)" ] }, { @@ -345,17 +347,9 @@ "id": "b5ab91ee", "metadata": {}, "source": [ - "# Results and Interpretation\n", - "\n", - "As we can see, 3 (or `k`) ranked results are printed in the output.\n", - "\n", - "### What Each Part Means\n", - "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", - "- Content text: This is the value of the field you configured as `text_key` (in this tutorial: `text-to-embed`). It represents the human-readable content we chose to display.\n", - "\n", "### How the Ranking Works with Search Service\n", "1. Your natural language query (e.g., `query = \"How to setup java SDK?\"`) is embedded using the NVIDIA model (`nvidia/llama-3.2-nv-embedqa-1b-v2`).\n", - "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"text-embedding\"`).\n", + "2. The query embedding is compared against the `embedding_key`.\n", "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", "\n", "\n", @@ -379,7 +373,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.7" + "version": "3.13.10" } }, "nbformat": 4, diff --git a/autovec_unstructured/frontmatter.md b/autovec_unstructured/frontmatter.md new file mode 100644 index 00000000..8e34b003 --- /dev/null +++ b/autovec_unstructured/frontmatter.md @@ -0,0 +1,21 @@ +--- +# frontmatter +path: "/tutorial-couchbase-autovectorization-workflows-with-unstructured-data-and-langchain" +title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services +short_title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets +description: + - Learn how to use Couchbase Capella's AI Services Auto-Vectorization feature to automatically process unstructured data from S3 buckets. + - Configure workflows to chunk and vectorize documents (PDFs, images, etc.) and import them into Capella collections. + - Perform semantic vector search using LangChain and the generated embeddings. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - Hyperscale Vector Index + - Artificial Intelligence + - LangChain +sdk_language: + - python +length: 20 Mins +---