Add VertexAI Text vectorizer (#52)

tylerhutcherson · web-flow · commit 1e97e3cca6ec · 2023-09-06T23:03:56.000-04:00
diff --git a/.github/workflows/run_tests.yml b/.github/workflows/run_tests.yml
@@ -35,14 +35,22 @@ jobs:
       run: |
         REDIS_URL=redis://localhost:6379
         echo REDIS_URL=$REDIS_URL >> $GITHUB_ENV
+    - name: Authenticate to Google Cloud
+      uses: google-github-actions/auth@v1
+      with:
+        credentials_json: ${{ secrets.GOOGLE_CREDENTIALS }}
     - name: Run tests
       env:
         OPENAI_API_KEY: ${{ secrets.OPENAI_KEY }}
+        GCP_LOCATION: ${{ secrets.GCP_LOCATION }}
+        GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
       run: |
         make test-cov
     - name: Run notebooks
       env:
         OPENAI_API_KEY: ${{ secrets.OPENAI_KEY }}
+        GCP_LOCATION: ${{ secrets.GCP_LOCATION }}
+        GCP_PROJECT_ID: ${{ secrets.GCP_PROJECT_ID }}
       run: |
         cd docs/ && treon -v --exclude="./examples/openai_qna.ipynb"
     - name: Publish coverage results
diff --git a/docs/api/vectorizer.rst b/docs/api/vectorizer.rst
@@ -43,4 +43,21 @@ OpenAITextVectorizer
    :members:
 
 
+VertexAITextVectorizer
+================
+
+.. _vertexaitextvectorizer_api:
+
+.. currentmodule:: redisvl.vectorize.text.vertexai
+
+.. autosummary::
+
+    VertexAITextVectorizer.__init__
+    VertexAITextVectorizer.embed
+    VertexAITextVectorizer.embed_many
+
+.. autoclass:: VertexAITextVectorizer
+   :show-inheritance:
+   :inherited-members:
+   :members:
 
diff --git a/docs/user_guide/vectorizers_03.ipynb b/docs/user_guide/vectorizers_03.ipynb
@@ -10,18 +10,19 @@
     "In this notebook, we will show how to use RedisVL to create embeddings using the built-in text embedding vectorizers. Today RedisVL supports:\n",
     "1. OpenAI\n",
     "2. HuggingFace\n",
+    "3. Vertex AI\n",
     "\n",
     "Before running this notebook, be sure to\n",
     "1. Have installed ``redisvl`` and have that environment active for this notebook.\n",
-    "2. Have a running Redis instance with RediSearch > 2.4 running.\n",
+    "2. Have a running Redis Stack instance with RediSearch > 2.4 active.\n",
     "\n",
-    "For example, you can run Redis locally with Docker:\n",
+    "For example, you can run Redis Stack locally with Docker:\n",
     "\n",
     "```bash\n",
     "docker run -d -p 6379:6379 -p 8001:8001 redis/redis-stack:latest\n",
     "```\n",
     "\n",
-    "which will run Redis on port 6379 and RedisInsight at http://localhost:8001."
+    "This will run Redis on port 6379 and RedisInsight at http://localhost:8001."
    ]
   },
   {
@@ -107,6 +108,7 @@
    "source": [
     "from redisvl.vectorize.text import OpenAITextVectorizer\n",
     "\n",
+    "# create a vectorizer\n",
     "oai = OpenAITextVectorizer(\n",
     "    model=\"text-embedding-ada-002\",\n",
     "    api_config={\"api_key\": api_key},\n",
@@ -179,7 +181,7 @@
    "source": [
     "### Huggingface\n",
     "\n",
-    "Huggingface is a popular NLP library that has a number of pre-trained models. RedisVL supports using Huggingface to create embeddings from these models. To use Huggingface, you will need to install the ``sentence-transformers`` library.\n",
+    "[Huggingface](https://huggingface.co/models) is a popular NLP platform that has a number of pre-trained models you can use off the shelf. RedisVL supports using Huggingface \"Sentence Transformers\" to create embeddings from text. To use Huggingface, you will need to install the ``sentence-transformers`` library.\n",
     "\n",
     "```bash\n",
     "pip install sentence-transformers\n",
@@ -188,43 +190,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/sam.partee/.virtualenvs/rvl/lib/python3.8/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
-      "  from .autonotebook import tqdm as notebook_tqdm\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "[0.00037813105154782534,\n",
-       " -0.05080341547727585,\n",
-       " -0.03514720872044563,\n",
-       " -0.023251093924045563,\n",
-       " -0.04415826499462128,\n",
-       " 0.020487893372774124,\n",
-       " 0.0014619074063375592,\n",
-       " 0.03126181662082672,\n",
-       " 0.056051574647426605,\n",
-       " 0.0188154224306345]"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
     "from redisvl.vectorize.text import HFTextVectorizer\n",
     "\n",
     "\n",
-    "# create a provider\n",
+    "# create a vectorizer\n",
+    "# choose your model from the huggingface website\n",
     "hf = HFTextVectorizer(model=\"sentence-transformers/all-mpnet-base-v2\")\n",
     "\n",
     "# embed a sentence\n",
@@ -242,6 +217,44 @@
     "embeddings = hf.embed_many(sentences, as_buffer=True)\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### VertexAI\n",
+    "\n",
+    "[VertexAI](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) is GCP's fully-featured AI platform including a number of pretrained LLMs. RedisVL supports using VertexAI to create embeddings from these models. To use VertexAI, you will first need to install the ``google-cloud-aiplatform`` library.\n",
+    "\n",
+    "```bash\n",
+    "pip install google-cloud-aiplatform>=1.26\n",
+    "```\n",
+    "\n",
+    "1. Then you need to gain access to a [Google Cloud Project](https://cloud.google.com/gcp?hl=en) and provide [access to credentials](https://cloud.google.com/docs/authentication/application-default-credentials). This typically accomplished with the `GOOGLE_APPLICATION_CREDENTIALS` environment variable pointing to the path of a JSON key file downloaded from your service account on GCP.\n",
+    "2. Lastly, you need to find your [project ID](https://support.google.com/googleapi/answer/7014113?hl=en) and [geographic region for VertexAI](https://cloud.google.com/vertex-ai/docs/general/locations)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from redisvl.vectorize.text import VertexAITextVectorizer\n",
+    "\n",
+    "\n",
+    "# create a vectorizer\n",
+    "vtx = VertexAITextVectorizer(\n",
+    "    api_config={\n",
+    "        \"project_id\": os.environ[\"GCP_PROJECT_ID\"],\n",
+    "        \"location\": os.environ[\"GCP_LOCATION\"]\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "# embed a sentence\n",
+    "test = vtx.embed(\"This is a test sentence.\")\n",
+    "test[:10]"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -377,7 +390,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.10"
+   "version": "3.9.12"
   },
   "orig_nbformat": 4,
   "vscode": {
diff --git a/redisvl/vectorize/text/__init__.py b/redisvl/vectorize/text/__init__.py
@@ -1,7 +1,5 @@
 from redisvl.vectorize.text.huggingface import HFTextVectorizer
 from redisvl.vectorize.text.openai import OpenAITextVectorizer
+from redisvl.vectorize.text.vertexai import VertexAITextVectorizer
 
-__all__ = [
-    "OpenAITextVectorizer",
-    "HFTextVectorizer",
-]
+__all__ = ["OpenAITextVectorizer", "HFTextVectorizer", "VertexAITextVectorizer"]
diff --git a/redisvl/vectorize/text/openai.py b/redisvl/vectorize/text/openai.py
@@ -80,7 +80,7 @@ def embed_many(
                 perform before vectorization. Defaults to None.
             batch_size (int, optional): Batch size of texts to use when creating
                 embeddings. Defaults to 10.
-            as_buffer (Optional[float], optional): Whether to convert the raw embedding
+            as_buffer (bool, optional): Whether to convert the raw embedding
                 to a byte string. Defaults to False.
 
         Returns:
@@ -197,7 +197,7 @@ async def aembed(
             text (str): Chunk of text to embed.
             preprocess (Optional[Callable], optional): Optional preprocessing callable to
                 perform before vectorization. Defaults to None.
-            as_buffer (float, optional): Whether to convert the raw embedding
+            as_buffer (bool, optional): Whether to convert the raw embedding
                 to a byte string. Defaults to False.
 
         Returns:
diff --git a/redisvl/vectorize/text/vertexai.py b/redisvl/vectorize/text/vertexai.py
@@ -0,0 +1,141 @@
+from typing import Callable, Dict, List, Optional
+
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+from tenacity.retry import retry_if_not_exception_type
+
+from redisvl.vectorize.base import BaseVectorizer
+
+
+class VertexAITextVectorizer(BaseVectorizer):
+    """VertexAI text vectorizer
+
+    This vectorizer uses the VertexAI Palm 2 embedding model API to create embeddings for text. It requires an
+    active GCP project, location, and application credentials.
+    """
+
+    def __init__(
+        self, model: str = "textembedding-gecko", api_config: Optional[Dict] = None
+    ):
+        """Initialize the VertexAI vectorizer.
+
+        Args:
+            model (str): Model to use for embedding.
+            api_config (Optional[Dict], optional): Dictionary containing the API key.
+                Defaults to None.
+
+        Raises:
+            ImportError: If the google-cloud-aiplatform library is not installed.
+            ValueError: If the API key is not provided.
+        """
+        super().__init__(model)
+
+        if (
+            not api_config
+            or "project_id" not in api_config
+            or "location" not in api_config
+        ):
+            raise ValueError(
+                "GCP project id and valid location are required in the api_config"
+            )
+
+        try:
+            import vertexai
+            from vertexai.preview.language_models import TextEmbeddingModel
+
+            vertexai.init(
+                project=api_config["project_id"], location=api_config["location"]
+            )
+        except ImportError:
+            raise ImportError(
+                "VertexAI vectorizer requires the google-cloud-aiplatform library."
+                "Please install with pip install google-cloud-aiplatform>=1.26"
+            )
+
+        self._model_client = TextEmbeddingModel.from_pretrained(model)
+        self._dims = self._set_model_dims()
+
+    def _set_model_dims(self) -> int:
+        try:
+            embedding = self._model_client.get_embeddings(["dimension test"])[0].values
+        except (KeyError, IndexError) as ke:
+            raise ValueError(f"Unexpected response from the VertexAI API: {str(ke)}")
+        except Exception as e:  # pylint: disable=broad-except
+            # fall back (TODO get more specific)
+            raise ValueError(f"Error setting embedding model dimensions: {str(e)}")
+        return len(embedding)
+
+    @retry(
+        wait=wait_random_exponential(min=1, max=60),
+        stop=stop_after_attempt(6),
+        retry=retry_if_not_exception_type(TypeError),
+    )
+    def embed_many(
+        self,
+        texts: List[str],
+        preprocess: Optional[Callable] = None,
+        batch_size: int = 10,
+        as_buffer: bool = False,
+    ) -> List[List[float]]:
+        """Embed many chunks of texts using the VertexAI API.
+
+        Args:
+            texts (List[str]): List of text chunks to embed.
+            preprocess (Optional[Callable], optional): Optional preprocessing callable to
+                perform before vectorization. Defaults to None.
+            batch_size (int, optional): Batch size of texts to use when creating
+                embeddings. Defaults to 10.
+            as_buffer (bool, optional): Whether to convert the raw embedding
+                to a byte string. Defaults to False.
+
+        Returns:
+            List[List[float]]: List of embeddings.
+
+        Raises:
+            TypeError: If the wrong input type is passed in for the test.
+        """
+        if not isinstance(texts, list):
+            raise TypeError("Must pass in a list of str values to embed.")
+        if len(texts) > 0 and not isinstance(texts[0], str):
+            raise TypeError("Must pass in a list of str values to embed.")
+
+        embeddings: List = []
+        for batch in self.batchify(texts, batch_size, preprocess):
+            response = self._model_client.get_embeddings(batch)
+            embeddings += [
+                self._process_embedding(r.values, as_buffer) for r in response
+            ]
+        return embeddings
+
+    @retry(
+        wait=wait_random_exponential(min=1, max=60),
+        stop=stop_after_attempt(6),
+        retry=retry_if_not_exception_type(TypeError),
+    )
+    def embed(
+        self,
+        text: str,
+        preprocess: Optional[Callable] = None,
+        as_buffer: bool = False,
+    ) -> List[float]:
+        """Embed a chunk of text using the VertexAI API.
+
+        Args:
+            text (str): Chunk of text to embed.
+            preprocess (Optional[Callable], optional): Optional preprocessing callable to
+                perform before vectorization. Defaults to None.
+            as_buffer (bool, optional): Whether to convert the raw embedding
+                to a byte string. Defaults to False.
+
+        Returns:
+            List[float]: Embedding.
+
+        Raises:
+            TypeError: If the wrong input type is passed in for the test.
+        """
+        if not isinstance(text, str):
+            raise TypeError("Must pass in a str value to embed.")
+
+        if preprocess:
+            text = preprocess(text)
+        result = self._model_client.get_embeddings([text])
+        return self._process_embedding(result[0].values, as_buffer)
diff --git a/setup.py b/setup.py
@@ -14,7 +14,8 @@ def read_dev_requirements():
 extras_require = {
     "all": [
         "openai>=0.26.4",
-        "sentence-transformers>=2.2.2"
+        "sentence-transformers>=2.2.2",
+        "google-cloud-aiplatform>=1.26"
     ],
     "dev": read_dev_requirements()
 }
diff --git a/tests/integration/test_vectorizers.py b/tests/integration/test_vectorizers.py
@@ -2,10 +2,14 @@
 
 import pytest
 
-from redisvl.vectorize.text import HFTextVectorizer, OpenAITextVectorizer
+from redisvl.vectorize.text import (
+    HFTextVectorizer,
+    OpenAITextVectorizer,
+    VertexAITextVectorizer,
+)
 
 
-@pytest.fixture(params=[HFTextVectorizer, OpenAITextVectorizer])
+@pytest.fixture(params=[HFTextVectorizer, OpenAITextVectorizer, VertexAITextVectorizer])
 def vectorizer(request, openai_key):
     # Here we use actual models for integration test
     if request.param == HFTextVectorizer:
@@ -14,6 +18,15 @@ def vectorizer(request, openai_key):
         return request.param(
             model="text-embedding-ada-002", api_config={"api_key": openai_key}
         )
+    elif request.param == VertexAITextVectorizer:
+        # also need to set GOOGLE_APPLICATION_CREDENTIALS env var
+        return request.param(
+            model="textembedding-gecko",
+            api_config={
+                "location": os.environ["GCP_LOCATION"],
+                "project_id": os.environ["GCP_PROJECT_ID"],
+            },
+        )
 
 
 def test_vectorizer_embed(vectorizer):