diff --git a/index.toml b/index.toml index 7eeaf19..dcc7712 100644 --- a/index.toml +++ b/index.toml @@ -370,3 +370,9 @@ title = "Tabular Data Processing with Prior Labs MCP" notebook = "prior_labs_agent.ipynb" new = true topics = ["Agents", "MCP", "Data Processing"] + +[[cookbook]] +title = "Build a YouTube Transcript RAG Pipeline with Haystack" +notebook = "youtube_transcript_rag.ipynb" +topics = ["RAG", "Question-Answering"] +new = true diff --git a/notebooks/youtube_transcript_rag_haystack.ipynb b/notebooks/youtube_transcript_rag_haystack.ipynb new file mode 100644 index 0000000..d2c90f9 --- /dev/null +++ b/notebooks/youtube_transcript_rag_haystack.ipynb @@ -0,0 +1,429 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "authorship_tag": "ABX9TyOxBySgJruHsXHgPCaSJJ10", + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# YouTube Transcript RAG with Haystack and HuggingFace\n", + "\n", + "> This cookbook shows how to build a RAG pipeline over any YouTube video\n", + "> transcript using [Haystack](https://github.com/deepset-ai/haystack).\n", + "\n", + "## Overview\n", + "\n", + "## What this cookbook does\n", + "\n", + "- [Fetches the transcript of any public YouTube video automatically](#step-1-fetch-youtube-transcript)\n", + "- [Splits it into searchable chunks](#step-2-split-and-index-transcript-using-haystack)\n", + "- [Embeds chunks using BAAI/bge-base-en-v1.5](#step-2-split-and-index-transcript-using-haystack)\n", + "- [Retrieves the most relevant chunks using semantic search](#step-3-build-rag-query-pipeline-using-haystack)\n", + "- [Generates accurate answers using Qwen2.5 via HuggingFace Inference API (free)](#step-3-build-rag-query-pipeline-using-haystack)\n", + "\n", + "## Components used\n", + "- [YouTubeTranscriptApi](https://github.com/jdepoix/youtube-transcript-api): to fetch video transcripts\n", + "- [DocumentSplitter](https://docs.haystack.deepset.ai/docs/documentsplitter): to split transcript into overlapping chunks\n", + "- [SentenceTransformersDocumentEmbedder](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder): to embed chunks using `BAAI/bge-base-en-v1.5`\n", + "- [InMemoryDocumentStore](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore): to store and search embedded chunks\n", + "- [InMemoryEmbeddingRetriever](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever): to retrieve relevant chunks for a query\n", + "- [InferenceClient](https://huggingface.co/docs/huggingface_hub/guides/inference): to generate answers using `Qwen2.5-7B-Instruct`\n", + "\n" + ], + "metadata": { + "id": "bjn02OYcPdGj" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Install the dependencies" + ], + "metadata": { + "id": "dw24tO4oSIDQ" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "Ggq2FdGNHh-h" + }, + "outputs": [], + "source": [ + "!pip install haystack-ai youtube-transcript-api sentence-transformers -q" + ] + }, + { + "cell_type": "code", + "source": [ + "from haystack import Document, Pipeline\n", + "from haystack.document_stores.in_memory import InMemoryDocumentStore\n", + "from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder\n", + "from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever\n", + "from haystack.components.builders import ChatPromptBuilder\n", + "from haystack.components.generators.chat import HuggingFaceAPIChatGenerator\n", + "from haystack.components.preprocessors import DocumentSplitter\n", + "from haystack.dataclasses import ChatMessage\n", + "from huggingface_hub import InferenceClient\n", + "from youtube_transcript_api import YouTubeTranscriptApi\n", + "import os" + ], + "metadata": { + "id": "TewyL8tbHnTw" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up your HuggingFace API Token\n", + "\n", + "This cookbook uses HuggingFace's free Inference API to generate answers.\n", + "You'll need a free HuggingFace account and API token to proceed.\n", + "\n", + "### Step 1 — Create a free HuggingFace account\n", + "Go to [huggingface.co/join](https://huggingface.co/join) and sign up for free.\n", + "\n", + "### Step 2 — Generate an API token\n", + "1. Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)\n", + "2. Click **\"New token\"**\n", + "3. Give it any name (e.g. `haystack-notebook`)\n", + "4. Set the role to **\"Read\"**\n", + "5. Click **\"Generate a token\"** and copy it\n", + "\n", + "### Step 3 — Add token to Colab Secrets\n", + "1. Click the ** key icon** in the left sidebar of Colab\n", + "2. Click **\"Add new secret\"**\n", + "3. Set the name to exactly `HF_TOKEN`\n", + "4. Paste your token as the value\n", + "5. Toggle the **notebook access** switch to ON\n" + ], + "metadata": { + "id": "mjO6vV7g_cq_" + } + }, + { + "cell_type": "code", + "source": [ + "from google.colab import userdata\n", + "import os\n", + "\n", + "os.environ[\"HF_API_TOKEN\"] = userdata.get('HF_TOKEN')" + ], + "metadata": { + "id": "7OtDfuZ1Hz7a" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1 — Fetch YouTube Transcript\n", + "\n", + "\n", + "\n", + "We use `youtube-transcript-api` to fetch the full transcript of any public\n", + "YouTube video. The transcript is joined into a single string and wrapped\n", + "into a Haystack [`Document`](https://docs.haystack.deepset.ai/docs/data-classes#document)\n", + "object — the core data structure Haystack uses to represent text content\n", + "throughout the entire pipeline.\n", + "\n", + "> A `Document` in Haystack can hold text content, metadata, and embeddings\n", + "> all in one place — making it easy to pass data between pipeline components." + ], + "metadata": { + "id": "mzTvxtxeQCv0" + } + }, + { + "cell_type": "code", + "source": [ + "def get_transcript(video_url):\n", + "\n", + " # Extract video ID from URL\n", + " video_id=video_url.split(\"v=\")[-1].split(\"&\")[0]\n", + "\n", + " # Fetch transcript using new API\n", + " ytt_api=YouTubeTranscriptApi()\n", + " transcript=ytt_api.fetch(video_id)\n", + "\n", + " # Join all text chunks into one document\n", + " full_text=\" \".join([entry.text for entry in transcript])\n", + "\n", + " return full_text\n", + "\n", + "# Enter any YouTube URL here\n", + "video_url=input(\"Enter YouTube URL: \")\n", + "transcript_text=get_transcript(video_url)\n", + "\n", + "print(f\"Transcript fetched successfully!\")\n", + "print(f\"Total characters: {len(transcript_text)}\")\n", + "print(f\"\\nPreview:\\n{transcript_text[:300]}...\")" + ], + "metadata": { + "id": "75G-QJasIhYI" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 2 — Split and Index Transcript using Haystack\n", + "\n", + "This step uses three Haystack components to prepare the transcript for retrieval.\n", + "\n", + "### DocumentSplitter\n", + "[`DocumentSplitter`](https://docs.haystack.deepset.ai/docs/documentsplitter)\n", + "breaks the transcript `Document` into smaller chunks of **150 words** with\n", + "**20 word overlap**.\n", + "\n", + "Haystack's `DocumentSplitter` supports splitting by `word`, `sentence`, or\n", + "`passage` — we use `word` here for consistent chunk sizes. The overlap ensures\n", + "context is preserved at chunk boundaries.\n", + "\n", + "### SentenceTransformersDocumentEmbedder\n", + "[`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder)\n", + "converts each chunk into a dense vector embedding using `BAAI/bge-base-en-v1.5`\n", + "— a top ranked model on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).\n", + "\n", + "Haystack's embedder automatically adds the embedding to each `Document` object's\n", + "`embedding` field, keeping text and its vector representation together.\n", + "\n", + "### InMemoryDocumentStore\n", + "[`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore)\n", + "stores all embedded `Document` objects in memory for fast similarity search.\n", + "\n", + "`InMemoryDocumentStore` is ideal for lightweight single-video demos like this one.\n", + "\n", + "For larger scale use cases with multiple videos haystack supports many document stores for production use cases — including\n", + "[Qdrant](https://haystack.deepset.ai/integrations/qdrant),\n", + "[Weaviate](https://haystack.deepset.ai/integrations/weaviate), and\n", + "[Elasticsearch](https://haystack.deepset.ai/integrations/elasticsearch).\n", + "\n" + ], + "metadata": { + "id": "VyljdgLkQSgB" + } + }, + { + "cell_type": "code", + "source": [ + "# Initialize document store\n", + "document_store=InMemoryDocumentStore()\n", + "\n", + "# Convert to Haystack Document\n", + "docs=[Document(content=transcript_text)]\n", + "\n", + "# Split into chunks\n", + "splitter=DocumentSplitter(\n", + " split_by=\"word\",\n", + " split_length=150,\n", + " split_overlap=20\n", + ")\n", + "\n", + "split_docs=splitter.run(documents=docs)[\"documents\"]\n", + "\n", + "# Embed chunks\n", + "embedder=SentenceTransformersDocumentEmbedder(\n", + " model=\"BAAI/bge-base-en-v1.5\"\n", + ")\n", + "embedder.warm_up()\n", + "embedded_docs=embedder.run(documents=split_docs)[\"documents\"]\n", + "\n", + "# Write to store\n", + "document_store.write_documents(embedded_docs)\n", + "\n", + "# Verify documents are stored correctly\n", + "print(f\" {document_store.count_documents()} chunks indexed successfully!\")" + ], + "metadata": { + "id": "YxHrQXgeIpHb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 3 — Build RAG Query Pipeline\n", + "\n", + "The query pipeline runs three steps in sequence:\n", + "\n", + "### 1. Embed the question\n", + "The question is embedded using\n", + "[`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder)\n", + "with the same `BAAI/bge-base-en-v1.5` model used during indexing.\n", + "\n", + "> It is critical to use the **same model** for both document and query\n", + "> embedding — different models produce vectors in different spaces, making\n", + "> similarity search meaningless.\n", + "\n", + "### 2. Retrieve relevant chunks\n", + "[`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever)\n", + "finds the **top 5 most similar chunks** using cosine similarity between the\n", + "question embedding and all stored chunk embeddings.\n", + "\n", + "This is semantic search — it understands *meaning*, not just keyword matches.\n", + "\n", + "### 3. Generate the answer\n", + "The 5 retrieved chunks are passed as context to `Qwen2.5-7B-Instruct` via\n", + "HuggingFace's free\n", + "[Inference API](https://huggingface.co/docs/api-inference/index).\n", + "\n", + "The model is explicitly instructed to answer **only** based on the provided\n", + "transcript context — this prevents hallucinations and keeps answers grounded\n", + "in the video content." + ], + "metadata": { + "id": "BIgYgRcH3yIN" + } + }, + { + "cell_type": "code", + "source": [ + "# Initialize components\n", + "text_embedder=SentenceTransformersTextEmbedder(\n", + " model=\"BAAI/bge-base-en-v1.5\"\n", + ")\n", + "text_embedder.warm_up()\n", + "\n", + "retriever=InMemoryEmbeddingRetriever(\n", + " document_store=document_store,\n", + " top_k=5\n", + ")\n", + "\n", + "client=InferenceClient(\n", + " model=\"Qwen/Qwen2.5-7B-Instruct\",\n", + " token=os.environ[\"HF_API_TOKEN\"]\n", + ")\n", + "\n", + "def ask(question):\n", + "\n", + " # Embed question\n", + " embedded_question=text_embedder.run(text=question)\n", + "\n", + " # Retrieve relevant chunks\n", + " retrieved=retriever.run(\n", + " query_embedding=embedded_question[\"embedding\"]\n", + " )\n", + "\n", + " # Build context\n", + " context=\"\\n\".join([doc.content for doc in retrieved[\"documents\"]])\n", + "\n", + " # Build prompt\n", + " prompt = f\"\"\"You are a helpful assistant that answers questions\n", + " based on YouTube video transcripts.\n", + "\n", + "Context:{context}\n", + "\n", + "Question:{question}\n", + "\n", + "Answer based only on the transcript content above.\n", + "\"\"\"\n", + " # Generate answer\n", + " result=client.chat_completion(\n", + " messages=[{\"role\": \"user\", \"content\": prompt}],\n", + " max_tokens=300\n", + " )\n", + "\n", + " print(f\"Question: {question}\")\n", + " print(f\"\\nAnswer: {result.choices[0].message.content}\")\n", + " print(\"\\n\" + \"─\"*60 + \"\\n\")\n", + "\n", + "print(\"Pipeline ready!\")" + ], + "metadata": { + "id": "QZXG_t0jNflk" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Step 4 — Ask Questions\n", + "\n", + "The pipeline is ready! Pass any YouTube URL and ask questions about the video.\n", + "\n", + "> **Tips for best results:**\n", + "> - Ask specific questions rather than vague ones\n", + "> - Conceptual questions like *\"How does X work?\"* perform better than\n", + "> opinion-based ones\n", + "> - The answer quality depends on how clearly the topic is covered in the video\n", + "\n", + "Feel free to swap in any YouTube URL and ask your own questions below!" + ], + "metadata": { + "id": "gL6iDKKKQueP" + } + }, + { + "cell_type": "code", + "source": [ + "# Try your own YouTube URL and questions!\n", + "ask(\"What is the main topic of this video?\")\n", + "ask(\"What are the key concepts explained?\")\n", + "ask(\"What prerequisites does the speaker recommend?\")" + ], + "metadata": { + "id": "XYbAa-rVNzrC" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## You built a YouTube RAG Pipeline!\n", + "\n", + "In this cookbook you built a complete RAG pipeline that can answer questions about any YouTube video using Haystack and HuggingFace — completely free.\n", + "\n", + "### Further Reading\n", + "- [Haystack Documentation](https://docs.haystack.deepset.ai)\n", + "- [Haystack Tutorials](https://haystack.deepset.ai/tutorials)\n", + "- [HuggingFace Inference API](https://huggingface.co/docs/api-inference/index)\n", + "- [Other Haystack Cookbooks](https://github.com/deepset-ai/haystack-cookbook)" + ], + "metadata": { + "id": "t7OyRvZsI8FM" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "ttPI1eCYJDkT" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file