diff --git a/week4/community-contributions/IbrahimSheriff/README.md b/week4/community-contributions/IbrahimSheriff/README.md new file mode 100644 index 0000000000..9f5edcaad0 --- /dev/null +++ b/week4/community-contributions/IbrahimSheriff/README.md @@ -0,0 +1,47 @@ +# Week 4: Code gen and unit-test AI + +## What it is + +A **Gradio app** (launched from the notebook) that uses a frontier LLM via **OpenRouter** to: + +- **Convert Python → TypeScript** — Generate idiomatic, typed TypeScript from Python with equivalent behavior. +- **Run code** — Run the Python or TypeScript in the app and compare outputs. +- **Generate unit tests** — Choose **Python** (pytest) or **TypeScript** (Jest) and generate tests for the code in the corresponding box. + +Notebook: **`week4/python_to_typescript.ipynb`**. + +## Setup + +1. **API key:** Put your OpenRouter API key in a `.env` file in the repo root: + + ``` + OPENROUTER_API_KEY=sk-or-v1-... + ``` + + Get a key at [openrouter.ai](https://openrouter.ai). + +2. **Dependencies:** From the repo root run: + ```bash + uv sync + ``` + Or install: `python-dotenv`, `openai`, `gradio`. + +## How to run + +1. Open and run **`week4/python_to_typescript.ipynb`**. +2. Run all cells to load the client and start the Gradio app. +3. Use the local URL (e.g. `http://127.0.0.1:7860`) in your browser. +4. **Convert:** Paste Python, pick a model, click **Convert** to get TypeScript. +5. **Run:** Use **Run Python** and **Run TypeScript** to see outputs. +6. **Unit tests:** Choose **Python** or **TypeScript** in the dropdown and click **Generate unit tests** to get pytest or Jest tests. + +## Optional: Run TypeScript in the app + +The **Run TypeScript** button uses `npx tsx` (or `npx ts-node`). You need: + +- **Node.js** installed. +- **tsx** or **ts-node**, e.g.: + - `npx tsx` (recommended, no global install), + - or `npm i -g ts-node`. + +If neither is available, the app will show an install message. diff --git a/week4/community-contributions/IbrahimSheriff/python_to_typescript.ipynb b/week4/community-contributions/IbrahimSheriff/python_to_typescript.ipynb new file mode 100644 index 0000000000..016b8e52fe --- /dev/null +++ b/week4/community-contributions/IbrahimSheriff/python_to_typescript.ipynb @@ -0,0 +1,512 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "intro", + "metadata": {}, + "source": [ + "# Python to TypeScript Code Generator\n", + "\n", + "Use a frontier LLM via **OpenRouter only** to generate idiomatic, typed TypeScript code from Python that produces equivalent behavior.\n", + "\n", + "**Setup:** Set `OPENROUTER_API_KEY` in your `.env` file. Get a key at [openrouter.ai](https://openrouter.ai).\n", + "\n", + "**Run TypeScript:** The \"Run TypeScript\" button requires Node.js and `ts-node` (or `tsx`) installed, e.g. `npm i -g ts-node` or use `npx ts-node main.ts`." + ] + }, + { + "cell_type": "markdown", + "id": "cfb14d27", + "metadata": {}, + "source": [ + "## Imports\n", + "\n", + "Dependencies: `dotenv` for API key, `openai` for the OpenRouter client, `gradio` for the UI." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "imports", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "import os\n", + "import io\n", + "import sys\n", + "import subprocess\n", + "from dotenv import load_dotenv\n", + "from openai import OpenAI\n", + "import gradio as gr\n", + "\n", + "OPENROUTER_BASE_URL = \"https://openrouter.ai/api/v1\"" + ] + }, + { + "cell_type": "markdown", + "id": "9d996b5e", + "metadata": {}, + "source": [ + "## API key\n", + "\n", + "Load `OPENROUTER_API_KEY` from `.env`. Get a key at [openrouter.ai](https://openrouter.ai)." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "load_env", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OpenRouter API Key exists and begins sk-or-v1\n" + ] + } + ], + "source": [ + "load_dotenv(override=True)\n", + "openrouter_api_key = os.getenv('OPENROUTER_API_KEY')\n", + "\n", + "if openrouter_api_key:\n", + " print(f\"OpenRouter API Key exists and begins {openrouter_api_key[:8]}\")\n", + "else:\n", + " print(\"OpenRouter API Key not set - please add OPENROUTER_API_KEY to your .env file\")" + ] + }, + { + "cell_type": "markdown", + "id": "9c96290b", + "metadata": {}, + "source": [ + "## OpenRouter client and models\n", + "\n", + "Single client for OpenRouter; model dropdown uses OpenRouter model IDs (e.g. GPT-4o, Claude, Gemini)." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "client", + "metadata": {}, + "outputs": [], + "source": [ + "# Connect to OpenRouter - single client for all models\n", + "client = OpenAI(\n", + " api_key=openrouter_api_key,\n", + " base_url=OPENROUTER_BASE_URL,\n", + ")\n", + "\n", + "# Model dropdown: OpenRouter model IDs\n", + "MODELS = [\n", + " \"openai/gpt-4o\",\n", + " \"anthropic/claude-3-5-sonnet\",\n", + " \"google/gemini-2.0-flash-001\",\n", + " \"openai/gpt-4o-mini\",\n", + " \"anthropic/claude-3-haiku\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "50246369", + "metadata": {}, + "source": [ + "## Prompts for TypeScript\n", + "\n", + "System prompt asks for idiomatic TypeScript with types; user prompt wraps the Python code. `messages_for(python)` builds the chat message list." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "prompts", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompt = \"\"\"\n", + "Your task is to convert Python code into idiomatic TypeScript.\n", + "Use proper types, interfaces, and type annotations. The TypeScript code must produce the same runtime behavior and output as the Python code.\n", + "Respond only with TypeScript code. Do not wrap the code in markdown code fences or provide explanations.\n", + "\"\"\"\n", + "\n", + "def user_prompt_for(python):\n", + " return f\"\"\"\n", + "Port this Python code to equivalent TypeScript. Use proper types and idiomatic TypeScript. The code will be run with ts-node.\n", + "Respond only with TypeScript code.\n", + "\n", + "Python code to port:\n", + "\n", + "```python\n", + "{python}\n", + "```\n", + "\"\"\"\n", + "\n", + "def messages_for(python):\n", + " return [\n", + " {\"role\": \"system\", \"content\": system_prompt},\n", + " {\"role\": \"user\", \"content\": user_prompt_for(python)}\n", + " ]" + ] + }, + { + "cell_type": "markdown", + "id": "4d701368", + "metadata": {}, + "source": [ + "## Write output\n", + "\n", + "Writes the generated TypeScript to a file (default `main.ts`, UTF-8)." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "write_output", + "metadata": {}, + "outputs": [], + "source": [ + "def write_output(ts_code, path=\"main.ts\"):\n", + " with open(path, \"w\", encoding=\"utf-8\") as f:\n", + " f.write(ts_code)" + ] + }, + { + "cell_type": "markdown", + "id": "49911b40", + "metadata": {}, + "source": [ + "## Port (convert)\n", + "\n", + "Calls the LLM with `messages_for(python)`, strips markdown code fences from the reply, writes to file, and returns the TypeScript for display." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "port", + "metadata": {}, + "outputs": [], + "source": [ + "def port(model, python):\n", + " kwargs = {\"model\": model, \"messages\": messages_for(python)}\n", + " if 'gpt' in model:\n", + " kwargs[\"reasoning_effort\"] = \"high\"\n", + " response = client.chat.completions.create(**kwargs)\n", + " reply = response.choices[0].message.content or \"\"\n", + " # Strip markdown code fences\n", + " reply = reply.strip()\n", + " for prefix in ('```typescript', '```ts', '```'):\n", + " if reply.startswith(prefix):\n", + " reply = reply[len(prefix):].lstrip('\\n')\n", + " break\n", + " if reply.endswith('```'):\n", + " reply = reply[:-3].rstrip()\n", + " write_output(reply)\n", + " return reply" + ] + }, + { + "cell_type": "markdown", + "id": "380223e4", + "metadata": {}, + "source": [ + "## Run Python\n", + "\n", + "Executes the Python code in a sandbox, captures stdout, and returns it (or an error message)." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "run_python", + "metadata": {}, + "outputs": [], + "source": [ + "def run_python(code):\n", + " globals_dict = {\"__builtins__\": __builtins__}\n", + " buffer = io.StringIO()\n", + " old_stdout = sys.stdout\n", + " sys.stdout = buffer\n", + " try:\n", + " exec(code, globals_dict)\n", + " output = buffer.getvalue()\n", + " except Exception as e:\n", + " output = f\"Error: {e}\"\n", + " finally:\n", + " sys.stdout = old_stdout\n", + " return output" + ] + }, + { + "cell_type": "markdown", + "id": "3dffba39", + "metadata": {}, + "source": [ + "## Run TypeScript\n", + "\n", + "Writes the TypeScript to a temp file, runs it with `npx tsx` (or `ts-node`), and returns stdout (or stderr on failure). Requires Node.js and tsx/ts-node." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "run_typescript", + "metadata": {}, + "outputs": [], + "source": [ + "def run_typescript(ts_code):\n", + " if not ts_code or not ts_code.strip():\n", + " return \"No TypeScript code to run. Convert Python first.\"\n", + " cwd = os.getcwd()\n", + " run_file = os.path.join(cwd, \"_gradio_ts_run.ts\")\n", + " try:\n", + " with open(run_file, \"w\", encoding=\"utf-8\") as f:\n", + " f.write(ts_code)\n", + " # Prefer tsx (captures stdout reliably); fall back to ts-node\n", + " for cmd in [[\"npx\", \"--yes\", \"tsx\", run_file], [\"npx\", \"--yes\", \"ts-node\", run_file]]:\n", + " try:\n", + " result = subprocess.run(\n", + " cmd,\n", + " cwd=cwd,\n", + " capture_output=True,\n", + " text=True,\n", + " encoding=\"utf-8\",\n", + " timeout=30,\n", + " env={**os.environ, \"FORCE_COLOR\": \"0\"},\n", + " )\n", + " break\n", + " except FileNotFoundError:\n", + " continue\n", + " else:\n", + " return \"Neither tsx nor ts-node found. Install with: npx tsx or npm i -g ts-node\"\n", + " if result.returncode != 0:\n", + " return result.stderr or result.stdout or \"Non-zero exit\"\n", + " out = (result.stdout or \"\").strip()\n", + " err = (result.stderr or \"\").strip()\n", + " if out:\n", + " return out\n", + " if err:\n", + " return err\n", + " return \"(Program ran successfully with no output.)\"\n", + " except subprocess.TimeoutExpired:\n", + " return \"Execution timed out.\"\n", + " except Exception as e:\n", + " return f\"Error: {e}\"\n", + " finally:\n", + " if os.path.exists(run_file):\n", + " try:\n", + " os.unlink(run_file)\n", + " except OSError:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "69cfa03e", + "metadata": {}, + "source": [ + "## Generate unit tests\n", + "\n", + "Uses the LLM to generate unit tests. Choose **Python** (pytest) to test the Python code, or **TypeScript** (Jest) to test the TypeScript code. Output is shown in a separate text area." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "f63db500", + "metadata": {}, + "outputs": [], + "source": [ + "UNIT_TEST_SYSTEM_PROMPT_PYTHON = \"\"\"\n", + "Your task is to write unit tests for the given Python code.\n", + "Use pytest style: test functions named test_*, assert statements, and pytest features (e.g. parametrize, fixtures) when helpful.\n", + "Cover the main behavior and edge cases. Respond only with Python test code. Do not wrap in markdown code fences or add explanations.\n", + "\"\"\"\n", + "\n", + "UNIT_TEST_SYSTEM_PROMPT_TS = \"\"\"\n", + "Your task is to write unit tests for the given TypeScript code.\n", + "Use Jest: describe/it, expect(), and Jest matchers. Use proper types. Cover the main behavior and edge cases.\n", + "Respond only with TypeScript test code. Do not wrap in markdown code fences or add explanations.\n", + "\"\"\"\n", + "\n", + "def user_prompt_for_tests_python(python_code):\n", + " return f\"\"\"Write unit tests for this Python code. Use pytest. Respond only with the test code.\n", + "\n", + "Python code to test:\n", + "\n", + "```python\n", + "{python_code}\n", + "```\n", + "\"\"\"\n", + "\n", + "def user_prompt_for_tests_ts(ts_code):\n", + " return f\"\"\"Write unit tests for this TypeScript code. Use Jest. Respond only with the test code.\n", + "\n", + "TypeScript code to test:\n", + "\n", + "```typescript\n", + "{ts_code}\n", + "```\n", + "\"\"\"\n", + "\n", + "def messages_for_tests_python(python_code):\n", + " return [\n", + " {\"role\": \"system\", \"content\": UNIT_TEST_SYSTEM_PROMPT_PYTHON},\n", + " {\"role\": \"user\", \"content\": user_prompt_for_tests_python(python_code)}\n", + " ]\n", + "\n", + "def messages_for_tests_ts(ts_code):\n", + " return [\n", + " {\"role\": \"system\", \"content\": UNIT_TEST_SYSTEM_PROMPT_TS},\n", + " {\"role\": \"user\", \"content\": user_prompt_for_tests_ts(ts_code)}\n", + " ]\n", + "\n", + "def generate_unit_tests(model, test_type, python_code, ts_code):\n", + " code = python_code if test_type == \"Python\" else ts_code\n", + " if not code or not code.strip():\n", + " msg = \"Paste Python code above\" if test_type == \"Python\" else \"Convert to TypeScript first or paste TypeScript code\"\n", + " return f\"{msg}, then click Generate unit tests.\"\n", + " messages = messages_for_tests_python(python_code) if test_type == \"Python\" else messages_for_tests_ts(ts_code)\n", + " kwargs = {\"model\": model, \"messages\": messages}\n", + " if \"gpt\" in model:\n", + " kwargs[\"reasoning_effort\"] = \"high\"\n", + " response = client.chat.completions.create(**kwargs)\n", + " reply = (response.choices[0].message.content or \"\").strip()\n", + " if test_type == \"Python\":\n", + " prefixes = (\"```python\", \"```py\", \"```\")\n", + " else:\n", + " prefixes = (\"```typescript\", \"```ts\", \"```\")\n", + " for prefix in prefixes:\n", + " if reply.startswith(prefix):\n", + " reply = reply[len(prefix):].lstrip(\"\\n\")\n", + " break\n", + " if reply.endswith(\"```\"):\n", + " reply = reply[:-3].rstrip()\n", + " return reply" + ] + }, + { + "cell_type": "markdown", + "id": "31f2008d", + "metadata": {}, + "source": [ + "## Default example\n", + "\n", + "Short Python snippet shown in the first code box so you can click Convert immediately." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "default_example", + "metadata": {}, + "outputs": [], + "source": [ + "DEFAULT_PYTHON = '''print(\"Hello from Python\")\n", + "x = [1, 2, 3]\n", + "print(sum(x))'''" + ] + }, + { + "cell_type": "markdown", + "id": "fe6b1375", + "metadata": {}, + "source": [ + "## Gradio UI\n", + "\n", + "Two code areas (Python input, TypeScript output), model dropdown, Convert button, and Run Python / Run TypeScript with result text areas. Launch in browser." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "gradio_ui", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "* Running on local URL: http://127.0.0.1:7864\n", + "* To create a public link, set `share=True` in `launch()`.\n" + ] + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "with gr.Blocks() as ui:\n", + " gr.Markdown(\"## Python to TypeScript - By Sheriff Ibrahm\")\n", + " with gr.Row():\n", + " python = gr.Textbox(label=\"Python code:\", lines=16, value=DEFAULT_PYTHON)\n", + " ts = gr.Textbox(label=\"TypeScript code:\", lines=16)\n", + " with gr.Row():\n", + " model = gr.Dropdown(MODELS, label=\"Model\", value=MODELS[0])\n", + " convert_btn = gr.Button(\"Convert\")\n", + " convert_btn.click(port, inputs=[model, python], outputs=[ts])\n", + "\n", + " with gr.Row():\n", + " py_result = gr.Textbox(label=\"Python output\", lines=4)\n", + " ts_result = gr.Textbox(label=\"TypeScript output\", lines=4)\n", + " with gr.Row():\n", + " run_py_btn = gr.Button(\"Run Python\")\n", + " run_ts_btn = gr.Button(\"Run TypeScript\")\n", + " run_py_btn.click(run_python, inputs=[python], outputs=[py_result])\n", + " run_ts_btn.click(run_typescript, inputs=[ts], outputs=[ts_result])\n", + "\n", + " gr.Markdown(\"### Generate unit tests\")\n", + " TEST_TYPES = [\"Python\", \"TypeScript\"]\n", + " with gr.Row():\n", + " test_type = gr.Dropdown(TEST_TYPES, label=\"Test language\", value=\"Python\")\n", + " unit_tests = gr.Textbox(label=\"Unit tests\", lines=16)\n", + " gen_tests_btn = gr.Button(\"Generate unit tests\")\n", + " gen_tests_btn.click(generate_unit_tests, inputs=[model, test_type, python, ts], outputs=[unit_tests])\n", + "\n", + "ui.launch(inbrowser=True)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week5/community-contributions/IbrahimSheriff/README.md b/week5/community-contributions/IbrahimSheriff/README.md new file mode 100644 index 0000000000..5a44acea91 --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/README.md @@ -0,0 +1,90 @@ +# Galdunx RAG Chatbot & Evaluation + +A RAG (Retrieval-Augmented Generation) chatbot and evaluation suite for the **Galdunx** knowledge base. Chat over Galdunx docs via a Gradio UI and run retrieval (MRR) and answer-quality (LLM-as-judge) evals. + +## Overview + +- **Chat UI** — Ask questions about Galdunx; answers use requery + rerank over a ChromaDB vector store. +- **Evaluation** — Gradio app to run MRR (retrieval) and LLM-as-judge (answer quality) on a configurable eval set. + +## Prerequisites + +- Python 3.10+ +- Dependencies from the repo (e.g. `langchain-*`, `chromadb`, `gradio`, `python-dotenv`). + +### Environment + +Create a `.env` in this directory (or repo root) if using API-backed embeddings/LLM: + +- **`OPENROUTER_API_KEY`** — If set, uses OpenRouter for embeddings (`openai/text-embedding-3-small`) and chat (`openai/gpt-4o-mini`). Otherwise uses local HuggingFace embeddings and OpenAI `gpt-4o-mini` for chat (OpenAI key required for that). + +## Quick Start + +Run all commands from `week5/IbrahimSheriff` (or ensure this directory is on `PYTHONPATH`). + +### 1. Ingest the knowledge base + +Builds the vector store from `knowledge-base/*.md`: + +```bash +python ingest.py +``` + +If the Gradio chat app is running, stop it first or ingestion may fail (DB locked). + +### 2. Chat app + +```bash +python app.py +``` + +Opens a Gradio chat interface. Use **Re-ingest knowledge base** to refresh the vector store from the UI. + +### 3. Evaluation app + +```bash +python eval_app.py +``` + +- **Eval set:** Use the default Galdunx eval set or paste custom JSON. +- **Run evaluation:** Runs MRR (retrieval), then RAG answers + LLM-as-judge; shows summary and per-example details. + +Custom eval JSON format: + +```json +[ + { + "question": "Your question?", + "expected_sources": ["filename.md"], + "expected_answer": "Optional expected answer for judge." + } +] +``` + +Default eval set is loaded from `eval_data.json` when present. + +## Project structure + +| File / folder | Purpose | +|----------------------|--------| +| `app.py` | Gradio chat UI; calls `answer_question` and optional re-ingest. | +| `answer.py` | RAG pipeline: requery, retrieve, merge, rerank, then LLM answer. | +| `ingest.py` | Loads `knowledge-base/*.md`, chunks, embeds, writes to `vector_db/`. | +| `eval_app.py` | Gradio UI for running MRR + LLM-as-judge evaluation. | +| `evaluation.py` | Eval logic: MRR, default eval set, LLM judge; used by `eval_app.py`. | +| `eval_data.json` | Default eval set (questions + expected sources/answers). | +| `knowledge-base/` | Source Markdown files (e.g. about Galdunx, web dev, UI/UX, 10K Store). | +| `vector_db/` | ChromaDB persistence (created by `ingest.py`). | + +## RAG pipeline (answer.py) + +1. **Requery** — LLM generates an extra retrieval-oriented query from the user question and history. +2. **Retrieve** — Vector search for both the combined user input and the AI query; results merged and deduped. +3. **Rerank** — Top chunks reordered by relevance with an LLM (structured output). +4. **Answer** — Top reranked chunks are used as context for the final answer. + +## Notes + +- Embeddings and DB path are shared between `ingest.py` and `answer.py` so the chat app uses the same vector store. +- If you see “vector database is locked”, stop the Gradio chat app before re-running ingest. +- For local embeddings, NumPy 2.x can conflict with some setups; the ingest script suggests using `OPENROUTER_API_KEY` or `uv pip install 'numpy<2'` if needed. diff --git a/week5/community-contributions/IbrahimSheriff/answer.py b/week5/community-contributions/IbrahimSheriff/answer.py new file mode 100644 index 0000000000..4da57e3cdb --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/answer.py @@ -0,0 +1,181 @@ +import logging +import os +from pathlib import Path + +from langchain_openai import ChatOpenAI, OpenAIEmbeddings +from langchain_chroma import Chroma +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_core.messages import SystemMessage, HumanMessage, convert_to_messages +from langchain_core.documents import Document +from pydantic import BaseModel, Field + +from dotenv import load_dotenv + +load_dotenv(override=True) + +OPENROUTER_MODEL = "openai/gpt-4o-mini" +OPENAI_MODEL = "gpt-4o-mini" +DB_NAME = str(Path(__file__).parent / "vector_db") + +OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") +OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" + +# Must match ingest: OpenRouter (openai/text-embedding-3-small) or OpenAI or HuggingFace +if OPENROUTER_API_KEY: + embeddings = OpenAIEmbeddings( + model="openai/text-embedding-3-small", + openai_api_key=OPENROUTER_API_KEY, + openai_api_base=OPENROUTER_BASE_URL, + ) +else: + embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") + +RETRIEVAL_K_PER_QUERY = 6 +RERANK_TOP_K = 5 +CHUNKS_FOR_RERANK = 10 + +SYSTEM_PROMPT = """ +You are a knowledgeable, friendly assistant representing Galdunx, a digital product and creative technology studio. +You are chatting with a user about Galdunx and its services. +If relevant, use the given context to answer any question. +If you don't know the answer, say so. +Context: +{context} +""" + + +def _get_retriever(k: int | None = None): + """Build retriever on each use so we see the current collection after re-ingest.""" + vs = Chroma(persist_directory=DB_NAME, embedding_function=embeddings) + search_k = k if k is not None else RETRIEVAL_K_PER_QUERY + return vs.as_retriever(search_kwargs={"k": search_k}) + +if OPENROUTER_API_KEY: + llm = ChatOpenAI( + temperature=0, + model_name=OPENROUTER_MODEL, + openai_api_key=OPENROUTER_API_KEY, + openai_api_base=OPENROUTER_BASE_URL, + ) +else: + llm = ChatOpenAI(temperature=0.5, model_name=OPENAI_MODEL) + + +class RankOrder(BaseModel): + """Order of chunk ids from most to least relevant (1-based).""" + order: list[int] = Field( + description="The order of relevance of chunks, from most relevant to least relevant, by chunk id number", + ) + + +def requery_question(question: str, history: list[dict]) -> str: + """ + Use the LLM to produce a short, retrieval-oriented query that complements the user question + so that searching with both user and this AI query gives broader coverage. + """ + history_text = "\n".join( + f"{m['role']}: {m['content']}" for m in history + ) or "(no prior messages)" + prompt = f"""You are helping answer questions about Galdunx (a digital product and creative technology studio) using a knowledge base. + +Conversation history: +{history_text} + +Current user question: +{question} + +Respond with a single short search query that complements the user's question and is likely to surface relevant content when used together with their question. Focus on key entities and intent. Output ONLY the search query, nothing else.""" + response = llm.invoke([HumanMessage(content=prompt)]) + return (response.content or "").strip() + + +def fetch_context_unranked(question: str, k: int) -> list[Document]: + """Retrieve up to k context documents for a question (no reranking).""" + return _get_retriever(k=k).invoke(question) + + +def merge_chunks(user_chunks: list[Document], ai_chunks: list[Document]) -> list[Document]: + """Merge AI chunks into user chunks, deduplicating by page_content.""" + merged = list(user_chunks) + existing = {doc.page_content for doc in user_chunks} + for doc in ai_chunks: + if doc.page_content not in existing: + merged.append(doc) + existing.add(doc.page_content) + return merged + + +def rerank(question: str, chunks: list[Document]) -> list[Document]: + """Reorder chunks by relevance to the question using the LLM. Returns reordered list.""" + if not chunks: + return [] + system_prompt = """You are a document re-ranker. You are given a question and a list of text chunks from a knowledge base. Rank the chunks by relevance to the question, most relevant first. Reply only with a JSON object with one key "order": a list of chunk ids (1-based), in relevance order. Include every chunk id exactly once.""" + user_prompt = f"Question:\n\n{question}\n\nChunks:\n\n" + for i, doc in enumerate(chunks): + user_prompt += f"# CHUNK ID: {i + 1}:\n\n{doc.page_content}\n\n" + user_prompt += "Reply only with the JSON object (e.g. {\"order\": [3, 1, 2, ...]})." + structured_llm = llm.with_structured_output(RankOrder) + result = structured_llm.invoke( + [SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)] + ) + return [chunks[i - 1] for i in result.order] + + +def fetch_context(question: str, history: list[dict]) -> list[Document]: + """ + Requery with AI, retrieve from user + AI query, merge and dedupe, take 10, rerank, return top 5. + """ + combined = combined_question(question, history) + logging.info(f"Combined user question/input: {combined}") + + ai_query = requery_question(question, history) + logging.info(f"Requery (AI search query): {ai_query}") + + user_chunks = fetch_context_unranked(combined, RETRIEVAL_K_PER_QUERY) + ai_chunks = fetch_context_unranked(ai_query, RETRIEVAL_K_PER_QUERY) + logging.info(f"Retrieved {len(user_chunks)} from user query, {len(ai_chunks)} from AI query.") + + merged = merge_chunks(user_chunks, ai_chunks) + chunks_for_rerank = merged[:CHUNKS_FOR_RERANK] + logging.info(f"Merged to {len(merged)} unique chunks, using up to {len(chunks_for_rerank)} for rerank.") + + reranked = rerank(question, chunks_for_rerank) + docs = reranked[:RERANK_TOP_K] + logging.info(f"Reranked, using top {len(docs)} chunks for context.") + + return docs + + +def combined_question(question: str, history: list[dict] = []) -> str: + """ + Combine all the user's messages into a single string. + """ + prior = "\n".join(m["content"] for m in history if m["role"] == "user") + return prior + "\n" + question + + +def answer_question(question: str, history: list[dict] = []) -> tuple[str, list[Document]]: + """ + Answer the given question with RAG (requery + rerank); return the answer and the context documents. + Logs each result for debugging or tracing. + """ + logging.basicConfig(level=logging.INFO) + + docs = fetch_context(question, history or []) + for idx, doc in enumerate(docs): + logging.info(f"Document {idx + 1}: {doc.page_content[:300]}") + + context = "\n\n".join(doc.page_content for doc in docs) + logging.info(f"Context passed to system prompt: {context[:500]}") + + system_prompt = SYSTEM_PROMPT.format(context=context) + messages = [SystemMessage(content=system_prompt)] + messages.extend(convert_to_messages(history or [])) + messages.append(HumanMessage(content=question)) + + logging.info(f"Messages sent to LLM: {[str(getattr(m, 'content', ''))[:200] for m in messages]}") + + response = llm.invoke(messages) + logging.info(f"LLM response: {response.content}") + + return response.content, docs diff --git a/week5/community-contributions/IbrahimSheriff/app.py b/week5/community-contributions/IbrahimSheriff/app.py new file mode 100644 index 0000000000..eede82b22c --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/app.py @@ -0,0 +1,75 @@ +""" +Gradio chat UI for the Galdunx RAG chatbot. +Run from week5/IbrahimSheriff: python app.py +Or from repo root: python -m week5.IbrahimSheriff.app (if week5 is on path). +""" +import gradio as gr + +from answer import answer_question +from ingest import fetch_documents, create_chunks, create_embeddings + +_INGEST_MSG = ( + "The knowledge base is not loaded yet. Please click **Re-ingest knowledge base** below, " + "or run from the terminal: `python week5/IbrahimSheriff/ingest.py`" +) + + +def _history_to_dicts(history): + """Convert Gradio chat history to list[dict] with role and content.""" + out = [] + for msg in history or []: + if isinstance(msg, (list, tuple)) and len(msg) >= 2: + user_msg, assistant_msg = msg[0], msg[1] + out.append({"role": "user", "content": user_msg}) + out.append({"role": "assistant", "content": assistant_msg}) + elif isinstance(msg, dict): + out.append({"role": msg.get("role", "user"), "content": msg.get("content", "")}) + else: + role = getattr(msg, "role", "user") + content = getattr(msg, "content", str(msg)) + out.append({"role": role, "content": content}) + return out + + +def chat(message, history): + """Chat handler for Gradio: answer using RAG and return the reply text.""" + if not message or not message.strip(): + return "" + history_dicts = _history_to_dicts(history) + try: + answer, _ = answer_question(message, history_dicts) + return answer + except Exception as e: + if "does not exist" in str(e) or "NotFoundError" in type(e).__name__: + return _INGEST_MSG + raise + + +def run_ingest(): + """Run ingestion and return a status message.""" + try: + documents = fetch_documents() + chunks = create_chunks(documents) + create_embeddings(chunks) + return "Re-ingestion complete. Vector store updated." + except Exception as e: + return f"Re-ingestion failed: {e}" + + +def main(): + with gr.Blocks(title="Galdunx RAG Chat") as demo: + gr.Markdown("# Galdunx Knowledge Assistant\nChat with the assistant about Galdunx and its services.") + gr.ChatInterface( + fn=chat, + type="messages", + textbox=gr.Textbox(placeholder="Ask about Galdunx...", container=False), + ) + with gr.Row(): + ingest_btn = gr.Button("Re-ingest knowledge base") + status = gr.Textbox(label="Status", interactive=False) + ingest_btn.click(fn=run_ingest, outputs=status) + demo.launch(inbrowser=True) + + +if __name__ == "__main__": + main() diff --git a/week5/community-contributions/IbrahimSheriff/eval_app.py b/week5/community-contributions/IbrahimSheriff/eval_app.py new file mode 100644 index 0000000000..6e3063a9cb --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/eval_app.py @@ -0,0 +1,144 @@ +""" +Gradio UI to run RAG evaluation: MRR (retrieval) and LLM-as-judge (answer quality). +Run from week5/IbrahimSheriff: python eval_app.py +""" +from __future__ import annotations + +import json +from pathlib import Path + +import gradio as gr + +from evaluation import ( + EvalExample, + default_eval_set, + compute_mrr, + run_answer_eval, + load_eval_set_from_json, +) + + +def _format_doc_source(doc) -> str: + raw = (doc.metadata or {}).get("source", "") + return Path(raw).name if raw else "(no source)" + + +def run_evaluation(use_default: bool, custom_json: str, progress: gr.Progress | None = None): + """ + Run full eval: MRR then LLM judge on each answer. + Yields (summary, details) so the UI updates during the run instead of staying stagnant. + """ + def update_progress(frac: float, msg: str) -> None: + if progress is not None: + progress(frac, desc=msg) + + # Update UI immediately so it doesn't look stagnant + yield "⏳ **Starting evaluation...**", "" + + # Parse eval set + if use_default: + eval_set = default_eval_set() + else: + if not (custom_json or custom_json.strip()): + yield "⚠️ Provide a JSON eval set (list of {question, expected_sources?, expected_answer?}).", "" + return + try: + data = json.loads(custom_json) + eval_set = [ + EvalExample( + question=item["question"], + expected_sources=item.get("expected_sources", []), + expected_answer=item.get("expected_answer"), + ) + for item in data + ] + except json.JSONDecodeError as e: + yield f"⚠️ Invalid JSON: {e}", "" + return + except (KeyError, TypeError) as e: + yield f"⚠️ Each item must have 'question'; optional 'expected_sources', 'expected_answer': {e}", "" + return + + if not eval_set: + yield "⚠️ Eval set is empty.", "" + return + + try: + update_progress(0.0, "Running retrieval (MRR)...") + yield "⏳ **Running retrieval (MRR)...** This may take a minute.", "" + + mean_mrr, mrr_results = compute_mrr(eval_set) + mrr_detail_lines = [] + for ex, docs, mrr in mrr_results: + sources = ", ".join(_format_doc_source(d) for d in docs[:5]) + mrr_detail_lines.append(f"- **Q:** {ex.question[:60]}… → MRR = {mrr:.3f} | Top sources: {sources}") + + update_progress(0.4, "Running RAG answers + LLM judge...") + yield "⏳ **MRR done.** Running RAG answers and LLM judge (this is slow)...", "" + + mean_score, answer_results = run_answer_eval(eval_set) + detail_lines = [] + for i, (ex, answer, docs, judge) in enumerate(answer_results, 1): + sources = ", ".join(_format_doc_source(d) for d in docs[:5]) + detail_lines.append( + f"### Example {i}: {ex.question}\n" + f"- **Retrieved sources:** {sources}\n" + f"- **Model answer:** {answer[:400]}{'…' if len(answer) > 400 else ''}\n" + f"- **Judge score:** {judge.score}/5 — {judge.reasoning}\n" + ) + + summary = ( + f"## Evaluation results\n\n" + f"- **MRR (retrieval):** {mean_mrr:.4f}\n\n" + f"- **Mean LLM judge score (1–5):** {mean_score:.2f}\n\n" + f"### MRR per query\n" + "\n".join(mrr_detail_lines) + ) + details = "## Per-example details (answer + judge)\n\n" + "\n\n".join(detail_lines) + update_progress(1.0, "Done") + yield summary, details + except Exception as e: + import traceback + yield f"❌ **Error:** {e}\n\n```\n{traceback.format_exc()}\n```", "" + + +def main(): + base = Path(__file__).parent + default_json = "" + eval_data_path = base / "eval_data.json" + if eval_data_path.exists(): + default_json = eval_data_path.read_text(encoding="utf-8") + + with gr.Blocks(title="RAG Evaluation") as demo: + gr.Markdown("# RAG Evaluation — MRR & LLM-as-judge\nRun retrieval + answer evals and see MRR and per-response judge scores.") + use_default = gr.Radio( + choices=["default", "custom"], + value="default", + label="Eval set", + info="Use default Galdunx eval set or paste custom JSON.", + ) + custom_json = gr.Textbox( + label="Custom eval set (JSON)", + placeholder='[{"question": "...", "expected_sources": ["file.md"], "expected_answer": "..."}, ...]', + value=default_json, + lines=12, + ) + run_btn = gr.Button("Run evaluation", variant="primary") + summary = gr.Markdown(label="Summary") + details = gr.Markdown(label="Details") + + def run(choice: str, json_text: str): + """Generator: yield progress updates so the UI is not stagnant.""" + yield from run_evaluation(choice == "default", json_text, progress=None) + + run_btn.click( + fn=run, + inputs=[use_default, custom_json], + outputs=[summary, details], + show_progress=True, + ) + + demo.launch(inbrowser=True) + + +if __name__ == "__main__": + main() diff --git a/week5/community-contributions/IbrahimSheriff/evaluation.py b/week5/community-contributions/IbrahimSheriff/evaluation.py new file mode 100644 index 0000000000..b85fd20420 --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/evaluation.py @@ -0,0 +1,173 @@ +""" +RAG evaluation: MRR (retrieval) and LLM-as-judge (answer quality). +Used by eval_app.py Gradio UI. +""" +from __future__ import annotations + +import json +import logging +from dataclasses import dataclass, field +from pathlib import Path + +from langchain_core.documents import Document +from langchain_core.messages import HumanMessage, SystemMessage +from pydantic import BaseModel, Field + +from answer import answer_question, fetch_context + +# Reuse answer.py LLM for judge +from answer import llm + +logging.basicConfig(level=logging.WARNING) + + +@dataclass +class EvalExample: + """Single evaluation example: question + optional expected answer/sources.""" + question: str + expected_sources: list[str] = field(default_factory=list) # e.g. ["about-galdunx.md"] + expected_answer: str | None = None # optional; used for LLM judge + + +def _doc_source(doc: Document) -> str: + """Normalize source to filename for matching (e.g. 'knowledge-base/about-galdunx.md' -> 'about-galdunx.md').""" + raw = (doc.metadata or {}).get("source", "") + return Path(raw).name if raw else "" + + +def compute_mrr_one(question: str, retrieved_docs: list[Document], expected_sources: list[str]) -> float: + """ + MRR for one query: 1/rank of first relevant doc, or 0 if none. + Relevant = doc source filename (e.g. about-galdunx.md) in expected_sources. + """ + if not expected_sources: + return 0.0 + expected_set = {s.strip().lower() for s in expected_sources} + for rank, doc in enumerate(retrieved_docs, start=1): + src = _doc_source(doc).lower() + if any(src == e or src.endswith(e) for e in expected_set): + return 1.0 / rank + return 0.0 + + +def compute_mrr(eval_set: list[EvalExample], fetch_context_fn=None) -> tuple[float, list[tuple[EvalExample, list[Document], float]]]: + """ + Run retrieval (fetch_context) for each example and compute MRR. + Returns (mean_MRR, list of (example, retrieved_docs, mrr_score)). + """ + if fetch_context_fn is None: + fetch_context_fn = lambda q, h: fetch_context(q, h or []) + results: list[tuple[EvalExample, list[Document], float]] = [] + for ex in eval_set: + docs = fetch_context_fn(ex.question, []) + mrr = compute_mrr_one(ex.question, docs, ex.expected_sources) + results.append((ex, docs, mrr)) + n = len(results) + mean_mrr = sum(r[2] for r in results) / n if n else 0.0 + return mean_mrr, results + + +class JudgeResult(BaseModel): + """LLM judge output for one answer.""" + score: int = Field(description="Score from 1 to 5, where 5 is fully correct and relevant") + reasoning: str = Field(description="Brief explanation for the score") + + +def llm_judge_one(question: str, model_answer: str, context_snippet: str, expected_answer: str | None = None) -> JudgeResult: + """ + Use LLM to judge one RAG answer: relevance, correctness, completeness. + If expected_answer is provided, judge against it; otherwise judge against context only. + """ + expected_part = "" + if expected_answer: + expected_part = f"\nExpected answer (reference): {expected_answer}\n" + system_prompt = """You are an evaluator for a RAG (retrieval-augmented generation) system. You will see: +1. The user question +2. The context that was retrieved (excerpt) +3. The model's answer +4. Optionally a reference expected answer + +Score the model's answer from 1 to 5: +- 5: Fully correct, relevant, and complete; uses context appropriately. +- 4: Mostly correct and relevant; minor omissions or slight inaccuracies. +- 3: Partially correct; some relevant info but missing key points or some inaccuracy. +- 2: Weak; little relevance or several inaccuracies. +- 1: Wrong, irrelevant, or contradicts context. + +Reply with a JSON object with keys "score" (integer 1-5) and "reasoning" (brief explanation).""" + user_prompt = f"""Question: {question} +{expected_part} +Context excerpt (first 800 chars): {context_snippet[:800]} + +Model answer: {model_answer} + +Provide your score and reasoning (JSON with "score" and "reasoning").""" + structured = llm.with_structured_output(JudgeResult) + out = structured.invoke([SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)]) + return out + + +def run_answer_eval( + eval_set: list[EvalExample], + answer_fn=None, +) -> tuple[float, list[tuple[EvalExample, str, list[Document], JudgeResult]]]: + """ + Run full RAG (answer_question) for each example, then LLM judge each answer. + Returns (mean_judge_score, list of (example, answer, docs, judge_result)). + """ + if answer_fn is None: + answer_fn = answer_question + results: list[tuple[EvalExample, str, list[Document], JudgeResult]] = [] + for ex in eval_set: + answer, docs = answer_fn(ex.question, []) + context_snippet = "\n\n".join(d.page_content for d in docs)[:1200] + judge = llm_judge_one(ex.question, answer, context_snippet, ex.expected_answer) + results.append((ex, answer, docs, judge)) + n = len(results) + mean_score = sum(r[3].score for r in results) / n if n else 0.0 + return mean_score, results + + +def default_eval_set() -> list[EvalExample]: + """Default eval set for Galdunx knowledge base.""" + return [ + EvalExample( + question="What is Galdunx and what does it do?", + expected_sources=["about-galdunx.md"], + expected_answer="Galdunx is a digital product and creative technology studio that designs and builds digital solutions. It offers web development, UI/UX design, and helps businesses turn ideas into scalable products.", + ), + EvalExample( + question="What web technologies does Galdunx use?", + expected_sources=["web-development.md"], + expected_answer="Next.js, React, Node.js, Nest.js, REST APIs, and related web technologies.", + ), + EvalExample( + question="What is the 10K Store?", + expected_sources=["10k-store.md"], + expected_answer="An affordable e-commerce solution: ready-to-launch online store with product listing, payments, and mobile-friendly design.", + ), + EvalExample( + question="What UI/UX design services does Galdunx offer?", + expected_sources=["uiux-design.md"], + expected_answer="User interfaces for SaaS/dashboards, user experience design, design systems; tools like Figma.", + ), + EvalExample( + question="Which industries does Galdunx serve?", + expected_sources=["about-galdunx.md"], + expected_answer="SaaS, blockchain/Web3, fintech, healthcare, and other industries.", + ), + ] + + +def load_eval_set_from_json(path: str | Path) -> list[EvalExample]: + """Load eval set from JSON: list of {question, expected_sources?, expected_answer?}.""" + with open(path, encoding="utf-8") as f: + data = json.load(f) + out = [] + for item in data: + out.append(EvalExample( + question=item["question"], + expected_sources=item.get("expected_sources", []), + expected_answer=item.get("expected_answer"), + )) + return out diff --git a/week5/community-contributions/IbrahimSheriff/exercise.ipynb b/week5/community-contributions/IbrahimSheriff/exercise.ipynb new file mode 100644 index 0000000000..0add9aa1b4 --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/exercise.ipynb @@ -0,0 +1,157 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ff60bebf", + "metadata": {}, + "source": [ + "# Galdunx RAG Chatbot\n", + "\n", + "This notebook runs the ingestion and chat app for the Galdunx knowledge base.\n", + "\n", + "1. **Run the first code cell** to ingest the knowledge base (loads `knowledge-base/*.md`, chunks, embeds via OpenRouter, stores in ChromaDB).\n", + "2. **Run the second code cell** to launch the Gradio chat UI.\n", + "\n", + "**To re-ingest** (e.g. after editing `.md` files): **stop the Gradio app first** (interrupt the kernel or stop the server), then run the first cell again. The database cannot be written while the chat app is using it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "afc69f45", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "\n", + "folder = Path.cwd()\n", + "if not (folder / \"ingest.py\").exists():\n", + " folder = folder / \"week5\" / \"IbrahimSheriff\"\n", + "sys.path.insert(0, str(folder))\n", + "\n", + "import ingest\n", + "\n", + "documents = ingest.fetch_documents()\n", + "chunks = ingest.create_chunks(documents)\n", + "ingest.create_embeddings(chunks)\n", + "print(\"Ingestion complete.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79d0224f", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "\n", + "folder = Path.cwd()\n", + "if not (folder / \"ingest.py\").exists():\n", + " folder = folder / \"week5\" / \"IbrahimSheriff\"\n", + "sys.path.insert(0, str(folder))\n", + "\n", + "import numpy as np\n", + "import plotly.graph_objects as go\n", + "from sklearn.manifold import TSNE\n", + "import chromadb\n", + "\n", + "DB_NAME = str(folder / \"vector_db\")\n", + "client = chromadb.PersistentClient(path=DB_NAME)\n", + "collections = client.list_collections()\n", + "if not collections:\n", + " raise FileNotFoundError(\"No vector DB found. Run the ingest cell first.\")\n", + "coll = collections[0]\n", + "data = coll.get(include=[\"embeddings\", \"metadatas\", \"documents\"])\n", + "\n", + "embeddings = np.array(data[\"embeddings\"], dtype=np.float32)\n", + "n_vectors, n_dims = embeddings.shape\n", + "print(f\"Loaded {n_vectors} vectors with {n_dims} dimensions\")\n", + "\n", + "tsne = TSNE(n_components=2, perplexity=min(15, n_vectors - 1), random_state=42)\n", + "coords_2d = tsne.fit_transform(embeddings)\n", + "print(\"t-SNE reduction complete.\")\n", + "\n", + "metadatas = data[\"metadatas\"] or []\n", + "documents = data[\"documents\"] or []\n", + "sources = []\n", + "for i in range(n_vectors):\n", + " meta = metadatas[i] if i < len(metadatas) else {}\n", + " s = meta.get(\"source\", f\"chunk_{i}\")\n", + " if isinstance(s, str) and \"/\" in s:\n", + " s = s.split(\"/\")[-1]\n", + " sources.append(s)\n", + "unique_sources = list(dict.fromkeys(sources))\n", + "palette = [\"blue\", \"green\", \"red\", \"orange\", \"purple\", \"brown\", \"pink\", \"gray\"]\n", + "colors = [palette[unique_sources.index(s) % len(palette)] for s in sources]\n", + "\n", + "# Hover text: source + document snippet (like day2)\n", + "hover_texts = [\n", + " f\"Source: {s}
Text: {(doc or '')[:100]}...\" \n", + " for s, doc in zip(sources, documents)\n", + "]\n", + "\n", + "fig = go.Figure(data=[go.Scatter(\n", + " x=coords_2d[:, 0],\n", + " y=coords_2d[:, 1],\n", + " mode=\"markers\",\n", + " marker=dict(size=10, color=colors, opacity=0.8),\n", + " text=hover_texts,\n", + " hoverinfo=\"text\",\n", + ")])\n", + "fig.update_layout(\n", + " title=\"Knowledge-base embeddings (2D t-SNE)\",\n", + " xaxis_title=\"t-SNE 1\",\n", + " yaxis_title=\"t-SNE 2\",\n", + " width=800,\n", + " height=600,\n", + " margin=dict(r=20, b=10, l=10, t=40),\n", + ")\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c2cfa1e", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "import sys\n", + "from pathlib import Path\n", + "\n", + "folder = Path.cwd()\n", + "if not (folder / \"app.py\").exists():\n", + " folder = folder / \"week5\" / \"IbrahimSheriff\"\n", + "sys.path.insert(0, str(folder))\n", + "\n", + "import app\n", + "app.main()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/week5/community-contributions/IbrahimSheriff/ingest.py b/week5/community-contributions/IbrahimSheriff/ingest.py new file mode 100644 index 0000000000..dc0abe880e --- /dev/null +++ b/week5/community-contributions/IbrahimSheriff/ingest.py @@ -0,0 +1,88 @@ +""" +Ingest Galdunx knowledge-base into ChromaDB. +Uses the same DB path as answer.py so the chat app finds the collection. +Run from week5/IbrahimSheriff: python ingest.py +""" +import os +from pathlib import Path + +import numpy # Must be imported before HuggingFaceEmbeddings/SentenceTransformer use it + +from langchain_community.document_loaders import DirectoryLoader, TextLoader +from langchain_text_splitters import RecursiveCharacterTextSplitter +from langchain_chroma import Chroma +from langchain_huggingface import HuggingFaceEmbeddings +from langchain_openai import OpenAIEmbeddings + +from dotenv import load_dotenv + +load_dotenv(override=True) + +# Same path as answer.py so both use the same vector store +DB_NAME = str(Path(__file__).parent / "vector_db") +KNOWLEDGE_BASE = str(Path(__file__).parent / "knowledge-base") + +# Must match answer.py: OpenRouter or HuggingFace +if os.getenv("OPENROUTER_API_KEY"): + embeddings = OpenAIEmbeddings( + model="openai/text-embedding-3-small", + openai_api_key=os.getenv("OPENROUTER_API_KEY"), + openai_api_base="https://openrouter.ai/api/v1", + ) +else: + embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") + + +def fetch_documents(): + """Load all .md files from knowledge-base (flat or nested).""" + loader = DirectoryLoader( + KNOWLEDGE_BASE, + glob="**/*.md", + loader_cls=TextLoader, + loader_kwargs={"encoding": "utf-8"}, + ) + documents = loader.load() + return documents + + +def create_chunks(documents): + text_splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=200) + return text_splitter.split_documents(documents) + + +def create_embeddings(chunks): + try: + if os.path.exists(DB_NAME): + Chroma(persist_directory=DB_NAME, embedding_function=embeddings).delete_collection() + + vectorstore = Chroma.from_documents( + documents=chunks, embedding=embeddings, persist_directory=DB_NAME + ) + except Exception as e: + if "readonly" in str(e).lower() or "read-only" in str(e).lower(): + raise RuntimeError( + "The vector database is locked because the Gradio chat app is still running.\n" + "Stop the app first: interrupt the notebook kernel (e.g. click Stop), or close the\n" + "Gradio tab and stop the server. Then re-run the ingest cell." + ) from e + if "Numpy is not available" in str(e): + raise RuntimeError( + "NumPy 2.x is incompatible with PyTorch in this setup. Either:\n" + " 1. Set OPENROUTER_API_KEY in .env to use API embeddings, or\n" + " 2. Run: uv pip install 'numpy<2' then re-run ingest." + ) from e + raise + + collection = vectorstore._collection + count = collection.count() + sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0] + dimensions = len(sample_embedding) + print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store") + return vectorstore + + +if __name__ == "__main__": + documents = fetch_documents() + chunks = create_chunks(documents) + create_embeddings(chunks) + print("Ingestion complete.") diff --git a/week5/community-contributions/IbrahimSheriff/knowledge-base/Archive.zip b/week5/community-contributions/IbrahimSheriff/knowledge-base/Archive.zip new file mode 100644 index 0000000000..b1fd262638 Binary files /dev/null and b/week5/community-contributions/IbrahimSheriff/knowledge-base/Archive.zip differ