diff --git a/.github/agents/blogs.agent.md b/.github/agents/blogs.agent.md index dd395fb805..8578fee121 100644 --- a/.github/agents/blogs.agent.md +++ b/.github/agents/blogs.agent.md @@ -1,10 +1,11 @@ --- name: Blog-Writer description: Specialized agent for writing blog blurbs and standalone blog posts about new features. -model: Claude Sonnet 4.5 (copilot) +model: Claude Opus 4.5 (copilot) tools: ['edit', 'search', 'runTasks', 'microsoft_docs_mcp/*', 'fetch', 'github.vscode-pull-request-github/issue_fetch', 'todos', 'shell'] --- +DO NOT OPEN A Pull Request ! VERY IMPORTANT MESSAGE: DO NOT OPEN A PR ON THIS REPO. If you have opened one, close it now. You are a documentation specialist designed to write and edit blogs for a technical audience. Your output should only be in HTML format. @@ -17,11 +18,10 @@ Create a list of tasks to implement the different phases below. As tasks are com - Ask the user if they want to create a blog blurb, a standalone blog post, or both? -Gather details about the blog to be created: +Gather details about the blog to be created. Ask the questions one by one and wait for answer before asking the next question.: - What is the feature or topic of the blog? - - Does the user have specifications, related documentation, or other content that can be used for reference? - - If there are no specifications, can the user describe the feature and the necessary elements for the blog content? - - Are there screenshots or images available? If so, where are they located? + - Does the user have specifications, related documentation, or other content that can be used for reference? These can be copy pasted into the chat now. + - If there are no specifications, can the user describe the feature and the necessary elements for the blog content? **Target lengths:** - Blog blurb: ~110-150 words @@ -91,22 +91,6 @@ Update the list of tasks to reflect the completion of Phase 3. Based on the approved outline, the user's requirements, and research findings, create the requested blog content. -## HTML Structure Guidelines -- Use semantic HTML tags: `

`, `

`, `

`, `

    `, `
      `, ``, ``, `` -- Headings: Use `

      ` for main sections, `

      ` for subsections -- Links: Use descriptive link text, not "click here" or "learn more" - - ✅ `Learn about row-level security policies` - - ❌ `Click here` -- Lists: Use `
        ` for unordered, `
          ` for sequential steps -- Code: Use `` for inline code, consider `
          ` for blocks
          -- Images (if applicable): Include descriptive alt text
          -  - `Screenshot showing the access control configuration panel`
          -
          -## Link Requirements
          -- All documentation links must be absolute URLs starting with https://learn.microsoft.com/
          -- Verify that linked documentation exists in the repository
          -- Use descriptive anchor text that explains what the user will find
          -
           After completing the content, present it to the user for review before proceeding to Phase 5.
           
             Update the list of tasks to reflect the completion of Phase 4.
          @@ -125,7 +109,89 @@ After completing the content, present it to the user for review before proceedin
             - Present the HTML in a code block for easy copying
             - Ensure proper HTML formatting with indentation
           
          +  The following WordPress HTML formatting instructions must be strictly followed:
          +  
          +  ### Document Structure
          +  - Wrap entire content in `` and `
          ` tags + - Close with `
          ` and `` + + ### Paragraphs + ```html + +

          Your paragraph text here.

          + + ``` + + ### Headings + **H2 (Main sections):** + ```html + +

          Your Heading Text

          + + ``` + + **H3 (Subsections):** + ```html + +

          Your Subheading Text

          + + ``` + + ### Links + - Inline links: `link text` + - External links with target blank: `link text` + + ### Bold Text + - Use `text` for emphasis + + ### Lists + **Unordered lists:** + ```html + +
            +
          • List item text
          • + + + +
          • Another list item
          • +
          + + ``` + + ### Images + + **With center alignment:** + ```html + +
          Alt text description
          Caption text
          + + ``` + + **Without alignment specified:** + ```html + +
          Alt text description.
          Caption text
          + + ``` + + ### Video Embeds (YouTube) + ```html + + [embed]https://www.youtube.com/watch?v=VIDEO_ID[/embed] + + ``` + + ### Key Formatting Rules + 1. Every block element needs opening and closing WordPress comments + 2. Paragraphs, headings, lists, images, and embeds all follow the `` pattern + 3. Each list item gets its own `` wrapper + 4. Use `rel="noreferrer noopener"` for external links with `target="_blank"` + 5. Always include alt text for images + 6. Figure captions use `class="wp-element-caption"` + 7. Image IDs should be unique integers + ## Content Guidelines + - Be concise. Do not restate information in more than one place. - Follow Microsoft documentation style guidelines: https://learn.microsoft.com/en-us/style-guide/welcome/ - **Use plain, inclusive language** - Avoid gender-specific terms, use neutral examples - **Use present tense** - "This feature lets you..." not "This feature will let you..." @@ -156,6 +222,7 @@ After completing the content, present it to the user for review before proceedin Perform final validation checks before delivering the content: ## Content Validation + - **Structure**: Ensure all required sections are present. Ensure that there are no restatements or redundant information. - **Word count**: Verify length matches target (blurb: 110-150 words, standalone: 900-1000 words) - **Accuracy**: Ensure all technical information is correct and up-to-date - **Completeness**: All sections from approved outline are included diff --git a/.github/agents/docs-image.agent.md b/.github/agents/docs-image.agent.md new file mode 100644 index 0000000000..b4b27cda68 --- /dev/null +++ b/.github/agents/docs-image.agent.md @@ -0,0 +1,94 @@ +--- +name: Image-Documentation-Agent +description: Specialized agent for suggesting, placing, and referencing images in technical documentation. +model: Claude Opus 4.5 (copilot) +tools: + ['edit', 'search', 'runTasks', 'microsoft_docs_mcp/*', 'fetch', 'github.vscode-pull-request-github/issue_fetch', 'todos', 'shell'] +--- + +You are a documentation specialist designed manage images in technical documentation. + +Your role is to execute the following workflow. + +Create a list of tasks to implement the different phases below. As tasks are completed, update the list (e.g., ✅ for done, ⏳ for in progress). + +# Phase 1: Suggest image placement + + + +- Review the provided documentation files. +- Identify sections where images would enhance understanding (e.g., diagrams, screenshots, charts). +- For each identified section, suggest the type of image needed (e.g., screenshot, diagram, chart) and a brief description of its content. +- Create a list of suggested images with their descriptions and placement locations within the documents. Save this list as a markdown file named `docs-list.md` in the `.github` directory. + + Update the list of tasks to reflect the completion of Phase 1. + + +# Phase 2: Get images from user + + +Ask the user to provide the images based on the suggestions from Phase 1. The only type of images allowed are in PNG format. Any other format should be ignored. +Create a staging folder named `docs-images-staging` in the root directory of the repository. +Instruct the user to upload the images to this folder with filenames that correspond to their descriptions in the `docs-list.md` file. + + Update the list of tasks to reflect the completion of Phase 2. + + +# Phase 3: Review Images and Update List + + + +Once the user has provided images: +- Review each image in the `docs-images-staging` folder to understand what it shows +- Verify that the images match the descriptions in the `docs-list.md` file +- If any images don't match expectations or additional images are needed, update the `docs-list.md` file accordingly +- Note the key visual elements in each image to ensure accurate alt text descriptions +- Confirm with the user that all necessary images have been provided before proceeding + +Update the list of tasks to reflect the completion of Phase 3. + + + +# Phase 4: Move images to correct location + + +For each image in the `docs-images-staging` folder, move it to the appropriate location within the `docs` directory structure based on its intended use in the documentation. + + - The correct location is under the 'media' folder within the relevant documentation section (e.g., `docs/real-time-intelligence/media/` for Real-Time Intelligence docs). + - within the media folder, the image goes under a folder with the same name as the document it is used in (e.g., `docs/real-time-intelligence/media/tutorial-7-create-anomaly-detection/` for images used in `tutorial-7-create-anomaly-detection.md`). The image name must be completely in lowercase, with words separated by hyphens. + If this folder does not exist, create it. + + Update the list of tasks to reflect the completion of Phase 4. + + +# Phase 5: Insert images into document + + + +- For each placeholder in the document, insert the corresponding image using the correct markdown syntax. + + - Use the following syntax for images: + + ```markdown + :::image type="content" source="./media/architecture.png" alt-text="Architecture diagram showing data flow between services."::: + ``` + + - Alt text guidelines: + - Describe what type of image it is, for example "screenshot", "diagram", "chart", etc. + - Summarize the content of the image in a concise manner. + - End with a period. + + Update the list of tasks to reflect the completion of Phase 5. + + + +# Phase 6: Delete image list and staging folder + + + +- Remove the list of images that is stored as a markdown file named `docs-list.md` in the `.github` directory. +- Remove the staging folder named `docs-images-staging` in the root directory of the repository. + + Update the list of tasks to reflect the completion of Phase 6. + + diff --git a/.github/agents/docs.agent.md b/.github/agents/docs.agent.md index d600c3c46b..da65cc92fc 100644 --- a/.github/agents/docs.agent.md +++ b/.github/agents/docs.agent.md @@ -1,7 +1,7 @@ --- name: Documentation-Writer description: Specialized agent for creating new documentation and editing existing documentation. -model: Claude Sonnet 4.5 (copilot) +model: Claude Opus 4.5 (copilot) tools: ['edit', 'search', 'runTasks', 'microsoft_docs_mcp/*', 'fetch', 'github.vscode-pull-request-github/issue_fetch', 'todos', 'shell'] --- @@ -16,7 +16,7 @@ Create a list of tasks to implement the different phases below. As tasks are com -Your task is to gather all necessary information from the user to create or edit technical documentation. Follow these steps: +Your task is to gather all necessary information from the user to create or edit technical documentation. Follow these steps and ask the questions one by one and wait for answer before asking the next question. - Ask the user if they want to create a new document or edit existing ones. - Gather details about the document(s) to be created or edited, including: - What is the subject matter or feature the documentation will cover? diff --git a/data-explorer/kusto/functions-library/functions-library.md b/data-explorer/kusto/functions-library/functions-library.md index 060b7c5fb3..9af4edae82 100644 --- a/data-explorer/kusto/functions-library/functions-library.md +++ b/data-explorer/kusto/functions-library/functions-library.md @@ -33,7 +33,7 @@ The user-defined functions code is given in the articles. It can be used within | [geoip_fl()](geoip-fl.md) | Retrieves geographic information of ip address. | | [get_packages_version_fl()](get-packages-version-fl.md) | Returns version information of the Python engine and the specified packages. | -## Machine learning functions +## Machine learning & AI functions | Function Name | Description | |--|--| @@ -43,6 +43,7 @@ The user-defined functions code is given in the articles. It can be used within | [kmeans_dynamic_fl()](kmeans-dynamic-fl.md) | Clusterize using the K-Means algorithm, features are in a single dynamic column. | | [predict_fl()](predict-fl.md) | Predict using an existing trained machine learning model. | | [predict_onnx_fl()](predict-onnx-fl.md) | Predict using an existing trained machine learning model in ONNX format. | +| [slm_embeddings_fl()](slm-embeddings-fl.md) | Generate text embeddings using local Small Language Models (SLM). | ## Plotly functions diff --git a/data-explorer/kusto/functions-library/slm-embeddings-fl.md b/data-explorer/kusto/functions-library/slm-embeddings-fl.md new file mode 100644 index 0000000000..7fdf723938 --- /dev/null +++ b/data-explorer/kusto/functions-library/slm-embeddings-fl.md @@ -0,0 +1,234 @@ +--- +title: slm_embeddings_fl() +description: This article describes the slm_embeddings_fl() user-defined function. +ms.reviewer: adieldar +ms.topic: reference +ms.date: 12/16/2025 +--- +# slm_embeddings_fl() + +>[!INCLUDE [applies](../includes/applies-to-version/applies.md)] [!INCLUDE [fabric](../includes/applies-to-version/fabric.md)] [!INCLUDE [azure-data-explorer](../includes/applies-to-version/azure-data-explorer.md)] + +The function `slm_embeddings_fl()` is a [UDF (user-defined function)](../query/functions/user-defined-functions.md) that generates text embeddings using local Small Language Models (SLM). This function converts text into numerical vector representations that can be used for semantic search, similarity analysis, and other natural language processing tasks. +Currently the function supports [jina-v2-small](https://huggingface.co/jinaai/jina-embeddings-v2-small-en) and [e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) models. + +[!INCLUDE [python-zone-pivot-fabric](../includes/python-zone-pivot-fabric.md)] + +## Syntax + +`T | invoke slm_embeddings_fl(`*text_col*`,` *embeddings_col* [`,` *batch_size* ] [`,` *model_name* ] [`,` *prefix* ]`)` + +[!INCLUDE [syntax-conventions-note](../includes/syntax-conventions-note.md)] + +## Parameters + +|Name|Type|Required|Description| +|--|--|--|--| +|*text_col*| `string` | :heavy_check_mark:|The name of the column containing the text to embed.| +|*embeddings_col*| `string` | :heavy_check_mark:|The name of the column to store the output embeddings.| +|*batch_size*| `int` ||The number of texts to process in each batch. Default is 32.| +|*model_name*| `string` ||The name of the embedding model to use. Supported values are `jina-v2-small` (default) and `e5-small-v2`.| +|*prefix*| `string` ||The text prefix to add before each input. Default is `query:`. For E5 model, use `query:` for search queries and `passage:` for documents to be searched. This parameter is ignored for Jina model.| + +## Function definition + +You can define the function by either embedding its code as a query-defined function, or creating it as a stored function in your database, as follows: + +### [Query-defined](#tab/query-defined) + +Define the function using the following [let statement](../query/let-statement.md). No permissions are required. + +> [!IMPORTANT] +> A [let statement](../query/let-statement.md) can't run on its own. It must be followed by a [tabular expression statement](../query/tabular-expression-statements.md). To run a working example of `slm_embeddings_fl()`, see [Example](#example). + +~~~kusto +let slm_embeddings_fl = (tbl:(*), text_col:string, embeddings_col:string, batch_size:int=32, model_name:string='jina-v2-small', prefix:string='query:') +{ + let kwargs = bag_pack('text_col', text_col, 'embeddings_col', embeddings_col, 'batch_size', batch_size, 'model_name', model_name, 'prefix', prefix); + let code = ```if 1: + from sandbox_utils import Zipackage + Zipackage.install('embedding_engine.zip') +# Zipackage.install('tokenizers-0.22.1.whl') # redundant if tokenizers package is included in the Python image + + from embedding_factory import create_embedding_engine + + text_col = kargs["text_col"] + embeddings_col = kargs["embeddings_col"] + batch_size = kargs["batch_size"] + model_name = kargs["model_name"] + prefix = kargs["prefix"] + + Zipackage.install(f'{model_name}.zip') + + engine = create_embedding_engine(model_name, cache_dir="C:\\Temp") + embeddings = engine.encode(df[text_col].tolist(), batch_size=batch_size, prefix=prefix) # prefix is used only for E5 + + result = df + result[embeddings_col] = list(embeddings) + ```; + tbl + | evaluate hint.distribution=per_node python(typeof(*), code, kwargs, external_artifacts = bag_pack( + 'embedding_engine.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/embedding_engine.zip', +// 'tokenizers-0.22.1.whl', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/tokenizers-0.22.1-cp39-abi3-win_amd64.whl', + 'jina-v2-small.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/jina-v2-small.zip', + 'e5-small-v2.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/e5-small-v2.zip')) +}; +// Write your query to use the function here. +~~~ + +### [Stored](#tab/stored) + +Define the stored function once using the following [`.create function`](../management/create-function.md). [Database User permissions](../access-control/role-based-access-control.md) are required. + +> [!IMPORTANT] +> You must run this code to create the function before you can use the function as shown in the [Example](#example). + +~~~kusto +.create-or-alter function with (folder = "Packages\\AI", docstring = "Embedding using local SLM") +slm_embeddings_fl(tbl:(*), text_col:string, embeddings_col:string, batch_size:int=32, model_name:string='jina-v2-small', prefix:string='query:') +{ + let kwargs = bag_pack('text_col', text_col, 'embeddings_col', embeddings_col, 'batch_size', batch_size, 'model_name', model_name, 'prefix', prefix); + let code = ```if 1: + from sandbox_utils import Zipackage + Zipackage.install('embedding_engine.zip') +# Zipackage.install('tokenizers-0.22.1.whl') # redundant if tokenizers package is included in the Python image + + from embedding_factory import create_embedding_engine + + text_col = kargs["text_col"] + embeddings_col = kargs["embeddings_col"] + batch_size = kargs["batch_size"] + model_name = kargs["model_name"] + prefix = kargs["prefix"] + + Zipackage.install(f'{model_name}.zip') + + engine = create_embedding_engine(model_name, cache_dir="C:\\Temp") + embeddings = engine.encode(df[text_col].tolist(), batch_size=batch_size, prefix=prefix) # prefix is used only for E5 + + result = df + result[embeddings_col] = list(embeddings) + ```; + tbl + | evaluate hint.distribution=per_node python(typeof(*), code, kwargs, external_artifacts = bag_pack( + 'embedding_engine.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/embedding_engine.zip', +// 'tokenizers-0.22.1.whl', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/tokenizers-0.22.1-cp39-abi3-win_amd64.whl', + 'jina-v2-small.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/jina-v2-small.zip', + 'e5-small-v2.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/e5-small-v2.zip')) +} +~~~ + +--- + +## Example + +The following example uses the [invoke operator](../query/invoke-operator.md) to run the function. + +### Generate embeddings and perform semantic search + +### [Query-defined](#tab/query-defined) + +To use a query-defined function, invoke it after the embedded function definition. + +~~~kusto +let slm_embeddings_fl=(tbl:(*), text_col:string, embeddings_col:string, batch_size:int=32, model_name:string='jina-v2-small', prefix:string='query:') +{ + let kwargs = bag_pack('text_col', text_col, 'embeddings_col', embeddings_col, 'batch_size', batch_size, 'model_name', model_name, 'prefix', prefix); + let code = ```if 1: +from sandbox_utils import Zipackage +Zipackage.install('embedding_engine.zip') + +from embedding_factory import create_embedding_engine + +text_col = kargs["text_col"] +embeddings_col = kargs["embeddings_col"] +batch_size = kargs["batch_size"] +model_name = kargs["model_name"] +prefix = kargs["prefix"] + +Zipackage.install(f'{model_name}.zip') + +engine = create_embedding_engine(model_name, cache_dir="C:\\Temp") +embeddings = engine.encode(df[text_col].tolist(), batch_size=batch_size, prefix=prefix) + +result = df +result[embeddings_col] = list(embeddings) +```; + tbl + | evaluate hint.distribution=per_node python(typeof(*), code, kwargs, external_artifacts = bag_pack( + 'embedding_engine.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/embedding_engine.zip', + 'jina-v2-small.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/jina-v2-small.zip', + 'e5-small-v2.zip', 'https://artifactswestus.z22.web.core.windows.net/models/SLM/e5-small-v2.zip')) +}; +// +// Create a sample dataset with text passages +let passages = datatable(text:string) +[ + "Machine learning models can process natural language efficiently.", + "Python is a versatile programming language for data science.", + "Azure Data Explorer provides fast analytics on large datasets.", + "Embeddings convert text into numerical vector representations.", + "Neural networks learn patterns from training data." +]; +// Generate embeddings for passages using 'passage:' prefix +let passage_embeddings = + passages + | extend text_embeddings=dynamic(null) + | invoke slm_embeddings_fl('text', 'text_embeddings', 32, 'e5-small-v2', 'passage:'); +// Create a search query and find similar passages +let search_query = datatable(query:string) +[ + "How do embeddings work?" +]; +search_query +| extend query_embeddings=dynamic(null) +| invoke slm_embeddings_fl('query', 'query_embeddings', 32, 'e5-small-v2', 'query:') +| extend dummy=1 +| join (passage_embeddings | extend dummy=1) on dummy +| project query, text, similarity=series_cosine_similarity(query_embeddings, text_embeddings, 1.0, 1.0) +| top 3 by similarity desc +~~~ + +### [Stored](#tab/stored) + +> [!IMPORTANT] +> For this example to run successfully, you must first run the [Function definition](#function-definition) code to store the function. + +```kusto +// Create a sample dataset with text passages +let passages = datatable(text:string) +[ + "Machine learning models can process natural language efficiently.", + "Python is a versatile programming language for data science.", + "Azure Data Explorer provides fast analytics on large datasets.", + "Embeddings convert text into numerical vector representations.", + "Neural networks learn patterns from training data." +]; +// Generate embeddings for passages using 'passage:' prefix +let passage_embeddings = + passages + | extend text_embeddings=dynamic(null) + | invoke slm_embeddings_fl('text', 'text_embeddings', 32, 'e5-small-v2', 'passage:'); +// Create a search query and find similar passages +let search_query = datatable(query:string) +[ + "How do embeddings work?" +]; +search_query +| extend query_embeddings=dynamic(null) +| invoke slm_embeddings_fl('query', 'query_embeddings', 32, 'e5-small-v2', 'query:') +| extend dummy=1 +| join (passage_embeddings | extend dummy=1) on dummy +| project query, text, similarity=series_cosine_similarity(query_embeddings, text_embeddings, 1.0, 1.0) +| top 3 by similarity desc +``` + +--- + +**Output** + +| query | text | similarity | +|---|---|---| +| How do embeddings work? | Embeddings convert text into numerical vector representations. | 0.871 | +| How do embeddings work? | Neural networks learn patterns from training data. | 0.812 | +| How do embeddings work? | Machine learning models can process natural language efficiently. | 0.782 | diff --git a/data-explorer/kusto/functions-library/toc.yml b/data-explorer/kusto/functions-library/toc.yml index 5450ccf904..a43590e722 100644 --- a/data-explorer/kusto/functions-library/toc.yml +++ b/data-explorer/kusto/functions-library/toc.yml @@ -182,6 +182,9 @@ items: - name: series_uv_change_points_fl() displayName: functions library, anomaly detection, univariate, change point href: series-uv-change-points-fl.md +- name: slm_embeddings_fl() + displayName: functions library, text embeddings, SLM, vector embeddings, text analytics, NLP, AI + href: slm-embeddings-fl.md - name: time_weighted_avg_fl() displayName: functions library, binning, time weighted, time series, interpolation href: time-weighted-avg-fl.md