rag_code/Tutorials/3_evaluation_of_rag_studio_chain/README.md at main · apurva-koti/rag_code

Tutorial 3: Running evaluation on a logged RAG chain

This tutorial walks you through using Evaluation Suite to evaluate the quality of a RAG chain built with RAG Studio.

Open the notebook to get started.

For a fully worked examples that log and evaluate a chain, please see the [3_pdf_rag_with_single_turn_chat](../../RAG Cookbook/3_pdf_rag_with_single_turn_chat/) or [4_pdf_rag_with_multi_turn_chat](../../RAG Cookbook/4_pdf_rag_with_multi_turn_chat/) examples.

Evaluation Harness supports 2 approaches to provide your chain’s outputs in order to generate quality/cost/latency Metrics. This tutorial shows you approach #2.

Approach	Description	When to use
1	Eval Harness runs a chain on your behalf. Pass a reference to the chain itself so Evaluation Harness can generate the outputs on your behalf	- Your chain is logged using MLflow w/ MLflow Tracing enabled - Your chain is available as a Python function in your local notebook
2	Run chain yourself, pass outputs to Eval Harness. Run the chain being evaluated yourself, capturing the chain’s outputs and passing the outputs as a Pandas DataFrame	- Your chain is developed outside of Databricks - You want to evaluate outputs from a chain already running in production - You are testing different evaluation / LLM Judge configurations and your chain doesn’t produce deterministic outputs (e.g., LLM has a high temperature)

Step 1: Gather your chain's outputs (and optionally ground truth) to evaluate

First, you need to collect sample requests, optionally with ground truth, for evalaution to run on. evaluate(...) takes a parameter data that is a Pandas DataFrame your Evaluation Set, optionally, w/ ground truth. Let's look at a few examples.

- For full details on the schema, view the Evaluation Harness Input Schema section of the RAG Studio documentation.
- For full details on the metrics available, view the LLM Judges & Metrics section of the RAG Studio documentation.

These Dictionary-based examples are provided just to show the schema. You do NOT have to start from a Dictionary - you can use any existing Pandas or Spark DataFrame with this schema.

Below, we walk you through the 3 levels of data that you may have available in your Evaluation Set. As you increase your data level, Evaluation Suite can offer you additional functionality.

	Level A	Level B	Level C
Required data
Evaluation set: `request`	✔	✔	✔
Evaluation set: `expected_response`	X	✔	✔
Evaluation set: `expected_retrieved_context`	X	X	✔
Supported metrics
`response/llm_judged/relevance_to_query_rating`	✔	✔	✔
`response/llm_judged/harmfulness_rating/average`	✔	✔	✔
`retrieval/llm_judged/chunk_relevance_precision/average`	✔	✔	✔
`response/llm_judged/groundedness_rating/average`	✔	✔	✔
`chain/request_token_count`	✔	✔	✔
`chain/response_token_count`	✔	✔	✔
`chain/total_token_count`	✔	✔	✔
`chain/input_token_count`	✔	✔	✔
`chain/output_token_count`	✔	✔	✔
Customer-defined LLM judges	✔	✔	✔
`response/llm_judged/correctness_rating/average`	X	✔	✔
`retrieval/ground_truth/document_recall/average`	X	X	✔
`retrieval/ground_truth/document_precision/average`	X	X	✔

Level A: Evaluation set contains just sample requests

level_A_data = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }]

Level B: Evaluation set contains labeled ground truth answers

level_B_data  = [
    {
        "request_id": "your-request-id", 
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }]

Level C: Evaluation set contains labeled ground truth answers & retrieval context

level_C_data  = [
    {
        "request_id": "your-request-id", 
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",       
    }]

Step 2: Run evaluation

Once you have your data available, running evaluation is simple single-line command.

There are 4 options for passing your chain to the Evaluation Harness. Option 1 is the most commonly used since you can simply call evaluate(...) on the output of log_model(...).

Option 1. Reference to a MLflow logged model in the current MLflow Experiment

model = "runs:/6b69501828264f9s9a64eff825371711/chain" 

`6b69501828264f9s9a64eff825371711` is the run_id, `chain` is the artifact_path that was passed when calling mlflow.xxx.log_model(...).  

This value can be accessed via `model_info.model_uri` if you called model_info = mlflow.xxx.log_model(), where xxx is `langchain` or `pyfunc`.

Option 2. Reference to a Unity Catalog registered model

model = "models:/catalog.schema.model_name/1"  # 1 is the version number

Option 3. A PyFunc model that is loaded in the Notebook
```
model = mlflow.pyfunc.load_model(...)
```

Option 4. A local function in the Notebook

def model_fn(model_input):
  # do stuff
  response = 'the answer!'
  return response

model = model_fn

Here we use the level_C_data_df but you can replace this with any of the other example DataFrames above.

import pandas as pd
import mlflow
# If you do not start a MLflow run, `evaluate(...) will start a Run on your behalf.
with mlflow.start_run(run_name="level_C_data"):
  evaluation_results = mlflow.evaluate(
    data=level_C_data_df,
    model="runs:/a828658a8c9f46eeb7ef346e65228394/chain", 
    model_type="databricks-rag")

Step 3: Use the data & metrics

Evaluation Harness produces several outputs:

Aggregated metric values across the entire Evaluation Set
- Average numerical result of each metric
Data about each question in the Evaluation Set
- In the same schema as the Evaluation Input
  - Inputs sent to the chain
  - All chain generated data used in evaluation e.g., response, retrieved_context, trace, etc
- Numeric result of each metric e.g., 1 or 0, etc
- Ratings & rationales from each Databricks and Customer-defined LLM judge

These outputs are available in 2 locations:

Stored inside the MLflow Run & Experiment as raw data & visualizations
Returned as DataFrames & Dictionaries by mlflow.evaluate(..)

Note: The data is identical between these 2 locations, so which view you use is a matter of your preference.

# Access aggregated metric values across the entire Evaluation Set
metrics_as_dict = evaluation_results.metrics

# Access the data produced on each question in the Evaluation Set
per_question_results_df = evaluation_results.tables['eval_results']
display(per_question_results_df)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 3: Running evaluation on a logged RAG chain

Step 1: Gather your chain's outputs (and optionally ground truth) to evaluate

Level A: Evaluation set contains just sample requests

Level B: Evaluation set contains labeled ground truth answers

Level C: Evaluation set contains labeled ground truth answers & retrieval context

Step 2: Run evaluation

Step 3: Use the data & metrics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tutorial 3: Running evaluation on a logged RAG chain

Step 1: Gather your chain's outputs (and optionally ground truth) to evaluate

Level A: Evaluation set contains just sample requests

Level B: Evaluation set contains labeled ground truth answers

Level C: Evaluation set contains labeled ground truth answers & retrieval context

Step 2: Run evaluation

Step 3: Use the data & metrics