This tutorial walks you through using Evaluation Suite to evaluate the quality of a RAG chain built with RAG Studio.
Open the notebook to get started.
For a fully worked examples that log and evaluate a chain, please see the [3_pdf_rag_with_single_turn_chat](../../RAG Cookbook/3_pdf_rag_with_single_turn_chat/) or [4_pdf_rag_with_multi_turn_chat](../../RAG Cookbook/4_pdf_rag_with_multi_turn_chat/) examples.
Evaluation Harness supports 2 approaches to provide your chain’s outputs in order to generate quality/cost/latency Metrics. This tutorial shows you approach #2.
| Approach | Description | When to use |
|---|---|---|
| 1 | Eval Harness runs a chain on your behalf. Pass a reference to the chain itself so Evaluation Harness can generate the outputs on your behalf | - Your chain is logged using MLflow w/ MLflow Tracing enabled - Your chain is available as a Python function in your local notebook |
| 2 | Run chain yourself, pass outputs to Eval Harness. Run the chain being evaluated yourself, capturing the chain’s outputs and passing the outputs as a Pandas DataFrame | - Your chain is developed outside of Databricks - You want to evaluate outputs from a chain already running in production - You are testing different evaluation / LLM Judge configurations and your chain doesn’t produce deterministic outputs (e.g., LLM has a high temperature) |
First, you need to collect sample requests, optionally with ground truth, for evalaution to run on. evaluate(...) takes a parameter data that is a Pandas DataFrame your Evaluation Set, optionally, w/ ground truth. Let's look at a few examples.
- For full details on the schema, view the Evaluation Harness Input Schema section of the RAG Studio documentation.
- For full details on the metrics available, view the LLM Judges & Metrics section of the RAG Studio documentation.
These Dictionary-based examples are provided just to show the schema. You do NOT have to start from a Dictionary - you can use any existing Pandas or Spark DataFrame with this schema.
Below, we walk you through the 3 levels of data that you may have available in your Evaluation Set. As you increase your data level, Evaluation Suite can offer you additional functionality.
| Level A | Level B | Level C | |
|---|---|---|---|
| Required data | |||
Evaluation set: request |
✔ | ✔ | ✔ |
Evaluation set: expected_response |
X | ✔ | ✔ |
Evaluation set: expected_retrieved_context |
X | X | ✔ |
| Supported metrics | |||
response/llm_judged/relevance_to_query_rating |
✔ | ✔ | ✔ |
response/llm_judged/harmfulness_rating/average |
✔ | ✔ | ✔ |
retrieval/llm_judged/chunk_relevance_precision/average |
✔ | ✔ | ✔ |
response/llm_judged/groundedness_rating/average |
✔ | ✔ | ✔ |
chain/request_token_count |
✔ | ✔ | ✔ |
chain/response_token_count |
✔ | ✔ | ✔ |
chain/total_token_count |
✔ | ✔ | ✔ |
chain/input_token_count |
✔ | ✔ | ✔ |
chain/output_token_count |
✔ | ✔ | ✔ |
| Customer-defined LLM judges | ✔ | ✔ | ✔ |
response/llm_judged/correctness_rating/average |
X | ✔ | ✔ |
retrieval/ground_truth/document_recall/average |
X | X | ✔ |
retrieval/ground_truth/document_precision/average |
X | X | ✔ |
level_A_data = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}]
level_B_data = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_response": "There's no significant difference.",
}]
level_C_data = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
"doc_uri": "doc_uri_2_1",
},
{
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
}]
Once you have your data available, running evaluation is simple single-line command.
There are 4 options for passing your chain to the Evaluation Harness. Option 1 is the most commonly used since you can simply call evaluate(...) on the output of log_model(...).
-
Option 1. Reference to a MLflow logged model in the current MLflow Experiment
model = "runs:/6b69501828264f9s9a64eff825371711/chain" `6b69501828264f9s9a64eff825371711` is the run_id, `chain` is the artifact_path that was passed when calling mlflow.xxx.log_model(...). This value can be accessed via `model_info.model_uri` if you called model_info = mlflow.xxx.log_model(), where xxx is `langchain` or `pyfunc`. -
Option 2. Reference to a Unity Catalog registered model
model = "models:/catalog.schema.model_name/1" # 1 is the version number -
Option 3. A PyFunc model that is loaded in the Notebook
model = mlflow.pyfunc.load_model(...) -
Option 4. A local function in the Notebook
def model_fn(model_input): # do stuff response = 'the answer!' return response model = model_fn
Here we use the level_C_data_df but you can replace this with any of the other example DataFrames above.
import pandas as pd
import mlflow
# If you do not start a MLflow run, `evaluate(...) will start a Run on your behalf.
with mlflow.start_run(run_name="level_C_data"):
evaluation_results = mlflow.evaluate(
data=level_C_data_df,
model="runs:/a828658a8c9f46eeb7ef346e65228394/chain",
model_type="databricks-rag")
Evaluation Harness produces several outputs:
- Aggregated metric values across the entire Evaluation Set
- Average numerical result of each metric
- Data about each question in the Evaluation Set
- In the same schema as the Evaluation Input
- Inputs sent to the chain
- All chain generated data used in evaluation e.g., response, retrieved_context, trace, etc
- Numeric result of each metric e.g., 1 or 0, etc
- Ratings & rationales from each Databricks and Customer-defined LLM judge
- In the same schema as the Evaluation Input
These outputs are available in 2 locations:
- Stored inside the MLflow Run & Experiment as raw data & visualizations
- Returned as DataFrames & Dictionaries by
mlflow.evaluate(..)
Note: The data is identical between these 2 locations, so which view you use is a matter of your preference.
# Access aggregated metric values across the entire Evaluation Set
metrics_as_dict = evaluation_results.metrics
# Access the data produced on each question in the Evaluation Set
per_question_results_df = evaluation_results.tables['eval_results']
display(per_question_results_df)