| title | Quickstart - Pt 2 |
|---|---|
| icon | square-2 |
This continues from Quickstart Part 1, where we built a cake recipe generator prompt.
In Part 1, you created a prompt, tested it in the playground, and learned about versioning. Now let's evaluate prompt quality, test different models, and connect everything to your code.
Before deploying a prompt, you want to know if it's actually good. PromptLayer lets you build evaluation pipelines that score your prompt's outputs automatically.
Evaluations run against a dataset - a collection of test cases with inputs and expected outputs. Let's create one for our cake recipe prompt.
Click New → Dataset and name it "cake-recipes-test".
Add a few test cases. Each row needs the input variables your prompt expects (cake_type, serving_size) and optionally an expected output to compare against:
Download this CSV or add rows manually in the UI.
Learn more about Datasets.
Now let's build a pipeline that runs your prompt against each test case and scores the results.
Click New → Evaluation and select your dataset.
First, add a Prompt Template column. This runs your prompt against each row in the dataset, using the column values as input variables. The output appears in a new column.
Next, add an LLM-as-judge scoring column. This uses AI to score each output against criteria you define. For our recipe prompt, we might check:
- Does the recipe include all required sections (Overview, Ingredients, Instructions)?
- Are measurements provided in both metric and US units?
- Is the serving size correct?
You can also add an Equality Comparison column to compare the prompt output against the expected_output column in your dataset.
Run the evaluation to see scores across all test cases. Learn more about Evaluations.
Beyond LLM-as-judge, PromptLayer supports:- Human grading: Collect scores from domain experts
- Equality Comparison: Compare outputs to expected results
- Cosine similarity: Measure semantic similarity between outputs
- Code evaluators: Write custom Python scoring functions
Workflow nodes work the same way in eval pipelines.
Want to compare how your prompt performs across GPT, Claude, and Gemini? Create a new evaluation for model comparison.
Add multiple Prompt Template columns, each configured with a different model override. The pipeline runs your prompt on each model and shows results side by side.
This helps you find the best price/performance balance for your use case. PromptLayer supports all major providers, plus any OpenAI-compatible API via custom base URLs.
Once your prompt is in production, you'll have real request logs. Use these to test new prompt versions against actual user inputs.
Go to Datasets and click Add from Request History. This opens a request log browser where you can filter and select requests.
Filter by prompt name, date range, metadata, or search content. Select the requests you want and click Add Requests. This captures the real inputs your users sent, along with the outputs your current prompt produced.
Create an evaluation that runs your new prompt version against this historical dataset. Add columns for:
- New prompt output: The response from your updated prompt version
- Equality Comparison: Side-by-side comparison highlighting changes from the original output
Backtests are powerful because they show impact on real user data without needing to define exact pass/fail criteria upfront. You can quickly spot if a change produces dramatically different outputs.
Learn more about backtesting.
Attach an evaluation pipeline to run automatically every time you save a new prompt version - similar to GitHub Actions running tests on each commit.
When saving a prompt, the commit dialog lets you select an evaluation pipeline. Choose one and click Next.
From then on, each new version you create will run through the eval and show its score in the version history. This makes it easy to spot regressions before they reach production.
Learn more about continuous integration.
Evaluations aren't just for testing prompts - they're one of the best tools for running prompts in batch. Think of it like a spreadsheet where each column can be an AI-powered computation.
Upload a CSV or create a dataset, add prompt columns, run the batch, and export the results. Common use cases:
- Data labeling: Run GPT over production data to create labeled training datasets
- Research: Use web search prompts to find information about a list of companies or people
- Content generation: Process hundreds of items (commits, support tickets, reviews) to generate summaries or emails
- Data enrichment: Take a list of names and enrich with company, location, or other attributes
You don't need to build permanent evaluation infrastructure for this. One-off, ad-hoc batch runs are a perfectly valid use case. Create a dataset, add your prompt columns, run it, export, and move on.
PromptLayer serves as the source of truth for your prompts. Your application fetches prompts by name, keeping prompt logic out of your codebase and enabling non-engineers to make changes.
pip install promptlayernpm install promptlayerInitialize the client and run a prompt. The SDK fetches the prompt template from PromptLayer, runs it against your configured LLM provider locally, then logs the result back.
from promptlayer import PromptLayer
client = PromptLayer() # Uses PROMPTLAYER_API_KEY env var
response = client.run(
prompt_name="cake-recipe",
input_variables={
"cake_type": "Chocolate Cake",
"serving_size": "8"
}
)import { PromptLayer } from "promptlayer";
const client = new PromptLayer();
const response = await client.run({
promptName: "cake-recipe",
inputVariables: {
cake_type: "Chocolate Cake",
serving_size: "8"
}
});print(response["prompt_blueprint"]["prompt_template"]["messages"][-1]["content"])This lets you switch models without changing how you read responses.
After running, head to Logs in PromptLayer to see your request with the full input, output, and latency.
Use prompt_release_label="production" to fetch the version labeled for production. Use prompt_version=3 to pin to a specific version number. Workflows work the same way - just pass the workflow name. Store API keys as environment variables (PROMPTLAYER_API_KEY, OPENAI_API_KEY) - the client reads these automatically.
Add metadata to track requests by user, session, or feature flag:
response = client.run(
prompt_name="cake-recipe",
input_variables={"cake_type": "Chocolate", "serving_size": "8"},
tags=["production", "recipe-feature"],
metadata={
"user_id": "user_123",
"session_id": "sess_abc"
}
)
# Add a score after reviewing the output
client.track.score(request_id=response["request_id"], score=95)Your metadata and tags appear in the log details, letting you filter and search by user or feature.
Learn more about metadata and tagging.
For workflows or other complex pipelines, enable tracing with client = PromptLayer(enable_tracing=True) and use the @client.traceable decorator on your functions to see each step as spans. Learn more about traces.
Use organizations and workspaces to manage teams and environments. Common setups include separate workspaces for Production, Staging, and Development.
Key features:
- Role-based access control: Owner, Admin, Editor, and Viewer roles at organization and workspace levels
- Audit logs: Track who changed what and when
- Author attribution: See who created and modified each prompt version
- Centralized billing: Manage usage across all workspaces
Each workspace has its own API key. Switch between workspaces using the dropdown in the top navigation.
- Integration Patterns - Caching, webhooks, and production patterns
- Tutorial Videos - Watch walkthroughs of common workflows
- API Reference - Full SDK documentation









