How to build LLM evaluation framewok - Blog Post (recipe #2) #14500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

natalia-amorim wants to merge 2 commits into master from recipe-blog-post-llm-evaluation-framework

Contributor

natalia-amorim commented Jan 21, 2026

Second article in the PostHog recipe series, a step-by-step guide to building an LLM evaluation framework for production AI systems.

Targeting "LLM evaluation framework" (350/mo) as primary keyword, with secondary coverage of
LLM evaluation (1.1k),
LLM evaluation metrics (400),
LLM monitoring (350),
LLM monitoring tools (200),
LLM debugging (150),
LLM evaluation tools (150),
LLM evaluation benchmarks (100),
LLM cost optimization (90),
LLM evaluation methods (90),
best LLM evaluation tools (90),
LLM latency (70)

Supports the Evaluations feature launch and aligns with paid ads push on LLM analytics
Showcases LLM Analytics, Evaluations, Session Replay, Product Analytics, Surveys, Feature Flags, and Experiments as a connected stack

Checklist

Words are spelled using American English
PostHog product names are in title case. It's "Product Analytics" not "Product analytics". If talking about a category of product, use sentence case e.g. "There are a lot of product analytics tools, but PostHog's Product Analytics is the best"
Titles are in sentence case
Feature names are in sentence case. It's "Click here to create a trend insight" not "... create a Trend Insight" and so on.
Use relative URLs for internal links
If I moved a page, I added a redirect in vercel.json

Article checklist

I've added (at least) 3-5 internal links to this new article
I've added keywords for this page to the rank tracker in Ahrefs
I've checked the preview build of the article
The date on the article is today's date
I've added this to the relevant "Tutorials and guides" docs page (if applicable)


          First draft

93eb60c

vercel bot commented Jan 21, 2026 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
posthog	Ready	Preview	Jan 21, 2026 10:18pm

natalia-amorim requested a review from Radu-Raicea

January 21, 2026 21:54


          Merge branch 'master' into recipe-blog-post-llm-evaluation-framework

4d5e037

Contributor Author

natalia-amorim commented Jan 21, 2026

Hi @Radu-Raicea! As discussed, here's my first stab at a blog post focused on LLM analytics. Would love your eyes on it when you get a chance. A few things I'm hoping you can help with:

Accuracy check — did I misrepresent the product or any features? I'm not an expert here so want to make sure I got it right
Helpfulness — does this feel useful? Did I miss anything important in the directions?
General sanity check — I've been writing this while pretty under the weather (my brain is operating at 35% capacity atm lol), so lmk if you catch anything weird 🙃

For context: this is meant to be a high-level tutorial targeting SEO/LLM traffic to drive awareness and visibility, not a deep technical dive.

Any and all feedback welcome!

vercel bot deployed to Preview

January 21, 2026 22:18

View deployment

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		After AI: you write code that calls an LLM and... hope. Hope it doesn't hallucinate, go off-brand, or end up screenshotted on Twitter. Except hope is not a strategy for production systems.

		LLMs fail differently than traditional code; bad outputs don't throw errors, they just land on users' plates. It's like serving food you haven't tasted, and hoping you eyeballed the right amount of salt.

Member

Radu-Raicea Jan 22, 2026

If you're a chef, you don't really taste it though :D

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		This recipe gives you a quality control system.

		![LLM evaluation framework recipe card](https://res.cloudinary.com/dmukukwp6/image/upload/Recipe_Cards_blue_4c0590aa9b.png)

Member

Radu-Raicea Jan 22, 2026

The Y in Telemetry is not bolded

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		### Want to cook with PostHog? Great choice.

		If you haven't set it up yet, [start here](/docs/getting-started/install). For LLM analytics specifically:

Member

Radu-Raicea Jan 22, 2026

Should we link the LLMA getting started page here instead?

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		If you haven't set it up yet, [start here](/docs/getting-started/install). For LLM analytics specifically:

		1. Install the [PostHog SDK](/docs/llm-analytics/installation) with your LLM provider wrapper ([OpenAI](/docs/llm-analytics/installation/openai), [Anthropic](/docs/llm-analytics/installation/anthropic), [LangChain](/docs/llm-analytics/installation/langchain), etc.)

Member

Radu-Raicea Jan 22, 2026

I would mention the manual capture method too, in case we don't support their framework with a wrapper.

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx

+. Make sure you're calling [`posthog.identify()`](/docs/product-analytics/identify) so you can connect AI outputs to specific users
+. Your generations will start appearing in [LLM Analytics](https://app.posthog.com/llm-analytics) automatically
+              The first **100K** LLM events per month are free with 30-day retention!

Member

Radu-Raicea Jan 22, 2026

We could either not mention the retention, or specify that we drop exactly drop the whole LLM event, but rather the input/output messages. We still keep the events for their metadata (cost, number, users associated) past 30 days.

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx

+              An LLM evaluation framework is only as good as its data. Every generation should be captured with enough context to debug later.
+              Set up your LLM observability so that, at a minimum, it records:
+              - **Input** (the user's prompt and any system context)

Member

Radu-Raicea Jan 22, 2026

and available tools that the LLM can call

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx

+              <details>
+              <summary><strong>👨‍🍳 Chef's tip</strong></summary>
+              Capture more than you think you need. Storage is cheap; re-instrumenting later is not.

Member

Radu-Raicea Jan 22, 2026

It's more like: backfilling past data that you didn't capture is impossible

Member

Radu-Raicea Jan 22, 2026

re-instrumenting later is not that bad, but your new "complete" data only starts from that moment, can't rewrite history

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		### Method 1: Automated LLM evaluations

		Automated evaluations use LLM-as-a-judge to score sampled production traffic against your quality criteria. Each evaluation runs automatically on live data and returns a pass/fail result with reasoning, so you know why something failed.

Member

Radu-Raicea Jan 22, 2026

Currently we only support LLMaaJ, but we're going to support other evaluators soon.

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx

+              <details>
+              <summary><strong>👨‍🍳 Chef's tip</strong></summary>
+              Don't over-index on any single method. Automated evals catch issues at scale, user feedback provides ground truth, and manual review keeps everything honest. The value is in how you combine them.

Member

Radu-Raicea Jan 22, 2026

Exactly, I was going to mention this.

Member

Radu-Raicea Jan 22, 2026

PostHog AI have a weekly traces hour doing method 3.

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx


		Everything connects automatically. Filter [session replays](/docs/session-replay) by LLM events or eval results. Build [funnels](/docs/product-analytics/funnels) or [retention](/docs/product-analytics/retention) insights broken down by AI quality metrics. Click from a generation directly to its associated replay.

		Not sure where to start? Ask [PostHog AI](/ai). You can ask questions like "Which users had the most failed evaluations last week?" or "Show me retention for users who got hallucinations vs. those who didn't" – and it'll guide you through building the right insight.

Member

Radu-Raicea Jan 22, 2026

Not sure if it works yet through PostHog AI. We haven't specifically built it/tried it. It's coming this quarter though.

Member

Radu-Raicea Jan 22, 2026

So it's up to you to decide whether to leave it because it will be added towards the end of Q1, or to reword/remove this.

Radu-Raicea reviewed

View reviewed changes

contents/blog/how-to-build-llm-evaluation-framework.mdx

+. Go to [Experiments](https://app.posthog.com/experiments) → **New experiment**
+. Link it to your feature flag (the one controlling your prompt variant)
+. Set your goal metrics — include both eval metrics (pass rate, feedback score) and product metrics (retention, conversion, task completion)

Member

Radu-Raicea Jan 22, 2026

This is very rudimentary and we want to improve synergy between products. We don't expect anyone to actually be able to do this easily yet. Perhaps next quarter, we'll have better support with Experiments (especially as we launch Prompt Management).

Radu-Raicea approved these changes

View reviewed changes

Member

Radu-Raicea left a comment

Looks great! 🎉

natalia-amorim requested a review from ivanagas

January 22, 2026 16:49

Contributor Author

natalia-amorim commented Jan 22, 2026

Hey @ivanagas tagging you for review here but not a massive rush on this one, it can wait till next week if you have bigger priorities; also I'll implement all feedback at once (including Radu's) so will wait for your pass before diving in again :) thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet