Ankit Shukla provides an intuitive walkthrough of AI evaluations, highlighting their importance for Product Managers.

Made with Rinto — analyse your own content free

Author

This guide to AI evals was written by Aakash Gupta, based on an episode with Ankit Shukla, published February 19, 2026.

Why AI Evals Matter for PMs

AI evaluations are a critical new skill for all Product Managers because AI functions differently from traditional deterministic code.

AI is Probabilistic

Unlike traditional code that is deterministic (e.g., 2+2 always equals 4), AI outputs are probabilistic and can vary.

Traditional Testing Insufficient

The probabilistic nature of AI means traditional feature testing methods are insufficient for AI features.

Three-Part Eval System

A comprehensive AI evaluation system requires three types of evaluations: offline, online, and human.

PMs Should Own Evals

AI product managers should own evaluations due to their unique position understanding business, customers, and technology outcomes.

Cost of Not Doing Evals

Skipping AI evaluations can lead to features failing in production, increased support tickets, and wasted development effort.

Fundamental Nature of LLMs

Understanding how Large Language Models (LLMs) fundamentally work is crucial before building effective evaluations.

LLMs Are Probabilistic

LLMs are statistical models that predict the next token based on probability distributions, not by 'knowing' facts.

The Temperature Problem

Temperature controls the randomness of LLM outputs, ranging from deterministic (Temperature = 0) to more random (Temperature = 1).

The Context Window Problem

LLMs have limited context windows, and response quality degrades as the window fills, especially for information in the middle.

The Prompt Sensitivity Problem

Small changes in prompts can lead to large and unpredictable changes in LLM outputs, making AI products fragile.

The Hallucination Problem

LLMs can hallucinate, making up facts, citing non-existent sources, and inventing details because they predict plausible, not accurate, text.

Build Evaluation Rubric

Building an evaluation rubric is essential to define 'good' quality and measure the performance of AI features.

Start with User Scenarios

Begin rubric development by identifying the top 10 scenarios users will encounter with the AI feature.

Define Success Criteria

For each scenario, define specific, measurable, and unambiguous criteria for what constitutes a successful AI response.

Build Rubric Categories

A good rubric typically has 4-6 categories, each with a 1-5 scoring scale.

Create Reference Examples

For each category and score level, create reference examples to establish ground truth for evaluators.

Test Your Rubric

Have 2-3 people independently grade the same 10 outputs using the rubric and calculate inter-rater reliability to refine it.

Evaluation Metrics Framework

Different AI use cases require specific metrics for effective evaluation, as there is no 'one size fits all' approach.

Retrieval Metrics

Metrics for AI systems that retrieve information, such as RAG systems, focus on relevance and surfacing the right documents.

Generation Metrics

Metrics for AI systems that generate text, focusing on similarity and overlap with reference texts.

Task-Specific Metrics

Some AI tasks require custom metrics tailored to the specific requirements and desired outcomes of that task.

LLM-as-Judge Metrics

Using an LLM to grade outputs based on a rubric can correlate better with human judgment than traditional metrics.

Choosing the Right Metrics

The choice of metrics depends on the AI's primary function, often requiring a combination of multiple metrics.

Build LLM Judges Step-by-Step

Implementing LLM judges involves a step-by-step process from defining evaluation prompts to continuous evaluation.

Step 1: Define Evaluation Prompt

The evaluation prompt for an LLM judge needs four components: the rubric, reference examples, input query, and output to evaluate.

Step 2: Test on Known Examples

Manually test the LLM judge with 10 outputs that have known correct scores, comparing judge scores to ground truth and refining the prompt as needed.

Step 3: Implement with Claude Code

Automate the eval pipeline using tools like Claude Code by providing the evaluation prompt and test dataset to generate the necessary script.

Step 4: Read Results and Find Gaps

Generate a summary dashboard from eval results, focusing on mean scores, distributions, and worst/best performing examples to identify patterns.

Step 5: Set Quality Thresholds

Define minimum acceptable scores for each criterion and add them as pass/fail gates in the evaluation pipeline.

Step 6: Run Evals Continuously

Implement continuous evaluation before every release, daily in production, after prompt changes, and after model updates to catch regressions.

Common LLM Judge Pitfalls

Several common pitfalls can undermine the effectiveness of LLM judges if not addressed.

Production Monitoring That Works

Effective AI evaluation extends beyond launch with robust production monitoring to ensure ongoing performance and value.

Three Layers of Monitoring

Production monitoring involves three layers: System Metrics, Quality Metrics, and Business Metrics.

Setting Up Automatic Alerts

Establish alerts for critical performance deviations to enable immediate investigation.

The Human Review Queue

Sample 1% of production traffic daily for human review of 10-20 real user interactions using the rubric.

The Feedback Loop

Production monitoring findings should feed back into the evaluation dataset to continuously improve evals.

When to Rollback

Define clear rollback criteria beforehand to quickly reverse a feature if it fails in production.

Final Words: Key Actions

AI evals combine product sense, technical understanding, and statistical thinking, forming a new essential PM skill.

▸ 8 Expand

APEX

AI Evals Explained Simply

Ankit Shukla provides an intuitive walkthrough of AI evaluations, highlighting their importance for Product Managers.

Made with Rinto — analyse your own content free

DETL

Author

This guide to AI evals was written by Aakash Gupta, based on an episode with Ankit Shukla, published February 19, 2026.

▸ 5 Expand

CONC

Why AI Evals Matter for PMs

AI evaluations are a critical new skill for all Product Managers because AI functions differently from traditional deterministic code.

JUST

AI is Probabilistic

Unlike traditional code that is deterministic (e.g., 2+2 always equals 4), AI outputs are probabilistic and can vary.

INSG

Traditional Testing Insufficient

The probabilistic nature of AI means traditional feature testing methods are insufficient for AI features.

▸ 4 Expand

CONC

Three-Part Eval System

A comprehensive AI evaluation system requires three types of evaluations: offline, online, and human.

SUBC

Offline Evals

Offline evals are used for testing AI features before they are launched.

SUBC

Online Evals

Online evals are used for monitoring AI feature performance after launch in production.

SUBC

Human Evals

Human evals involve spot-checking quality to determine if users actually like the AI feature.

INSG

Need All Three Evals

Teams commonly skip online and human evals, but all three types are necessary to prevent feature failure.

▸ 2 Expand

CONC

PMs Should Own Evals

AI product managers should own evaluations due to their unique position understanding business, customers, and technology outcomes.

JUST

PMs Understand Outcomes

Product Managers understand what success looks like, customer needs, and business value metrics.

DETL

Industry Leaders Agree

Product leaders like Todd Olson (CEO of Pendo) and Rachel Wolan (CPO of Webflow) agree that AI evals are the most important new skill for PMs.

▸ 2 Expand

CONC

Cost of Not Doing Evals

Skipping AI evaluations can lead to features failing in production, increased support tickets, and wasted development effort.

EXMP

Failed Feature Scenario

A prototype works great in demo but fails after launch, causing user complaints, hallucinations, and a feature rollback.

STAT

Work Lost

Six months of work can be lost if a feature is rolled back due to lack of proper evaluations.

▸ 5 Expand

CONC

Fundamental Nature of LLMs

Understanding how Large Language Models (LLMs) fundamentally work is crucial before building effective evaluations.

▸ 2 Expand

SUBC

LLMs Are Probabilistic

LLMs are statistical models that predict the next token based on probability distributions, not by 'knowing' facts.

EXMP

Capital of France Query

When asked 'What’s the capital of France?', an LLM predicts 'The capital of France is Paris' based on probability, not a lookup.

JUST

Different Outputs Explanation

The same prompt can yield different results because the model samples from a probability distribution, leading to varied outputs.

▸ 2 Expand

SUBC

The Temperature Problem

Temperature controls the randomness of LLM outputs, ranging from deterministic (Temperature = 0) to more random (Temperature = 1).

DETL

Typical Production Temperature

Most products use a temperature setting between 0.3 and 0.7 for LLMs.

INSG

Temperature Impact on Evals

Evaluations must be conducted at the same temperature setting used in production, or they become meaningless.

▸ 4 Expand

SUBC

The Context Window Problem

LLMs have limited context windows, and response quality degrades as the window fills, especially for information in the middle.

STAT

GPT-4 Context Window

GPT-4 has a context window of 128K tokens.

STAT

Claude Context Window

Claude has a context window of 200K tokens.

DETL

Lost in the Middle Problem

Critical information placed in the middle of a long prompt might be missed by the LLM, even within the context window.

INSG

Context Length Impact on Evals

Evaluations must test different context lengths, as prompt performance can vary significantly.

▸ 2 Expand

SUBC

The Prompt Sensitivity Problem

Small changes in prompts can lead to large and unpredictable changes in LLM outputs, making AI products fragile.

EXMP

Prompt Variation Example

Changing 'Please summarize this document' to 'Summarize this document' or adding 'Be concise' yields different results.

INSG

Prompt Variation Impact on Evals

Evaluations must test multiple prompt variations that users might actually type, not just one canonical prompt.

▸ 1 Expand

SUBC

The Hallucination Problem

LLMs can hallucinate, making up facts, citing non-existent sources, and inventing details because they predict plausible, not accurate, text.

INSG

Verification Required

It is essential to verify that LLM output is factually correct, not just that it 'looks good'.

▸ 5 Expand

CONC

Build Evaluation Rubric

Building an evaluation rubric is essential to define 'good' quality and measure the performance of AI features.

▸ 3 Expand

DETL

Start with User Scenarios

Begin rubric development by identifying the top 10 scenarios users will encounter with the AI feature.

EXMP

Chatbot Scenarios

For a customer support chatbot, scenarios include questions about return policy, shipping times, product specs, account issues, and unclear queries.

EXMP

Code Generation Tool Scenarios

For a code generation tool, scenarios include requests for simple functions, complex algorithms, refactoring, bug fixes, and tests.

INSG

User Scenarios as Test Cases

The identified user scenarios become the test cases for evaluating the AI feature.

▸ 2 Expand

DETL

Define Success Criteria

For each scenario, define specific, measurable, and unambiguous criteria for what constitutes a successful AI response.

EXMP

Good vs Bad Criteria

Bad criteria: 'The response is helpful'; Good criteria: 'The response contains the correct return window (30 days) and includes the return portal link'.

EXMP

Code Success Criteria

Good code success criteria: 'The code passes all test cases, follows project style guide, and includes error handling'.

▸ 2 Expand

DETL

Build Rubric Categories

A good rubric typically has 4-6 categories, each with a 1-5 scoring scale.

DETL

Rubric Categories

Key categories include Correctness, Completeness, Clarity, Tone, Safety, and Efficiency.

DETL

Scoring Scale Definition

The 1-5 scale ranges from 'Completely fails' (1) to 'Fully succeeds' (5), with clear definitions for each score.

▸ 2 Expand

DETL

Create Reference Examples

For each category and score level, create reference examples to establish ground truth for evaluators.

EXMP

Correctness Example Scores

Examples for 'Correctness' in a support chatbot demonstrate scores of 5 (full success), 3 (partial success), and 1 (factual error).

JUST

Reference Examples Purpose

These examples show human or LLM evaluators what specific quality levels look like.

▸ 1 Expand

DETL

Test Your Rubric

Have 2-3 people independently grade the same 10 outputs using the rubric and calculate inter-rater reliability to refine it.

JUST

Reliable Evals Need Good Rubric

A bad rubric leads to bad evals, while a good rubric ensures reliable evaluations.

▸ 5 Expand

CONC

Evaluation Metrics Framework

Different AI use cases require specific metrics for effective evaluation, as there is no 'one size fits all' approach.

▸ 2 Expand

SUBC

Retrieval Metrics

Metrics for AI systems that retrieve information, such as RAG systems, focus on relevance and surfacing the right documents.

DETL

Key Retrieval Metrics

Key metrics include Precision, Recall, F1 Score (harmonic mean), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).

INSG

Most Important Retrieval Metrics

For most AI Product Manager use cases, F1 and NDCG are the most important retrieval metrics.

▸ 2 Expand

SUBC

Generation Metrics

Metrics for AI systems that generate text, focusing on similarity and overlap with reference texts.

DETL

Text Generation Metrics

Metrics include BLEU (translation), ROUGE (summarization), METEOR (synonyms/stemming), and BERTScore (semantic similarity).

INSG

Most Robust Generation Metric

BERTScore is recommended as the most robust generation metric for most AI products.

▸ 3 Expand

SUBC

Task-Specific Metrics

Some AI tasks require custom metrics tailored to the specific requirements and desired outcomes of that task.

EXMP

Code Generation Metrics

Metrics for code generation include compilation success, test passage, style guide adherence, and cyclomatic complexity.

EXMP

Customer Support Metrics

Metrics for customer support include correct information, required links, brand tone matching, and issue resolution.

EXMP

Summarization Metrics

Metrics for summarization include capturing key points, omitting irrelevant details, length, and coherence.

▸ 2 Expand

SUBC

LLM-as-Judge Metrics

Using an LLM to grade outputs based on a rubric can correlate better with human judgment than traditional metrics.

DETL

LLM Judge Method

Ask an LLM (e.g., GPT-4 or Claude) to score a response on a 1-5 scale for helpfulness or other criteria.

JUST

Good Rubric is Key for LLM Judge

The effectiveness of an LLM judge relies on providing it with a well-defined rubric and reference examples.

▸ 1 Expand

DCSN

Choosing the Right Metrics

The choice of metrics depends on the AI's primary function, often requiring a combination of multiple metrics.

DETL

Metrics Decision Tree

Use Precision/Recall/F1 for document retrieval, BERTScore for generating text similar to references, task-specific metrics for specific tasks, and LLM-as-judge for holistic quality.

▸ 7 Expand

CONC

Build LLM Judges Step-by-Step

Implementing LLM judges involves a step-by-step process from defining evaluation prompts to continuous evaluation.

▸ 1 Expand

DETL

Step 1: Define Evaluation Prompt

The evaluation prompt for an LLM judge needs four components: the rubric, reference examples, input query, and output to evaluate.

EXMP

Evaluation Prompt Template

A template includes criteria like Correctness, Completeness, Clarity, and Tone with 1-5 scoring, reference examples, user query, and assistant response.

▸ 1 Expand

DETL

Step 2: Test on Known Examples

Manually test the LLM judge with 10 outputs that have known correct scores, comparing judge scores to ground truth and refining the prompt as needed.

DETL

Common Prompt Fixes

Common fixes include adding more reference examples, making criteria more specific, adding chain-of-thought reasoning, or using a better model (e.g., GPT-4 vs GPT-3.5).

▸ 2 Expand

DETL

Step 3: Implement with Claude Code

Automate the eval pipeline using tools like Claude Code by providing the evaluation prompt and test dataset to generate the necessary script.

EXMP

Claude Code Prompt

A prompt for Claude Code can request an eval pipeline taking a CSV, running through Claude as a judge, parsing scores, and outputting a summary CSV with averages and flagged low scores.

INSG

PM Role in Implementation

The PM's role is defining the rubric, curating test cases, and interpreting results, while implementation is largely automated.

▸ 2 Expand

DETL

Step 4: Read Results and Find Gaps

Generate a summary dashboard from eval results, focusing on mean scores, distributions, and worst/best performing examples to identify patterns.

EXMP

Claude Code Dashboard Request

Tell Claude Code to generate a dashboard showing mean score per criterion, score distribution, and top 10 worst/best performers.

INSG

Product Sense for Gaps

Product sense is crucial to interpret numbers, determine why failures occurred, and decide on corrective actions.

▸ 2 Expand

DCSN

Step 5: Set Quality Thresholds

Define minimum acceptable scores for each criterion and add them as pass/fail gates in the evaluation pipeline.

EXMP

Example Thresholds

Example thresholds include Correctness and Completeness ≥4.0 average, and Clarity and Tone ≥3.5 average.

EXMP

Claude Code Threshold Integration

Instruct Claude Code to add pass/fail checks to the eval pipeline, flagging runs if any criterion average drops below defined thresholds.

▸ 1 Expand

DETL

Step 6: Run Evals Continuously

Implement continuous evaluation before every release, daily in production, after prompt changes, and after model updates to catch regressions.

EXMP

Claude Code Continuous Eval Setup

Tell Claude Code to run the eval pipeline nightly against a 1% random sample of production traffic and send Slack notifications for threshold drops.

▸ 4 Expand

CONC

Common LLM Judge Pitfalls

Several common pitfalls can undermine the effectiveness of LLM judges if not addressed.

▸ 1 Expand

DETL

Pitfall 1: Same Model for Judge/Product

Using the same model as both the judge and the product can lead to biased evaluations.

DCSN

Solution: Stronger Judge Model

Use a stronger model as the judge (e.g., GPT-4 to judge GPT-3.5 outputs).

▸ 1 Expand

DETL

Pitfall 2: Not Calibrating Judge

Failing to calibrate the LLM judge regularly can result in unreliable scores.

DCSN

Solution: Calibrate Regularly

Regularly compare judge scores to human scores and adjust prompts to ensure accuracy.

▸ 1 Expand

DETL

Pitfall 3: Too Many Dimensions

Judging too many dimensions at once can overwhelm the LLM judge and reduce accuracy.

DCSN

Solution: Break Down Rubrics

Break complex rubrics into multiple judge calls to evaluate fewer dimensions per call.

▸ 1 Expand

DETL

Pitfall 4: Non-Zero Temperature

Using a temperature greater than 0 for LLM judges introduces randomness, making scores inconsistent.

DCSN

Solution: Use Temperature=0

Always use temperature=0 for LLM evaluations to ensure deterministic and consistent scores.

▸ 5 Expand

CONC

Production Monitoring That Works

Effective AI evaluation extends beyond launch with robust production monitoring to ensure ongoing performance and value.

▸ 4 Expand

SUBC

Three Layers of Monitoring

Production monitoring involves three layers: System Metrics, Quality Metrics, and Business Metrics.

▸ 1 Expand

DETL

Layer 1: System Metrics

These are basic health metrics indicating infrastructure issues.

DETL

Examples of System Metrics

System metrics include Latency (p50, p95, p99), Error rate, Token usage, API costs, and Timeout rate.

▸ 1 Expand

DETL

Layer 2: Quality Metrics

These measure the performance and quality of the AI's output.

DETL

Examples of Quality Metrics

Quality metrics include average LLM judge scores, human feedback scores (thumbs up/down), task success rate, and hallucination rate.

▸ 1 Expand

DETL

Layer 3: Business Metrics

These measure if the AI is delivering value and achieving business objectives.

DETL

Examples of Business Metrics

Business metrics include feature adoption rate, user retention, customer satisfaction (CSAT/NPS), support ticket deflection, and revenue impact.

INSG

All Layers Are Critical

All three layers are necessary; ignoring any layer leaves gaps in understanding AI performance and value.

▸ 1 Expand

DETL

Setting Up Automatic Alerts

Establish alerts for critical performance deviations to enable immediate investigation.

EXMP

Alert Criteria Examples

Alerts should be set for quality scores dropping below threshold, error rates spiking above 1%, p95 latency exceeding 3 seconds, or hallucination rate exceeding 5%.

▸ 2 Expand

DETL

The Human Review Queue

Sample 1% of production traffic daily for human review of 10-20 real user interactions using the rubric.

JUST

Human Review Catches Misses

Human review catches issues LLM judges miss and keeps the team connected to real user experiences.

DCSN

Recalibrate Judge if Divergent

If human scores diverge from LLM judge scores, recalibrate the LLM judge.

▸ 2 Expand

DETL

The Feedback Loop

Production monitoring findings should feed back into the evaluation dataset to continuously improve evals.

DETL

Feedback Loop Steps

Add bad production outputs to the test dataset, label them with correct scores, and rerun evals to ensure the system catches them.

INSG

Virtuous Cycle of Evals

This feedback loop creates a virtuous cycle, continuously improving the evaluation system over time.

▸ 1 Expand

DCSN

When to Rollback

Define clear rollback criteria beforehand to quickly reverse a feature if it fails in production.

EXMP

Rollback Criteria Examples

Criteria include quality score drop >10% from baseline, error rate exceeding 5%, more than 3 critical bugs in 24 hours, or negative business impact.

▸ 7 Expand

CONC

Final Words: Key Actions

AI evals combine product sense, technical understanding, and statistical thinking, forming a new essential PM skill.

DCSN

Build Eval Rubric

Build a comprehensive evaluation rubric specifically for your product.

DCSN

Implement Right Metrics

Implement appropriate metrics, including retrieval, generation, task-specific, and LLM-as-judge metrics.

DCSN

Create LLM Judge

Create an LLM judge to automate the evaluation process.

DCSN

Set Quality Thresholds

Establish clear quality thresholds before launching any AI feature.

DCSN

Monitor Quality Continuously

Monitor AI quality continuously once it is in production.

DCSN

Feedback Production Learnings

Integrate learnings from production back into your evaluation dataset for ongoing improvement.

INSG

Do Not Ship Without Evals

It is crucial never to ship an AI product without a robust evaluation system in place.