This guide to AI evals was written by Aakash Gupta, based on an episode with Ankit Shukla, published February 19, 2026.
AI evaluations are a critical new skill for all Product Managers because AI functions differently from traditional deterministic code.
Unlike traditional code that is deterministic (e.g., 2+2 always equals 4), AI outputs are probabilistic and can vary.
The probabilistic nature of AI means traditional feature testing methods are insufficient for AI features.
A comprehensive AI evaluation system requires three types of evaluations: offline, online, and human.
Offline evals are used for testing AI features before they are launched.
Online evals are used for monitoring AI feature performance after launch in production.
Human evals involve spot-checking quality to determine if users actually like the AI feature.
Teams commonly skip online and human evals, but all three types are necessary to prevent feature failure.
AI product managers should own evaluations due to their unique position understanding business, customers, and technology outcomes.
Product Managers understand what success looks like, customer needs, and business value metrics.
Product leaders like Todd Olson (CEO of Pendo) and Rachel Wolan (CPO of Webflow) agree that AI evals are the most important new skill for PMs.
Skipping AI evaluations can lead to features failing in production, increased support tickets, and wasted development effort.
A prototype works great in demo but fails after launch, causing user complaints, hallucinations, and a feature rollback.
Six months of work can be lost if a feature is rolled back due to lack of proper evaluations.
Understanding how Large Language Models (LLMs) fundamentally work is crucial before building effective evaluations.
LLMs are statistical models that predict the next token based on probability distributions, not by 'knowing' facts.
When asked 'What’s the capital of France?', an LLM predicts 'The capital of France is Paris' based on probability, not a lookup.
The same prompt can yield different results because the model samples from a probability distribution, leading to varied outputs.
Temperature controls the randomness of LLM outputs, ranging from deterministic (Temperature = 0) to more random (Temperature = 1).
Most products use a temperature setting between 0.3 and 0.7 for LLMs.
Evaluations must be conducted at the same temperature setting used in production, or they become meaningless.
LLMs have limited context windows, and response quality degrades as the window fills, especially for information in the middle.
GPT-4 has a context window of 128K tokens.
Claude has a context window of 200K tokens.
Critical information placed in the middle of a long prompt might be missed by the LLM, even within the context window.
Evaluations must test different context lengths, as prompt performance can vary significantly.
Small changes in prompts can lead to large and unpredictable changes in LLM outputs, making AI products fragile.
Changing 'Please summarize this document' to 'Summarize this document' or adding 'Be concise' yields different results.
Evaluations must test multiple prompt variations that users might actually type, not just one canonical prompt.
LLMs can hallucinate, making up facts, citing non-existent sources, and inventing details because they predict plausible, not accurate, text.
It is essential to verify that LLM output is factually correct, not just that it 'looks good'.
Building an evaluation rubric is essential to define 'good' quality and measure the performance of AI features.
Begin rubric development by identifying the top 10 scenarios users will encounter with the AI feature.
For a customer support chatbot, scenarios include questions about return policy, shipping times, product specs, account issues, and unclear queries.
For a code generation tool, scenarios include requests for simple functions, complex algorithms, refactoring, bug fixes, and tests.
The identified user scenarios become the test cases for evaluating the AI feature.
For each scenario, define specific, measurable, and unambiguous criteria for what constitutes a successful AI response.
Bad criteria: 'The response is helpful'; Good criteria: 'The response contains the correct return window (30 days) and includes the return portal link'.
Good code success criteria: 'The code passes all test cases, follows project style guide, and includes error handling'.
A good rubric typically has 4-6 categories, each with a 1-5 scoring scale.
Key categories include Correctness, Completeness, Clarity, Tone, Safety, and Efficiency.
The 1-5 scale ranges from 'Completely fails' (1) to 'Fully succeeds' (5), with clear definitions for each score.
For each category and score level, create reference examples to establish ground truth for evaluators.
Examples for 'Correctness' in a support chatbot demonstrate scores of 5 (full success), 3 (partial success), and 1 (factual error).
These examples show human or LLM evaluators what specific quality levels look like.
Have 2-3 people independently grade the same 10 outputs using the rubric and calculate inter-rater reliability to refine it.
A bad rubric leads to bad evals, while a good rubric ensures reliable evaluations.
Different AI use cases require specific metrics for effective evaluation, as there is no 'one size fits all' approach.
Metrics for AI systems that retrieve information, such as RAG systems, focus on relevance and surfacing the right documents.
Key metrics include Precision, Recall, F1 Score (harmonic mean), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).
For most AI Product Manager use cases, F1 and NDCG are the most important retrieval metrics.
Metrics for AI systems that generate text, focusing on similarity and overlap with reference texts.
Metrics include BLEU (translation), ROUGE (summarization), METEOR (synonyms/stemming), and BERTScore (semantic similarity).
BERTScore is recommended as the most robust generation metric for most AI products.
Some AI tasks require custom metrics tailored to the specific requirements and desired outcomes of that task.
Metrics for code generation include compilation success, test passage, style guide adherence, and cyclomatic complexity.
Metrics for customer support include correct information, required links, brand tone matching, and issue resolution.
Metrics for summarization include capturing key points, omitting irrelevant details, length, and coherence.
Using an LLM to grade outputs based on a rubric can correlate better with human judgment than traditional metrics.
Ask an LLM (e.g., GPT-4 or Claude) to score a response on a 1-5 scale for helpfulness or other criteria.
The effectiveness of an LLM judge relies on providing it with a well-defined rubric and reference examples.
The choice of metrics depends on the AI's primary function, often requiring a combination of multiple metrics.
Use Precision/Recall/F1 for document retrieval, BERTScore for generating text similar to references, task-specific metrics for specific tasks, and LLM-as-judge for holistic quality.
Implementing LLM judges involves a step-by-step process from defining evaluation prompts to continuous evaluation.
The evaluation prompt for an LLM judge needs four components: the rubric, reference examples, input query, and output to evaluate.
A template includes criteria like Correctness, Completeness, Clarity, and Tone with 1-5 scoring, reference examples, user query, and assistant response.
Manually test the LLM judge with 10 outputs that have known correct scores, comparing judge scores to ground truth and refining the prompt as needed.
Common fixes include adding more reference examples, making criteria more specific, adding chain-of-thought reasoning, or using a better model (e.g., GPT-4 vs GPT-3.5).
Automate the eval pipeline using tools like Claude Code by providing the evaluation prompt and test dataset to generate the necessary script.
A prompt for Claude Code can request an eval pipeline taking a CSV, running through Claude as a judge, parsing scores, and outputting a summary CSV with averages and flagged low scores.
The PM's role is defining the rubric, curating test cases, and interpreting results, while implementation is largely automated.
Generate a summary dashboard from eval results, focusing on mean scores, distributions, and worst/best performing examples to identify patterns.
Tell Claude Code to generate a dashboard showing mean score per criterion, score distribution, and top 10 worst/best performers.
Product sense is crucial to interpret numbers, determine why failures occurred, and decide on corrective actions.
Define minimum acceptable scores for each criterion and add them as pass/fail gates in the evaluation pipeline.
Example thresholds include Correctness and Completeness ≥4.0 average, and Clarity and Tone ≥3.5 average.
Instruct Claude Code to add pass/fail checks to the eval pipeline, flagging runs if any criterion average drops below defined thresholds.
Implement continuous evaluation before every release, daily in production, after prompt changes, and after model updates to catch regressions.
Tell Claude Code to run the eval pipeline nightly against a 1% random sample of production traffic and send Slack notifications for threshold drops.
Several common pitfalls can undermine the effectiveness of LLM judges if not addressed.
Using the same model as both the judge and the product can lead to biased evaluations.
Use a stronger model as the judge (e.g., GPT-4 to judge GPT-3.5 outputs).
Failing to calibrate the LLM judge regularly can result in unreliable scores.
Regularly compare judge scores to human scores and adjust prompts to ensure accuracy.
Judging too many dimensions at once can overwhelm the LLM judge and reduce accuracy.
Break complex rubrics into multiple judge calls to evaluate fewer dimensions per call.
Using a temperature greater than 0 for LLM judges introduces randomness, making scores inconsistent.
Always use temperature=0 for LLM evaluations to ensure deterministic and consistent scores.
Effective AI evaluation extends beyond launch with robust production monitoring to ensure ongoing performance and value.
Production monitoring involves three layers: System Metrics, Quality Metrics, and Business Metrics.
These are basic health metrics indicating infrastructure issues.
System metrics include Latency (p50, p95, p99), Error rate, Token usage, API costs, and Timeout rate.
These measure the performance and quality of the AI's output.
Quality metrics include average LLM judge scores, human feedback scores (thumbs up/down), task success rate, and hallucination rate.
These measure if the AI is delivering value and achieving business objectives.
Business metrics include feature adoption rate, user retention, customer satisfaction (CSAT/NPS), support ticket deflection, and revenue impact.
All three layers are necessary; ignoring any layer leaves gaps in understanding AI performance and value.
Establish alerts for critical performance deviations to enable immediate investigation.
Alerts should be set for quality scores dropping below threshold, error rates spiking above 1%, p95 latency exceeding 3 seconds, or hallucination rate exceeding 5%.
Sample 1% of production traffic daily for human review of 10-20 real user interactions using the rubric.
Human review catches issues LLM judges miss and keeps the team connected to real user experiences.
If human scores diverge from LLM judge scores, recalibrate the LLM judge.
Production monitoring findings should feed back into the evaluation dataset to continuously improve evals.
Add bad production outputs to the test dataset, label them with correct scores, and rerun evals to ensure the system catches them.
This feedback loop creates a virtuous cycle, continuously improving the evaluation system over time.
Define clear rollback criteria beforehand to quickly reverse a feature if it fails in production.
Criteria include quality score drop >10% from baseline, error rate exceeding 5%, more than 3 critical bugs in 24 hours, or negative business impact.
AI evals combine product sense, technical understanding, and statistical thinking, forming a new essential PM skill.
Build a comprehensive evaluation rubric specifically for your product.
Implement appropriate metrics, including retrieval, generation, task-specific, and LLM-as-judge metrics.
Create an LLM judge to automate the evaluation process.
Establish clear quality thresholds before launching any AI feature.
Monitor AI quality continuously once it is in production.
Integrate learnings from production back into your evaluation dataset for ongoing improvement.
It is crucial never to ship an AI product without a robust evaluation system in place.
This guide to AI evals was written by Aakash Gupta, based on an episode with Ankit Shukla, published February 19, 2026.
AI evaluations are a critical new skill for all Product Managers because AI functions differently from traditional deterministic code.
Unlike traditional code that is deterministic (e.g., 2+2 always equals 4), AI outputs are probabilistic and can vary.
The probabilistic nature of AI means traditional feature testing methods are insufficient for AI features.
A comprehensive AI evaluation system requires three types of evaluations: offline, online, and human.
Offline evals are used for testing AI features before they are launched.
Online evals are used for monitoring AI feature performance after launch in production.
Human evals involve spot-checking quality to determine if users actually like the AI feature.
Teams commonly skip online and human evals, but all three types are necessary to prevent feature failure.
AI product managers should own evaluations due to their unique position understanding business, customers, and technology outcomes.
Product Managers understand what success looks like, customer needs, and business value metrics.
Product leaders like Todd Olson (CEO of Pendo) and Rachel Wolan (CPO of Webflow) agree that AI evals are the most important new skill for PMs.
Skipping AI evaluations can lead to features failing in production, increased support tickets, and wasted development effort.
A prototype works great in demo but fails after launch, causing user complaints, hallucinations, and a feature rollback.
Six months of work can be lost if a feature is rolled back due to lack of proper evaluations.
Understanding how Large Language Models (LLMs) fundamentally work is crucial before building effective evaluations.
LLMs are statistical models that predict the next token based on probability distributions, not by 'knowing' facts.
When asked 'What’s the capital of France?', an LLM predicts 'The capital of France is Paris' based on probability, not a lookup.
The same prompt can yield different results because the model samples from a probability distribution, leading to varied outputs.
Temperature controls the randomness of LLM outputs, ranging from deterministic (Temperature = 0) to more random (Temperature = 1).
Most products use a temperature setting between 0.3 and 0.7 for LLMs.
Evaluations must be conducted at the same temperature setting used in production, or they become meaningless.
LLMs have limited context windows, and response quality degrades as the window fills, especially for information in the middle.
GPT-4 has a context window of 128K tokens.
Claude has a context window of 200K tokens.
Critical information placed in the middle of a long prompt might be missed by the LLM, even within the context window.
Evaluations must test different context lengths, as prompt performance can vary significantly.
Small changes in prompts can lead to large and unpredictable changes in LLM outputs, making AI products fragile.
Changing 'Please summarize this document' to 'Summarize this document' or adding 'Be concise' yields different results.
Evaluations must test multiple prompt variations that users might actually type, not just one canonical prompt.
LLMs can hallucinate, making up facts, citing non-existent sources, and inventing details because they predict plausible, not accurate, text.
It is essential to verify that LLM output is factually correct, not just that it 'looks good'.
Building an evaluation rubric is essential to define 'good' quality and measure the performance of AI features.
Begin rubric development by identifying the top 10 scenarios users will encounter with the AI feature.
For a customer support chatbot, scenarios include questions about return policy, shipping times, product specs, account issues, and unclear queries.
For a code generation tool, scenarios include requests for simple functions, complex algorithms, refactoring, bug fixes, and tests.
The identified user scenarios become the test cases for evaluating the AI feature.
For each scenario, define specific, measurable, and unambiguous criteria for what constitutes a successful AI response.
Bad criteria: 'The response is helpful'; Good criteria: 'The response contains the correct return window (30 days) and includes the return portal link'.
Good code success criteria: 'The code passes all test cases, follows project style guide, and includes error handling'.
A good rubric typically has 4-6 categories, each with a 1-5 scoring scale.
Key categories include Correctness, Completeness, Clarity, Tone, Safety, and Efficiency.
The 1-5 scale ranges from 'Completely fails' (1) to 'Fully succeeds' (5), with clear definitions for each score.
For each category and score level, create reference examples to establish ground truth for evaluators.
Examples for 'Correctness' in a support chatbot demonstrate scores of 5 (full success), 3 (partial success), and 1 (factual error).
These examples show human or LLM evaluators what specific quality levels look like.
Have 2-3 people independently grade the same 10 outputs using the rubric and calculate inter-rater reliability to refine it.
A bad rubric leads to bad evals, while a good rubric ensures reliable evaluations.
Different AI use cases require specific metrics for effective evaluation, as there is no 'one size fits all' approach.
Metrics for AI systems that retrieve information, such as RAG systems, focus on relevance and surfacing the right documents.
Key metrics include Precision, Recall, F1 Score (harmonic mean), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).
For most AI Product Manager use cases, F1 and NDCG are the most important retrieval metrics.
Metrics for AI systems that generate text, focusing on similarity and overlap with reference texts.
Metrics include BLEU (translation), ROUGE (summarization), METEOR (synonyms/stemming), and BERTScore (semantic similarity).
BERTScore is recommended as the most robust generation metric for most AI products.
Some AI tasks require custom metrics tailored to the specific requirements and desired outcomes of that task.
Metrics for code generation include compilation success, test passage, style guide adherence, and cyclomatic complexity.
Metrics for customer support include correct information, required links, brand tone matching, and issue resolution.
Metrics for summarization include capturing key points, omitting irrelevant details, length, and coherence.
Using an LLM to grade outputs based on a rubric can correlate better with human judgment than traditional metrics.
Ask an LLM (e.g., GPT-4 or Claude) to score a response on a 1-5 scale for helpfulness or other criteria.
The effectiveness of an LLM judge relies on providing it with a well-defined rubric and reference examples.
The choice of metrics depends on the AI's primary function, often requiring a combination of multiple metrics.
Use Precision/Recall/F1 for document retrieval, BERTScore for generating text similar to references, task-specific metrics for specific tasks, and LLM-as-judge for holistic quality.
Implementing LLM judges involves a step-by-step process from defining evaluation prompts to continuous evaluation.
The evaluation prompt for an LLM judge needs four components: the rubric, reference examples, input query, and output to evaluate.
A template includes criteria like Correctness, Completeness, Clarity, and Tone with 1-5 scoring, reference examples, user query, and assistant response.
Manually test the LLM judge with 10 outputs that have known correct scores, comparing judge scores to ground truth and refining the prompt as needed.
Common fixes include adding more reference examples, making criteria more specific, adding chain-of-thought reasoning, or using a better model (e.g., GPT-4 vs GPT-3.5).
Automate the eval pipeline using tools like Claude Code by providing the evaluation prompt and test dataset to generate the necessary script.
A prompt for Claude Code can request an eval pipeline taking a CSV, running through Claude as a judge, parsing scores, and outputting a summary CSV with averages and flagged low scores.
The PM's role is defining the rubric, curating test cases, and interpreting results, while implementation is largely automated.
Generate a summary dashboard from eval results, focusing on mean scores, distributions, and worst/best performing examples to identify patterns.
Tell Claude Code to generate a dashboard showing mean score per criterion, score distribution, and top 10 worst/best performers.
Product sense is crucial to interpret numbers, determine why failures occurred, and decide on corrective actions.
Define minimum acceptable scores for each criterion and add them as pass/fail gates in the evaluation pipeline.
Example thresholds include Correctness and Completeness ≥4.0 average, and Clarity and Tone ≥3.5 average.
Instruct Claude Code to add pass/fail checks to the eval pipeline, flagging runs if any criterion average drops below defined thresholds.
Implement continuous evaluation before every release, daily in production, after prompt changes, and after model updates to catch regressions.
Tell Claude Code to run the eval pipeline nightly against a 1% random sample of production traffic and send Slack notifications for threshold drops.
Several common pitfalls can undermine the effectiveness of LLM judges if not addressed.
Using the same model as both the judge and the product can lead to biased evaluations.
Use a stronger model as the judge (e.g., GPT-4 to judge GPT-3.5 outputs).
Failing to calibrate the LLM judge regularly can result in unreliable scores.
Regularly compare judge scores to human scores and adjust prompts to ensure accuracy.
Judging too many dimensions at once can overwhelm the LLM judge and reduce accuracy.
Break complex rubrics into multiple judge calls to evaluate fewer dimensions per call.
Using a temperature greater than 0 for LLM judges introduces randomness, making scores inconsistent.
Always use temperature=0 for LLM evaluations to ensure deterministic and consistent scores.
Effective AI evaluation extends beyond launch with robust production monitoring to ensure ongoing performance and value.
Production monitoring involves three layers: System Metrics, Quality Metrics, and Business Metrics.
These are basic health metrics indicating infrastructure issues.
System metrics include Latency (p50, p95, p99), Error rate, Token usage, API costs, and Timeout rate.
These measure the performance and quality of the AI's output.
Quality metrics include average LLM judge scores, human feedback scores (thumbs up/down), task success rate, and hallucination rate.
These measure if the AI is delivering value and achieving business objectives.
Business metrics include feature adoption rate, user retention, customer satisfaction (CSAT/NPS), support ticket deflection, and revenue impact.
All three layers are necessary; ignoring any layer leaves gaps in understanding AI performance and value.
Establish alerts for critical performance deviations to enable immediate investigation.
Alerts should be set for quality scores dropping below threshold, error rates spiking above 1%, p95 latency exceeding 3 seconds, or hallucination rate exceeding 5%.
Sample 1% of production traffic daily for human review of 10-20 real user interactions using the rubric.
Human review catches issues LLM judges miss and keeps the team connected to real user experiences.
If human scores diverge from LLM judge scores, recalibrate the LLM judge.
Production monitoring findings should feed back into the evaluation dataset to continuously improve evals.
Add bad production outputs to the test dataset, label them with correct scores, and rerun evals to ensure the system catches them.
This feedback loop creates a virtuous cycle, continuously improving the evaluation system over time.
Define clear rollback criteria beforehand to quickly reverse a feature if it fails in production.
Criteria include quality score drop >10% from baseline, error rate exceeding 5%, more than 3 critical bugs in 24 hours, or negative business impact.
AI evals combine product sense, technical understanding, and statistical thinking, forming a new essential PM skill.
Build a comprehensive evaluation rubric specifically for your product.
Implement appropriate metrics, including retrieval, generation, task-specific, and LLM-as-judge metrics.
Create an LLM judge to automate the evaluation process.
Establish clear quality thresholds before launching any AI feature.
Monitor AI quality continuously once it is in production.
Integrate learnings from production back into your evaluation dataset for ongoing improvement.
It is crucial never to ship an AI product without a robust evaluation system in place.