{"apex":{"type":"APEX","text":"Ankit Shukla provides an intuitive walkthrough of AI evaluations, highlighting their importance for Product Managers.","id":"n1","children":[{"children":[],"type":"DETL","text":"This guide to AI evals was written by Aakash Gupta, based on an episode with Ankit Shukla, published February 19, 2026.","id":"n2","parentId":"n1","label":"Author"},{"label":"Why AI Evals Matter for PMs","type":"CONC","text":"AI evaluations are a critical new skill for all Product Managers because AI functions differently from traditional deterministic code.","parentId":"n1","id":"n3","children":[{"children":[],"type":"JUST","text":"Unlike traditional code that is deterministic (e.g., 2+2 always equals 4), AI outputs are probabilistic and can vary.","id":"n4","parentId":"n3","label":"AI is Probabilistic"},{"label":"Traditional Testing Insufficient","children":[],"type":"INSG","text":"The probabilistic nature of AI means traditional feature testing methods are insufficient for AI features.","id":"n5","parentId":"n3"},{"parentId":"n3","id":"n6","type":"CONC","text":"A comprehensive AI evaluation system requires three types of evaluations: offline, online, and human.","children":[{"label":"Offline Evals","parentId":"n6","id":"n7","type":"SUBC","text":"Offline evals are used for testing AI features before they are launched.","children":[]},{"type":"SUBC","text":"Online evals are used for monitoring AI feature performance after launch in production.","parentId":"n6","id":"n8","children":[],"label":"Online Evals"},{"children":[],"id":"n9","parentId":"n6","type":"SUBC","text":"Human evals involve spot-checking quality to determine if users actually like the AI feature.","label":"Human Evals"},{"label":"Need All Three Evals","parentId":"n6","id":"n10","type":"INSG","text":"Teams commonly skip online and human evals, but all three types are necessary to prevent feature failure.","children":[]}],"label":"Three-Part Eval System"},{"label":"PMs Should Own Evals","id":"n11","parentId":"n3","type":"CONC","text":"AI product managers should own evaluations due to their unique position understanding business, customers, and technology outcomes.","children":[{"label":"PMs Understand Outcomes","children":[],"parentId":"n11","id":"n12","type":"JUST","text":"Product Managers understand what success looks like, customer needs, and business value metrics."},{"children":[],"parentId":"n11","id":"n13","type":"DETL","text":"Product leaders like Todd Olson (CEO of Pendo) and Rachel Wolan (CPO of Webflow) agree that AI evals are the most important new skill for PMs.","label":"Industry Leaders Agree"}]},{"type":"CONC","text":"Skipping AI evaluations can lead to features failing in production, increased support tickets, and wasted development effort.","parentId":"n3","id":"n14","children":[{"parentId":"n14","id":"n15","type":"EXMP","text":"A prototype works great in demo but fails after launch, causing user complaints, hallucinations, and a feature rollback.","children":[],"label":"Failed Feature Scenario"},{"label":"Work Lost","type":"STAT","text":"Six months of work can be lost if a feature is rolled back due to lack of proper evaluations.","parentId":"n14","id":"n16","children":[]}],"label":"Cost of Not Doing Evals"}]},{"type":"CONC","text":"Understanding how Large Language Models (LLMs) fundamentally work is crucial before building effective evaluations.","parentId":"n1","id":"n17","children":[{"label":"LLMs Are Probabilistic","children":[{"parentId":"n18","id":"n19","type":"EXMP","text":"When asked 'What’s the capital of France?', an LLM predicts 'The capital of France is Paris' based on probability, not a lookup.","children":[],"label":"Capital of France Query"},{"label":"Different Outputs Explanation","type":"JUST","text":"The same prompt can yield different results because the model samples from a probability distribution, leading to varied outputs.","parentId":"n18","id":"n20","children":[]}],"type":"SUBC","text":"LLMs are statistical models that predict the next token based on probability distributions, not by 'knowing' facts.","parentId":"n17","id":"n18"},{"parentId":"n17","id":"n21","type":"SUBC","text":"Temperature controls the randomness of LLM outputs, ranging from deterministic (Temperature = 0) to more random (Temperature = 1).","children":[{"label":"Typical Production Temperature","children":[],"parentId":"n21","id":"n22","type":"DETL","text":"Most products use a temperature setting between 0.3 and 0.7 for LLMs."},{"children":[],"type":"INSG","text":"Evaluations must be conducted at the same temperature setting used in production, or they become meaningless.","id":"n23","parentId":"n21","label":"Temperature Impact on Evals"}],"label":"The Temperature Problem"},{"label":"The Context Window Problem","type":"SUBC","text":"LLMs have limited context windows, and response quality degrades as the window fills, especially for information in the middle.","id":"n24","parentId":"n17","children":[{"label":"GPT-4 Context Window","children":[],"type":"STAT","text":"GPT-4 has a context window of 128K tokens.","id":"n25","parentId":"n24"},{"children":[],"type":"STAT","text":"Claude has a context window of 200K tokens.","id":"n26","parentId":"n24","label":"Claude Context Window"},{"type":"DETL","text":"Critical information placed in the middle of a long prompt might be missed by the LLM, even within the context window.","id":"n27","parentId":"n24","children":[],"label":"Lost in the Middle Problem"},{"label":"Context Length Impact on Evals","children":[],"type":"INSG","text":"Evaluations must test different context lengths, as prompt performance can vary significantly.","id":"n28","parentId":"n24"}]},{"type":"SUBC","text":"Small changes in prompts can lead to large and unpredictable changes in LLM outputs, making AI products fragile.","id":"n29","parentId":"n17","children":[{"children":[],"parentId":"n29","id":"n30","type":"EXMP","text":"Changing 'Please summarize this document' to 'Summarize this document' or adding 'Be concise' yields different results.","label":"Prompt Variation Example"},{"children":[],"type":"INSG","text":"Evaluations must test multiple prompt variations that users might actually type, not just one canonical prompt.","parentId":"n29","id":"n31","label":"Prompt Variation Impact on Evals"}],"label":"The Prompt Sensitivity Problem"},{"label":"The Hallucination Problem","id":"n32","parentId":"n17","type":"SUBC","text":"LLMs can hallucinate, making up facts, citing non-existent sources, and inventing details because they predict plausible, not accurate, text.","children":[{"label":"Verification Required","children":[],"type":"INSG","text":"It is essential to verify that LLM output is factually correct, not just that it 'looks good'.","id":"n33","parentId":"n32"}]}],"label":"Fundamental Nature of LLMs"},{"type":"CONC","text":"Building an evaluation rubric is essential to define 'good' quality and measure the performance of AI features.","parentId":"n1","id":"n34","children":[{"label":"Start with User Scenarios","children":[{"children":[],"parentId":"n35","id":"n36","type":"EXMP","text":"For a customer support chatbot, scenarios include questions about return policy, shipping times, product specs, account issues, and unclear queries.","label":"Chatbot Scenarios"},{"label":"Code Generation Tool Scenarios","children":[],"type":"EXMP","text":"For a code generation tool, scenarios include requests for simple functions, complex algorithms, refactoring, bug fixes, and tests.","parentId":"n35","id":"n37"},{"label":"User Scenarios as Test Cases","type":"INSG","text":"The identified user scenarios become the test cases for evaluating the AI feature.","id":"n38","parentId":"n35","children":[]}],"id":"n35","parentId":"n34","type":"DETL","text":"Begin rubric development by identifying the top 10 scenarios users will encounter with the AI feature."},{"children":[{"label":"Good vs Bad Criteria","children":[],"type":"EXMP","text":"Bad criteria: 'The response is helpful'; Good criteria: 'The response contains the correct return window (30 days) and includes the return portal link'.","id":"n40","parentId":"n39"},{"label":"Code Success Criteria","children":[],"type":"EXMP","text":"Good code success criteria: 'The code passes all test cases, follows project style guide, and includes error handling'.","id":"n41","parentId":"n39"}],"type":"DETL","text":"For each scenario, define specific, measurable, and unambiguous criteria for what constitutes a successful AI response.","id":"n39","parentId":"n34","label":"Define Success Criteria"},{"type":"DETL","text":"A good rubric typically has 4-6 categories, each with a 1-5 scoring scale.","parentId":"n34","id":"n42","children":[{"parentId":"n42","id":"n43","type":"DETL","text":"Key categories include Correctness, Completeness, Clarity, Tone, Safety, and Efficiency.","children":[],"label":"Rubric Categories"},{"label":"Scoring Scale Definition","children":[],"type":"DETL","text":"The 1-5 scale ranges from 'Completely fails' (1) to 'Fully succeeds' (5), with clear definitions for each score.","parentId":"n42","id":"n44"}],"label":"Build Rubric Categories"},{"label":"Create Reference Examples","parentId":"n34","id":"n45","type":"DETL","text":"For each category and score level, create reference examples to establish ground truth for evaluators.","children":[{"children":[],"id":"n46","parentId":"n45","type":"EXMP","text":"Examples for 'Correctness' in a support chatbot demonstrate scores of 5 (full success), 3 (partial success), and 1 (factual error).","label":"Correctness Example Scores"},{"children":[],"type":"JUST","text":"These examples show human or LLM evaluators what specific quality levels look like.","parentId":"n45","id":"n47","label":"Reference Examples Purpose"}]},{"children":[{"type":"JUST","text":"A bad rubric leads to bad evals, while a good rubric ensures reliable evaluations.","id":"n49","parentId":"n48","children":[],"label":"Reliable Evals Need Good Rubric"}],"id":"n48","parentId":"n34","type":"DETL","text":"Have 2-3 people independently grade the same 10 outputs using the rubric and calculate inter-rater reliability to refine it.","label":"Test Your Rubric"}],"label":"Build Evaluation Rubric"},{"label":"Evaluation Metrics Framework","id":"n50","parentId":"n1","type":"CONC","text":"Different AI use cases require specific metrics for effective evaluation, as there is no 'one size fits all' approach.","children":[{"label":"Retrieval Metrics","children":[{"type":"DETL","text":"Key metrics include Precision, Recall, F1 Score (harmonic mean), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG).","parentId":"n51","id":"n52","children":[],"label":"Key Retrieval Metrics"},{"label":"Most Important Retrieval Metrics","children":[],"id":"n53","parentId":"n51","type":"INSG","text":"For most AI Product Manager use cases, F1 and NDCG are the most important retrieval metrics."}],"type":"SUBC","text":"Metrics for AI systems that retrieve information, such as RAG systems, focus on relevance and surfacing the right documents.","parentId":"n50","id":"n51"},{"label":"Generation Metrics","type":"SUBC","text":"Metrics for AI systems that generate text, focusing on similarity and overlap with reference texts.","parentId":"n50","id":"n54","children":[{"label":"Text Generation Metrics","children":[],"type":"DETL","text":"Metrics include BLEU (translation), ROUGE (summarization), METEOR (synonyms/stemming), and BERTScore (semantic similarity).","id":"n55","parentId":"n54"},{"id":"n56","parentId":"n54","type":"INSG","text":"BERTScore is recommended as the most robust generation metric for most AI products.","children":[],"label":"Most Robust Generation Metric"}]},{"label":"Task-Specific Metrics","children":[{"type":"EXMP","text":"Metrics for code generation include compilation success, test passage, style guide adherence, and cyclomatic complexity.","parentId":"n57","id":"n58","children":[],"label":"Code Generation Metrics"},{"label":"Customer Support Metrics","children":[],"id":"n59","parentId":"n57","type":"EXMP","text":"Metrics for customer support include correct information, required links, brand tone matching, and issue resolution."},{"children":[],"id":"n60","parentId":"n57","type":"EXMP","text":"Metrics for summarization include capturing key points, omitting irrelevant details, length, and coherence.","label":"Summarization Metrics"}],"id":"n57","parentId":"n50","type":"SUBC","text":"Some AI tasks require custom metrics tailored to the specific requirements and desired outcomes of that task."},{"children":[{"label":"LLM Judge Method","id":"n62","parentId":"n61","type":"DETL","text":"Ask an LLM (e.g., GPT-4 or Claude) to score a response on a 1-5 scale for helpfulness or other criteria.","children":[]},{"label":"Good Rubric is Key for LLM Judge","type":"JUST","text":"The effectiveness of an LLM judge relies on providing it with a well-defined rubric and reference examples.","parentId":"n61","id":"n63","children":[]}],"type":"SUBC","text":"Using an LLM to grade outputs based on a rubric can correlate better with human judgment than traditional metrics.","id":"n61","parentId":"n50","label":"LLM-as-Judge Metrics"},{"children":[{"children":[],"id":"n65","parentId":"n64","type":"DETL","text":"Use Precision/Recall/F1 for document retrieval, BERTScore for generating text similar to references, task-specific metrics for specific tasks, and LLM-as-judge for holistic quality.","label":"Metrics Decision Tree"}],"type":"DCSN","text":"The choice of metrics depends on the AI's primary function, often requiring a combination of multiple metrics.","parentId":"n50","id":"n64","label":"Choosing the Right Metrics"}]},{"label":"Build LLM Judges Step-by-Step","parentId":"n1","id":"n66","type":"CONC","text":"Implementing LLM judges involves a step-by-step process from defining evaluation prompts to continuous evaluation.","children":[{"label":"Step 1: Define Evaluation Prompt","id":"n67","parentId":"n66","type":"DETL","text":"The evaluation prompt for an LLM judge needs four components: the rubric, reference examples, input query, and output to evaluate.","children":[{"children":[],"parentId":"n67","id":"n68","type":"EXMP","text":"A template includes criteria like Correctness, Completeness, Clarity, and Tone with 1-5 scoring, reference examples, user query, and assistant response.","label":"Evaluation Prompt Template"}]},{"label":"Step 2: Test on Known Examples","children":[{"label":"Common Prompt Fixes","type":"DETL","text":"Common fixes include adding more reference examples, making criteria more specific, adding chain-of-thought reasoning, or using a better model (e.g., GPT-4 vs GPT-3.5).","parentId":"n69","id":"n70","children":[]}],"parentId":"n66","id":"n69","type":"DETL","text":"Manually test the LLM judge with 10 outputs that have known correct scores, comparing judge scores to ground truth and refining the prompt as needed."},{"children":[{"label":"Claude Code Prompt","type":"EXMP","text":"A prompt for Claude Code can request an eval pipeline taking a CSV, running through Claude as a judge, parsing scores, and outputting a summary CSV with averages and flagged low scores.","id":"n72","parentId":"n71","children":[]},{"label":"PM Role in Implementation","children":[],"type":"INSG","text":"The PM's role is defining the rubric, curating test cases, and interpreting results, while implementation is largely automated.","id":"n73","parentId":"n71"}],"type":"DETL","text":"Automate the eval pipeline using tools like Claude Code by providing the evaluation prompt and test dataset to generate the necessary script.","parentId":"n66","id":"n71","label":"Step 3: Implement with Claude Code"},{"parentId":"n66","id":"n74","type":"DETL","text":"Generate a summary dashboard from eval results, focusing on mean scores, distributions, and worst/best performing examples to identify patterns.","children":[{"label":"Claude Code Dashboard Request","children":[],"id":"n75","parentId":"n74","type":"EXMP","text":"Tell Claude Code to generate a dashboard showing mean score per criterion, score distribution, and top 10 worst/best performers."},{"parentId":"n74","id":"n76","type":"INSG","text":"Product sense is crucial to interpret numbers, determine why failures occurred, and decide on corrective actions.","children":[],"label":"Product Sense for Gaps"}],"label":"Step 4: Read Results and Find Gaps"},{"id":"n77","parentId":"n66","type":"DCSN","text":"Define minimum acceptable scores for each criterion and add them as pass/fail gates in the evaluation pipeline.","children":[{"label":"Example Thresholds","children":[],"parentId":"n77","id":"n78","type":"EXMP","text":"Example thresholds include Correctness and Completeness ≥4.0 average, and Clarity and Tone ≥3.5 average."},{"label":"Claude Code Threshold Integration","children":[],"type":"EXMP","text":"Instruct Claude Code to add pass/fail checks to the eval pipeline, flagging runs if any criterion average drops below defined thresholds.","id":"n79","parentId":"n77"}],"label":"Step 5: Set Quality Thresholds"},{"label":"Step 6: Run Evals Continuously","id":"n80","parentId":"n66","type":"DETL","text":"Implement continuous evaluation before every release, daily in production, after prompt changes, and after model updates to catch regressions.","children":[{"label":"Claude Code Continuous Eval Setup","type":"EXMP","text":"Tell Claude Code to run the eval pipeline nightly against a 1% random sample of production traffic and send Slack notifications for threshold drops.","parentId":"n80","id":"n81","children":[]}]},{"label":"Common LLM Judge Pitfalls","children":[{"children":[{"parentId":"n83","id":"n84","type":"DCSN","text":"Use a stronger model as the judge (e.g., GPT-4 to judge GPT-3.5 outputs).","children":[],"label":"Solution: Stronger Judge Model"}],"id":"n83","parentId":"n82","type":"DETL","text":"Using the same model as both the judge and the product can lead to biased evaluations.","label":"Pitfall 1: Same Model for Judge/Product"},{"label":"Pitfall 2: Not Calibrating Judge","children":[{"id":"n86","parentId":"n85","type":"DCSN","text":"Regularly compare judge scores to human scores and adjust prompts to ensure accuracy.","children":[],"label":"Solution: Calibrate Regularly"}],"parentId":"n82","id":"n85","type":"DETL","text":"Failing to calibrate the LLM judge regularly can result in unreliable scores."},{"children":[{"children":[],"type":"DCSN","text":"Break complex rubrics into multiple judge calls to evaluate fewer dimensions per call.","id":"n88","parentId":"n87","label":"Solution: Break Down Rubrics"}],"parentId":"n82","id":"n87","type":"DETL","text":"Judging too many dimensions at once can overwhelm the LLM judge and reduce accuracy.","label":"Pitfall 3: Too Many Dimensions"},{"type":"DETL","text":"Using a temperature greater than 0 for LLM judges introduces randomness, making scores inconsistent.","parentId":"n82","id":"n89","children":[{"id":"n90","parentId":"n89","type":"DCSN","text":"Always use temperature=0 for LLM evaluations to ensure deterministic and consistent scores.","children":[],"label":"Solution: Use Temperature=0"}],"label":"Pitfall 4: Non-Zero Temperature"}],"parentId":"n66","id":"n82","type":"CONC","text":"Several common pitfalls can undermine the effectiveness of LLM judges if not addressed."}]},{"label":"Production Monitoring That Works","children":[{"type":"SUBC","text":"Production monitoring involves three layers: System Metrics, Quality Metrics, and Business Metrics.","parentId":"n91","id":"n92","children":[{"children":[{"children":[],"parentId":"n93","id":"n94","type":"DETL","text":"System metrics include Latency (p50, p95, p99), Error rate, Token usage, API costs, and Timeout rate.","label":"Examples of System Metrics"}],"type":"DETL","text":"These are basic health metrics indicating infrastructure issues.","id":"n93","parentId":"n92","label":"Layer 1: System Metrics"},{"label":"Layer 2: Quality Metrics","children":[{"label":"Examples of Quality Metrics","id":"n96","parentId":"n95","type":"DETL","text":"Quality metrics include average LLM judge scores, human feedback scores (thumbs up/down), task success rate, and hallucination rate.","children":[]}],"type":"DETL","text":"These measure the performance and quality of the AI's output.","id":"n95","parentId":"n92"},{"label":"Layer 3: Business Metrics","children":[{"children":[],"parentId":"n97","id":"n98","type":"DETL","text":"Business metrics include feature adoption rate, user retention, customer satisfaction (CSAT/NPS), support ticket deflection, and revenue impact.","label":"Examples of Business Metrics"}],"type":"DETL","text":"These measure if the AI is delivering value and achieving business objectives.","parentId":"n92","id":"n97"},{"label":"All Layers Are Critical","children":[],"type":"INSG","text":"All three layers are necessary; ignoring any layer leaves gaps in understanding AI performance and value.","parentId":"n92","id":"n99"}],"label":"Three Layers of Monitoring"},{"label":"Setting Up Automatic Alerts","type":"DETL","text":"Establish alerts for critical performance deviations to enable immediate investigation.","parentId":"n91","id":"n100","children":[{"type":"EXMP","text":"Alerts should be set for quality scores dropping below threshold, error rates spiking above 1%, p95 latency exceeding 3 seconds, or hallucination rate exceeding 5%.","id":"n101","parentId":"n100","children":[],"label":"Alert Criteria Examples"}]},{"label":"The Human Review Queue","type":"DETL","text":"Sample 1% of production traffic daily for human review of 10-20 real user interactions using the rubric.","parentId":"n91","id":"n102","children":[{"children":[],"type":"JUST","text":"Human review catches issues LLM judges miss and keeps the team connected to real user experiences.","parentId":"n102","id":"n103","label":"Human Review Catches Misses"},{"label":"Recalibrate Judge if Divergent","children":[],"type":"DCSN","text":"If human scores diverge from LLM judge scores, recalibrate the LLM judge.","id":"n104","parentId":"n102"}]},{"label":"The Feedback Loop","id":"n105","parentId":"n91","type":"DETL","text":"Production monitoring findings should feed back into the evaluation dataset to continuously improve evals.","children":[{"children":[],"type":"DETL","text":"Add bad production outputs to the test dataset, label them with correct scores, and rerun evals to ensure the system catches them.","id":"n106","parentId":"n105","label":"Feedback Loop Steps"},{"children":[],"type":"INSG","text":"This feedback loop creates a virtuous cycle, continuously improving the evaluation system over time.","id":"n107","parentId":"n105","label":"Virtuous Cycle of Evals"}]},{"label":"When to Rollback","id":"n108","parentId":"n91","type":"DCSN","text":"Define clear rollback criteria beforehand to quickly reverse a feature if it fails in production.","children":[{"label":"Rollback Criteria Examples","children":[],"parentId":"n108","id":"n109","type":"EXMP","text":"Criteria include quality score drop >10% from baseline, error rate exceeding 5%, more than 3 critical bugs in 24 hours, or negative business impact."}]}],"type":"CONC","text":"Effective AI evaluation extends beyond launch with robust production monitoring to ensure ongoing performance and value.","id":"n91","parentId":"n1"},{"parentId":"n1","id":"n110","type":"CONC","text":"AI evals combine product sense, technical understanding, and statistical thinking, forming a new essential PM skill.","children":[{"label":"Build Eval Rubric","id":"n111","parentId":"n110","type":"DCSN","text":"Build a comprehensive evaluation rubric specifically for your product.","children":[]},{"type":"DCSN","text":"Implement appropriate metrics, including retrieval, generation, task-specific, and LLM-as-judge metrics.","id":"n112","parentId":"n110","children":[],"label":"Implement Right Metrics"},{"id":"n113","parentId":"n110","type":"DCSN","text":"Create an LLM judge to automate the evaluation process.","children":[],"label":"Create LLM Judge"},{"label":"Set Quality Thresholds","type":"DCSN","text":"Establish clear quality thresholds before launching any AI feature.","parentId":"n110","id":"n114","children":[]},{"type":"DCSN","text":"Monitor AI quality continuously once it is in production.","parentId":"n110","id":"n115","children":[],"label":"Monitor Quality Continuously"},{"label":"Feedback Production Learnings","children":[],"type":"DCSN","text":"Integrate learnings from production back into your evaluation dataset for ongoing improvement.","id":"n116","parentId":"n110"},{"label":"Do Not Ship Without Evals","children":[],"id":"n117","parentId":"n110","type":"INSG","text":"It is crucial never to ship an AI product without a robust evaluation system in place."}],"label":"Final Words: Key Actions"}],"label":"AI Evals Explained Simply"},"sourceType":"text","slug":"ai-evals-explained-simply-56403b","contentType":"Explainer","sourceUrl":null,"sharedAt":{"_seconds":1780211299,"_nanoseconds":25000000},"title":"AI Evals Explained Simply"}