{"title":"Eval-Driven Development: How to Ship AI Systems to Production Without Guessing","sourceUrl":"https://www.linkedin.com/pulse/eval-driven-development-how-ship-ai-systems-production-namohar-m-zzvfc/","sourceType":"url","contentType":"Explainer","apex":{"id":"n1","type":"APEX","label":"Eval-Driven Development for AI production","text":"Eval-Driven Development (EDD) ensures AI systems behave correctly in production, moving beyond good demos to measurable quality.","children":[{"id":"n2","type":"SECT","label":"Why production AI needs more than a demo","text":"Shipping AI without evaluations risks deploying broken systems that users discover first.","parentId":"n1","children":[{"id":"n3","type":"SUP","label":"Shipping AI without evals is like deploying code without tests","text":"Deploying AI without evaluations is comparable to releasing code without tests, leading to user-discovered breakage.","parentId":"n2","children":[]},{"id":"n4","type":"NTRL","label":"First wave of enterprise AI prototypes","text":"The initial enterprise AI prototypes generated excitement with surprisingly good answers from simple chat windows and clever prompts.","parentId":"n2","children":[]},{"id":"n5","type":"EVID","label":"Production reality arrived","text":"Production reality revealed issues where prompt changes, cheaper models, retrieval updates, or safety guardrails caused problems.","parentId":"n2","children":[{"id":"n6","type":"DATA","label":"Prompt change reduced accuracy","text":"A prompt change improved tone but unfortunately reduced the system's accuracy.","parentId":"n5","children":[]},{"id":"n7","type":"DATA","label":"Cheaper model missed edge cases","text":"A cheaper model reduced cost, but it failed to address critical edge cases.","parentId":"n5","children":[]},{"id":"n8","type":"DATA","label":"Retrieval update surfaced wrong policy","text":"An update to the retrieval system resulted in the display of incorrect policy information.","parentId":"n5","children":[]},{"id":"n9","type":"DATA","label":"Agent called tool with wrong parameter","text":"An AI agent correctly called a tool but unfortunately used an incorrect parameter.","parentId":"n5","children":[]},{"id":"n10","type":"DATA","label":"Safety guardrail blocked legitimate users","text":"A safety guardrail inadvertently prevented legitimate users from accessing the system.","parentId":"n5","children":[]},{"id":"n11","type":"DATA","label":"Customer found hallucination first","text":"A customer discovered a hallucination before the engineering team identified it.","parentId":"n5","children":[]}]},{"id":"n12","type":"INSG","label":"Eval-Driven Development closes the gap","text":"Eval-Driven Development addresses the gap between impressive demos and the realities of production AI behavior.","parentId":"n2","children":[]},{"id":"n13","type":"SUP","label":"Better question for production AI","text":"The key question for production AI is proving correct system behavior across workflows, risks, users, and constraints.","parentId":"n2","children":[{"id":"n14","type":"JUST","label":"Evals essential for production","text":"Evals are essential to verify correct system behavior across various critical aspects for production readiness.","parentId":"n13","children":[]}]}]},{"id":"n15","type":"SECT","label":"What is Eval-Driven Development","text":"Eval-Driven Development (EDD) is a practice defining, running, analyzing, and improving evaluations throughout the AI application lifecycle.","parentId":"n1","children":[{"id":"n16","type":"SUP","label":"Eval definition","text":"An eval is a structured measurement of an AI system's performance for specific tasks regarding correctness, safety, and reliability.","parentId":"n15","children":[]},{"id":"n17","type":"EVID","label":"Production evals are tailored","text":"Unlike public benchmarks, production evaluations are customized to specific business contexts, including workflows, customers, and policies.","parentId":"n15","children":[{"id":"n18","type":"JUST","label":"Public benchmarks are insufficient","text":"Public benchmarks indicate general model strength but cannot confirm specific business policy adherence or data exposure avoidance.","parentId":"n17","children":[]}]},{"id":"n19","type":"INSG","label":"Evals need to be product-specific","text":"Evaluations must be tailored to specific products to ensure relevance and effectiveness in measuring AI system performance.","parentId":"n15","children":[]},{"id":"n20","type":"CMPR","label":"TDD vs EDD for software development","text":"Eval-Driven Development for AI extends Test-Driven Development principles to account for the probabilistic nature of AI systems.","table":{"cols":["Development Approach","Traditional Software (TDD)","AI Software (EDD)"],"rows":[{"label":"Steps","cells":["Write tests, write code, run tests, refactor, repeat","Define quality, build scenarios, run AI system, grade outputs, analyze failures, improve system, gate production releases"]}]},"parentId":"n15","children":[]},{"id":"n21","type":"JUST","label":"Goal of EDD is measurable development","text":"The primary objective of Eval-Driven Development is to make AI development quantifiable, not to achieve deterministic AI.","parentId":"n15","children":[]}]},{"id":"n22","type":"SECT","label":"Traditional testing insufficient for AI","text":"Traditional software tests are often deterministic, which is not suitable for the complex, probabilistic nature of AI systems.","parentId":"n1","children":[{"id":"n23","type":"SUP","label":"Traditional tests are deterministic","text":"Traditional software tests typically have a single expected answer and use binary assertions.","parentId":"n22","children":[{"id":"n24","type":"EXMP","label":"Calculator addition example","text":"An example of a deterministic test is Assert.Equal(4, Calculator.Add(2, 2)), expecting one exact outcome.","parentId":"n23","children":[]}]},{"id":"n25","type":"OPP","label":"AI systems are different","text":"AI systems often have multiple acceptable responses, depending on various contextual factors like policy and order history.","parentId":"n22","children":[{"id":"n26","type":"EXMP","label":"Customer refund request example","text":"A customer's refund request for defective headphones after 45 days demonstrates the complexity of AI responses.","parentId":"n25","children":[]},{"id":"n27","type":"EVID","label":"Factors influencing best AI answer","text":"The best AI answer depends on policy, order history, defect status, product category, warranty rules, escalation, tone, and safety.","parentId":"n25","children":[]},{"id":"n28","type":"EVID","label":"Questions for useful AI eval","text":"A useful AI evaluation must ask if the answer correctly explained policy, avoided invention, asked about defects, used right sources, called right tools, met limits, and escalated.","parentId":"n25","children":[]}]},{"id":"n29","type":"INSG","label":"EDD is broader than unit testing","text":"Eval-Driven Development encompasses more than unit testing by combining various methods to assess AI quality comprehensively.","parentId":"n22","children":[{"id":"n30","type":"EVID","label":"EDD methods combined","text":"EDD combines deterministic checks, rubric scoring, LLM-as-judge, human review, trace analysis, statistical reporting, and production monitoring.","parentId":"n29","children":[]}]},{"id":"n31","type":"JUST","label":"TDD vs EDD for system behavior","text":"TDD verifies if code works, while EDD confirms if the AI system behaves correctly, addressing their different natures.","parentId":"n22","children":[]}]},{"id":"n32","type":"SECT","label":"Production eval stack components","text":"A mature eval practice requires a system with specific components, not just a single spreadsheet or script.","parentId":"n1","children":[{"id":"n33","type":"SUP","label":"1. Representative test dataset","text":"The eval dataset must contain scenarios the AI system will encounter in production, including diverse cases.","parentId":"n32","children":[{"id":"n34","type":"EVID","label":"Dataset scenario types","text":"Scenarios include happy-path, edge cases, adversarial prompts, historical failures, synthetic cases, multi-turn workflows, tool failures, and high-risk cases.","parentId":"n33","children":[]},{"id":"n35","type":"EXMP","label":"Customer support agent dataset examples","text":"For customer support, this includes refund questions, product defects, angry customers, privacy attacks, policy exceptions, and payment disputes.","parentId":"n33","children":[]},{"id":"n36","type":"EXMP","label":"Claims processing agent dataset examples","text":"For claims processing, examples include clean claims, incomplete documents, fraud indicators, policy exclusions, appeals, and conflicting evidence.","parentId":"n33","children":[]},{"id":"n37","type":"INSG","label":"Dataset needs representativeness","text":"A good eval dataset is effective because it is representative of real-world situations, not just large in size.","parentId":"n33","children":[]}]},{"id":"n38","type":"SUP","label":"2. Clear grading rubric","text":"A rubric defines what constitutes \"good,\" \"acceptable,\" and \"bad\" performance for the AI system.","parentId":"n32","children":[{"id":"n39","type":"EVID","label":"Common scoring dimensions","text":"Common scoring dimensions include correctness, completeness, grounding, safety, privacy, compliance, tone, formatting, tool use, escalation, latency, and cost.","parentId":"n38","children":[]},{"id":"n40","type":"JUST","label":"Rubric enables decision-making","text":"A rubric allows teams to make objective decisions about AI quality, resolving subjective arguments.","parentId":"n38","children":[]}]},{"id":"n41","type":"SUP","label":"3. Right grading method","text":"Each requirement should be graded using the most cost-effective and reliable method available.","parentId":"n32","children":[{"id":"n42","type":"CMPR","label":"Grading method suitability","text":"Different grading methods are best suited for specific types of AI evaluations and carry distinct risks.","table":{"cols":["Grading method","Best for","Risk"],"rows":[{"label":"Deterministic checks","cells":["JSON schema, exact labels, required fields, tool calls","Too brittle for open-ended answers"]},{"label":"Human review","cells":["High-risk, regulated, subjective cases","Expensive and slower"]},{"label":"LLM-as-judge","cells":["Semantic quality, tone, groundedness, completeness","Judge bias and inconsistency"]},{"label":"Hybrid evaluation","cells":["Most production AI systems","Requires orchestration and calibration"]}]},"parentId":"n41","children":[]},{"id":"n43","type":"DCSN","label":"Practical grading approach","text":"A practical approach uses deterministic checks where possible, LLM judges for semantic quality, and human review for high-risk or disputed cases.","parentId":"n41","children":[]}]},{"id":"n44","type":"SUP","label":"4. Full run reproducibility","text":"Every eval run must record its exact configuration to ensure reproducibility and aid in debugging regressions.","parentId":"n32","children":[{"id":"n45","type":"EVID","label":"Configuration details to record","text":"Configuration includes model provider/version, temperature, prompt/retrieval index/tool schema versions, orchestration logic, safety policies, code commit, dataset/grader version.","parentId":"n44","children":[]},{"id":"n46","type":"JUST","label":"Debugging regressions requires reproducibility","text":"Without the ability to reproduce an eval run, teams cannot confidently debug regressions in AI systems.","parentId":"n44","children":[]},{"id":"n47","type":"INSG","label":"AI equivalent of build lineage","text":"Reproducibility in AI evals is considered the AI equivalent of build lineage in traditional software development.","parentId":"n44","children":[]}]},{"id":"n48","type":"SUP","label":"5. Metrics that matter","text":"Relying on one average score is insufficient; metrics should be tracked by workflow, risk tier, customer segment, and model version.","parentId":"n32","children":[{"id":"n49","type":"EVID","label":"Useful metrics to track","text":"Useful metrics include task success rate, hallucination rate, groundedness, tool-use accuracy, safety violation rate, privacy leakage rate, refusal quality, human escalation rate, p50 and p95 latency, cost per task, and confidence intervals.","parentId":"n48","children":[]},{"id":"n50","type":"INSG","label":"Segmentation exposes risk","text":"Averages conceal risks, while segmentation effectively reveals them by breaking down performance.","parentId":"n48","children":[]}]},{"id":"n51","type":"SUP","label":"6. Shared failure taxonomy","text":"A common failure taxonomy provides a shared language for teams to discuss and categorize AI quality issues.","parentId":"n32","children":[{"id":"n52","type":"EVID","label":"Useful failure categories","text":"Categories include hallucination, wrong tool selection, wrong tool input, tool output misuse, incomplete answer, missing citation, unsafe response, privacy leak, policy violation, poor escalation, over/under-refusal, bad formatting, timeout, and looping.","parentId":"n51","children":[]},{"id":"n53","type":"INSG","label":"Precision enables engineering fixes","text":"Precise feedback, like an agent passing an ungrounded customer ID in a refund workflow, allows engineering teams to implement targeted fixes.","parentId":"n51","children":[]}]}]},{"id":"n54","type":"SECT","label":"EDD process workflow","text":"A practical Eval-Driven Development workflow consists of ten sequential steps, from prototype to production.","parentId":"n1","children":[{"id":"n55","type":"STEP","label":"Step 1: Define production goal","text":"Clearly define the production goal, avoiding vague objectives.","parentId":"n54","children":[{"id":"n56","type":"EXMP","label":"Weak vs stronger goal example","text":"A \"build an AI assistant for support\" goal is weak; a \"support agent that resolves refund/replacement for eligible orders, follows policy, protects data, escalates exceptions, maintains p95 latency under 8 seconds, and stays within budget\" is stronger.","parentId":"n55","children":[]},{"id":"n57","type":"JUST","label":"Stronger goals are measurable","text":"The stronger, more detailed goal definition is valuable because it makes the objective measurable.","parentId":"n55","children":[]}]},{"id":"n58","type":"STEP","label":"Step 2: Identify critical workflows","text":"Prioritize workflows based on their value and associated risks.","parentId":"n54","children":[{"id":"n59","type":"EXMP","label":"Workflow value and risk example","text":"Workflows like \"Order status\" are high value/low risk, while \"Request for another user’s data\" is low value/critical risk.","parentId":"n58","children":[]},{"id":"n60","type":"DATA","label":"Workflow value and risk table","text":"Workflow value and risk prioritization ranges from high value/low risk (Order status) to low value/critical risk (Request for another user’s data).","table":{"cols":["Workflow","Value","Risk"],"rows":[{"label":"Order status","cells":["High","Low"]},{"label":"Refund eligibility","cells":["High","Medium"]},{"label":"Replacement initiation","cells":["High","Medium"]},{"label":"Payment dispute","cells":["Medium","High"]},{"label":"Policy exception","cells":["Medium","High"]},{"label":"Request for another user’s data","cells":["Low","Critical"]}]},"parentId":"n58","children":[]},{"id":"n61","type":"INSG","label":"Single global score insufficient","text":"A single global score is inadequate because system readiness varies by workflow, for example, for order status versus payment disputes.","parentId":"n58","children":[]}]},{"id":"n62","type":"STEP","label":"Step 3: Define success and failure criteria","text":"Establish explicit thresholds for success and failure, which can evolve over time.","parentId":"n54","children":[{"id":"n63","type":"EXMP","label":"Explicit criteria example","text":"Criteria includes minimum task success rate (0.92), groundedness (0.95), tool-use accuracy (0.95), maximum p95 latency (8 seconds), and cost ($80 per 1000 tasks).","parentId":"n62","children":[]},{"id":"n64","type":"DATA","label":"Critical failures example","text":"Critical failures specify maximum policy violations (0), privacy leaks (0), and unsafe completions (0).","parentId":"n62","children":[]},{"id":"n65","type":"JUST","label":"Criteria must be explicit","text":"The criteria must be explicit to ensure clear definitions, even if the specific numbers change.","parentId":"n62","children":[]}]},{"id":"n66","type":"STEP","label":"Step 4: Build initial eval dataset","text":"Start with a small but realistic initial eval suite, expanding it as production usage increases.","parentId":"n54","children":[{"id":"n67","type":"EVID","label":"Initial suite composition","text":"A useful initial suite includes 50 happy-path examples, 50 edge cases, 25 adversarial cases, 25 historical failures, 25 tool-use workflows, and 25 safety/compliance cases.","parentId":"n66","children":[]},{"id":"n68","type":"SUBS","label":"Add real user traces and incidents","text":"As production usage grows, incorporate real user traces and examples derived from incidents into the dataset.","parentId":"n66","children":[]}]},{"id":"n69","type":"STEP","label":"Step 5: Create baseline","text":"Run the current AI system against the eval suite to establish a baseline for performance.","parentId":"n54","children":[{"id":"n70","type":"CMPR","label":"Example baseline metrics","text":"An example baseline shows metrics like task success rate, hallucination rate, and p95 latency.","table":{"cols":["Metric","Baseline"],"rows":[{"label":"Task success rate","cells":["78%"]},{"label":"Hallucination rate","cells":["9%"]},{"label":"Tool-use accuracy","cells":["83%"]},{"label":"Safety violation rate","cells":["1.2%"]},{"label":"p95 latency","cells":["11.8 seconds"]},{"label":"Cost per 1,000 tasks","cells":["$145"]}]},"parentId":"n69","children":[]},{"id":"n71","type":"JUST","label":"Baseline enables evidence-based improvement","text":"A baseline allows teams to make improvements based on evidence rather than relying on intuition.","parentId":"n69","children":[]}]},{"id":"n72","type":"STEP","label":"Step 6: Improve system iteratively","text":"Use identified failure patterns to guide decisions on what aspects of the system to fix.","parentId":"n54","children":[{"id":"n73","type":"CMPR","label":"Failure pattern to likely fix","text":"Various failure patterns, like hallucinations or wrong tool arguments, are associated with specific likely fixes.","table":{"cols":["Failure pattern","Likely fix"],"rows":[{"label":"Hallucination","cells":["Better retrieval, stricter grounding prompt, citation requirement"]},{"label":"Wrong tool","cells":["Improve tool descriptions, routing logic, examples"]},{"label":"Wrong tool arguments","cells":["Stronger schema validation, deterministic checks"]},{"label":"Incomplete answer","cells":["Add decomposition step or improve prompt rubric"]},{"label":"Unsafe answer","cells":["Safety classifier, refusal examples, policy guardrails"]},{"label":"High latency","cells":["Model routing, caching, fewer retrieval calls"]},{"label":"High cost","cells":["Smaller model for low-risk cases, shorter context, batching"]}]},"parentId":"n72","children":[]},{"id":"n74","type":"JUST","label":"Eval suite guides next improvements","text":"The eval suite not only grades the system but also indicates where the next improvements should be made.","parentId":"n72","children":[]}]},{"id":"n75","type":"STEP","label":"Step 7: Add evals to CI/CD","text":"Integrate evals into the CI/CD pipeline, triggering them for any changes to prompts, models, retrieval, tools, workflows, or guardrails.","parentId":"n54","children":[{"id":"n76","type":"CMPR","label":"Eval tiers and purpose","text":"Different tiers of evaluations are triggered at various stages of development and deployment, each serving a specific purpose.","table":{"cols":["Tier","Trigger","Purpose"],"rows":[{"label":"Smoke eval","cells":["Pull request","Fast check for obvious breakage"]},{"label":"Regression eval","cells":["Merge to main","Compare with baseline"]},{"label":"Full eval","cells":["Before staging or production","Release readiness"]},{"label":"Safety eval","cells":["Before production and scheduled","Red-team and policy validation"]},{"label":"Online eval","cells":["Production","Drift, incidents, feedback, canary quality"]}]},"parentId":"n75","children":[]},{"id":"n77","type":"INSG","label":"EDD part of engineering system","text":"Integrating evals into CI/CD makes Eval-Driven Development an intrinsic part of the engineering system.","parentId":"n75","children":[]}]},{"id":"n78","type":"STEP","label":"Step 8: Define release gates","text":"Convert risk tolerance into specific engineering thresholds that must be met before release.","parentId":"n54","children":[{"id":"n79","type":"EXMP","label":"Release gates example","text":"Release gates include overall task success rate (0.92), hallucination rate (0.02), tool-use accuracy (0.95), p95 latency (8 seconds max), cost ($80 per 1000 tasks max).","parentId":"n78","children":[]},{"id":"n80","type":"DATA","label":"Critical safety release gates","text":"Critical safety gates require maximum PII leakage (0), critical policy violations (0), and harmful completions (0).","parentId":"n78","children":[]},{"id":"n81","type":"DATA","label":"Regression release gates","text":"Regression gates limit maximum task success drop (0.01) and groundedness drop (0.01) versus baseline.","parentId":"n78","children":[]},{"id":"n82","type":"JUST","label":"Failure to meet gates blocks shipment","text":"The rule is straightforward: if the system fails to meet the release gate criteria, it will not be shipped.","parentId":"n78","children":[]}]},{"id":"n83","type":"STEP","label":"Step 9: Monitor production behavior","text":"Supplement offline evals with production monitoring to capture real-world scenarios not covered by test sets.","parentId":"n54","children":[{"id":"n84","type":"EVID","label":"Production monitoring captures","text":"Monitoring captures live traces, tool calls, retrieved context, user feedback, human escalations, safety triggers, latency/cost spikes, failure categories, topic/document drift, and incident reviews.","parentId":"n83","children":[]},{"id":"n85","type":"INSG","label":"Production is source of best eval cases","text":"Production is not the final stage of evaluation; it is the source of the most valuable eval cases.","parentId":"n83","children":[]}]},{"id":"n86","type":"STEP","label":"Step 10: Turn failures into regression tests","text":"Convert every significant production failure into a future eval case to strengthen the eval suite over time.","parentId":"n54","children":[{"id":"n87","type":"SUBS","label":"Failure loop","text":"The loop involves identifying a production issue, incident review, root cause analysis, adding an eval case, fixing the system, rerunning the suite, and gating future releases.","parentId":"n86","children":[]},{"id":"n88","type":"INSG","label":"Eval suite becomes smarter","text":"This iterative process ensures the evaluation suite continuously improves and becomes more effective over time.","parentId":"n86","children":[]}]}]},{"id":"n89","type":"SECT","label":"Evaluating AI agents requires process evals","text":"For AI agents, evaluating outcomes alone is insufficient; process failures can occur even with correct final answers.","parentId":"n1","children":[{"id":"n90","type":"SUP","label":"Chatbot vs agent distinction","text":"A chatbot produces text, whereas an agent takes actions, a distinction that fundamentally changes evaluation needs.","parentId":"n89","children":[{"id":"n91","type":"EVID","label":"Production agent actions","text":"A production agent may retrieve documents, call APIs, update records, create tickets, ask questions, escalate, and decide when to stop.","parentId":"n90","children":[]},{"id":"n92","type":"WARN","label":"Only evaluating final answer is risky","text":"Evaluating only the final answer risks missing dangerous process failures, even if the outcome appears correct.","parentId":"n90","children":[]}]},{"id":"n93","type":"EVID","label":"Agent evaluation layers","text":"Agents require evaluation at two layers: outcome evals and process evals.","parentId":"n89","children":[{"id":"n94","type":"SUP","label":"Outcome evals","text":"Outcome evals assess the quality of the final result produced by the agent.","parentId":"n93","children":[{"id":"n95","type":"EXMP","label":"Outcome eval examples","text":"Examples include: Was the task completed? Was the answer correct/grounded? Was intent resolved? Did it escalate as needed?","parentId":"n94","children":[]}]},{"id":"n96","type":"SUP","label":"Process evals","text":"Process evals examine whether the agent followed the correct operational path.","parentId":"n93","children":[{"id":"n97","type":"EXMP","label":"Process eval examples","text":"Examples include: Did it select the right tool? Were tool inputs valid? Did it avoid unnecessary tools? Did it use outputs correctly? Did calls succeed? Did it loop/stall? Did it follow workflow sequence?","parentId":"n96","children":[]}]}]},{"id":"n98","type":"REQ","label":"Process evals non-negotiable for high-risk","text":"Process evaluations are absolutely necessary for high-risk workflows, as a good final answer does not excuse policy violations.","parentId":"n89","children":[]}]},{"id":"n99","type":"SECT","label":"How evals reduce AI cost","text":"Evals provide evidence for cost optimization decisions, especially when considering replacing expensive frontier models with cheaper ones.","parentId":"n1","children":[{"id":"n100","type":"JUST","label":"Cost optimization answer not from demo","text":"Decisions about replacing expensive frontier models with cheaper ones should be based on evals, not just demos.","parentId":"n99","children":[]},{"id":"n101","type":"SUP","label":"Four common cost reduction strategies","text":"There are four common strategies to reduce AI costs using evaluations.","parentId":"n99","children":[{"id":"n102","type":"EVID","label":"1. Full replacement","text":"Use a cheaper model only if it consistently meets the same production thresholds as the expensive one.","parentId":"n101","children":[]},{"id":"n103","type":"EVID","label":"2. Routing","text":"Route easy, low-risk tasks to a cheaper model, while sending complex or high-risk tasks to a stronger model.","parentId":"n101","children":[]},{"id":"n104","type":"EVID","label":"3. Fallback","text":"Allow a cheaper model to attempt the task first, escalating to a stronger model or human if confidence is low or risks are detected.","parentId":"n101","children":[]},{"id":"n105","type":"EVID","label":"4. Distillation or prompt optimization","text":"Improve smaller model behavior using high-quality outputs from stronger models, then verify these improvements with evaluations.","parentId":"n101","children":[]}]},{"id":"n106","type":"JUST","label":"Evals enable cost optimization without guessing","text":"Evaluations enable teams to optimize costs effectively without making blind decisions or guessing about performance.","parentId":"n99","children":[]}]},{"id":"n107","type":"SECT","label":"Best practices for EDD","text":"Follow these best practices to effectively implement Eval-Driven Development.","parentId":"n1","children":[{"id":"n108","type":"TIP","label":"Start with real user journeys","text":"Base evals on specific workflows, policies, customers, documents, tools, and risks, not generic model benchmarks.","parentId":"n107","children":[]},{"id":"n109","type":"TIP","label":"Start small but representative","text":"A small, representative 50-case eval suite addressing common failure modes is better than a large one with many easy examples.","parentId":"n107","children":[]},{"id":"n110","type":"TIP","label":"Separate low-risk and high-risk workflows","text":"Avoid a single global score; a model may be sufficient for FAQ lookups but inadequate for claims adjudication.","parentId":"n107","children":[]},{"id":"n111","type":"TIP","label":"Evaluate traces, not just final answers","text":"For agents, trace evaluation is crucial as bugs often reside in tool calls, parameters, retrieval results, and intermediate decisions.","parentId":"n107","children":[]},{"id":"n112","type":"TIP","label":"Use deterministic checks wherever possible","text":"Automate machine-checkable requirements such as JSON validity, schema compliance, exact labels, tool arguments, SQL execution, code tests, and citation fields.","parentId":"n107","children":[]},{"id":"n113","type":"TIP","label":"Calibrate LLM judges","text":"LLM-as-judge is useful but not perfect, requiring calibration against human labels and recalibration when prompts, rubrics, or models change.","parentId":"n107","children":[]},{"id":"n114","type":"TIP","label":"Track quality, latency, and cost together","text":"An AI system can be accurate but still fail in production if it's too slow or too expensive, emphasizing the need for comprehensive tracking.","parentId":"n107","children":[]},{"id":"n115","type":"TIP","label":"Turn every incident into an eval","text":"The most effective eval cases often emerge from real failures, making every incident an opportunity to strengthen the suite.","parentId":"n107","children":[]}]},{"id":"n116","type":"SECT","label":"Maturity model for EDD","text":"Organizations typically progress through various levels of maturity in their Eval-Driven Development adoption.","parentId":"n1","children":[{"id":"n117","type":"CMPR","label":"EDD Maturity Levels","text":"The EDD maturity model outlines five levels, from basic demo testing to continuous optimization, with clear next steps for progression.","table":{"cols":["Level","What it looks like","Next step"],"rows":[{"label":"Level 0: Demo testing","cells":["A few hand-picked prompts, no rubric","Capture representative examples"]},{"label":"Level 1: Manual evals","cells":["Spreadsheet review, small rubric","Convert cases to versioned JSONL or CSV"]},{"label":"Level 2: Automated offline evals","cells":["Eval runner, basic metrics","Store run lineage and compare baselines"]},{"label":"Level 3: CI/CD regression evals","cells":["Release gates block bad changes","Add trace-level agent evals and safety suites"]},{"label":"Level 4: Production monitoring","cells":["Online traces, feedback, drift, incidents","Convert failures into evals automatically"]},{"label":"Level 5: Continuous optimization","cells":["Cost-quality routing and fallback","Tune routing, models, prompts, and workflows continuously"]}]},"parentId":"n116","children":[]},{"id":"n118","type":"REQ","label":"Suggested EDD targets","text":"Suggested EDD maturity targets vary by development stage, from Level 1 for prototypes to Level 5 for scaled AI platforms.","parentId":"n116","children":[{"id":"n119","type":"DATA","label":"Prototype EDD target","text":"A prototype should aim for a minimum of Level 1 maturity in Eval-Driven Development.","parentId":"n118","children":[]},{"id":"n120","type":"DATA","label":"Internal pilot EDD target","text":"An internal pilot should target a minimum of Level 2 maturity in Eval-Driven Development.","parentId":"n118","children":[]},{"id":"n121","type":"DATA","label":"External production EDD target","text":"External production systems require a minimum of Level 3 maturity in Eval-Driven Development.","parentId":"n118","children":[]},{"id":"n122","type":"DATA","label":"Regulated/high-risk production EDD target","text":"Regulated or high-risk production systems need a minimum of Level 4 maturity in Eval-Driven Development.","parentId":"n118","children":[]},{"id":"n123","type":"DATA","label":"Scaled AI platform EDD target","text":"A scaled AI platform should aim for a Level 5 maturity target in Eval-Driven Development.","parentId":"n118","children":[]}]}]},{"id":"n124","type":"SECT","label":"Leader questions before AI production","text":"Leaders should ask specific questions before approving production AI to ensure readiness and accountability.","parentId":"n1","children":[{"id":"n125","type":"EVID","label":"Key questions for leaders","text":"Leaders should inquire about top workflows, eval dataset size/coverage, failure severity, pass rates by workflow/risk tier, comparison to baseline, hallucination rate, tool-use accuracy, safety/privacy evals, p95 latency, cost per task, human escalation rate, release gates, monitoring, and incident-to-regression-test process.","parentId":"n124","children":[]},{"id":"n126","type":"JUST","label":"Unanswered questions mean not production-ready","text":"If a team cannot answer these questions, the AI system is not considered ready for production deployment.","parentId":"n124","children":[]}]},{"id":"n127","type":"SECT","label":"Final takeaway","text":"Eval-Driven Development builds trust and confidence in AI systems by making their behavior measurable, just as TDD did for code.","parentId":"n1","children":[{"id":"n128","type":"INSG","label":"EDD is not bureaucracy","text":"Eval-Driven Development is a method for AI teams to move quickly without pretending the system is deterministic.","parentId":"n127","children":[]},{"id":"n129","type":"EVID","label":"Demos create excitement, evals create trust","text":"Demos generate initial excitement, but evaluations are what ultimately build trust in AI systems.","parentId":"n127","children":[]},{"id":"n130","type":"JUST","label":"EDD provides discipline for probabilistic systems","text":"EDD offers AI teams the same discipline as TDD for probabilistic systems: define, measure, analyze, improve, and block regressions.","parentId":"n127","children":[]},{"id":"n131","type":"INSG","label":"Strongest feedback loop wins","text":"Teams with the most robust feedback loops, not the flashiest demos, will ultimately succeed with production AI.","parentId":"n127","children":[]},{"id":"n132","type":"SUP","label":"Shipping AI without evals is like deploying code without tests","text":"Deploying AI without evaluations is comparable to releasing code without tests, meaning you won't know what broke until users do.","parentId":"n127","children":[]},{"id":"n133","type":"INSG","label":"EDD reveals breakage before users","text":"Eval-Driven Development ensures that AI teams discover system breakage before users encounter it.","parentId":"n127","children":[]}]}]},"slug":"evaldriven-development-how-to-ship-ai-sy-f6f0f2","sharedAt":{"_seconds":1781259086,"_nanoseconds":275000000}}