Sharing Our Passion for Technology
& Continuous Learning
2025 has been described as the year of the agent. We have seen incredible advancement in the capabilities that AI agents have been able to accomplish. As the role of the agent grows, one fact remains true: evals are more crucial to the success of AI adoption than ever before.
What Are Evals?
Evals are systematic tests that measure whether your AI produces the right outputs. They answer: "Is the agent working as intended?"
Evaluations measure specific qualities such as:
- Accuracy against a known ground truth
- Relevance to the question that was asked
- Risk profile of a given response
- Regulatory Compliance
- Any other relevant aspects of a response
Additionally, they are ideally automated, repeatable, and integrated into the development workflow.
Most teams start with golden test cases: curated examples with known correct answers that establish a baseline and track improvement over time. Once a system proves reliable against golden data, teams layer in production evals to measure real-world performance and user satisfaction.
The Problem: Generic Evals Create a Measurement Gap
Some simple, off-the-shelf evals that teams may reach for are hallucination and toxicity. The goal of these off-the-shelf evals are to make sure that our AI is not making up information or pulling from its training data and to ensure that the responses from our AI tools are not using harmful language.
Both of these sound great! Why wouldn't a team reach for these on every project?
The problem is that as AI has improved, these behaviors have become baseline expectations for models today. With existing foundational models, they are fantastic at following instruction and oftentimes simply need to be told to only respond using information found in the current context window.
As we can see, teams that implement these evals may well be getting confirmation that the system is doing what we ask of it, but this is a case of missing the forest for the trees. Neither of these evals tell us how effective the agent is at solving our domain specific task.
On the other hand, custom, domain-specific evals allow us to quantitatively measure failure modes presented by our solutions and directly measure impact on our end users. Rather than focusing on generic evals and gaining generic value for our users, we begin measuring what matters in our use-case. Crafting custom domain specific evaluations empowers teams to more precisely approach a given issue with confidence rather than moving the needle on evals that don't impact user satisfaction.
Crafting Business-Specific Evals
The process of crafting business specific evals can be unique to each domain, to a degree. However, there exist common stages that universally apply.
Start With Realistic Scenarios
Arguably the most important step is to take a user centric approach to our eval design. Off-the-shelf evals oftentimes miss this, favoring quantity of evaluations over quality. We must understand how our system should work. The way that we accomplish this is by interviewing subject matter experts and understanding the problem domain as best we can. Ideally, these interviews include understanding what questions they expect to be asked and what a golden solution looks like. If this solution is simply a defined answer, that's a great starting point. If we can get a full understanding of how these experts answer the question, even better. Agents thrive when given examples of how similar problems have been solved previously and this will strengthen their ability moving forward.
Using the real examples from our SMEs, potentially alongside synthetically generated data for our given use-case, we perform error analysis across our domain's problem set and classify, by hand, where AI succeeds, and more importantly where it fails. When AI does not succeed, it is imperative that we understand what mistake it made so that we can group these issues by general failure mode. The most important aspect at this stage is that teams read the AI traces and understand the trajectory of the AI model's work. By performing error analysis by hand, we are able to group failure modes into a finite set of failure modes, which become our first custom evaluations.
An Example Failure Mode
A failure mode is only useful when it can be clearly measured and tracked over time. To do this, teams must first define what success looks like, then choose the simplest evaluation method capable of reliably measuring it. Let's say we are creating an AI agent that lets us summarize product information for a customer and helps them understand the difference between 2 technical specifications. Our AI agent correctly identifies the products in question, pulls the correct specifications, and even summarizes the differences between them, however the language used is found to be overly technical and hard to understand. This is a problem repeatedly in our domain's problem set and commonly where we see customers struggle. This is a key failure mode that the team is then able to directly improve customer experience. We will learn about how this informs our evaluations in the next section.
Defining Our Custom Evals
Now that we understand generally how our AI performs out of the box on our given tasks, we understand where it falls short. This could take many forms, such as using the wrong tool or using a tool incorrectly.
Examples of using a tool incorrectly are:
-
Writing incorrect SQL queries
-
Calling the wrong API endpoint
-
Generate keywords for a document search that do not yield correct documents
Even simpler, maybe our AI returns the correct data and uses the correct tools but presents the data back to our end user in a confusing manner that leads them to become frustrated.
All of these are situations that may warrant an eval be created in order to track that specific failure mode over time. The simplest goal at this point in our process is to determine what our failure mode was and to describe what a successful interaction looks like.
Implementing Our Eval
Once an eval metric has been defined and understood, we have a few options available to us in order to implement the evaluation.
There are 3 primary classes of evaluation. These are:
- Complete LLM-as-a-Judge
- Best used for nuanced evaluations requiring human-like judgement that are ideally true/false judgements. AI is asked whether a specific quality matches the produced result.
- Examples: Answering questions like "Is this document relevant to the question asked?"
- Structured LLM-as-a-Judge
- A mix of LLM-as-Judge and Code Based evaluation. Leverages an LLM to extract or classify data from a given result and then uses that data to programmatically determine a final score.
- Example: Calculate answer correctness by using an LLM to pull out facts from the generated and ground truth answers and then compare programmatically what percentage overlap there is
- Code-Based
- Has a process defined, in code, to grade an output from our AI system. Does not rely on AI to produce a final grade and as such are extremely scalable.
- Examples: Expected message format, correct tool choice. These do not require any reasoning and can instead be directly tested.
Each approach has its pros and cons, but a general rule of thumb is to prefer code-based evaluation when reasonably possible and reserve LLM-as-a-Judge for the more nuanced metrics that have been defined that would typically rely on a human's judgement. The reason that we prefer code-based evals when feasible is because these are deterministic rather than relying on a stochastic system, and they are generally much faster to implement, oftentimes completing in a fraction of the time compared to an LLM-as-a-Judge.
The Example Evaluation
Looking back at our example above, where we had a common failure mode with our agent presenting information in an overly technical manner, we can create an eval as a Complete LLM-as-a-Judge metric. This metric's goal is simple. For each product comparison we give it, it should answer the question "Can this comparison be understood without prerequisite technical knowledge?" By leveraging an LLM-as-a-Judge, we are able to get a human-like understanding of text that is generated without the results being too rigid.
Improving Our System
Once we have our first iteration of an eval defined, we can begin testing inputs and give ourselves the freedom to iterate on our agents themselves with confidence, knowing that we are continuing to solve problems for our users and not waste time measuring data that does not impact our outcomes.
Make Evals Part of Your Workflow
Now that we have created evals for issues encountered in our first round of feedback, we can create a flywheel for improvement. Our product should contain a mechanism for users to provide feedback. Thumbs up and down are common starting points, but allowing text feedback in our response is even better. Once we have this, we can continue the process in a loop. Teams should regularly review traces to understand how the tool's effectiveness is changing over time in response to new information, new workflows, and other outside influences. Teams should continuously understand their customer and meet them where they are. If customers are encountering new failure modes, that may be a sign that a new eval class should be added to our system and if use-cases shift, existing evals may need to shift definition over time. Just like business, evals may not remain static over time and teams should remain flexible and be constantly tuning their evals.
Conclusion
The accelerated growth of powerful language models has raised the bar for AI quality, making generic, off-the-shelf evaluations like hallucination and toxicity baseline checks that no longer measure success for end users.
Successful AI adoption requires custom, domain-specific evaluations. This is achieved through a user-centric journey, leveraging subject matter experts and error analysis to quantitatively determine if you are solving problems for your users.
- Start with Realistic Scenarios: Interview subject matter experts (SMEs) to gather golden solutions and understand the problem domain.
- Perform Error Analysis: Use real examples to classify AI failures into a finite set of failure modes.
- Define Custom Evals: Create metrics to track specific failure modes, determining what a successful interaction looks like.
- Prioritize Code-Based Evals: Prefer deterministic and scalable code-based evaluation when feasible.
- Make Evals Part of Your Workflow: Establish a continuous feedback loop where user feedback directly informs new evals.
By prioritizing code-based evals and reserving LLM-as-a-Judge for nuanced, human-like metrics, teams gain the confidence to iterate and improve their systems rapidly.
This continuous focus on evaluation as a core element of the product development lifecycle creates a flywheel of improvement, allowing user feedback to directly inform new evals and focusing your resources on the components that truly matter. Stop wasting time measuring data that doesn't impact outcomes, and start measuring what matters most: the agent's ability to drive tangible user satisfaction and business success.
