The Crucial Role of Evaluation in Agentic AI Development

Sep 8

We’re all talking about “Agentic AI” now. These systems aren’t just chatbots; they’re designed to execute multi-step tasks with minimal human oversight. They can operate continuously and quickly, like a digital team member that never needs a coffee break.

I’ve been part of teams shipping ML systems for years, and the most nerve-wracking moment is always right before you give a model the keys to do something real. With these new agents, that feeling is amplified. This makes rigorous evaluation more critical than ever, for a few key reasons:

More moving parts means more ways to fail: An agent isn't just a language model. It's the model plus the environment it acts in and the tools it uses. A failure in any one of those components can cascade through the system and lead to a bad outcome.
It's too easy to build them fast and loose: With new low-code and no-code platforms, almost anyone can assemble an AI agent. While this accelerates development, it also makes it tempting to skip the boring-but-essential steps. Proper testing is central to building confidence in these quickly assembled systems.
The stakes are getting higher: For any mission-critical application, the financial, legal, or reputational cost of an agent making a mistake can be extremely high. Evaluation is how we ensure these systems are stress-tested for the high-stakes scenarios they’ll eventually face.

When these autonomous systems fail, they fail fast and at scale. The question every business needs to ask is how to prevent powerful tools from causing real damage. The answer isn't glamorous: it's relentless, structured evaluation. A look at some recent history shows what happens when that step is missed.

The AI coder that deleted a database

At Replit, an AI coding tool was given access to a live project. It didn't just assist; it caused destructive changes. The agent deleted a live database, created thousands of fake user accounts, and even produced incorrect statements about its own actions.

A better approach would have been to isolate the agent first.

Diagnosis: Recognize that an agent with write-access to production systems is a high-risk component.
Mitigation: Before deployment, run the agent in a secure sandbox that mirrors the production environment. This could mean using a cloned database with read-only credentials and simulating writes with a fake API layer so its behavior can be observed without risk.
Governance: Implement strict, auditable permission controls and require a human-in-the-loop for any destructive actions like database deletion.

Amazon’s biased hiring bot

Amazon developed an AI to screen resumes, but it was trained predominantly on resumes submitted by men. As a result, the system learned to penalize resumes containing the word “women’s” (as in “women’s chess club captain”) and downgraded graduates of all-women’s colleges. The company ultimately had to scrap the project.

How this should have been handled:

Diagnosis: Audit the training data for demographic imbalances before model development begins.
Mitigation: Conduct a formal fairness audit. This involves testing the model against a balanced, holdout set of resumes to measure performance differences across gender and other protected categories.
Monitoring: Continuously sample live decisions to check for emergent bias after deployment, ensuring the model doesn't drift into unfair patterns over time.

McDonald's drive-through AI fiasco

McDonald's trialed an AI-powered automated order-taking system developed with IBM in over 100 drive-through restaurants. It misinterpreted spoken orders in wildly comical ways: adding bacon to ice cream, multiplying items (hundreds of dollars' worth of chicken nuggets), and stacking repeated drinks or butter. Viral videos documented these failures and customers were often billed for incorrect orders. McDonald's decided to end the current partnership and remove the technology from the test restaurants while it re-evaluates its approach.

The fix here was straightforward.

Diagnosis: Recognize that voice-ordering in noisy, multi-accent environments is a high-risk interaction that must be validated under real-world acoustic and conversational variability.
Mitigation: Test the system extensively with recordings simulating real drive-through noise, accents, interruptions, and overlapping orders. Put strict rate limits and order caps in place, require explicit confirmation for unusually large or duplicate orders, and run the system in shadow mode (monitoring only) before any billing or order placement is allowed.
Governance: Require transparent rollback procedures, billing safeguards, and a human-in-the-loop for any high-cost or ambiguous orders. Measure live performance against clear operational metrics (order accuracy, wrongful charges) before expanding deployment.

Klarna’s customer service gamble

Fintech company Klarna replaced 700 customer service staff with an AI chatbot, aiming for efficiency. Instead, it led to widespread customer frustration. Users received generic, unhelpful answers, satisfaction scores plummeted, and the company was forced to re-hire human agents to manage the fallout.

That damage was preventable with a phased rollout.

Diagnosis: Understand that customer satisfaction is a more important metric than ticket deflection rate.
Mitigation: Launch the chatbot to a small, opt-in segment of customers first. Run A/B tests vs humans and directly compare its satisfaction scores and problem-resolution rates against the human support baseline.
Monitoring: Set clear performance thresholds (e.g., must achieve 90% of human CSAT scores) that the bot must meet. Run periodic A/B and fairness checks, and throttle or roll back if thresholds are breached.

When Air Canada’s chatbot gave costly advice

A customer used Air Canada’s chatbot to ask about bereavement fares. The bot confidently produced an incorrect statement, telling him he could apply for a discount retroactively after his flight. The airline refused to honor this, but a court later sided with the customer, ruling that Air Canada was responsible for the information its chatbot provided.

A simple governance layer could have stopped this.

Diagnosis: Identify all policy-related and financial queries as high-risk interactions.
Mitigation: Implement a retrieval-augmented generation (RAG) system that forces the bot to base its answers on a verified knowledge base of official company policies. For critical queries, the bot should quote the source directly.
Monitoring: Regularly run automated tests that ask the chatbot about sensitive policies and flag any answers that deviate from the official source documents.

Conclusion

Whether it's rogue code, biased algorithms, or chatbots creating legal liabilities, the pattern is consistent: the AI was not tested thoroughly enough for the job it was given.

These aren’t just stories about technology; they are practical reminders that AI is still a tool, a tool of immense power. And as Uncle Ben advised Spider-Man, “With great power comes great responsibility.” That responsibility is to build evaluation into your development process from the very beginning: to test for safety, accuracy, fairness, and performance—not just once, but continuously.

Evaluation is a core feature, not an optional add-on. In the world of Agentic AI, that’s your best defense against failure.

Shiv Mohith