AI's Little Secret: We've Been Doing Testing All Along

I had the opportunity to speak at QonfX Berlin, a QA and testing leadership event hosted by The Test Tribe. Coming from the Data Science, Machine Learning, and AI side of the world, it was a slightly different room from the ones I usually find myself in. I had a thesis based on my GenAI project experiences and wanted to engage with the audience to confirm whether it makes sense. The title of my talk was:

AI’s Little Secret: We’ve been doing testing all along. We just called it evals!

The core of the thesis was this: as software becomes more AI-driven — both in terms of how it is built but also in the form of actual AI-driven product features — traditional testing practices should also evolve to look like evaluations, or “evals”.

In traditional software development, we usually move through recognizable stages: requirements, design, implementation, testing, deployment, and monitoring. But with AI products, especially chatbots and agents, some of those stages are getting compressed into a prompt. A product manager might not write a detailed PRD for every behavior and an engineer might not implement every rule as deterministic code. Instead, the desired behavior and constraints are increasingly captured and live inside a prompt.

The nature of software is changing from deterministic to probabilistic as we increasingly adopt AI. When we were building a traditional form, we could validate the email field, restrict input types and write deterministic assertions. But increasingly the main interface is a large text box where anything goes — users can say anything and the system may respond differently each time. That changes the testing problem fundamentally. And how do we even test that?

My thesis is that this is where the language of AI starts sounding a lot like the language of testing:

Test cases become eval datasets.
Assertions become graders.
Regression testing becomes running evals across new versions.
Edge cases become adversarial prompts, prompt injection attempts, and unexpected user behavior.
Production monitoring becomes real-world evals on user conversations.

In the talk, I drew from my experience of working with customers to launch a chatbot. The naive version is familiar: write a prompt, define a few do’s and don’ts, test a handful of chats manually, and launch.

Then the bug reports arrive.

A sales lead in Italy says that the bot responded to her in German. The finance team complained that the bot refuses to answer financing-related questions that it should have handled. Each of these incidents becomes a failure scenario. And once you write those failure scenarios down, you have the beginning of an eval suite.

That was the central bridge I wanted to build for the audience: AI evals are not some alien new discipline. They are a continuation of testing, adapted to a world where behavior is less deterministic and more language-driven.

I enjoyed the interactions that followed. A lot of the folks in the audience were already wrestling with these questions. How do you test an agentic flow? Do you care only about the final outcome or should you also look at each step in the trace? How can one possibly go through each step manually? Can an LLM-as-judge pattern be used here?

I certainly wasn’t expecting to have such detailed discussions and learnt so much in the process. We discussed tooling and infrastructure: observability tools like Langfuse, distinguishing between offline evals and online monitoring, which role in a product team is involved at each decision step and how they can be enabled. It became clear that my thesis was not as far out as it may have sounded — a lot of QA and testing professionals are already moving in this direction, whether or not they call it evals yet.

I came away with the impression that the fundamentals haven’t really changed even though we use different terminology. We already have the necessary expertise, structure, and practices from years of software development and just need to sit together and figure it out.

Download slides (PDF)