Systematic Error Detection for AI Systems

How to Know if Your Application Actually Works

So you’ve built your RAG bot, your customer service chatbot, or your AI-powered application. It looks great in demos, but how do you actually know if it works when real users start interacting with it? This is where systematic error detection comes in—a methodical approach borrowed from qualitative research in social sciences that’s now essential for AI systems and artificial intelligence applications.

This approach can help software products transform their AI from giving users overwhelming amounts of information to providing just what’s relevant. But you won’t discover these improvements without systematically analyzing how your AI tool actually behaves in practice. Just as computer networks rely on error detection codes and checksums to verify data transmission, AI systems need robust error detection techniques to ensure reliable performance and data quality.

AI Error Detection Through Dimensional Sampling: Creating Your Bootstrap Dataset

When launching a new AI system, you face a classic chicken-and-egg problem: you need user data to identify errors, but you don’t have users yet. The solution is bootstrapping an initial dataset through strategic synthetic data generation. This process ensures data integrity from the start and helps prevent errors before they impact real users.

The rule of thumb: generate at least 100 diverse traces to get a good picture of failure modes. But these can’t be random queries—you need a structured method that accounts for potential erroneous data and edge cases. The tech industry has learned that systematic input data generation is crucial for reliable AI technologies.

Creating Meaningful Dimensions

Rather than asking an LLM “give me potential user queries,” you need to think systematically about dimensions that vary meaningfully across different users and use cases. This approach helps detect errors early and ensures comprehensive coverage. Here’s how an apartment leasing assistant would approach this task, demonstrating the value of structured thinking in AI applications:

Intent/Feature Dimensions:

  • Check availability
  • Schedule tours/showings
  • Maintenance requests
  • Application processes
  • Rent collection

User Persona Dimensions:

  • Prospective residents (luxury vs budget-conscious)
  • Current residents
  • Past residents
  • Property managers

Query Complexity/Scenario Dimensions:

  • Highly ambiguous queries
  • Crystal clear, well-formed requests
  • Urgent/time-sensitive requests
  • Multi-step complex scenarios

The key insight: even Google Search, the world’s largest search engine, only handles “tens of different types of search queries.” Your AI system likely doesn’t have infinite query types either, making this dimensional approach both practical and accurate.

The Two-Step Synthesis Process

Once you have your dimensions, follow this proven two-step process that helps prevent errors and ensures data quality:

Step 1: Generate Tuples/Combinations Create combinations of your dimensions first. For example:

  • Feature: “Schedule maintenance”
  • Persona: “Current resident who is super busy”
  • Scenario: “Very ambiguous query”

Step 2: Generate Queries from Tuples Feed these combinations to an LLM to generate natural language processing outputs. This separation ensures diversity—you’ve explicitly specified diverse tuples, then generated queries from them.

Critical point: Always manually review these synthetic queries to verify accuracy and detect any incorrect information. Discard unrealistic ones. The goal isn’t exhaustive coverage—it’s finding the dimensions most likely to induce AI failures as quickly as possible. This method helps identify potential biased algorithms or systematic issues before deployment.

Data Integrity in AI Systems: Understanding Traces and Complete Data Storage

Before diving into analysis, let’s define what we’re analyzing. A trace is a complete, end-to-end record of what happens when a user submits a query—essentially the digital equivalent of a code word that captures every aspect of the interaction for later verification:

  • The initial user query and input data
  • All LLM inputs and outputs from large language models
  • Every tool call and tool response
  • Intermediate reasoning steps
  • Metadata and system information stored for analysis
  • Everything that happens during the interaction

This comprehensive data storage approach ensures you can verify each step of the AI’s decision making process and detect any point where errors might occur.

Common pitfall: Traces that don’t include tool calls. Without this information, you can’t diagnose whether failures stem from poor reasoning, hallucinated tools, or actual tool failures. This is similar to how computer networks require complete packet information for effective troubleshooting—missing even a single bit can make error detection impossible.

Systematic Detection of Errors: The Open Coding Process

Once you have traces, the real work begins. This is where we borrow from social science research methodology, specifically a technique called open coding—a systematic method that has proven effective across disciplines for identifying patterns in complex data over the past decade.

How Open Coding Works to Detect Errors

  1. Read each trace end-to-end to understand the complete interaction
  2. Write brief, descriptive notes as thoughts occur to you—this helps identify errors that might otherwise be missed
  3. Document any observed problems or surprising actions that indicate potential AI failures
  4. Note unexpected behaviors that feel wrong or could lead to poor user experience
  5. Include positive observations and learnings that show when the system is working correctly

Think of this as “journaling on the outputs.” You’re not trying to fix anything yet—you’re building understanding of how your artificial intelligence system actually behaves. This process helps humans make informed decisions about where to focus improvement efforts.

Real Examples from Practice

Let me walk you through some actual examples to illustrate the process:

Example 1: Ambiguous Query

  • User query: “Hello there, what’s up to 4 months rent?”
  • AI response: Immediately jumped to specials and concessions, offered “up to 8 weeks free”
  • Open coding note: “I have no idea what this query means. Assistant should have asked follow-up questions for clarification.”

Example 2: Missing Information

  • User query: “Are pets allowed at your community?”
  • AI response: “Yes, pets are welcome! We have a pet-friendly policy…”
  • Problem: No tool call was made to actually retrieve pet policy information
  • Open coding note: “Don’t respond unless you’ve invoked a tool to get the right information. Potential hallucination.”

Example 3: Poor UX Design

  • User query: “Can we please schedule a tour?”
  • AI response: Listed 15 different time slots in bullet points: “Monday 4:40 PM, Tuesday 9:20 AM, Wednesday 2:15 PM…”
  • Open coding note: “This UX is terrible. Should provide a calendar widget or scheduling link instead of overwhelming text list.”

Notice how these notes focus on behavior rather than technical implementation. You’re not debugging code—you’re identifying user experience problems.

Error Correction Through Categorization: From AI Failures to Actionable Insights

After reviewing 100 traces (which takes about an hour once you get into flow), you’ll have 100+ open-ended notes documenting various AI errors and observations. Now comes axial coding—transforming those individual observations into structured failure categories that enable systematic error correction.

The Systematic Approach to Error Categorization

Imagine each note as a sticky note on a wall. Your task is to group similar sticky notes together and create a taxonomy (categorization system) that helps identify errors efficiently. For example:

High-level category: “Hallucination Issues”

  • Tool call hallucinations (claims to have scheduled without calling scheduling tool)
  • Information hallucinations (states pet policy without checking actual data)
  • Feature hallucinations (offers services that don’t exist)

High-level category: “UX/Product Issues”

  • Poor scheduling interface (text lists vs. widgets)
  • Missing visual content (no pool photos for pool inquiries)
  • Inefficient handoff processes

This systematic approach to machine learning error analysis ensures you capture both technical AI failures and user experience problems that could lead to wrong outcomes.

Making Categories Binary and Testable

Structure your failure modes as clear yes/no criteria to ensure accurate detection:

  • ✅ “Assistant made tool call without user requesting action”
  • ❌ “Assistant was somewhat unhelpful” (too subjective)

Binary categories make human annotation faster and more consistent while enabling reliable measurement later—functioning like a cryptographic hash function that provides clear verification of whether a specific error type is present or absent. This approach eliminates ambiguity and ensures consistent evaluation across different team members, helping maintain data integrity in your analysis process.

Translating Findings into Actionable Improvements

Your error categories become your improvement roadmap. Each cluster represents a specific failure mode that you can address through:

  • Prompt engineering: Adjust system prompts to handle common failure patterns
  • Tool integration: Add capabilities to address functional gaps
  • User experience changes: Modify interfaces to prevent problematic interactions
  • Training data improvements: Address knowledge gaps in your models

How to Identify Errors: Common Pitfalls in AI Systems Analysis

Based on extensive experience with teams implementing error detection in AI systems, here are the most frequent mistakes to avoid:

1. Unrepresentative Training Data

Problem: Asking an LLM “give me user queries for my recipe bot” without dimensional thinking Solution: Define meaningful dimensions (user personas, query complexity, feature intent) before generating queries. Poor training data leads to systems that can’t detect errors in real-world scenarios.

2. Coming with Fixed Expectations

Problem: Starting with a predetermined list of failure modes and only looking for those specific errors Solution: Let failure modes emerge from the data. Your actual AI failures will surprise you and differ from theoretical problems.

3. Trying to Skip Steps

Problem: Jumping straight to automated evaluation without understanding actual failure modes Solution: Do the manual analysis first. There are no shortcuts to building deep understanding of how your technology actually behaves.

4. Working in Isolation

Problem: Doing error analysis alone without domain expert input Solution: Talk to 2-3 people on your team. Different perspectives reveal different failure dimensions and help identify errors you might miss.

5. Focusing on Implementation Instead of Behavior

Problem: “Why did the system prompt make it do X?” Solution: “What behavior happened and what should have happened instead?” Focus on user impact rather than technical details.

6. Incomplete Data Collection

Problem: Traces that don’t show tool usage, making diagnosis impossible Solution: Capture everything—every tool call, response, and intermediate step. Think of this like a parity bit system that requires complete information to detect single bit errors effectively.

Cyclic Redundancy Check: Building Sustainable AI Error Detection Systems

Start Simple, Iterate Often

You don’t need complex software to begin. A custom interface can help detect errors efficiently:

  • Review traces quickly with good/bad labels
  • Add open-ended notes explaining the reasoning
  • Generate automatic summaries of failure mode frequencies
  • Create actionable bug tickets and feature requests directly from analysis

The key insight: “As a product owner, this analysis gives me enough information to just get to work—creating bug tickets, feature requests, and directing engineering focus.” This approach ensures informed decisions based on actual data rather than assumptions.

Essential Interface Features

Whether you build custom tools or use existing platforms (Braintrust, LangSmith), prioritize features that help identify errors quickly:

  • Fast trace navigation: Click between traces quickly to maintain analysis flow
  • Clear trace visualization: See the conversation flow clearly without missing details
  • Easy annotation: Add notes without friction to capture insights in real-time
  • Category assignment: Apply failure modes from dropdowns for consistent labeling
  • Summary dashboards: See failure mode frequencies at a glance to prioritize fixes

Make It Cyclical, Not One-Time

This isn’t a pre-launch checklist item—it’s ongoing product development that requires cyclic redundancy check principles. Schedule regular error analysis cycles to maintain system accuracy:

  • High-frequency period: Weekly during initial deployment to catch early problems
  • Steady state: Every 6-8 weeks to monitor for new error patterns
  • Triggered analysis: After major feature releases or user complaints

As your artificial intelligence system evolves, new failure modes emerge. Regular analysis keeps you ahead of user frustration and ensures your AI remains accurate and reliable. Just as big data systems require continuous monitoring for data quality, AI applications need ongoing error detection to maintain performance standards.

The Bigger Picture: Why This Matters

Competitive Advantage Through Understanding

As one example from practitioners shows: “If you do this systematically, it’s highly likely you’ll know more about your AI’s behavior than anyone else on your team.” This knowledge compounds over time—you’re building institutional understanding that improves every product decision and helps prevent errors before they impact users.

Beyond Bug Fixes: Product Evolution

Error detection doesn’t just find bugs—it drives product evolution and helps identify opportunities for improvement. Examples implementations:

  • Scheduling Agent discovered users expected scheduling widgets, not text lists → built calendar integration that reduced booking errors
  • Pool inquiry patterns revealed users wanted photos → added visual content requirements to improve accuracy
  • Handoff failures showed when AI should transfer to humans → improved escalation logic to prevent wrong responses

These insights came from systematic analysis, not intuition. The process helps organizations make informed decisions about feature priorities and user experience improvements.

Customer Discovery for AI Applications

This systematic error detection mirrors established qualitative research methods (open coding, axial coding) from customer discovery, as documented in research from institutions like Harvard University. You’re treating AI failures not as technical bugs but as user experience problems requiring systematic understanding.

The transformation: scattered user complaints → clear improvement roadmap → targeted engineering focus → better user outcomes. This method has proven effective across the tech industry for improving AI systems and ensuring they provide accurate, reliable results.


Discover more from Josh Pitzalis

Subscribe to get the latest posts sent to your email.