Strategies for Testing AI-Based Systems

The rapid proliferation of Artificial Intelligence (AI) and autonomous, agentic systems presents unique and complex challenges for traditional software testing practices. Testers must evolve their skills to evaluate the quality, reliability, security, and ethical behavior of these intelligent systems.

If you're interested in cutting through the complexity and mastering the tools necessary for this new frontier of quality assurance, this course is for you. In this hands-on class, you will learn how to apply specialized testing techniques, tools, and methodologies to validate the performance and trustworthiness of AI systems. A variety of techniques and tools will be introduced to help testers as they plan, execute, automate, and report testing activities specific to AI.

Key takeaways from this class include:

Understand the fundamental differences between traditional software testing and AI systems testing.
Learn how to define quality metrics and testing strategies for Machine Learning (ML) models and data.
Master techniques for testing agentic AI systems, including goal-directed behavior, planning, and emergent properties.
Understand how to perform adversarial and robustness testing on AI components.
Take home information on analyzing and reporting on the ethical and safety aspects of AI system performance.
Learn to leverage specialized tools for data quality analysis, model explainability (XAI), and continuous validation.

Who Should Attend

This course is ideal for software testers, quality assurance engineers, and test managers who are or will be responsible for validating systems that incorporate AI, Machine Learning, or Agentic AI components. A foundational understanding of software testing principles and a high-level knowledge of AI/ML concepts are necessary.

Laptop and RDP Required

This class involves hands-on activities using sample software to better facilitate learning. Each student should bring a laptop with a remote desktop protocol (RDP) client pre-installed. Connection specifics and credentials will be supplied during class. Please work with your IT Admin before class to verify that your RDP client can be used to access a virtual machine running in the Microsoft Azure environment. If you or your Admin have questions about the specific applications involved, contact our Client Support team.

Course Outline

Day 1: Foundations of AI QA and Testing GenAI Solutions

Morning Session: The New Paradigm & The Golden Set

1. Introduction: The QA Shift from Deterministic to Probabilistic

The Core Problem: Moving from Assert.AreEqual(expected, actual) to evaluate range-based valid outputs.
The "Oracle" Problem: Why traditional QA lacks a single source of truth for AI responses.
Defining the System Under Test:
- GenAI Solutions: RAG chatbots and content generators.
- Agentic AI: Systems that reason, plan, and execute actions via tools.
Scope of Testing: Why we focus on "Inference" (Solution-level) rather than "Training" (Model-level).

2. The "Golden Dataset": The New Regression Suite

Concept: Creating a curated set of high-quality inputs and ideal outcomes to serve as a baseline.
Why it is Critical: Detecting regression when prompts or underlying models change.
Creation Strategy:
- Sourcing real production data vs. synthetic data generation.
- Labeling data: "Good" (Pass) vs. "Bad" (Fail) examples.
Hands-on Lab: Building a "Golden Test Set"

Afternoon Session: Testing RAG & LLM Integration

3. Evaluation Metrics & Methodologies (Beyond Accuracy)

Key Quality Metrics:
- Faithfulness/Groundedness: Is the answer supported by the context?.
- Relevance: Did it answer the user's specific query?.
- Coherence & Toxicity: Readability and safety checks.
Evaluation Techniques:
- Heuristics: Keyword matching and similarity thresholds (Exact match vs. Semantic similarity).
- LLM-as-a-Judge: Using a strong model (e.g., GPT-4) to grade the response of the system under test.

4. Deep Dive: Testing Retrieval-Augmented Generation (RAG)

Testing the Retrieval Component:
- Context Precision vs. Context Recall: Did the system find the right document chunk?.
- Technique: Verifying document IDs in the Golden Dataset.
Testing the Generation Component:
- Prompt Engineering Tests: A/B testing prompt templates (Zero-shot vs. Few-shot).
- Hallucination Detection: Identifying when the model "makes things up" outside of provided context.
Hands-on Lab: Testing retrieval accuracy and faithfulness checks

Day 2: Advanced Testing – Agents, Robustness, and Security

Morning Session: Testing Agentic AI & Integration

5. The Challenge of Agentic AI

What makes Agents different? Statefulness, memory, and complex execution paths.
Testing Tool Use:
- Verifying the agent selects the correct API/Tool for the job.
- Parameter Verification: Did the agent extract the correct variables (e.g., origin='JFK', dest='SFO')?.
Discuss Agentic flow of: Thought -> Plan -> Action -> Observation.

6. Trajectory Analysis & Multi-Agent Systems Trajectory

Testing: Evaluating how the agent arrived at the answer, not just the final result.
- Bad Trajectory: Inefficient loops or asking redundant questions.
- Good Trajectory: Efficient planning and execution.
Multi-Agent Considerations:
- Testing handoffs between agents (Routing).
- System stability and infinite loops in agent conversations.
Hands-on Lab: Testing Agents. Analyzing logs to determine if the agent took the most efficient path

Afternoon Session:
NFRs, Security, and Operations

7. Robustness, Error Handling, and Non-Functional Requirements

Robustness Testing:
- API Failures: How does the AI behave when a tool (e.g., Flight Search API) returns a 500 error? Does it degrade gracefully?
- Boundary Conditions: Testing out-of-distribution inputs and edge cases.
Non-Functional Testing:
- Latency: Measuring "Time to First Token" vs. Total Generation Time.
- Cost: Tracking token consumption per query to prevent budget overrun.

8. Security, Fairness, and "Red Teaming"

Adversarial Testing (Red Teaming):
- Prompt Injection: Attempts to hijack the system instructions (e.g., "Ignore previous instructions").
- PII Leakage: Ensuring the model does not reveal sensitive data from the knowledge base.
Fairness & Ethics:
- Testing for bias in responses (e.g., gender, race, or socio-economic bias). Implementing safety guardrails and output filters.
Hands-on Lab: Performing a "Red Team" attack

9. Automation & MLOps for Testers

Continuous Testing in CI/CD:
- Automating the Golden Set execution in the build pipeline.
- Tools Overview: Introduction to frameworks like Deepchecks, LangSmith, and prompt evaluation tools.
Production Monitoring:
- Shift-Right Testing: Monitoring for "Drift" (answers getting worse over time) and user feedback loops.

10. Wrap-up & Retro: Discussion on the future of AI testing (Self-healing systems, Formal verification).

Class Daily Schedule

Sign-In/Registration 7:30 - 8:30 a.m.
Morning Session 8:30 a.m. - 12:00 p.m.
Lunch 12:00 - 1:00 p.m.
Afternoon Session 1:00 - 5:00 p.m.
Times represent the typical daily schedule. Please confirm your schedule at registration.

Training Course Fee Includes

• Digital course materials
• Continental breakfasts and refreshment breaks
• Lunches

Big Data, Analytics, AI/Machine Learning for Testing

Experienced Tester

Test Management

Test Automation Engineer

Jeffery Payne

Coveros

Jeffery Payne is CEO and founder of Coveros, Inc., a company that helps organizations accelerate software delivery using agile methods. Prior to founding Coveros, he was the co-founder of application security company Cigital, where he served as CEO for 16 years. Jeffery is a recognized software expert and popular keynote speaker at both business and technology conferences on a variety of software quality, security, DevOps, and agile topics. He has testified in front of congress on issues such as digital rights mgmt., software quality, and software research.

Strategies for Testing AI-Based Systems

Jeffery Payne

Related Sessions