STARWEST 2026 - Testing AI Systems
Wednesday, September 23
How Testers Can Break AI: Practical Techniques to Find Bias, Hallucinations, and Accessibility
As AI-powered features (especially generative AI) are rapidly integrated into modern software, testing teams face a critical challenge. Traditional testing approaches focus on correctness and performance but fail to uncover ethical risks such as bias, hallucinations, and accessibility regressions. In real projects, this has led to AI systems that technically “work” yet exclude users, generate misleading outputs, or erode trust. In this talk, Aditi addresses this gap by reframing AI quality as a testable concern and applying practical, tester-led techniques rather than data science-heavy...
Evaluating Agentic LLM Apps: Beyond Vibes
"It seems to work" isn't a deployment strategy. As AI agents move from demos to production, teams discover that traditional software testing falls apart — outputs are non-deterministic, "correct" is subjective, and yesterday's perfect prompt fails mysteriously today. This talk tackles the unique challenges of verifying agentic applications. Rushabh will explore why agent evaluation is fundamentally harder than traditional ML testing: multi-step reasoning chains, tool use side effects, and the compounding uncertainty problem. You'll learn practical approaches to building evaluation datasets...
Testing AI Systems That Change Over Time
Modern software systems increasingly rely on AI-driven features such as recommendations, copilots, and automated decision-making. Unlike traditional software, these systems evolve over time as data changes and user behavior shifts, making them difficult to test using deterministic test cases alone. Many testing teams struggle with unpredictable outputs, flaky tests, and failures that only appear after deployment. In this session, Dr. Longe will address the challenge of testing AI-enabled systems that change over time and explain how testers can adapt familiar testing principles to these...