Wednesday, September 23, 2026 - 1:30pm to 2:30pm

Evaluating Agentic LLM Apps: Beyond Vibes

"It seems to work" isn't a deployment strategy. As AI agents move from demos to production, teams discover that traditional software testing falls apart — outputs are non-deterministic, "correct" is subjective, and yesterday's perfect prompt fails mysteriously today. This talk tackles the unique challenges of verifying agentic applications. Rushabh will explore why agent evaluation is fundamentally harder than traditional ML testing: multi-step reasoning chains, tool use side effects, and the compounding uncertainty problem. You'll learn practical approaches to building evaluation datasets that actually reflect production scenarios, not just cherry-picked examples. He will cover automated evaluation pipelines using LLM-as-judge patterns, their surprising failure modes, and when you absolutely need human-in-the-loop review. Learn how to detect regressions before users do — setting up continuous evaluation that catches prompt drift and model degradation. Finally, this talk will examine when benchmarks lie: why your agent aces public evals but fails in production, and how to build evaluations that predict real-world performance.

Big Data, Analytics, AI/Machine Learning for Testing

Developer

Rushabh Mehta

Meta

Rushabh Mehta is a Tech Lead at Meta with 12 years of experience building AI/ML infrastructure at billion-user scale. He currently leads development of environments to train and evaluate LLMs for agentic and tool-use capabilities, including onboarding evaluation benchmarks for rigorous model assessment. He built a Distributed Training Framework adopted by 80+ models, with a focus on training reliability—his cross-org initiatives have driven significant cost savings through GPU optimization and reduced idle time. Previously at Amazon, Rushabh led teams building Alexa's personalization platform and Prime Video's digital rights infrastructure. His backend work spans security, trust, and privacy across voice AI, payments, and streaming domains. Rushabh holds a Master's in Computer Science from Cornell University.

Evaluating Agentic LLM Apps: Beyond Vibes

Rushabh Mehta

Related Sessions