"It seems to work" isn't a deployment strategy. As AI agents move from demos to production, teams discover that traditional software testing falls apart — outputs are non-deterministic, "correct" is subjective, and yesterday's perfect prompt fails mysteriously today. This talk tackles the unique challenges of verifying agentic applications. Rushabh will explore why agent evaluation is fundamentally harder than traditional ML testing: multi-step reasoning chains, tool use side effects, and the compounding uncertainty problem. You'll learn practical approaches to building evaluation datasets...
Rushabh Mehta

Rushabh Mehta is a Tech Lead at Meta with 12 years of experience building AI/ML infrastructure at billion-user scale. He currently leads development of environments to train and evaluate LLMs for agentic and tool-use capabilities, including onboarding evaluation benchmarks for rigorous model assessment. He built a Distributed Training Framework adopted by 80+ models, with a focus on training reliability—his cross-org initiatives have driven significant cost savings through GPU optimization and reduced idle time. Previously at Amazon, Rushabh led teams building Alexa's personalization platform and Prime Video's digital rights infrastructure. His backend work spans security, trust, and privacy across voice AI, payments, and streaming domains. Rushabh holds a Master's in Computer Science from Cornell University.