Thursday, September 24, 2026 - 1:30pm to 2:30pm

Taming the Stochastic Beast: Building AI Evaluation Pipelines for GenAI Releases

If you've ever shipped a GenAI feature wondering “is this actually good enough?”, you're not alone. Traditional pass/fail QA breaks down when outputs are non-deterministic, and teams end up making release decisions based on subjective “vibe checks” rather than data. This session shows how Product Managers can partner with QA to replace intuition with a systematic AI evaluation pipeline. You'll learn how to define quality as measurable dimensions (groundedness, tone, helpfulness, safety), build a representative test set, and design rubrics that align product goals with engineering constraints. Nixalkumar will walk through a practical LLM-as-a-judge pattern used in production, including calibration against human ratings, inter-rater consistency checks, and a lightweight error taxonomy to scale grading across thousands of interactions. You'll also see how to turn scores into release gates: setting thresholds, catching regressions, monitoring drift, and maintaining an audit trail for leadership and compliance. You'll leave with templates you can adapt immediately: a rubric framework, test set plan, scorecard format, and a cross-functional operating model that helps teams ship GenAI features with confidence instead of crossed fingers.

Big Data, Analytics, AI/Machine Learning for Testing

Test Automation

Test Leadership

Test Strategy, Planning, Metrics

Test Management

Test Automation Engineer

Quality Assurance

Product Owner

Nixalkumar Patel

LG Electronics

Nixalkumar Patel is a Product Manager at LG Electronics, where he works at the messy intersection of "does this AI actually work?" and "how do we prove it?" He partners with engineering and QA teams to operationalize Generative AI features for consumer products, focusing on AI evaluation: defining measurable quality standards, building rubric-based scorecards, and scaling assessment through automated judging. He has led cross-functional initiatives connecting product goals to test coverage, regression gates, and monitoring for LLM-driven behaviors. Previously, he supported data products integrating an enterprise foundation model into smart-home use cases. He advocates for "evals" as the new unit tests for AI and shares practical frameworks teams can adopt without exposing proprietary data.

Taming the Stochastic Beast: Building AI Evaluation Pipelines for GenAI Releases

Nixalkumar Patel

Related Sessions