Thursday, September 24, 2026 - 3:00pm to 4:00pm

RAG Testing That Holds Up: Evaluating LLMs for Faithfulness, Boundaries, and Trust

Many teams are adopting RAG to constrain LLMs to internal documents, policies, and knowledge bases, but “using RAG” does not guarantee trustworthy behavior. In practice, models still hallucinate, blend outside knowledge, ignore source boundaries, and produce confident answers that are not supported by retrieved evidence. Traditional test approaches (happy-path assertions, correctness spot checks, performance metrics) often miss these failures because the output reads plausibly correct. Drawing from real evaluation work on document-constrained enterprise systems, this session presents a practical, tester-friendly approach for validating RAG reliability. Katryna will cover how to design test scenarios that probe grounding and faithfulness, how to detect boundary violations, and how to structure adversarial and exploratory prompts that reliably surface failure modes. Attendees will leave with a lightweight test plan template (risk tiers, pass/fail criteria, and reproducible bug reports) they can apply immediately to their own RAG pipelines.

Big Data, Analytics, AI/Machine Learning for Testing

Exploratory Testing

Test Strategy, Planning, Metrics

User Experience (UX) Testing

Test Management

Quality Assurance

Consultant

Katryna Peart

Independent Researcher, AI Governance

Katryna Peart is an AI strategist focused on how large language models behave inside real organizations. She has evaluated retrieval-constrained and enterprise RAG systems for grounding, alignment, and reliability, contributing to institutional and municipal documentation used in production environments. With 15+ years of experience in regulated and public-sector communications, Katryna brings a rare human-centered lens to AI testing—bridging technical evaluation with governance, trust, and compliance. Her work centers on uncovering where AI “sounds right” but fails in ways that matter, and translating those risks into practical safeguards teams can use.

RAG Testing That Holds Up: Evaluating LLMs for Faithfulness, Boundaries, and Trust

Katryna Peart

Related Sessions