RAG Testing That Holds Up: Evaluating LLMs for Faithfulness, Boundaries, and Trust
Many teams are adopting RAG to constrain LLMs to internal documents, policies, and knowledge bases, but “using RAG” does not guarantee trustworthy behavior. In practice, models still hallucinate, blend outside knowledge, ignore source boundaries, and produce confident answers that are not supported by retrieved evidence. Traditional test approaches (happy-path assertions, correctness spot checks, performance metrics) often miss these failures because the output reads plausibly correct. Drawing from real evaluation work on document-constrained enterprise systems, this session presents a practical, tester-friendly approach for validating RAG reliability. Katryna will cover how to design test scenarios that probe grounding and faithfulness, how to detect boundary violations, and how to structure adversarial and exploratory prompts that reliably surface failure modes. Attendees will leave with a lightweight test plan template (risk tiers, pass/fail criteria, and reproducible bug reports) they can apply immediately to their own RAG pipelines.
Katryna Peart is an AI strategist focused on how large language models behave inside real organizations. She has evaluated retrieval-constrained and enterprise RAG systems for grounding, alignment, and reliability, contributing to institutional and municipal documentation used in production environments. With 15+ years of experience in regulated and public-sector communications, Katryna brings a rare human-centered lens to AI testing—bridging technical evaluation with governance, trust, and compliance. Her work centers on uncovering where AI “sounds right” but fails in ways that matter, and translating those risks into practical safeguards teams can use.
