Abstract

Large language model (LLM)–based agents demonstrate strong capabilities in isolated tasks, yet deploying them as reliable components of real-world systems remains a fundamental challenge. In production environments, failures rarely stem from individual model errors alone; instead, they emerge from compounding uncertainty, weak validation boundaries, and tightly coupled workflows that amplify small mistakes over time.
In this talk, I present a production-driven study of trustworthy agentic AI, grounded in the design and deployment of EnviStor, a multi-agent research data management system operating under real-world constraints. I argue that trustworthiness in agentic systems requires structure at two complementary levels. At the agent level, reliability depends on explicit internal organization—separating behaviors, domain knowledge, and skills, and balancing stability with adaptation through a Dual-Helix architecture. At the system level, reliability requires structural isolation between agents, enforced through role separation, privilege boundaries, and audited handoffs, forming a multi-agent system (MAS) designed to contain, rather than eliminate, inevitable errors.
Drawing on operational experience, I show that even well-engineered agents exhibit bounded accuracy in open-ended, multi-step tasks, making perfect autonomy neither realistic nor desirable. I conclude by discussing open challenges in validation, governance, and orchestration, and outline future research directions focused on understanding how trust boundaries can be progressively formalized and shifted within agentic systems under real-world constraints.

Here is the public livestream link. 

Please reach out to Sam Scalice (sscalice@ucar.edu) with any questions you may have.