IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

Elad Levi, Ilan Kadar

Type: Preprint

Publication Date: 2025-01-19

Citations: 0

DOI: https://doi.org/10.48550/arxiv.2501.11067

Abstract

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent

Locations

arXiv (Cornell University) - View - PDF

Similar Works

Action	Title	Year	Authors
+ PDF Chat	Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level	2024	Chenxu Wang Bin Dai Huaping Liu Baoyuan Wang
+	AgentBench: Evaluating LLMs as Agents	2023	Xiao Liu Hao Yu Hanchen Zhang Yifan Xu Xuanyu Lei Hanyu Lai Yu‐Cheng Gu Hangliang Ding Kaiwen Men Kejuan Yang
+ PDF Chat	TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation	2024	Yaoxiang Wang Zhiyong Wu Junfeng Yao Jinsong Su
+ PDF Chat	CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents	2024	Tianqi Xu Linyao Chen Dai-Jie Wu Yanjun Chen Zecheng Zhang Xiang Yao Zhiqiang Xie Yongchao Chen Shilong Liu Bochen Qian
+ PDF Chat	Multi-Agent Large Language Models for Conversational Task-Solving	2024	J. K. Becker
+ PDF Chat	AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios	2024	Xinyi Mou Jingcong Liang Jiayu Lin Xinnong Zhang Xiawei Liu Shiyue Yang Rong Ye Lei Chen Haoyu Kuang Xuanjing Huang
+ PDF Chat	clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents	2024	Anne Beyer Kranti Chalamalasetti Sherzod Hakimov Brielen Madureira Philipp Sadler David Schlangen
+ PDF Chat	Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines	2024	Yunsu Kim AhmedElmogtaba Abdelaziz Thiago Castro Ferreira Mohamed Al-Badrashiny Hassan Sawaf
+	AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents	2024	Chang Ma Junlei Zhang Zhihao Zhu Cheng Yang Yujiu Yang Yaohui Jin Zhenzhong Lan Lingpeng Kong Junxian He
+ PDF Chat	ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents	2024	Vardhan Dongre Xiaocheng Yang Emre Can Acikgoz Suvodip Dey Gökhan Tür Hakkani-Tur Dilek
+ PDF Chat	A Survey on Complex Tasks for Goal-Directed Interactive Agents	2024	Mareike Hartmann Alexander Koller
+ PDF Chat	DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents	2024	Jiho Kim Woosog Chay H.-M. Hwang Daeun Kyung Hyunseung Chung Eunbyeol Cho Yohan Jo Edward Kwok Yiu Choi
+ PDF Chat	Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence	2024	Weize Chen Ziming You Li Ran Yitong Guan Qian Chen Chenyang Zhao Cheng Yang Ruobing Xie Zhiyuan Liu Maosong Sun
+	Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents	2023	Yang Deng Wenxuan Zhang Wai Lam See-Kiong Ng Tat‐Seng Chua
+ PDF Chat	AutoAgents: A Framework for Automatic Agent Generation	2024	Guangyao Chen Siwei Dong Shu Yu Ge Zhang Jaward Sesay Börje F. Karlsson Jie Fu Yemin Shi
+	AutoAgents: A Framework for Automatic Agent Generation	2023	Guangyao Chen Siwei Dong Yu Shu Ge Zhang Jaward Sesay Börje F. Karlsson Jie Fu Yemin Shi
+ PDF Chat	Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning	2024	J. D.Z. Chen Juhao Liang Benyou Wang
+	LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models	2023	Marwa Abdulhai Isadora White Charlie Snell Charles Sun Joey Hong Yuexiang Zhai Kelvin Xu Sergey Levine
+ PDF Chat	Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models	2024	David Castillo-Bolado Joseph K. Davidson Fran Gray M. Khairul Amri Rosa
+ PDF Chat	Planning with Large Language Models for Conversational Agents	2024	Zhigen Li Jianxiang Peng Yanmeng Wang Tianhao Shen Minghui Zhang Linxi Su Shang Wu Yihang Wu Yuqian Wang Ye Wang

Works That Cite This (0)

Action	Title	Year	Authors

Works Cited by This (0)

Action	Title	Year	Authors