Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
Abstract (translated)
大型语言模型(LLMs)已经展示了作为智能代理解决复杂问题的卓越能力。然而,在涉及API或工具调用之间依赖关系的情景下——特别是在多轮对话中——进行有效规划仍然是一个重大的挑战。为了解决这个问题,我们推出了T1,这是一个增强型、跨领域、多轮会话的数据集,专门设计用于捕捉和管理不同领域的工具间的相互依赖性。T1通过集成的缓存机制(支持短期和长期记忆)帮助智能代理在九个不同的领域(包括4个单一领域和5个多领域)协调使用工具,并支持动态重新规划——例如决定是否重新计算或重用已缓存的结果。 除了促进关于工具使用和计划的研究外,T1还作为评估开源语言模型性能的基准。我们介绍了由T1-Agent提供支持的结果,展示了它们在复杂、依赖于工具的情景中进行规划和推理的能力。
URL
https://arxiv.org/abs/2505.16986