MMInA: Benchmarking Multihop Multimodal Internet Agents

Abstract
Abstract (translated)
URL
PDF

Abstract

Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at this https URL

Abstract (translated)

自主嵌入式代理生活在多媒体网站的互联网上。它们能否在多模态网站上跳跃以完成复杂的用户任务呢？现有的基准测试无法在现实、不断演变的環境中评估它们在網站上的本体嵌入。为了回答这个问题，我们提出了MMInA，一个多跳和多模态基准来评估用于多模态网站的代理，具有几个有趣的属性：1）不断演变的现实世界多模态网站。我们的基准独特地运行在不断演变的现实世界网站上，确保了高度的现实主义和应用性，以应对自然用户任务；2）多跳网页浏览。我们的數據集包括1,050个由人类编写的任务，涵盖了各种领域，如购物和旅游，每个任务都需要代理从网站页面自动提取多模态信息作为观察结果；3）整体评估。我们提出了一个新颖的代理完成多跳任务进度的评估协议。我们分别与单独的多模态语言模型和基于规则的网页代理进行实验。 extensive实验证明，尽管长链条多跳网页任务对人类来说是容易的，但它们仍然对最先进的网络代理具有挑战性。我们发现，代理在解决多跳任务时更容易在较早的跳数上失败，导致任务成功率降低。为了解决这个问题，我们提出了一个简单的记忆增强方法，通过重放过去的动作轨迹来反映。我们的方法显著提高了代理的单跳和多跳网页浏览能力。您可以在此处查看我们的代码和数据：https://www.mmina.org/

URL

https://arxiv.org/abs/2404.09992

PDF

https://arxiv.org/pdf/2404.09992.pdf

MMInA: Benchmarking Multihop Multimodal Internet Agents

Abstract

Abstract (translated)

URL

PDF Copy

PDF