Paper Reading AI Learner

MMInA: Benchmarking Multihop Multimodal Internet Agents

2024-04-15 17:59:50
Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu

Abstract

Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: 1) Evolving real-world multimodal websites. Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to autonomously extract multimodal information from web pages as observations; 2) Multihop web browsing. Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks; 3) Holistic evaluation. We propose a novel protocol for evaluating an agent's progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improved both the single-hop and multihop web browsing abilities of agents. See our code and data at this https URL

Abstract (translated)

自主嵌入式代理生活在多媒体网站的互联网上。它们能否在多模态网站上跳跃以完成复杂的用户任务呢?现有的基准测试无法在现实、不断演变的環境中评估它们在網站上的本体嵌入。为了回答这个问题,我们提出了MMInA,一个多跳和多模态基准来评估用于多模态网站的代理,具有几个有趣的属性:1)不断演变的现实世界多模态网站。我们的基准独特地运行在不断演变的现实世界网站上,确保了高度的现实主义和应用性,以应对自然用户任务;2)多跳网页浏览。我们的數據集包括1,050个由人类编写的任务,涵盖了各种领域,如购物和旅游,每个任务都需要代理从网站页面自动提取多模态信息作为观察结果;3)整体评估。我们提出了一个新颖的代理完成多跳任务进度的评估协议。我们分别与单独的多模态语言模型和基于规则的网页代理进行实验。 extensive实验证明,尽管长链条多跳网页任务对人类来说是容易的,但它们仍然对最先进的网络代理具有挑战性。我们发现,代理在解决多跳任务时更容易在较早的跳数上失败,导致任务成功率降低。为了解决这个问题,我们提出了一个简单的记忆增强方法,通过重放过去的动作轨迹来反映。我们的方法显著提高了代理的单跳和多跳网页浏览能力。您可以在此处查看我们的代码和数据:https://www.mmina.org/

URL

https://arxiv.org/abs/2404.09992

PDF

https://arxiv.org/pdf/2404.09992.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot