Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
自体中心互动世界模型对于增强现实和具身人工智能至关重要,在这些领域中,视觉生成必须能够快速响应用户输入,并且保持几何一致性和长期稳定性。本文研究了仅基于单一场景图像以及自由空间手势的情况下进行自体中心交互生成的方法,旨在合成这样的视频:在其中手进入场景、与物体互动并根据头部运动产生合理的世界动态变化。这种设置引入了许多基本挑战,包括自由空间手势和接触密集型训练数据之间的分布差异、单目视角下手部动作和相机移动的模糊性以及任意长度视频生成的需求。 我们提出了Hand2World框架,这是一个统一的自回归架构,通过基于投影3D手模型的手部条件处理解决了这些问题。这种处理方式使得可视性和遮挡可以从场景上下文中推断出来而不是直接编码到控制信号中。为了稳定自体视角的变化,我们在每像素Plücker射线嵌入的基础上注入了显式的相机几何信息,从而将相机移动与手部动作区分开来,并防止背景漂移现象的发生。 我们还开发了一条全自动单目注释流水线并从双向扩散模型中提取出了因果生成器,这使得任意长度的合成成为了可能。我们在三个自体中心交互基准上进行了实验,结果显示在感知质量、3D一致性以及支持相机控制和长期互动生成方面取得了显著改进。 简单来说,Hand2World框架通过处理手势与场景之间的关系,并结合先进的模型简化技术,能够生成更加真实且稳定的交互式视频内容,在增强现实等领域具有广泛的应用前景。
https://arxiv.org/abs/2602.09600
A quintessential feature of human intelligence is the ability to create ad hoc conventions over time to achieve shared goals efficiently. We investigate how communication strategies evolve through repeated collaboration as people coordinate on shared procedural abstractions. To this end, we conducted an online unimodal study (n = 98) using natural language to probe abstraction hierarchies. In a follow-up lab study (n = 40), we examined how multimodal communication (speech and gestures) changed during physical collaboration. Pairs used augmented reality to isolate their partner's hand and voice; one participant viewed a 3D virtual tower and sent instructions to the other, who built the physical tower. Participants became faster and more accurate by establishing linguistic and gestural abstractions and using cross-modal redundancy to emphasize key changes from previous interactions. Based on these findings, we extend probabilistic models of convention formation to multimodal settings, capturing shifts in modality preferences. Our findings and model provide building blocks for designing convention-aware intelligent agents situated in the physical world.
人类智能的一个典型特征是随着时间的推移,能够创建临时约定以高效地实现共同目标。我们研究了通过反复合作,沟通策略如何随着人们在共享过程抽象上进行协调而演变。为此,我们开展了一项在线单模态研究(n = 98),使用自然语言来探索抽象层次结构。在后续的实验室研究(n = 40)中,我们考察了多模态通信(包括言语和手势)在物理协作过程中如何变化。实验中,成对参与者通过增强现实技术隔离彼此的手部动作和声音;一名参与者观察一个3D虚拟塔楼,并向另一名正在构建实体塔楼的同伴发出指令。参与者通过建立语言和手势抽象,并利用跨模态冗余来强调先前互动中的关键变化而变得更快且更准确。 基于这些发现,我们将概率模型在约定形成方面的应用扩展到了多模态环境,捕捉到模式偏好转变的趋势。我们的研究结果与模型为设计适用于物理世界的感知常规的智能代理提供了基础构建模块。
https://arxiv.org/abs/2602.08914
Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.
手势是非言语交通交流中的关键组成部分,常常在正式交通规则不足时帮助行人与司机之间的互动。当自动驾驶汽车(AV)难以解读这些手势时,这一问题变得更加明显。在这项研究中,我们提出了一种基于二维姿态估计的框架,用于从WIVW数据集中提取的真实世界视频序列的手势分类。我们将手势分为四大类(停止、前进、感谢与问候、无手势),并从归一化的关键点中抽取76个静态和动态特征。我们的分析表明,手的位置和移动速度在区分不同手势类别时尤其具有鉴别力,达到了87%的分类准确率。这些发现不仅提升了自动驾驶系统的感知能力,还为理解交通环境中行人的行为提供了更广泛的认识。
https://arxiv.org/abs/2602.08479
The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.
"观察与倾听框架(LILO)使得智能车辆应用能够在了解外部场景和驾驶员状态的基础上提高安全性能,例如智能气囊部署、自动驾驶控制转换时的接管时间预测以及驾驶员注意力监控。在此研究中,我们提议对此框架进行增强,并强调音频模态作为理解驾驶员及其他人员的重要信息来源,在不断发展的自主驾驶环境中尤其如此。我们将LILO扩展为结合声音信号的形式,形成“倾听与观察内部和外部”(L-LIO)框架,通过多模态传感器融合提高对驾驶员状态的评估及环境的理解能力。 我们在三个案例中评估了音频如何增强车辆安全性:使用监督学习分析司机语音音频来分类潜在的身体状况(如酒后驾驶),收集并分析乘客自然语言指令(例如,“在那个红房子之后转弯”)以展示口语如何通过与视觉信号对齐的指令数据与规划系统交互,以及仅依赖视觉系统的局限性,在这种情况下声音可以消除外部代理行为指导和手势中的歧义。这些案例的数据集包括从真实世界环境中定制收集的车内及车外音频样本。 初步研究结果表明,音频提供了重要的安全信息,尤其是在细微或背景丰富的场景中,其中声音对于做出安全决策至关重要或是视觉信号本身是不够的时候。然而,挑战还包括环境噪音干扰、隐私考虑以及在不同人类受试者间实现稳健性的难题,这需要我们在动态的真实世界环境中进一步增强系统的可靠性。 L-LIO框架通过多模态融合的听觉和视觉感应来提高驾驶员状态和场景的理解能力,并为安全干预提供了新的路径。"
https://arxiv.org/abs/2602.07668
Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM$^2$}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed versions (\textit{SeM$^2_e$}), with the latter knowledge distilled to operate efficiently on edge hardware while maintaining 95\% of the relative performance. Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence, advancing socially expressive humanoid robotics for diverse real-world environments.
有效的类人机器人交互需要丰富的情感多模态表达,然而大多数的类人机器人在语音、面部表情和手势方面的协调能力不足。同时,实际部署要求设备能够在没有持续云连接的情况下自主运行。为了填补**S**peech(言语)、**E**motion(情感)和**M**otion(动作)之间的空白,我们提出了基于视觉语言模型框架的SeM$^2$,该框架通过三个关键组件协调情感一致性的多模态交互:一个多模态感知模块捕捉用户上下文线索、一个Chain-of-Thought推理用于响应规划以及一种新颖的语义序列对齐机制(SSAM),确保口头内容和物理表达在时间上精确同步。我们实现了基于云和**e**dge部署版本(SeM$^2_e$),后者通过知识蒸馏可以在边缘硬件上高效运行,同时保持95%以上的相对性能。全面的评估表明,我们的方法在自然性、情感清晰度以及模态一致性方面显著优于单模基线,在各种实际环境中推进了社会表达型类人机器人的发展。
https://arxiv.org/abs/2602.07434
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
在资源有限的可穿戴设备上进行人体活动识别(HAR)需要能够平衡精度和严格内存及计算预算的模型。尽管像TinierHAR(含34K参数)和TinyHAR(含55K参数)这样的轻量级架构实现了强大的准确性,但一旦考虑了操作系统的开销后,这些模型还是超出了资源有限的微控制器的SRAM存储限制。 我们提出了一种名为MicroBi-ConvLSTM的超轻量化卷积递归架构,在两阶段卷积特征提取的基础上(其中时间池化比为4倍),配合一个单向双向LSTM层,平均参数量仅为11.4K。这相对于TinierHAR减少了2.9倍参数,并且相比DeepConvLSTM减少达11.9倍的参数数量,同时保持线性O(N)复杂度。 在八个不同的人体活动识别基准测试中,MicroBi-ConvLSTM展示了其在超轻量级模型下的竞争力:UCI-HAR基准上达到93.41%的宏F1分数,在SKODA装配手势检测上为94.46%,而Daphnet步态冻结检测则达到了88.98%。系统性消融实验揭示了任务依赖性的组件贡献,其中双向性在事件检测中表现出了明显的优势,但在周期性运动检测中的边际收益有限。 INT8后训练量化仅导致平均F1分数下降0.21%,但可以将模型的部署大小降至约23.0KB,这对于内存受限的边缘设备来说是理想的。
https://arxiv.org/abs/2602.06523
We introduce PuppetAI, a modular soft robot interaction platform. This platform offers a scalable cable-driven actuation system and a customizable, puppet-inspired robot gesture framework, supporting a multitude of interaction gesture robot design formats. The platform comprises a four-layer decoupled software architecture that includes perceptual processing, affective modeling, motion scheduling, and low-level actuation. We also implemented an affective expression loop that connects human input to the robot platform by producing real-time emotional gestural responses to human vocal input. For our own designs, we have worked with nuanced gestures enacted by "soft robots" with enhanced dexterity and "pleasant-to-touch" plush exteriors. By reducing operational complexity and production costs while enhancing customizability, our work creates an adaptable and accessible foundation for future tactile-based expressive robot research. Our goal is to provide a platform that allows researchers to independently construct or refine highly specific gestures and movements performed by social robots.
我们介绍PuppetAI,这是一种模块化软机器人交互平台。该平台提供了一种可扩展的电缆驱动执行系统和一个受提线木偶启发、可根据需求定制化的机器人手势框架,支持多种互动手势机器人的设计格式。平台采用了四层解耦软件架构,包括感知处理、情感建模、动作调度以及低级执行等模块。 我们还实现了一个情感表达循环,通过生成实时的情感手势反应来将人类的声音输入与机器人平台相连结,从而使得该平台能够根据人类的语音输入产生即时反馈。在我们的设计中,我们利用了由柔软材料制成且具有增强灵活性和触感舒适的“软机器人”,执行细腻的手势动作。 通过降低操作复杂性和生产成本并提高定制化能力,我们的工作为未来基于触觉的情感表达机器人的研究建立了一个灵活易用的基础平台。我们的目标是提供一个平台,允许研究人员独立构建或改进由社交机器人执行的高度特定手势和运动。
https://arxiv.org/abs/2602.04787
Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure, yet most adversarial attacks change intensities or event counts instead of timing. We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs, thus remaining rate-preserving. We formalize a capacity-1 spike-retiming threat model with a unified trio of budgets: per-spike jitter $\mathcal{B}_{\infty}$, total delay $\mathcal{B}_{1}$, and tamper count $\mathcal{B}_{0}$. Feasible adversarial examples must satisfy timeline consistency and non-overlap, which makes the search space discrete and constrained. To optimize such retimings at scale, we use projected-in-the-loop (PIL) optimization: shift-probability logits yield a differentiable soft retiming for backpropagation, and a strict projection in the forward pass produces a feasible discrete schedule that satisfies capacity-1, non-overlap, and the chosen budget at every step. The objective maximizes task loss on the projected input and adds a capacity regularizer together with budget-aware penalties, which stabilizes gradients and aligns optimization with evaluation. Across event-driven benchmarks (CIFAR10-DVS, DVS-Gesture, N-MNIST) and diverse SNN architectures, we evaluate under binary and integer event grids and a range of retiming budgets, and also test models trained with timing-aware adversarial training designed to counter timing-only attacks. For example, on DVS-Gesture the attack attains high success (over $90\%$) while touching fewer than $2\%$ of spikes under $\mathcal{B}_{0}$. Taken together, our results show that spike retiming is a practical and stealthy attack surface that current defenses struggle to counter, providing a clear reference for temporal robustness in event-driven SNNs. Code is available at this https URL.
脉冲神经网络(SNNs)利用离散的脉冲和时间结构进行计算,然而大多数对抗性攻击会改变强度或事件计数而不是时间。本文研究了一种仅基于时间调整现有脉冲而不保留其时间和幅度的时间型对手,在事件驱动的SNN中,这种对手保持了速率不变。我们形式化了一个容量为1的脉冲重定时威胁模型,并使用统一的预算组合:每个脉冲抖动$\mathcal{B}_{\infty}$、总延迟$\mathcal{B}_{1}$和篡改计数$\mathcal{B}_{0}$。可行的对抗性示例必须满足时间线一致性与非重叠条件,这使得搜索空间既离散又受约束。为优化大规模脉冲重定时,我们采用了投影内循环(PIL)优化方法:通过转移概率对数获得可微软重定时以进行反向传播,并在正向传递中使用严格的投影生成满足容量1、非重叠以及选定预算条件的可行离散时间表。目标是最大化在投影输入上的任务损失,同时添加了容量正则化器和预算感知惩罚项,这有助于稳定梯度并使优化与评估保持一致。跨事件驱动基准(CIFAR10-DVS、DVS-Gesture、N-MNIST)及多种SNN架构,在二进制和整数时间网格以及广泛的重定时预算下进行评测,并测试了针对时间型攻击训练的模型。例如,在DVS-Gesture上,该攻击在$\mathcal{B}_{0}$条件下以少于2%的脉冲接触量实现了超过90%的成功率。总体而言,我们的结果显示脉冲重定时是一种实用且隐蔽的攻击面,现有的防御手段难以应对,并为事件驱动SNN的时间鲁棒性提供了一个清晰参考。代码可在该链接处获得。
https://arxiv.org/abs/2602.03284
We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right", "now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
我们介绍了一种名为Speech-to-Spatial的指代消歧框架,该框架将口头远程协助指令转换为基于空间定位的增强现实(AR)指导。与以往依赖额外线索(如手势、目光)或手动专家注释的系统不同,Speech-to-Spatial仅根据口语化的参考信息推断目标对象。受我们对言语引用模式初步研究的启发,我们将人们指定目标的方式进行了分类(直接属性、关系性、记忆式和链式),并将其与以物体为中心的关系图关联起来。对于每条语音输入,指代线索被解析并渲染为持久性的现场AR视觉指导,从而减少了远程指导过程中反复微调指引的需求(如“再往右一点”,“停。”)。 我们通过远程协助和意图消歧场景演示了系统的应用案例。我们的评估表明,与传统的仅依赖声音的基准系统相比,Speech-to-Spatial在提高任务效率、减轻认知负荷以及增强用户体验方面表现出明显优势,将抽象的语言指令转化为在实时共享视图中可视且可操作的指导。
https://arxiv.org/abs/2602.03059
Mobile manipulators in the home can enable people with cervical spinal cord injury (cSCI) to perform daily physical household tasks that they could not otherwise do themselves. However, paralysis in these users often limits access to traditional robot control interfaces such as joysticks or keyboards. In this work, we introduce and deploy the first system that enables a user with quadriplegia to control a mobile manipulator in their own home using bimanual high-density electromyography (HDEMG). We develop a pair of custom, fabric-integrated HDEMG forearm sleeves, worn on both arms, that capture residual neuromotor activity from clinically paralyzed degrees of freedom and support real-time gesture-based robot control. Second, by integrating vision, language, and motion planning modules, we introduce a shared autonomy framework that supports robust and user-driven teleoperation, with particular benefits for navigation-intensive tasks in home environments. Finally, to demonstrate the system in the wild, we present a twelve-day in-home user study evaluating real-time use of the wearable EMG interface for daily robot control. Together, these system components enable effective robot control for performing activities of daily living and other household tasks in a real home environment.
家庭中的移动机械臂可以帮助患有颈椎脊髓损伤(cSCI)的人完成他们通常无法独自完成的日常家务任务。然而,这些用户的瘫痪往往限制了他们使用传统的机器人控制界面(如操纵杆或键盘)。在这项工作中,我们引入并部署了一种系统,使四肢瘫痪的用户能够在其家中通过双手高密度肌电图(HDEMG)来操控移动机械臂。我们开发了一对定制的、织入布料中的HDEMG前臂袖套,佩戴在双臂上,可以捕捉到临床认为瘫痪的自由度上的残留神经肌肉活动,并支持基于手势的实时机器人控制。 其次,通过整合视觉、语言和运动规划模块,我们引入了一个共享自主框架,以支持稳健且用户驱动的远程操作,在家庭环境中进行密集导航任务时尤其有益。最后,为了展示该系统在真实环境中的应用效果,我们开展了一项为期十二天的家庭使用者研究,评估了实时使用可穿戴EMG界面进行日常机器人控制的情况。 这些系统的组成部分共同使得有效操控机器人以完成日常生活活动及其他家务工作成为可能,在真实的家庭环境中得以实现。
https://arxiv.org/abs/2602.02773
Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% $\pm$ 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% $\pm$ 2.98% to 96.9% $\pm$ 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.
https://arxiv.org/abs/2602.01855
We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at this https URL.
https://arxiv.org/abs/2602.00327
Reliable control of myoelectric prostheses is often hindered by high inter-subject variability and the clinical impracticality of high-density sensor arrays. This study proposes a deep learning framework for accurate gesture recognition using only two surface electromyography (sEMG) channels. The method employs a Convolutional Sparse Autoencoder (CSAE) to extract temporal feature representations directly from raw signals, eliminating the need for heuristic feature engineering. On a 6-class gesture set, our model achieved a multi-subject F1-score of 94.3% $\pm$ 0.3%. To address subject-specific differences, we present a few-shot transfer learning protocol that improved performance on unseen subjects from a baseline of 35.1% $\pm$ 3.1% to 92.3% $\pm$ 0.9% with minimal calibration data. Furthermore, the system supports functional extensibility through an incremental learning strategy, allowing for expansion to a 10-class set with a 90.0% $\pm$ 0.2% F1-score without full model retraining. By combining high precision with minimal computational and sensor overhead, this framework provides a scalable and efficient approach for the next generation of affordable and adaptive prosthetic systems.
https://arxiv.org/abs/2601.23011
Plants offer a paradoxical model for interaction: they are ambient, low-demand presences that nonetheless shape atmosphere, routines, and relationships through temporal rhythms and subtle expressions. In contrast, most human-robot interaction (HRI) has been grounded in anthropomorphic and zoomorphic paradigms, producing overt, high-demand forms of engagement. Using a Research through Design (RtD) methodology, we explore plants as metaphoric inspiration for HRI; we conducted iterative cycles of ideation, prototyping, and reflection to investigate what design primitives emerge from plant metaphors and morphologies, and how these primitives can be combined into expressive robotic forms. We present a suite of speculative, open-source prototypes that help probe plant-inspired presence, temporality, form, and gestures. We deepened our learnings from design and prototyping through prototype-centered workshops that explored people's perceptions and imaginaries of plant-inspired robots. This work contributes: (1) Set of plant-inspired robotic artifacts; (2) Designerly insights on how people perceive plant-inspired robots; and (3) Design consideration to inform how to use plant metaphors to reshape HRI.
https://arxiv.org/abs/2601.22387
Tendon-driven anthropomorphic robotic hands often lack direct joint angle sensing, as the integration of joint encoders can compromise mechanical compactness and dexterity. This paper presents a computational method for estimating joint positions from measured tendon displacements and tensions. An efficient kinematic modeling framework for anthropomorphic hands is first introduced based on the Denavit-Hartenberg convention. Using a simplified tendon model, a system of nonlinear equations relating tendon states to joint positions is derived and solved via a nonlinear optimization approach. The estimated joint angles are then employed for closed-loop control through a Jacobian-based proportional-integral (PI) controller augmented with a feedforward term, enabling gesture tracking without direct joint sensing. The effectiveness and limitations of the proposed estimation and control framework are demonstrated in the MuJoCo simulation environment using the Anatomically Correct Biomechatronic Hand, featuring five degrees of freedom for each long finger and six degrees of freedom for the thumb.
https://arxiv.org/abs/2601.20682
Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.
https://arxiv.org/abs/2601.20881
Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
https://arxiv.org/abs/2601.19258
Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.
https://arxiv.org/abs/2601.18451
With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.
https://arxiv.org/abs/2601.18305
This paper presents a novel neuromorphic control architecture for upper-limb prostheses that combines surface electromyography (sEMG) with gaze-guided computer vision. The system uses a spiking neural network deployed on the neuromorphic processor AltAi to classify EMG patterns in real time while an eye-tracking headset and scene camera identify the object within the user's focus. In our prototype, the same EMG recognition model that was originally developed for a conventional GPU is deployed as a spiking network on AltAi, achieving comparable accuracy while operating in a sub-watt power regime, which enables a lightweight, wearable implementation. For six distinct functional gestures recorded from upper-limb amputees, the system achieves robust recognition performance comparable to state-of-the-art myoelectric interfaces. When the vision pipeline restricts the decision space to three context-appropriate gestures for the currently viewed object, recognition accuracy increases to roughly 95% while excluding unsafe, object-inappropriate grasps. These results indicate that the proposed neuromorphic, context-aware controller can provide energy-efficient and reliable prosthesis control and has the potential to improve safety and usability in everyday activities for people with upper-limb amputation.
https://arxiv.org/abs/2601.17991