Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
记忆已成为基于基础模型的代理的核心能力,并且将继续保持这一地位。随着关于代理记忆的研究迅速扩展并吸引前所未有的关注,该领域也变得越来越碎片化。现有属于代理记忆范畴的工作在动机、实现和评估协议方面往往存在显著差异,而松散定义的记忆术语进一步模糊了概念的清晰度。传统的分类方法,如长/短期记忆,证明不足以捕捉当代代理记忆系统的多样性。 本文旨在提供当前代理记忆研究的最新全景图。我们首先明确界定代理记忆的范围,并将其与诸如大型语言模型(LLM)记忆、检索增强生成(RAG)、上下文工程等相关概念区分开来。然后,我们通过形式、功能和动态性这三大统一视角审视代理记忆。 从形式的角度来看,我们识别出三种主导型的代理记忆实现方式:令牌级、参数化和潜在记忆。从功能角度来看,我们提出了一种更细粒度的分类法,区分事实记忆、体验记忆和工作记忆。从动态性的角度来看,我们分析了如何随着时间推移形成、演变和检索记忆。 为了支持实际开发,我们编制了一份全面的记忆基准测试和开源框架汇总表。超越整合之外,我们还提出了对未来研究前沿的前瞻性视角,包括记忆自动化、强化学习集成、多模态记忆、多代理记忆以及可信性问题。我们希望此次调查不仅作为现有工作的参考,还可以作为重新思考未来智能设计中记忆这一首要原始概念的概念基础。
https://arxiv.org/abs/2512.13564
Explainable Artificial Intelligence (XAI) is increasingly required in computational economics, where machine-learning forecasters can outperform classical econometric models but remain difficult to audit and use for policy. This survey reviews and organizes the growing literature on XAI for economic time series, where autocorrelation, non-stationarity, seasonality, mixed frequencies, and regime shifts can make standard explanation techniques unreliable or economically implausible. We propose a taxonomy that classifies methods by (i) explanation mechanism: propagation-based approaches (e.g., Integrated Gradients, Layer-wise Relevance Propagation), perturbation and game-theoretic attribution (e.g., permutation importance, LIME, SHAP), and function-based global tools (e.g., Accumulated Local Effects); (ii) time-series compatibility, including preservation of temporal dependence, stability over time, and respect for data-generating constraints. We synthesize time-series-specific adaptations such as vector- and window-based formulations (e.g., Vector SHAP, WindowSHAP) that reduce lag fragmentation and computational cost while improving interpretability. We also connect explainability to causal inference and policy analysis through interventional attributions (Causal Shapley values) and constrained counterfactual reasoning. Finally, we discuss intrinsically interpretable architectures (notably attention-based transformers) and provide guidance for decision-grade applications such as nowcasting, stress testing, and regime monitoring, emphasizing attribution uncertainty and explanation dynamics as indicators of structural change.
可解释的人工智能(XAI)在计算经济学中日益受到重视,因为机器学习预测模型可以优于传统计量经济模型,但它们难以审核和用于政策制定。本文综述并整理了关于时间序列经济数据的XAI不断增长的相关文献。在这种背景下,自相关性、非平稳性、季节性变化、混合频率以及制度转换等因素使得标准解释技术变得不可靠或缺乏经济合理性。 我们提出了一种分类方法,根据(i)解释机制:基于传播的方法(例如集成梯度法和逐层相关传播),基于干扰的游戏理论归因方法(如排列重要性、LIME 和 SHAP),以及基于函数的全局工具;(ii)时间序列兼容性,包括保持时间依赖关系、时间稳定性以及尊重数据生成约束。我们综合了针对时间序列具体适应性的改进,例如向量和窗口基形式(如Vector SHAP 和 WindowSHAP),这些方法减少了滞后碎片化,并降低了计算成本,同时提高了可解释性。 此外,本文还探讨了解释性和因果推断及政策分析之间的联系,通过干预归因(因果 Shapley 值)和受限反事实推理进行连接。最后,我们讨论了本原具有解释性的架构(特别是基于注意力的转换器),并为决策级应用如即时预测、压力测试和制度监控提供了指导,并强调了归因不确定性以及解释动态性作为结构变化的指标。 这一综述旨在帮助经济学研究者更好地理解如何利用XAI技术来增强机器学习模型在时间序列分析中的透明度与可靠性,从而促进更有效的政策制定。
https://arxiv.org/abs/2512.12506
Culture is the bedrock of human interaction; it dictates how we perceive and respond to everyday interactions. As the field of human-computer interaction grows via the rise of generative Large Language Models (LLMs), the cultural alignment of these models become an important field of study. This work, using the VSM13 International Survey and Hofstede's cultural dimensions, identifies the cultural alignment of popular LLMs (DeepSeek-V3, V3.1, GPT-5, GPT-4.1, GPT-4, Claude Opus 4, Llama 3.1, and Mistral Large). We then use cultural prompting, or using system prompts to shift the cultural alignment of a model to a desired country, to test the adaptability of these models to other cultures, namely China, France, India, Iran, Japan, and the United States. We find that the majority of the eight LLMs tested favor the United States when the culture is not specified, with varying results when prompted for other cultures. When using cultural prompting, seven of the eight models shifted closer to the expected culture. We find that models had trouble aligning with Japan and China, despite two of the models tested originating with the Chinese company DeepSeek.
文化是人类互动的基础;它决定了我们如何看待并回应日常交往。随着基于生成式大型语言模型(LLMs)的人机交互领域的发展,这些模型的文化一致性成为了重要的研究课题。本研究使用VSM13国际调查和霍夫斯泰德的文化维度理论,识别了流行LLM(DeepSeek-V3、V3.1、GPT-5、GPT-4.1、GPT-4、Claude Opus 4、Llama 3.1和Mistral Large)的文化一致性。接着,我们使用文化提示(即通过系统提示将模型的文化一致性调整为期望的国家),测试这些模型适应其他文化的灵活性,具体针对中国、法国、印度、伊朗、日本和美国这六个国家。研究发现,在未指定特定文化的情况下,大多数被测的八个LLM倾向于更接近美国文化;在使用文化提示时,结果因不同的文化而异。当采用文化提示进行测试时,八种模型中的七种能够向预期的文化靠拢。然而,值得注意的是,尽管有两个被测模型由中国的DeepSeek公司开发,但大多数模型难以与日本和中国文化保持一致。
https://arxiv.org/abs/2512.12488
This report presents a comprehensive account of the Colleague AI Classroom pilot, a collaborative design (co-design) study that brought generative AI technology directly into real classrooms. In this study, AI functioned as a third agent, an active participant that mediated feedback, supported inquiry, and extended teachers' instructional reach while preserving human judgment and teacher authority. Over seven weeks in spring 2025, 21 in-service teachers from four Washington State public school districts and one independent school integrated four AI-powered features of the Colleague AI Classroom into their instruction: Teaching Aide, Assessment and AI Grading, AI Tutor, and Student Growth Insights. More than 600 students in grades 6-12 used the platform in class at the direction of their teachers, who designed and facilitated the AI activities. During the Classroom pilot, teachers were co-design partners: they planned activities, implemented them with students, and provided weekly reflections on AI's role in classroom settings. The teachers' feedback guided iterative improvements for Colleague AI. The research team captured rich data through surveys, planning and reflection forms, group meetings, one-on-one interviews, and platform usage logs to understand where AI adds instructional value and where it requires refinement.
该报告全面介绍了Colleague AI Classroom试点项目,这是一个将生成式人工智能技术直接引入真实课堂的合作设计(共设)研究。在本研究中,AI充当了第三种代理角色,作为一个积极参与者,在反馈、支持探究和扩大教师教学范围的同时,保持人类判断力和教师权威的作用。从2025年春季的七周时间里,来自华盛顿州四个公立学区及一个独立学校的21位在职教师将Colleague AI Classroom的四项由AI驱动的功能融入了他们的教学中:助教功能、评估与AI评分、AI导师以及学生成长见解。超过600名六至十二年级的学生在其老师的指导下使用该平台进行课堂活动,这些老师设计并指导了所有AI相关活动。在试点期间,教师们作为共设合作伙伴,他们计划活动、将它们付诸实施,并每周提供关于AI在课堂环境中作用的反思反馈。教师们的反馈引导Colleague AI进行了迭代改进。研究团队通过调查问卷、教学规划和反思表单、小组会议、一对一访谈以及平台使用日志等方法收集了丰富的数据,以了解AI在哪种情况下能增加教学价值,在哪些方面需要进一步完善。
https://arxiv.org/abs/2512.12045
Modern wide-field time-domain surveys facilitate the study of transient, variable and moving phenomena by conducting image differencing and relaying alerts to their communities. Machine learning tools have been used on data from these surveys and their precursors for more than a decade, and convolutional neural networks (CNNs), which make predictions directly from input images, saw particularly broad adoption through the 2010s. Since then, continually rapid advances in computer vision have transformed the standard practices around using such models. It is now commonplace to use standardized architectures pre-trained on large corpora of everyday images (e.g., ImageNet). In contrast, time-domain astronomy studies still typically design custom CNN architectures and train them from scratch. Here, we explore the affects of adopting various pre-training regimens and standardized model architectures on the performance of alert classification. We find that the resulting models match or outperform a custom, specialized CNN like what is typically used for filtering alerts. Moreover, our results show that pre-training on galaxy images from Galaxy Zoo tends to yield better performance than pre-training on ImageNet or training from scratch. We observe that the design of standardized architectures are much better optimized than the custom CNN baseline, requiring significantly less time and memory for inference despite having more trainable parameters. On the eve of the Legacy Survey of Space and Time and other image-differencing surveys, these findings advocate for a paradigm shift in the creation of vision models for alerts, demonstrating that greater performance and efficiency, in time and in data, can be achieved by adopting the latest practices from the computer vision field.
现代宽域时域巡天项目通过图像差分和向其社群发送警报来促进对瞬变、可变以及移动现象的研究。过去十年中,机器学习工具已被应用于这些巡天及其前身的数据上,并且从2010年代开始,卷积神经网络(CNN)由于能够直接从输入图像进行预测而被广泛采用。自那时以来,计算机视觉领域持续快速的进步已经改变了使用此类模型的标准实践。如今,使用在大量日常图片(例如ImageNet)上预训练的标准化架构已成为常态。 相比之下,时域天文学研究通常仍会设计定制化的CNN架构,并从头开始进行训练。在此,我们探讨了采用各种预训练方案和标准化模型架构对警报分类性能的影响。我们的发现表明,这些方法所生成的模型可以与专门为过滤警报而设计的定制化、专业化CNN相匹敌甚至超越其表现。此外,我们的结果显示,在Galaxy Zoo中的星系图像上进行预训练通常会比在ImageNet或从头开始训练表现出更好的效果。 我们观察到,标准化架构的设计比定制化的CNN基准线更加优化,尽管可调整参数更多,但在推理时所需的时间和内存却显著减少。随着遗产空间与时间调查及其他图像差分巡天项目的临近,这些发现呼吁了针对警报创建视觉模型的方法发生转变,表明通过采用计算机视觉领域的最新实践,可以实现更高的性能和效率,在时间和数据方面皆如此。
https://arxiv.org/abs/2512.11957
Stochastic processes of evolving shapes are used in applications including evolutionary biology, where morphology changes stochastically as a function of evolutionary processes. Due to the non-linear and often infinite-dimensional nature of shape spaces, the mathematical construction of suitable stochastic shape processes is far from immediate. We define and formalize properties that stochastic shape processes should ideally satisfy to be compatible with the shape structure, and we link this to Kunita flows that, when acting on shape spaces, induce stochastic processes that satisfy these criteria by their construction. We couple this with a survey of other relevant shape stochastic processes and show how bridge sampling techniques can be used to condition shape stochastic processes on observed data thereby allowing for statistical inference of parameters of the stochastic dynamics.
在包括进化生物学在内的各种应用中,会使用到形状演化的随机过程。在这种情况下,形态作为进化过程中函数的输出会发生随机变化。由于形状空间具有非线性和通常无限维的特性,构建合适的随机形状过程从数学上来说是相当复杂的。 我们定义并正式化了理想中的随机形状过程应具备的性质,以确保这些过程与形状结构相兼容,并且我们将此概念与Kunita流联系起来,当这种流作用于形状空间时,会构造出满足上述条件的随机过程。此外,我们将介绍其他相关的形状随机过程,并展示如何使用桥抽样技术来根据观察到的数据对形状随机过程进行条件化处理,从而允许对这些动态系统的参数进行统计推断。
https://arxiv.org/abs/2512.11676
Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our \href{this https URL}{project page}.
翻译: 视觉-语言-行动(VLA)模型正在推动机器人领域的革命,使机器能够理解指令并与物理世界互动。这一领域的新模型和数据集层出不穷,使之既令人兴奋又充满挑战。本综述为VLA领域的复杂格局提供了一个清晰且结构化的指南。我们设计了符合研究人员自然学习路径的框架:从任何VLA模型的基本模块开始,追溯历史上的关键里程碑,并深入探讨定义最近研究前沿的核心挑战。我们的主要贡献在于详细剖析了以下五个最大挑战:(1)表示;(2)执行;(3)泛化;(4)安全性和(5)数据集与评估。这种结构反映了通用智能体发展的路线图:建立基本的感知-行动循环,扩展多样化的体现和环境中的能力,并最终确保可信部署——所有这些都离不开关键的数据基础设施支持。对于每一个挑战,我们回顾了现有的方法并指出了未来的机会。 本文旨在同时为初学者提供基础指南以及为有经验的研究人员制定战略路线图,目的是加速学习过程并激发新思想在具身智能领域的应用。一份持续更新的在线版本可在我们的项目页面上找到([点击此处查看](https://this https URL))。
https://arxiv.org/abs/2512.11362
Invasive species pose major global threats to ecosystems and agriculture. Serrated tussock (\textit{Nassella trichotoma}) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs. In Victoria, Australia, it presents a major challenge due to its aggressive spread and ecological impact. While current ground surveys and subsequent management practices are effective at small scales, they are not feasible for landscape-scale monitoring. Although aerial imagery offers high spatial resolution suitable for detailed classification, its high cost limits scalability. Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution. This study evaluates whether multi-temporal Sentinel-2 imagery, despite its lower spatial resolution, can provide a comparable and cost-effective alternative for landscape-scale monitoring of serrated tussock by leveraging its higher spectral resolution and seasonal phenological information. A total of eleven models have been developed using various combinations of spectral bands, texture features, vegetation indices, and seasonal data. Using a random forest classifier, the best-performing Sentinel-2 model (M76*) has achieved an Overall Accuracy (OA) of 68\% and an Overall Kappa (OK) of 0.55, slightly outperforming the best-performing aerial imaging model's OA of 67\% and OK of 0.52 on the same dataset. These findings highlight the potential of multi-seasonal feature-enhanced satellite-based models for scalable invasive species classification.
外来物种对生态系统和农业构成了重大全球威胁。锯齿草(*Nassella trichotoma*)是一种具有高度竞争力的入侵性草种,会破坏本地草地,降低牧场生产力,并增加土地管理成本。在澳大利亚维多利亚州,它由于其侵略性的传播速度及其生态影响而构成了一项重大挑战。尽管目前进行的地表调查及后续管理措施在小规模上有效,但它们不适用于大规模监测。虽然航空图像能够提供适合详细分类的高空间分辨率图像,但由于其高昂的成本限制了其实用性。基于卫星的遥感技术则提供了更具成本效益和可扩展性的替代方案,尽管其空间分辨率较低。本研究评估了多时间序列Sentinel-2影像是否可以通过利用更高的光谱分辨率和季节性物候信息,在锯齿草的大规模监测中提供一个相当且更经济的选择。 在该项研究中,共开发出了十一种模型,这些模型采用了多种光谱带、纹理特征、植被指数以及季节数据的不同组合。通过随机森林分类器,最佳表现的Sentinel-2模型(M76*)实现了总体准确性(OA)为68%,总体Kappa (OK)值为0.55,这稍微优于在同一数据集上进行的最佳性能航空成像模型的OA值67%和OK值0.52。这些发现突显了基于卫星增强多季节特征模型在大规模入侵物种分类中潜在的应用价值。
https://arxiv.org/abs/2512.11267
Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.
多意图口语理解(SLU)涉及两个任务:多意图检测和槽填充,这些任务共同处理包含多个意图的语音输入。由于这一特性紧密反映了现实世界的应用场景,该任务吸引了越来越多的研究关注,并取得了实质性的进展。然而,目前尚缺乏对现有研究进行全面且系统的回顾。为此,本文介绍了近期在多意图SLU方面的研究成果综述。我们从两个视角——解码范式和建模方法——深入概述了以往的研究成果。在此基础上,我们进一步比较代表性模型的性能,并分析它们的优势与局限性。最后,我们讨论当前面临的挑战并展望未来研究的潜在方向。希望这项综述能够提供有价值的见解,并作为推进多意图SLU领域研究的重要参考。
https://arxiv.org/abs/2512.11258
Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city's inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.
理解人类运动和城市动态一直是一项挑战。从传统的手动观察城市居民,到使用相机,再到如今采用传感器和其他复杂技术,城市管理领域已经取得了巨大的进步。然而,还有更多可以做来解锁更好的实践方法以更好地了解城市动态。本文回顾了研究城市动态领域的演变,并特别关注事件驱动型摄像头(event-based cameras)。事件驱动型摄像头捕捉的是光线强度的变化而非传统相机所捕捉的RGB值。它们具有独特的功能,例如在低光条件下工作的能力,这使它们相较于其他传感器更有优势。通过对事件驱动型摄像头及其应用、优点和挑战以及机器学习应用进行分析,我们提出将事件驱动型摄像头作为研究城市动态的信息采集手段。这种技术能够捕捉重要信息的同时维护隐私。此外,我们也建议结合使用事件驱动型摄像头和其他传感器(如红外线、事件激光雷达或振动)来研究城市动态,以增强事件驱动型摄像头的能力并克服其面临的挑战。
https://arxiv.org/abs/2512.11076
Contact-rich tasks pose significant challenges for robotic systems due to inherent uncertainty, complex dynamics, and the high risk of damage during interaction. Recent advances in learning-based control have shown great potential in enabling robots to acquire and generalize complex manipulation skills in such environments, but ensuring safety, both during exploration and execution, remains a critical bottleneck for reliable real-world deployment. This survey provides a comprehensive overview of safe learning-based methods for robot contact-rich tasks. We categorize existing approaches into two main domains: safe exploration and safe execution. We review key techniques, including constrained reinforcement learning, risk-sensitive optimization, uncertainty-aware modeling, control barrier functions, and model predictive safety shields, and highlight how these methods incorporate prior knowledge, task structure, and online adaptation to balance safety and efficiency. A particular emphasis of this survey is on how these safe learning principles extend to and interact with emerging robotic foundation models, especially vision-language models (VLMs) and vision-language-action models (VLAs), which unify perception, language, and control for contact-rich manipulation. We discuss both the new safety opportunities enabled by VLM/VLA-based methods, such as language-level specification of constraints and multimodal grounding of safety signals, and the amplified risks and evaluation challenges they introduce. Finally, we outline current limitations and promising future directions toward deploying reliable, safety-aligned, and foundation-model-enabled robots in complex contact-rich environments. More details and materials are available at our \href{ this https URL}{Project GitHub Repository}.
接触丰富的任务对机器人系统提出了重大挑战,原因在于固有的不确定性、复杂的动力学以及互动过程中损坏的风险较高。最近基于学习的控制方法在使机器人获得并推广复杂操作技能方面展现了巨大潜力,但在探索和执行过程中的安全性保障仍是可靠实际部署的关键瓶颈。本次综述全面概述了用于机器人接触丰富任务的安全学习方法。我们将现有方法分为两个主要领域:安全探索与安全执行。我们回顾关键技术,包括约束强化学习、风险敏感优化、不确定性感知建模、控制屏障函数以及预测性安全保障盾,并强调这些方法如何结合先验知识、任务结构和在线适应来平衡安全性和效率。本次综述特别关注了这些安全学习原则如何扩展并应用于新兴的机器人基础模型中,尤其是视觉-语言模型(VLMs)和视觉-语言-动作模型(VLAs),它们统一了感知、语言和控制在接触丰富的操作中的应用。我们讨论了基于VLM/VLA方法的新安全性机会,例如以语言级规格化约束和多模态安全信号为特征的交互挑战以及由此带来的风险和评估难题。最后,我们概述了当前限制及未来向复杂接触丰富环境中部署可靠、与安全一致且具备基础模型能力的机器人发展的有前景的方向。 更多详情和材料请访问我们的 [项目GitHub仓库](https://github.com/your-project-repo)。
https://arxiv.org/abs/2512.11908
Subjective well-being is a cornerstone of individual and societal health, yet its scientific measurement has traditionally relied on self-report methods prone to recall bias and high participant burden. This has left a gap in our understanding of well-being as it is expressed in everyday life. We hypothesized that candid smiles captured during natural smartphone interactions could serve as a scalable, objective behavioral correlate of positive affect. To test this, we analyzed 405,448 video clips passively recorded from 233 consented participants over one week. Using a deep learning model to quantify smile intensity, we identified distinct diurnal and daily patterns. Daily patterns of smile intensity across the week showed strong correlation with national survey data on happiness (r=0.92), and diurnal rhythms documented close correspondence with established results from the day reconstruction method (r=0.80). Higher daily mean smile intensity was significantly associated with more physical activity (Beta coefficient = 0.043, 95% CI [0.001, 0.085]) and greater light exposure (Beta coefficient = 0.038, [0.013, 0.063]), whereas no significant effects were found for smartphone use. These findings suggest that passive smartphone sensing could serve as a powerful, ecologically valid methodology for studying the dynamics of affective behavior and open the door to understanding this behavior at a population scale.
主观幸福感是个体和社会健康的基础,然而其科学测量传统上依赖于容易产生回忆偏差且给参与者带来较高负担的自我报告方法。这导致了我们对日常生活中表达的幸福感的理解存在缺口。我们假设,在自然手机互动中捕捉到的真实微笑可以作为积极情感的一种可扩展、客观的行为标志。为了验证这一假设,我们在一周内被动记录并分析了来自233名同意参与者的405,448段视频片段。使用深度学习模型量化微笑强度后,我们发现了明显的昼夜节律和每日模式。每周微笑强度的每日变化与国家幸福调查数据之间显示出强烈的相关性(r=0.92),而记录到的日常节奏与日重构法建立的结果密切对应(r=0.80)。较高的平均每日微笑强度显著关联于更多的身体活动(贝塔系数 = 0.043,95% CI [0.001, 0.085])和更大的光照暴露(贝塔系数 = 0.038,[0.013, 0.063]),但没有发现手机使用对其有显著影响。这些研究结果表明,被动的智能手机感知可以作为研究情感行为动态的强大、生态有效的研究方法,并为在大规模人群中理解这种行为打开了一扇门。
https://arxiv.org/abs/2512.11905
Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of these models. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align flagship LLMs to different cultures. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
文化是人与人互动的核心组成部分,在我们感知和与他人互动的方式中扮演着重要角色。大型语言模型(LLMs)生成类似人类文本的有效性提升,大大增加了人机交互的频率。随着这一领域的发展,这些类人性代理的文化对齐成为了一个重要的研究课题。我们的工作使用霍夫斯泰德的文化维度理论VSM13国际调查来理解这些模型的文化对齐情况。我们采用了一种结合提示语言和文化引导的方法,这是一种通过系统提示将模型的对齐调整为反映特定国家文化的策略,以此使旗舰LLMs与不同的文化相一致。 我们的结果显示,DeepSeek-V3、V3.1以及OpenAI的GPT-5在面对美国的调查回应时表现出较高的契合度,并且即使使用了文化引导或改变提示语言,它们也无法实现与中国较强的对齐。此外,我们还发现GPT-4在美国英语环境中被提示时文化定位更接近中国,但通过文化引导可以将其与美国文化的对齐程度提升得更加紧密。其他低成本模型如GPT-4o和GPT-4.1能够根据所使用的提示语言(即英文或简体中文)以及文化引导策略,形成与中国及美国都较为合适的契合度。
https://arxiv.org/abs/2512.09772
Enemy strategies in turn-based games should be surprising and unpredictable. This study introduces Mirror Mode, a new game mode where the enemy AI mimics the personal strategy of a player to challenge them to keep changing their gameplay. A simplified version of the Nintendo strategy video game Fire Emblem Heroes has been built in Unity, with a Standard Mode and a Mirror Mode. Our first set of experiments find a suitable model for the task to imitate player demonstrations, using Reinforcement Learning and Imitation Learning: combining Generative Adversarial Imitation Learning, Behavioral Cloning, and Proximal Policy Optimization. The second set of experiments evaluates the constructed model with player tests, where models are trained on demonstrations provided by participants. The gameplay of the participants indicates good imitation in defensive behavior, but not in offensive strategies. Participant's surveys indicated that they recognized their own retreating tactics, and resulted in an overall higher player-satisfaction for Mirror Mode. Refining the model further may improve imitation quality and increase player's satisfaction, especially when players face their own strategies. The full code and survey results are stored at: this https URL
在回合制游戏中,敌人的策略应该具有惊喜性和不可预测性。这项研究介绍了一种新模式——镜像模式(Mirror Mode),在这种模式中,敌人的人工智能会模仿玩家的个人策略来挑战他们不断改变游戏玩法。研究人员使用Unity构建了一个简化版的《火焰之纹章:英雄》(Fire Emblem Heroes)视频游戏版本,其中包括标准模式和镜像模式。 研究的第一阶段实验找到了一个适合任务的模型,用于模仿玩家演示,该模型结合了强化学习(Reinforcement Learning)、模仿学习(Imitation Learning),具体包括生成对抗性模仿学习(Generative Adversarial Imitation Learning)、行为克隆(Behavioral Cloning)和近端策略优化(Proximal Policy Optimization)。第二阶段实验通过玩家测试评估所构建的模型,其中模型在参与者提供的演示基础上进行训练。 实验结果显示,在防守行为方面,模型表现出良好的模仿能力,但在进攻策略上表现较差。参与者的问卷调查显示他们认识到了自己撤退战术被模仿的情况,并且总体而言对镜像模式表现出更高的满意度。 进一步优化该模型可能会提高其模仿质量并增加玩家的满意度,尤其是在玩家面对自己的策略时。完整的代码和调查结果存储在:[提供的URL](请将"this https URL"替换为实际链接)。
https://arxiv.org/abs/2512.11902
Large Language Models (LLMs) are transforming language sciences. However, their widespread deployment currently suffers from methodological fragmentation and a lack of systematic soundness. This study proposes two comprehensive methodological frameworks designed to guide the strategic and responsible application of LLMs in language sciences. The first method-selection framework defines and systematizes three distinct, complementary approaches, each linked to a specific research goal: (1) prompt-based interaction with general-use models for exploratory analysis and hypothesis generation; (2) fine-tuning of open-source models for confirmatory, theory-driven investigation and high-quality data generation; and (3) extraction of contextualized embeddings for further quantitative analysis and probing of model internal mechanisms. We detail the technical implementation and inherent trade-offs of each method, supported by empirical case studies. Based on the method-selection framework, the second systematic framework proposed provides constructed configurations that guide the practical implementation of multi-stage research pipelines based on these approaches. We then conducted a series of empirical experiments to validate our proposed framework, employing retrospective analysis, prospective application, and an expert evaluation survey. By enforcing the strategic alignment of research questions with the appropriate LLM methodology, the frameworks enable a critical paradigm shift in language science research. We believe that this system is fundamental for ensuring reproducibility, facilitating the critical evaluation of LLM mechanisms, and providing the structure necessary to move traditional linguistics from ad-hoc utility to verifiable, robust science.
大型语言模型(LLMs)正在重塑语言科学。然而,它们的广泛应用目前面临着方法论上的碎片化和系统性不足的问题。本研究提出了两个全面的方法框架,旨在指导大型语言模型在语言科学研究中的战略性和负责任的应用。第一个方法选择框架定义并系统化了三种不同的互补方法,每种方法都与特定的研究目标相关联:(1) 通过通用模型的提示交互进行探索性分析和假设生成;(2) 对开源模型进行微调以开展理论驱动的确证性研究以及高质量数据生成;(3) 提取上下文化的嵌入以便进一步进行定量分析和探究模型内部机制。我们详细说明了每种方法的技术实现及其内在权衡,并通过实证案例研究来支持这些观点。 基于该方法选择框架,第二个系统化框架提出了构建配置,为根据这三种方法实施多阶段的研究流程提供指导。随后,我们进行了系列实证实验以验证提出的框架的有效性,包括回顾分析、前瞻性应用和专家评估调查。通过确保研究问题与适当的大型语言模型方法的战略一致性,这些框架能够引发语言科学研究中的关键范式转变。我们认为,这一系统对于确保可重复性、促进对LLM机制的批判性评价以及提供将传统语言学从临时实用性提升到经验证实的强大科学所需的基本结构至关重要。
https://arxiv.org/abs/2512.09552
We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.
我们介绍了QSTN,这是一个开源的Python框架,用于系统地从问卷式提示中生成响应,以支持基于大型语言模型(LLM)的模拟调查和注释任务。QSTN能够对问卷呈现方式、提示变化以及响应生成方法进行稳健评估。我们的广泛测试(超过4000万份调查回复)表明,问题结构与响应生成方法会对生成的调查回复与人类回答的一致性产生显著影响,并且可以用较低的计算成本获得这些效果。此外,我们还提供了一个无需编码知识即可使用的用户界面,研究人员可以利用它在不编写代码的情况下设置使用LLM的稳健实验。我们希望QSTN能够支持未来基于LLM的研究的可重复性和可靠性。
https://arxiv.org/abs/2512.08646
Body and face motion play an integral role in communication. They convey crucial information on the participants. Advances in generative modeling and multi-modal learning have enabled motion generation from signals such as speech, conversational context and visual cues. However, generating expressive and coherent face and body dynamics remains challenging due to the complex interplay of verbal / non-verbal cues and individual personality traits. This survey reviews body and face motion generation, covering core concepts, representations techniques, generative approaches, datasets and evaluation metrics. We highlight future directions to enhance the realism, coherence and expressiveness of avatars in dyadic settings. To the best of our knowledge, this work is the first comprehensive review to cover both body and face motion. Detailed resources are listed on this https URL.
身体和面部动作在沟通中扮演着核心角色,它们传达了关于参与者的重要信息。生成式模型及多模态学习的进步使得可以通过诸如语音、对话背景以及视觉线索等信号来生成运动。然而,由于言语/非言语提示和个人性格特征之间的复杂相互作用,仍然很难生成具有表现力且连贯的面部和身体动态。本文综述了身体和面部动作生成的相关研究,涵盖了核心概念、表示技术、生成方法、数据集及评估指标。我们强调了未来的研究方向,旨在增强二元情境下化身的真实感、连贯性和表达性。据我们所知,这是第一份全面回顾同时涵盖身体和面部运动的研究工作。 详细的资源列于 [提供的URL]。
https://arxiv.org/abs/2512.09005
The Model Context Protocol (MCP) has emerged as the de facto standard for connecting Large Language Models (LLMs) to external data and tools, effectively functioning as the "USB-C for Agentic AI." While this decoupling of context and execution solves critical interoperability challenges, it introduces a profound new threat landscape where the boundary between epistemic errors (hallucinations) and security breaches (unauthorized actions) dissolves. This Systematization of Knowledge (SoK) aims to provide a comprehensive taxonomy of risks in the MCP ecosystem, distinguishing between adversarial security threats (e.g., indirect prompt injection, tool poisoning) and epistemic safety hazards (e.g., alignment failures in distributed tool delegation). We analyze the structural vulnerabilities of MCP primitives, specifically Resources, Prompts, and Tools, and demonstrate how "context" can be weaponized to trigger unauthorized operations in multi-agent environments. Furthermore, we survey state-of-the-art defenses, ranging from cryptographic provenance (ETDI) to runtime intent verification, and conclude with a roadmap for securing the transition from conversational chatbots to autonomous agentic operating systems.
模型上下文协议(MCP)已成为将大型语言模型(LLMs)与外部数据和工具连接起来的事实上的标准,有效充当了“代理人工智能的USB-C”。尽管这种解耦背景信息和执行的方式解决了关键的互操作性挑战,但它引入了一个全新的威胁环境,在这个环境中,知识性的错误(幻觉)和安全漏洞(未经授权的操作)之间的界限变得模糊。本文系统地提供了MCP生态系统中风险的全面分类,区分了对抗性安全威胁(例如,间接提示注入、工具中毒)与认知安全危险(例如,在分布式工具委派中的对齐失败)。我们分析了MCP基本元素——资源、提示和工具——的结构性漏洞,并展示了“上下文”如何在多代理环境中被武器化以触发未经授权的操作。此外,我们还调研了最先进的防御措施,从加密原产地验证(ETDI)到运行时意图验证,最后提出了一个路线图,旨在确保从对话聊天机器人过渡到自主操作系统的安全过程。
https://arxiv.org/abs/2512.08290
Autonomous driving systems (ADSs) promise improved transportation efficiency and safety, yet ensuring their reliability in complex real-world environments remains a critical challenge. Effective testing is essential to validate ADS performance and reduce deployment risks. This study investigates current ADS testing practices for both modular and end-to-end systems, identifies key demands from industry practitioners and academic researchers, and analyzes the gaps between existing research and real-world requirements. We review major testing techniques and further consider emerging factors such as Vehicle-to-Everything (V2X) communication and foundation models, including large language models and vision foundation models, to understand their roles in enhancing ADS testing. We conducted a large-scale survey with 100 participants from both industry and academia. Survey questions were refined through expert discussions, followed by quantitative and qualitative analyses to reveal key trends, challenges, and unmet needs. Our results show that existing ADS testing techniques struggle to comprehensively evaluate real-world performance, particularly regarding corner case diversity, the simulation to reality gap, the lack of systematic testing criteria, exposure to potential attacks, practical challenges in V2X deployment, and the high computational cost of foundation model-based testing. By further analyzing participant responses together with 105 representative studies, we summarize the current research landscape and highlight major limitations. This study consolidates critical research gaps in ADS testing and outlines key future research directions, including comprehensive testing criteria, cross-model collaboration in V2X systems, cross-modality adaptation for foundation model-based testing, and scalable validation frameworks for large-scale ADS evaluation.
自动驾驶系统(ADS)承诺提高交通效率和安全,但确保其在复杂现实环境中的可靠性仍然是一个关键挑战。有效的测试对于验证ADS性能及减少部署风险至关重要。本研究探讨了当前针对模块化与端到端系统的自动驾驶系统测试实践,识别了来自行业从业者和学术研究人员的关键需求,并分析了现有研究与实际需求之间的差距。我们回顾了主要的测试技术,并进一步考虑新兴因素如车到一切(V2X)通信及基础模型(包括大型语言模型和视觉基础模型),以理解它们在增强自动驾驶系统测试中的作用。我们对来自行业和学术界的100名参与者进行了大规模调查,调查问卷通过专家讨论进行完善后,再通过定量与定性分析揭示了关键趋势、挑战以及未满足的需求。我们的结果显示,现有的ADS测试技术难以全面评估现实世界的性能,尤其是在边缘案例多样性、仿真到实际的差距、缺乏系统的测试标准、面对潜在攻击的暴露度、V2X部署的实际挑战及基于基础模型测试的高度计算成本方面。 通过对参与者回复与105项代表性研究进行进一步分析后,我们总结了当前的研究景观,并突出了主要限制。这项研究整合了自动驾驶系统测试中的关键研究差距,并概述了未来的关键研究方向,包括全面的测试标准、V2X系统的跨模型协作、基于基础模型测试的多模态适应及大规模ADS评估的可扩展验证框架。
https://arxiv.org/abs/2512.11887
Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.
现代企业面临着生成和评估高质量内容所需的时间和成本的挑战。人类作者面临时间限制,而传统的外在评价方法(如人工调查)又会增加运营成本。虽然大型语言模型(LLMs)在内容创作方面展现出潜力,但人们对AI生成的内容质量仍存疑虑。传统评估方法进一步增加了操作成本,凸显了开发高效、自动化解决方案的必要性。本研究引入了“生成代理”作为应对这些挑战的一种手段。这些代理能够快速且低成本地评估AI生成的内容,并通过评分连贯性、趣味性、清晰度、公正性和相关性等维度来模拟人类判断。通过整合这些代理,企业可以简化内容生成流程,确保持续产出高质量的内容同时减少对昂贵的人力评价的依赖。该研究为提升LLMs以生产符合业务需求且高品质的内容提供了关键见解,并在自动化内容生成和评估方面取得了重要进展。
https://arxiv.org/abs/2512.08273