Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.
在边缘计算和物联网环境中,为了适应不同时间点上计算资源的变化,需要采用具备可调适减少策略的动态架构来管理基础语音模型。一种新兴的方法是层丢弃(Layer Dropping, $\mathcal{LD}$),该方法通过在推理过程中跳过骨干网络的一部分层次来降低计算负担,从而使静态模型能够转变为动态模型。然而,现有方法在选择层的方式上存在局限性,或者需要显著修改神经架构。 为此,我们提出了一种输入驱动的层丢弃(Input-driven $\mathcal{LD}$)策略,这种方法利用网络的输入特征和一个轻量级的层选择网络来确定最佳处理层次组合。我们在四个公共语音和音频基准测试集上进行了广泛的实验,并使用两种不同的预训练基础模型,证明了我们方法的有效性。我们的方法显著优于随机丢弃的方法,并且在早期退出策略上的表现相当(或更好)。
https://arxiv.org/abs/2507.07954
Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{this https URL}
人体活动识别(HAR)在资源受限的可穿戴设备上应用时,需要既能保证精度又能提高计算效率的推理模型。本文介绍了一种名为TinierHAR的超轻量级深度学习架构,该架构结合了残差分组卷积、门控循环单元(GRU)和时间聚合技术,在不牺牲性能的情况下实现了最先进的能效。在14个公共HAR数据集上进行评估时,TinierHAR相较于TinyHAR将参数减少了2.7倍,并且比DeepConvLSTM减少了43.3倍;同时计算量(MACs)分别减少了6.4倍和58.6倍,而平均F1分数保持不变。除了定量上的提升,这项工作还首次系统地进行了消融研究,解析了TinierHAR、先前的SOTA TinyHAR以及经典的DeepConvLSTM在时空组件中的贡献,为高效设计HAR系统提供了实用见解。 我们最后讨论了这些发现,并为未来高效的HAR设计提出了基本原则指南。为了促进边缘设备上的HAR研究进展,我们将本文所有材料开源,供未来的基准测试使用\footnote{this https URL}。
https://arxiv.org/abs/2507.07949
The recent advances in generative models such as diffusion models have raised several risks and concerns related to privacy, copyright infringements and data stewardship. To better understand and control the risks, various researchers have created techniques, experiments and attacks that reconstruct images, or part of images, from the training set. While these techniques already establish that data from the training set can be reconstructed, they often rely on high-resources, excess to the training set as well as well-engineered and designed prompts. In this work, we devise a new attack that requires low resources, assumes little to no access to the actual training set, and identifies, seemingly, benign prompts that lead to potentially-risky image reconstruction. This highlights the risk that images might even be reconstructed by an uninformed user and unintentionally. For example, we identified that, with regard to one existing model, the prompt ``blue Unisex T-Shirt'' can generate the face of a real-life human model. Our method builds on an intuition from previous works which leverages domain knowledge and identifies a fundamental vulnerability that stems from the use of scraped data from e-commerce platforms, where templated layouts and images are tied to pattern-like prompts.
最近,生成模型(如扩散模型)的进展引发了一些与隐私、版权侵权和数据管理相关的风险和担忧。为了更好地理解和控制这些风险,各种研究人员已经创建了技术、实验和攻击方法来从训练集中重构图像或部分图像。尽管这些技术已经证明可以从训练集中重建数据,但它们通常依赖于高资源消耗、对实际训练集的访问权限以及精心设计的提示语。 在本工作中,我们设计了一种新的攻击方法,它只需要低资源,并且假设几乎没有访问实际训练集的机会,同时识别看似无害但实际上可能导致潜在风险图像重构的提示。这突显了即使是没有专业知识的用户也可能无意中重建图像的风险。例如,我们发现对于现有的某个模型而言,“蓝色男女同款T恤”的提示语可以生成一个现实生活中的模特的脸部图像。 我们的方法借鉴了之前研究的工作思路,利用领域知识并识别了一个根本性的脆弱性,该脆弱性源于从电子商务平台抓取的数据使用,其中模板布局和图像与模式化的提示语紧密关联。
https://arxiv.org/abs/2507.07947
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at this https URL.
尽管视觉-语言模型(VLMs)在多模态任务中表现出令人鼓舞的进步,但在工业异常检测和推理方面却常常遇到挑战,尤其是在提供可解释性说明以及泛化到未见过的类别上。这一限制源于异常检测本质上具有特定领域的特性,这阻碍了现有VLMs在需要精确、结构化和上下文感知分析的工业场景中的应用。 为了应对这些挑战,我们提出了SAGE框架,这是一个基于VLM的方法,通过自我引导事实增强(Self-Guided Fact Enhancement, SFE)和熵感知直接偏好优化(Entropy-aware Direct Preference Optimization, E-DPO)来提高异常推理能力。SFE通过事实提取与融合将领域特定知识融入视觉推理中;E-DPO则使用熵感知的优化方法使模型输出与专家偏好的一致。 此外,我们还引入了AD-PL,这是一个针对工业异常推理而设计的偏好优化数据集,包含28,415个问答实例和根据专家排名排序的答案。为了评估异常推理模型,我们开发了一种名为多尺度逻辑评价(Multiscale Logical Evaluation, MLE)的量化框架,用于分析模型的逻辑一致性和准确性。 SAGE在零样本设置和单样本设置下对工业异常数据集均表现出优越性能。代码、模型及数据集可在[此处](https://this-url.com)获取。
https://arxiv.org/abs/2507.07939
Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society's most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.
鉴于生成式人工智能的迅速采用及其对各种任务潜在影响,理解AI对经济的影响是社会面临的重要问题之一。在这项研究中,我们通过分析人们使用AI的工作活动、这些活动的成功程度和广泛性,并结合有关从事此类活动的职业的数据,朝着这个目标迈出了一步。我们分析了一个包含20万条匿名且隐私已清除的用户与微软Bing Copilot(一个公开可用的生成式AI系统)之间的对话数据集。我们发现人们寻求AI协助最常见的工作活动包括收集信息和写作,而AI本身最常执行的任务是提供信息和支持、写作、教学和建议。结合这些任务分类以及对任务成功度和影响范围的测量,我们为每个职业计算了AI适用性得分。研究结果表明,在计算机和数学、办公室及行政支持等知识型工作群体中,以及销售等需要提供和传递信息的职业中,AI的适用性得分最高。此外,我们还描述了哪些类型的工作活动执行得最成功,并探讨了工资与教育程度如何影响AI适用性,以及实际应用与职业AI影响预测之间的比较情况。
https://arxiv.org/abs/2507.07935
The past decade has seen incredible scaling of AI systems by a few companies, leading to inequality in AI model performance. This paper argues that, contrary to prevailing intuition, the diminishing returns to compute scaling will lead to a convergence of AI model capabilities. In other words, meek models (those with limited computation budget) shall inherit the earth, approaching the performance level of the best models overall. We develop a model illustrating that under a fixed-distribution next-token objective, the marginal capability returns to raw compute shrink substantially. Given current scaling practices, we argue that these diminishing returns are strong enough that even companies that can scale their models exponentially faster than other organizations will eventually have little advantage in capabilities. As part of our argument, we give several reasons that proxies like training loss differences capture important capability measures using evidence from benchmark data and theoretical performance models. In addition, we analyze empirical data on the capability difference of AI models over time. Finally, in light of the increasing ability of meek models, we argue that AI strategy and policy require reexamination, and we outline the areas this shift will affect.
过去十年,少数几家公司大幅扩展了人工智能系统的规模,导致了人工智能模型性能上的不平等。本文认为,与普遍直觉相反的是,计算能力的边际收益递减将导致AI模型能力趋同。换句话说,在资源受限的小型模型(那些拥有有限计算预算)也将能够接近并达到最佳模型的整体表现水平。 我们建立了一个模型来说明在固定分布下的下一个标记目标下,单纯增加计算资源带来的边际能力提升显著减少。基于当前的扩展实践,我们认为这种收益递减效应足够强,即使一些公司可以比其他组织更快地以指数速度扩大其模型规模,在能力上也终将不再拥有明显优势。 作为我们论点的一部分,我们提供了几个理由来说明诸如训练损失差异这样的代理指标能够捕捉到重要能力度量,并且通过基准数据和理论性能模型给出了支持这一观点的证据。此外,我们还分析了人工智能模型随时间推移在能力上的差距变化的数据。 最后,在考虑到小型模型不断增长的能力背景下,我们认为需要重新审视AI战略和政策,并概述了这种转变将影响的领域。
https://arxiv.org/abs/2507.07931
Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.
背景:公众演讲是一项重要的职业技能,但对于许多人来说仍是一个令人焦虑的来源。传统培训依赖于专家指导,但人工智能技术的进步催生了各种商业化的自动化公众演讲反馈工具。然而,大多数研究主要集中在原型设计而非商用应用上,并且对于这些工具如何被公共演讲专家们所认知知之甚少。 目的:本研究旨在评估公共演讲专家对基于AI的公共演讲培训工具的有效性和设计的看法,并提出改进这些建议的准则。 方法:该研究包括了16次半结构化访谈和2个焦点小组,参与者都是公共演讲方面的专家。他们讨论了目前商业工具的观点、这些工具与传统指导结合的可能性以及改善这些系统的建议。 结果及结论:专家们认可AI工具在处理重复性和技术性培训方面的能力,使教练能够专注于更高层次的技能提升。然而,他们指出现有工具存在关键问题,并强调需要个性化的、易于理解的反馈和明确的教学设计。总体而言,他们支持一种结合传统指导与由AI辅助练习的混合模式。
https://arxiv.org/abs/2507.07930
Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.
持续的自动化监测实验室小鼠能够更准确地收集数据,并通过实时洞察提高动物福利。研究人员可以通过在笼舍中整合行为和生理监测,实现对疾病进展及治疗效果的动态且具有临床相关性的特征描述。然而,由于小鼠的高密度饲养、相似外观、高度移动性和频繁互动等原因,提供个体小鼠指标变得困难。 为了解决这些挑战,我们开发了一种实时识别(ID)算法,该算法能够准确地将佩戴定制耳标的数字笼舍中由摄像头监控的小鼠进行身份预测。我们的管道包括三个部分: 1. 一个结合了小鼠外观和运动线索的定制多对象跟踪器(MouseTracks); 2. 基于变压器的身份分类器(Mouseformer); 3. 将最终ID预测分配给轨迹片段(tracklet)的关联程序(MouseMap)。 我们的模型能够在每秒30帧的速度下,根据定制耳标为动物分配一个ID,并实现全天候笼舍覆盖。我们展示了与当前小鼠跟踪方法相比,我们的自定义追踪和ID管道在不同品系的小鼠以及各种环境因素下提高了追踪效率并降低了身份切换率。
https://arxiv.org/abs/2507.07929
Cerebrovascular pathology significantly contributes to cognitive decline and neurological disorders, underscoring the need for advanced tools to assess vascular integrity. Three-dimensional Time-of-Flight Magnetic Resonance Angiography (3D TOF MRA) is widely used to visualize cerebral vasculature, however, clinical evaluations generally focus on major arterial abnormalities, overlooking quantitative metrics critical for understanding subtle vascular changes. Existing methods for extracting structural, geometrical and morphological arterial features from MRA - whether manual or automated - face challenges including user-dependent variability, steep learning curves, and lack of standardized quantitative validations. We propose a novel semi-supervised artery evaluation framework, named ArteryX, a MATLAB-based toolbox that quantifies vascular features with high accuracy and efficiency, achieving processing times ~10-15 minutes per subject at 0.5 mm resolution with minimal user intervention. ArteryX employs a vessel-fused network based landmarking approach to reliably track and manage tracings, effectively addressing the issue of dangling/disconnected vessels. Validation on human subjects with cerebral small vessel disease demonstrated its improved sensitivity to subtle vascular changes and better performance than an existing semi-automated method. Importantly, the ArteryX toolbox enables quantitative feature validation by integrating an in-vivo like artery simulation framework utilizing vessel-fused graph nodes and predefined ground-truth features for specific artery types. Thus, the ArteryX framework holds promise for benchmarking feature extraction toolboxes and for seamless integration into clinical workflows, enabling early detection of cerebrovascular pathology and standardized comparisons across patient cohorts to advance understanding of vascular contributions to brain health.
脑血管病变在认知功能下降和神经性疾病的发生中扮演着重要角色,这凸显了评估血管完整性所需先进工具的重要性。三维时间飞跃磁共振血管造影(3D TOF MRA)被广泛用于可视化大脑血管结构;然而,临床评价通常侧重于主要动脉异常的检查,而忽视了定量指标对理解细微血管变化的关键作用。无论是手动还是自动化的方法,在从MRA中提取结构、几何和形态学的动脉特征时都面临着挑战,包括用户依赖性变异、陡峭的学习曲线以及缺乏标准化的定量验证。 我们提出了一种名为ArteryX的新颖半监督动脉评估框架,这是一个基于MATLAB工具箱,能够以高准确性和效率量化血管特征。在0.5毫米分辨率下处理每个受试者的耗时约为10-15分钟,并且只需要最小限度的人工干预。ArteryX采用了一种基于融合血管的网络定位方法来可靠地追踪和管理轨迹,有效地解决了悬挂/断开连接的血管问题。 人体实验验证显示,对于患有脑小血管病的受试者,ArteryX对细微血管变化具有更高的敏感度,并且优于现有的半自动化方法。重要的是,ArteryX工具箱通过整合一种基于真实动脉模拟框架的方法进行定量特征验证,该框架利用融合血管图节点和预定义的真实基准特征来进行特定类型动脉的评估。 因此,ArteryX框架为基准测试功能提取工具箱以及无缝集成到临床工作流程中提供了希望。它能够早期检测脑血管病变,并对患者群体间进行标准化比较以增进我们对血管在大脑健康贡献的理解。
https://arxiv.org/abs/2507.07920
The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at this https URL.
随着时间的推移,文本数据爆炸性的增长给揭示不断变化的主题和趋势带来了重大挑战。现有的动态主题建模技术虽然强大,但通常存在于缺乏对解释支持以及用户友好探索功能的碎片化工作流程中。我们引入了 DTECT(动态主题探索及上下文追踪器),这是一个从原始文本数据到有意义的时间洞察的一站式系统。DTECT 提供了一个统一的工作流,支持数据预处理、多种模型架构和专门评估时间主题模型质量的度量标准。 通过引入LLM驱动的自动话题标签生成、基于时间显著词汇的趋势分析、结合文档级别总结的交互式可视化以及用于直观查询数据的自然语言聊天接口等功能,DTECT 显著提高了解释性。将这些功能整合到一个统一的平台中,DTECT 能够让用户更有效地追踪和理解主题动态。 DTECT 是开源软件,并可在[此处](https://example.com/detect)获取。
https://arxiv.org/abs/2507.07910
Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.
远程光电容积描记(rPPG)作为一种利用摄像头监测生理信号的非侵入性方法,已经展现出广阔的应用前景。尽管已提出多种领域适应和泛化方法来提高基于深度学习的rPPG模型在未知部署环境中的适应能力,但在隐私保护和实时适应方面的考虑限制了其实际应用。因此,在这项工作中,我们旨在为rPPG任务提出一种全新的完全测试时间适应(TTA)策略。 具体而言,根据生理学的先验知识以及我们的观察,我们注意到不仅在频率域中存在rPPG信号的空间-时间一致性,而且在时域中的不一致也很显著。鉴于此,通过利用这两种先验信息,即一致性和不一致性,我们引入了一种创新的知识驱动自监督框架——**C**onsistency-\**i*n\**C**onsistency-\**i**ntegration(**CiCi**),以增强模型在推理过程中的适应能力。此外,我们的方法进一步整合了一个梯度动态控制机制,用以缓解先验信息之间的潜在冲突,确保实例间的稳定适应。 通过TTA协议在五个不同的数据集上进行广泛实验,我们的方法持续优于现有技术,在不访问源数据的情况下实现了实时自监督适应的最先进性能。代码将在后续发布。
https://arxiv.org/abs/2507.07908
Tracking the strategic focus of companies through topics in their earnings calls is a key task in financial analysis. However, as industries evolve, traditional topic modeling techniques struggle to dynamically capture emerging topics and their relationships. In this work, we propose an LLM-agent driven approach to discover and retrieve emerging topics from quarterly earnings calls. We propose an LLM-agent to extract topics from documents, structure them into a hierarchical ontology, and establish relationships between new and existing topics through a topic ontology. We demonstrate the use of extracted topics to infer company-level insights and emerging trends over time. We evaluate our approach by measuring ontology coherence, topic evolution accuracy, and its ability to surface emerging financial trends.
通过分析公司财报电话会议中的主题来追踪公司的战略重点是财务分析的一项关键任务。然而,随着行业的演变,传统的主题建模技术难以动态捕捉新兴话题及其相互关系。为此,我们提出了一种基于大型语言模型(LLM)代理的创新方法,用于从季度财报电话会议中发现和检索新兴话题。 我们的方法包括使用一个由大型语言模型驱动的代理来执行以下任务: 1. 从文档中提取主题。 2. 将这些主题结构化为层次化的本体模型。 3. 建立新旧话题之间的关系,通过构建主题本体的方式进行整合和更新。 我们展示了如何利用提取的主题来推断公司的层面见解,并随着时间的推移识别出新兴趋势。为了评估这种方法的有效性,我们将从以下几个方面来进行衡量: - 本体的一致性和连贯性。 - 主题演变的准确性。 - 发现新金融趋势的能力。 通过这些方式,我们旨在提供一种更灵活、适应性强的方法来捕捉和理解公司战略方向的变化以及行业动态。
https://arxiv.org/abs/2507.07906
Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.
准确的位置估计对于部署在自主平台(包括地面车辆、海洋船只和无人机)上的现代导航系统至关重要。在此背景下,视觉同时定位与地图构建 (VSLAM) 依赖于从视觉输入数据中可靠地提取显著特征点,其中包含视觉里程计。本文提出了一种嵌入式实现方案,该方案基于未监督架构,并能够检测和描述特征点。我们的方法采用量化后的 SuperPoint 卷积神经网络。 我们的目标是通过保持高检测质量的同时最小化模型的计算需求,从而在资源有限(如移动或嵌入式系统)平台上实现高效部署。我们在 AMD/Xilinx Zynq UltraScale+ FPGA 系统级芯片 (SoC) 平台上实现了这一解决方案,并评估了深度学习处理单元(DPUs) 的性能。此外,我们使用 Brevitas 库和 FINN 框架进行模型量化及硬件感知优化。这使我们能够在FPGA平台上以高达54 fps的帧率处理640x480像素图像,优于该领域的现有解决方案。 我们在 TUM 数据集上进行了实验,展示了不同量化技术对视觉里程计任务中模型精度和性能的影响,并讨论了这些影响。
https://arxiv.org/abs/2507.07903
Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at this https URL.
多模态大型语言模型(MLLM)在AI辅助医学诊断方面取得了显著进展,但它们经常会产生与已建立的医学知识不符的事实性错误响应。检索增强生成(RAG)通过整合外部来源提高了事实准确性,但它也面临着两个关键挑战。首先,检索不足可能会遗漏重要信息,而过度检索则可能引入无关或误导性的内容,干扰模型输出。其次,即使初始答案是正确的,过度依赖检索数据也会导致事实性错误。为解决这些问题,我们推出了多模态智能检索和增强(MIRA)框架,旨在优化MLLM中的事实准确性。MIRA包含两个关键组成部分:(1) 一个经过校准的重新思考和重组模块,该模块动态调整检索上下文的数量以管理事实风险;以及 (2) 结合图像嵌入和医学知识库,并配备查询重写模块以实现高效多模态推理的医学RAG框架。这使得模型能够有效地整合其内在知识和外部参考信息。我们使用公开可用的医学VQA(视觉问答)和报告生成基准对MIRA进行了评估,结果表明它显著提高了事实准确性和整体性能,并取得了新的最先进的成果。代码可在提供的链接处获取。
https://arxiv.org/abs/2507.07902
The rapid development of artificial intelligence has positioned large language models as fundamental components of intelligent legal systems. However, these models face significant limitations in legal dispute analysis, including insufficient legal knowledge representation, limited concept understanding, and reasoning deficiencies. This research proposes an enhanced framework integrating prompt engineering with multidimensional knowledge graphs. The framework introduces a three-stage hierarchical prompt structure comprising task definition, knowledge background, and reasoning guidance, supplemented by legal-specific reasoning templates and dynamic optimization mechanisms. A three-layer knowledge graph architecture is constructed with legal classification ontology, representation, and instance layers. Four complementary methods enable precise legal concept retrieval: direct legal norm code matching, domain-specific semantic vector similarity, ontology-based path reasoning, and specialized lexical segmentation. These components integrate with web search technology to establish a knowledge-enhanced framework for legal decision-making. Experimental results demonstrate significant performance improvements in legal dispute analysis, enabling accurate legal application analysis for complex cases while exhibiting nuanced understanding of judicial decision-making logic, providing a novel technical approach for implementing intelligent legal assistance systems.
人工智能的快速发展已将大型语言模型定位为智能法律系统中的核心组成部分。然而,这些模型在法律纠纷分析方面面临显著限制,包括法律知识表示不足、概念理解有限以及推理能力欠缺等问题。本研究提出了一种改进框架,该框架结合了提示工程与多维度知识图谱技术。这一框架引入了一个三阶段层次化提示结构,涵盖任务定义、知识背景和推理指导,并辅以特定于法律的推理模板及动态优化机制。同时构建了一个三层知识图架构,包括法律分类本体论层、表示层以及实例层。为了实现精确的法律概念检索,提出了四种互补方法:直接匹配法律规定代码、领域特定语义向量相似度分析、基于本体论路径的推理以及专业词汇分割。这些组件与网络搜索技术结合使用,以建立一个增强知识支持的法律决策框架。 实验结果表明,在法律纠纷分析方面性能有了显著提升,能够对复杂案件进行准确的应用分析,并展现出对司法判决逻辑的细微理解能力,为智能法律辅助系统的实施提供了新颖的技术路径。
https://arxiv.org/abs/2507.07893
Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI's comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini's code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.
分子动力学模拟是理解蛋白质结构、动态和功能的原子级过程中的重要工具。然而,准备高质量的分子动力学(MD)输入文件是一个耗时且容易出错的过程。在这项工作中,我们引入了一种自动化管道,该管道结合了大型语言模型(LLMs),特别是Gemini 2.0 Flash,并辅以Python脚本和基于Selenium的网络自动化技术,以此来简化MD输入文件的生成过程。 此管道利用CHARMM GUI全面的网页界面来准备适合NAMD模拟的输入。通过整合Gemini的代码生成和迭代优化功能,可以自动编写、执行并修改用于导航CHARMM GUI、提取适当参数以及产生所需NAMD输入文件的脚本。使用额外软件进行后期处理进一步细化模拟输出,从而实现一个完整且几乎无需人工干预的工作流程。 我们的研究结果表明,这种做法减少了设置时间,降低了人为错误,并为同时处理多个蛋白质系统提供了一种可扩展的解决方案。这一自动化的框架为进一步在计算结构生物学中应用LLMs铺平了道路,提供了一个强大且适应性强的平台来推动模拟自动化技术的发展。
https://arxiv.org/abs/2507.07887
Existing pruning methods are typically applied during training or compile time and often rely on structured sparsity. While compatible with low-power microcontrollers (MCUs), structured pruning underutilizes the opportunity for fine-grained efficiency on devices without SIMD support or parallel compute. To address these limitations, we introduce UnIT (Unstructured Inference-Time pruning), a lightweight method that dynamically identifies and skips unnecessary multiply-accumulate (MAC) operations during inference, guided by input-specific activation patterns. Unlike structured pruning, UnIT embraces irregular sparsity and does not require retraining or hardware specialization. It transforms pruning decisions into lightweight comparisons, replacing multiplications with threshold checks and approximated divisions. UnIT further optimizes compute by reusing threshold computations across multiple connections and applying layer- and group-specific pruning sensitivity. We present three fast, hardware-friendly division approximations tailored to the capabilities of common embedded platforms. Demonstrated on the MSP430 microcontroller, UnIT achieves 11.02% to 82.03% MAC reduction, 27.30% to 84.19% faster inference, and 27.33% to 84.38% lower energy consumption compared to training-time pruned models, while maintaining accuracy with 0.48-7%. Under domain shift, UnIT matches or exceeds the accuracy of retrained models while requiring significantly fewer MACs. These results establish unstructured inference-time pruning as a viable and practical solution for efficient, retraining-free deployment of deep neural networks on MCUs.
现有的剪枝方法通常在训练或编译时应用,并且往往依赖于结构化稀疏性。虽然这些方法可以与低功耗微控制器(MCU)兼容,但在不支持单指令多数据流(SIMD)或并行计算的设备上,它们未能充分利用细粒度效率的机会。为了克服这些限制,我们引入了UnIT(非结构化推理时间剪枝),这是一种轻量级方法,它在推理过程中根据输入特定的激活模式动态识别和跳过不必要的乘积累加(MAC)操作。与结构化稀疏性不同,UnIT拥抱不规则稀疏性,并且不需要重新训练或专用硬件。 UnIT将剪枝决策转化为轻量级比较,通过阈值检查来替代乘法运算,并使用近似除法来代替。它进一步通过跨多个连接重用阈值计算并应用层和组特定的剪枝敏感度来优化计算。我们提出了三种快速、适合硬件的除法近似方法,适用于常见的嵌入式平台。 在MSP430微控制器上进行演示时,UnIT与训练时间剪枝模型相比,在保持0.48-7%精度的情况下实现了11.02%到82.03%的MAC减少、27.30%到84.19%更快的推理速度以及27.33%到84.38%更低的能量消耗。在领域转换下,UnIT与重新训练模型保持相同或更高的精度,同时需要更少的MAC运算。 这些结果证明了非结构化推理时间剪枝作为一种高效且实用的方法,在不需要重新训练的情况下将深度神经网络部署于MCUs上是可行和实际的选择。
https://arxiv.org/abs/2507.07885
Deep learning-based machine listening is broadening the scope of industrial acoustic analysis for applications like anomaly detection and predictive maintenance, thereby improving manufacturing efficiency and reliability. Nevertheless, its reliance on large, task-specific annotated datasets for every new task limits widespread implementation on shop floors. While emerging sound foundation models aim to alleviate data dependency, they are too large and computationally expensive, requiring cloud infrastructure or high-end hardware that is impractical for on-site, real-time deployment. We address this gap with LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), a kilobyte-sized industrial sound foundation model. Using knowledge distillation, LISTEN runs in real-time on low-cost edge devices. On benchmark downstream tasks, it performs nearly identically to its much larger parent model, even when fine-tuned with minimal datasets and training resource. Beyond the model itself, we demonstrate its real-world utility by integrating LISTEN into a complete machine monitoring framework on an edge device with an Industrial Internet of Things (IIoT) sensor and system, validating its performance and generalization capabilities on a live manufacturing shop floor.
基于深度学习的机器听觉正在扩大工业声学分析的应用范围,例如异常检测和预测性维护,从而提高制造效率和可靠性。然而,它对每个新任务所需的大量特定任务标注数据集的高度依赖限制了其在车间中的广泛实施。虽然新兴的声音基础模型旨在减轻这种数据依赖性问题,但它们体积庞大且计算成本高昂,需要云基础设施或高端硬件,这使得其实时现场部署不切实际。为了解决这一缺口,我们提出了LISTEN(轻量级工业声音表示Transformer边缘通知系统),这是一种千字节大小的工业声音基础模型。通过知识蒸馏技术,LISTEN可以在低成本的边缘设备上实现实时运行。在基准下游任务中,即使使用最小的数据集和训练资源进行微调,它也能几乎与体积大得多的母模型一样表现良好。除了模型本身之外,我们还展示了其实际应用价值,即将LISTEN集成到配备IIoT传感器和系统的边缘设备上的完整机器监控框架中,并在实时制造车间环境中验证了其性能和泛化能力。
https://arxiv.org/abs/2507.07879
Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website this https URL.
水下图像恢复算法旨在修复因拍摄于水下而失真的场景颜色、对比度和外观。这些算法在从海洋生态学和水产养殖到水下建筑和考古等各种应用中至关重要。虽然现有的基于像素域扩散的图像恢复方法对于处理深度变化有限的简单场景非常有效,但它们计算成本高且通常会在复杂的几何结构或显著深度变化的情况下生成不切实际的伪影。在这项工作中,我们通过结合一种新颖的网络架构(SLURPP)和一个准确的合成数据生成管道来克服这些限制。SLURPP将预训练的潜在扩散模型——这种模型包含了场景几何形状和深度的强大先验知识——与显式的场景分解相结合,后者能够建模并考虑光衰减和后向散射的效果。 为了训练SLURPP,我们设计了一个基于物理原理的水下图像合成管道,该管道将各种现实的水下退化效果应用于现有的陆地图像数据集。这种方法可以生成具有密集介质/退化注释的多样化训练数据。我们在合成基准和真实世界基准上对我们的方法进行了广泛的评估,并展示了最先进的性能表现。值得注意的是,SLURPP比现有基于扩散的方法快200多倍,同时在合成基准上的PSNR(峰值信噪比)提升了约3分贝。它还提供了真实的图像质量改进。项目网站:[此URL]。 请根据实际链接地址替换"this https URL"部分的占位符。
https://arxiv.org/abs/2507.07878
Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.
近期,在自动语音识别(ASR)领域取得了显著进展,展示了在多种音频应用中的卓越准确性和鲁棒性,例如实时转录和语音命令处理。然而,将这些模型部署到资源受限的边缘设备(如物联网设备、可穿戴设备)上仍然面临重大挑战,因为这类设备对内存、计算能力和电源有着严格的限制。量化技术,特别是后训练量化(PTQ),提供了一种有效的方法来减少模型大小和推理成本,并且无需重新训练。尽管其重要性不言而喻,但各种先进的量化方法以及不同位宽配置对ASR模型性能的影响仍然不明朗。 在本研究中,我们全面评估了八种最先进的后训练量化(PTQ)方法应用于两个领先的边缘-ASR模型家族——Whisper和Moonshine的效果。我们系统地分析了这八个方法在七个来自开放ASR排行榜的多样化数据集上的模型表现(包括准确性、内存I/O以及位操作),同时考虑了权重与激活的不同配置对量化性能的影响。 我们的框架基于大规模语言模型压缩工具包的扩展,结合了边缘-ASR模型、多样化的先进量化算法、统一的校准和评估数据管道,以及详细的分析工具。研究结果揭示了效率与准确性的权衡关系,并展示了即使是3位宽的量化也能在采用高级PTQ技术的情况下成功应用于高容量模型上。这些发现为优化低功耗、始终在线的边缘设备上的ASR模型提供了宝贵的见解。
https://arxiv.org/abs/2507.07877