Positron emission tomography (PET) combined with computed tomography (CT) imaging is routinely used in cancer diagnosis and prognosis by providing complementary information. Automatically segmenting tumors in PET/CT images can significantly improve examination efficiency. Traditional multi-modal segmentation solutions mainly rely on concatenation operations for modality fusion, which fail to effectively model the non-linear dependencies between PET and CT modalities. Recent studies have investigated various approaches to optimize the fusion of modality-specific features for enhancing joint representations. However, modality-specific encoders used in these methods operate independently, inadequately leveraging the synergistic relationships inherent in PET and CT modalities, for example, the complementarity between semantics and structure. To address these issues, we propose a Hierarchical Adaptive Interaction and Weighting Network termed H2ASeg to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we design a Modality-Cooperative Spatial Attention (MCSA) module that performs intra- and inter-modal interactions globally and locally. Additionally, a Target-Aware Modality Weighting (TAMW) module is developed to highlight tumor-related features within multi-modal features, thereby refining tumor segmentation. By embedding these modules across different layers, H2ASeg can hierarchically model cross-modal correlations, enabling a nuanced understanding of both semantic and structural tumor features. Extensive experiments demonstrate the superiority of H2ASeg, outperforming state-of-the-art methods on AutoPet-II and Hecktor2022 benchmarks. The code is released at this https URL.
利用正电子发射断层扫描(PET)与计算机断层扫描(CT)成像相结合可以提供互补信息,从而进行癌症诊断和预后。自动分割PET/CT图像中的肿瘤可以显著提高检查效率。传统的多模态分割解决方案主要依赖于模式串接操作进行模式融合,但这些方法无法有效地建模PET和CT模式之间的非线性依赖关系。最近的研究调查了各种方法来优化对模态特定特征的融合以增强联合表示。然而,用于这些方法的传统模态编码器是相互独立的,未能充分利用PET和CT模式固有的协同关系,例如语义和结构之间的互补性。为解决这些问题,我们提出了一个名为H2ASeg的自分层自适应交互加权网络,以探索固有跨模态关联和传递信息。具体来说,我们设计了一个模态合作空间注意(MCSA)模块,它在全球和局部进行模态交互。还开发了一个目标感知模式加权(TAMW)模块,以突出多模态特征中的肿瘤相关特征,从而改善肿瘤分割。通过在不同的层中嵌入这些模块,H2ASeg可以自层次建模跨模态关联,从而实现对语义和结构肿瘤特征的深入理解。大量实验证明,H2ASeg在AutoPet-II和Hecktor2022基准测试中的优越性,超过了最先进的方法。代码发布在https://这个URL上。
https://arxiv.org/abs/2403.18339
User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.
用户生成的数据源在揭示不良反应(ADRs)方面具有重要意义,越来越多的讨论发生在数字世界中。然而,现有的临床数据集主要围绕英文科学文章展开。这项工作提供一个多语言文本库,涉及ADR的收集,包括患者论坛、社交媒体和德语、法语、日语的临床报告。我们的数据集包括12个实体类型、4个属性类型和13个关系类型的注释。它为开发用于医疗保健的真实世界多语言语言模型做出了贡献。我们提供了统计数据,以突出数据集和相关挑战。我们还进行了初步实验,在跨语言实体和关系之间实现了强大的基线。
https://arxiv.org/abs/2403.18336
This work proposes a physics-informed deep learning (PIDL)-based constitutive model for investigating the viscoelastic-viscoplastic behavior of short fiber-reinforced nanoparticle-filled epoxies under various ambient conditions. The deep-learning model is trained to enforce thermodynamic principles, leading to a thermodynamically consistent constitutive model. To accomplish this, a long short-term memory network is combined with a feed-forward neural network to predict internal variables required for characterizing the internal dissipation of the nanocomposite materials. In addition, another feed-forward neural network is used to indicate the free-energy function, which enables defining the thermodynamic state of the entire system. The PIDL model is initially developed for the three-dimensional case by generating synthetic data from a classical constitutive model. The model is then trained by extracting the data directly from cyclic loading-unloading experimental tests. Numerical examples show that the PIDL model can accurately predict the mechanical behavior of epoxy-based nanocomposites for different volume fractions of fibers and nanoparticles under various hygrothermal conditions.
本文提出了一种基于物理约束的深度学习(PIDL)构成的多智能体模型,用于研究不同环境条件下短纤维增强纳米颗粒填充的环氧树脂的粘弹性-粘弹性行为。深度学习模型通过训练来执行热力学原理,从而实现了一个具有热力学一致性的构成模型。为了实现这一目标,将长短时记忆网络与前馈神经网络相结合,用于预测用于描述纳米复合材料内部消散的热力学变量。此外,另一个前馈神经网络用于表示自由能函数,从而可以定义整个系统的热力学状态。PIDL模型最初是为三维情况而开发的,通过从经典构件模型中生成合成数据来构建。然后,通过从周期加载-卸载实验中提取数据来训练模型。数值例子表明,PIDL模型可以在不同的体积分数下,准确预测基于环氧树脂的纳米复合材料的机械行为,不同环境下的湿热条件。
https://arxiv.org/abs/2403.18310
Recommender systems have been actively studied and applied in various domains to deal with information overload. Although there are numerous studies on recommender systems for movies, music, and e-commerce, comparatively less attention has been paid to the recommender system for NFTs despite the continuous growth of the NFT market. This paper presents a recommender system for NFTs that utilizes a variety of data sources, from NFT transaction records to external item features, to generate precise recommendations that cater to individual preferences. We develop a data-efficient graph-based recommender system to efficiently capture the complex relationship between each item and users and generate node(item) embeddings which incorporate both node feature information and graph structure. Furthermore, we exploit inputs beyond user-item interactions, such as image feature, text feature, and price feature. Numerical experiments verify the performance of the graph-based recommender system improves significantly after utilizing all types of item features as side information, thereby outperforming all other baselines.
推荐系统在处理信息过载的各种领域中得到了积极的研究和应用。尽管在电影、音乐和电子商务等领域的推荐系统中有很多研究,但相对较少关注NFT领域的推荐系统,尽管NFT市场持续增长。本文提出了一种用于NFT的推荐系统,该系统利用各种数据源(从NFT交易记录到外部物品特征),生成符合个人偏好的精确推荐。我们开发了一种数据有效的图基推荐系统,能够有效地捕捉每个物品与用户之间的关系,并生成包含节点特征和图结构信息的节点嵌入。此外,我们还利用用户与物品之间的互动之外的其他输入,例如图像特征、文本特征和价格特征。数值实验证实,在利用所有类型的物品特征作为侧信息后,基于图的推荐系统性能显著提高,从而超越了所有其他基线。
https://arxiv.org/abs/2403.18305
Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to a larger size incurs digitization noise and increased computation cost. In this work, we carry out a comprehensive, layer-wise investigation of CNN models in response to scale variation, based on Centered Kernel Alignment (CKA) analysis. The observations reveal lower layers are more sensitive to input image scale variations than high-level layers. Inspired by this insight, we propose Multi-scale Unified Network (MUSN) consisting of multi-scale subnets, a unified network, and scale-invariant constraint. Our method divides the shallow layers into multi-scale subnets to enable feature extraction from multi-scale inputs, and the low-level features are unified in deep layers for extracting high-level semantic features. A scale-invariant constraint is posed to maintain feature consistency across different scales. Extensive experiments on ImageNet and other scale-diverse datasets, demonstrate that MSUN achieves significant improvements in both model performance and computational efficiency. Particularly, MSUN yields an accuracy increase up to 44.53% and diminishes FLOPs by 7.01-16.13% in multi-scale scenarios.
卷积神经网络(CNN)在图像表示学习和识别方面取得了显著的进步。然而,当处理真实世界、多尺度图像输入时,它们在性能和计算效率方面面临显著挑战。传统方法将所有输入图像缩放到固定大小,其中较大固定尺寸有利于性能,但将小尺寸图像缩放到较大尺寸会导致量化噪声和增加计算成本。在本文中,我们根据中心卷积对齐(CKA)分析对CNN模型进行了全面的层次调查。观察结果表明,低层对输入图像尺度变化更加敏感,而高层层对输入图像的尺度变化不太敏感。基于这一洞察,我们提出了多尺度统一网络(MUSN)由多尺度子网、统一网络和尺度不变约束组成。我们的方法将浅层层分为多尺度子网,以便从多尺度输入中提取特征,并在深层中统一低级特征以提取高级语义特征。一个尺度不变约束 posed 为保持不同尺度特征的一致性。在ImageNet和其他尺度多样数据集上进行的大量实验证明,MSUN在模型性能和计算效率方面取得了显著的改进。特别是,MSUN在多尺度场景中的准确率提高了44.53%,FLOPs降低了7.01-16.13%。
https://arxiv.org/abs/2403.18294
Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility.\footnote{this https URL}.
Transformer结构在自然语言处理(NLP)、计算机视觉(CV)和信息检索(IR)等多个应用机器学习领域取得了巨大的成功。Transformer架构的核心机制--关注,在训练和推理过程中需要分别有$O(n^2)$和$O(n)$的时间复杂度。为了提高注意机制的可扩展性,已经提出了许多工作,如Flash Attention和Multi-query Attention。另一类工作旨在设计新的机制来替代注意力。最近,一个有代表性的模型结构--Mamba(基于状态空间模型),在多个序列建模任务上实现了与Transformer等同的性能。在这篇论文中,我们通过古典IR任务--文档排序,对Mamba的效率进行了评估。一个重新排序器模型接收查询和文档作为输入,并预测一个标量相关分数。这个任务要求语言模型能够理解长篇上下文输入,并捕捉查询和文档标记之间的相互作用。我们发现:(1)Mamba模型在相同训练方法下的竞争性能与Transformer模型相当;(2)但是,与高效的Transformer实现(如Flash Attention)相比,训练通过度较低。我们希望这项研究可以为探索Mamba模型在其他古典IR任务提供起点。我们的代码实现和训练checkpoints是公开的,以促进可重复性。 这个链接:https://github.com/jiexuanzeng/transformer-IR
https://arxiv.org/abs/2403.18276
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods.
户外捕获图像的质量通常受到天气的影响。一个影响观察者视线的因素是雨,它可能会遮挡依靠这些图像进行视觉检测的应用程序。本研究旨在通过自监督强化学习(RL)去除雨条纹来恢复雨图像。我们通过字典学习从输入雨图像中定位雨条纹像素,并使用像素级的RL代理进行多次修复操作,以逐渐去除雨。据我们所知,这是第一个将自监督强化学习应用于图像去雨的尝试。在多个基准图像去雨数据集上进行的实验结果表明,与最先进的少样本和自监督去雨方法相比,所提出的SRL-Derain具有优势。
https://arxiv.org/abs/2403.18270
As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness.
随着无人机技术的进步,使用无人机进行航空测量已成为现代低空遥感的优势趋势。高空视频数据的激增迫使对未来场景和感兴趣目标的动态状态进行准确预测,特别是在交通管理和灾害应对等领域。现有的视频预测方法仅关注预测未来场景(视频帧),忽略了明确建模目标运动状态,这是高空视频解释的关键。为解决这个问题,我们引入了一个名为 Target-Aware Aerial Video Prediction 的新任务,旨在同时预测未来场景和目标的动态状态。此外,我们为这个任务设计了一个名为 TAFormer 的模型,提供了一种统一建模视频和目标运动状态的方法。具体来说,我们引入了 Spatiotemporal Attention(STA),将视频动态学习的空间静态注意力和时间动态注意力解耦,有效建模场景外观和运动。此外,我们设计了一个信息共享机制(ISM),通过促进信息交互来统一建模视频和目标运动。为了减轻在模糊预测中区分目标的努力,我们引入了 Target-Sensitive Gaussian Loss(TSGL),提高了模型对目标位置和内容的敏感度。对于 UAV123VP 和 VisDroneVP(源于单对象跟踪数据集)的实验表明,TAFormer 在目标意识视频预测方面的表现异常出色,展示了它对空中视频解释额外需求的适应能力。
https://arxiv.org/abs/2403.18238
Large-scale robotic policies trained on data from diverse tasks and robotic platforms hold great promise for enabling general-purpose robots; however, reliable generalization to new environment conditions remains a major challenge. Toward addressing this challenge, we propose a novel approach for uncertainty-aware deployment of pre-trained language-conditioned imitation learning agents. Specifically, we use temperature scaling to calibrate these models and exploit the calibrated model to make uncertainty-aware decisions by aggregating the local information of candidate actions. We implement our approach in simulation using three such pre-trained models, and showcase its potential to significantly enhance task completion rates. The accompanying code is accessible at the link: this https URL
大规模机器人策略通过从多样任务和机器人平台的数据进行训练,具有很大的潜力,使通用机器人具有更大的潜力。然而,将这些策略应用到新环境条件仍然是一个主要挑战。为了解决这个挑战,我们提出了一个名为 uncertainty-aware deployment of pre-trained language-conditioned imitation learning agents 的全新方法。具体来说,我们使用温度缩放来校准这些模型,并通过聚合候选动作的局部信息来做出不确定性明智的决策。我们在仿真中使用三个这样的预训练模型来实现我们的方法,并展示其对显著提高任务完成率的可能性。附录中的代码可在此链接访问:https://
https://arxiv.org/abs/2403.18222
Reinforcement learning (RL) algorithms have become indispensable tools in artificial intelligence, empowering agents to acquire optimal decision-making policies through interactions with their environment and feedback mechanisms. This study explores the performance of RL agents in both two-dimensional (2D) and three-dimensional (3D) environments, aiming to research the dynamics of learning across different spatial dimensions. A key aspect of this investigation is the absence of pre-made libraries for learning, with the algorithm developed exclusively through computational mathematics. The methodological framework centers on RL principles, employing a Q-learning agent class and distinct environment classes tailored to each spatial dimension. The research aims to address the question: How do reinforcement learning agents adapt and perform in environments of varying spatial dimensions, particularly in 2D and 3D settings? Through empirical analysis, the study evaluates agents' learning trajectories and adaptation processes, revealing insights into the efficacy of RL algorithms in navigating complex, multi-dimensional spaces. Reflections on the findings prompt considerations for future research, particularly in understanding the dynamics of learning in higher-dimensional environments.
强化学习(RL)算法已成为人工智能的不可或缺的工具,通过与环境的交互和反馈机制,使智能体获得最优决策策略。本研究探讨了在二维(2D)和三维(3D)环境中RL代理器的性能,旨在研究不同空间维度下的学习动态。这一研究的关键点是缺乏为学习预先制定的库,该算法是通过计算数学仅通过计算得出的。方法论框架以RL原则为核心,采用针对每个空间维度的不同环境类和Q学习智能体类。研究的目的是回答这个问题:强化学习智能体如何在具有不同空间维度的环境中进行适应和表现?通过实证分析,研究评估了智能体的学习轨迹和适应过程,揭示了RL算法在复杂多维空间中导航的有效性。对研究结果的反思引发了关于未来研究的考虑,特别是在理解高维环境中学习动态方面。
https://arxiv.org/abs/2403.18219
Reinforcement learning (RL) has been widely used in decision-making tasks, but it cannot guarantee the agent's safety in the training process due to the requirements of interaction with the environment, which seriously limits its industrial applications such as autonomous driving. Safe RL methods are developed to handle this issue by constraining the expected safety violation costs as a training objective, but they still permit unsafe state occurrence, which is unacceptable in autonomous driving tasks. Moreover, these methods are difficult to achieve a balance between the cost and return expectations, which leads to learning performance degradation for the algorithms. In this paper, we propose a novel algorithm based on the long and short-term constraints (LSTC) for safe RL. The short-term constraint aims to guarantee the short-term state safety that the vehicle explores, while the long-term constraint ensures the overall safety of the vehicle throughout the decision-making process. In addition, we develop a safe RL method with dual-constraint optimization based on the Lagrange multiplier to optimize the training process for end-to-end autonomous driving. Comprehensive experiments were conducted on the MetaDrive simulator. Experimental results demonstrate that the proposed method achieves higher safety in continuous state and action tasks, and exhibits higher exploration performance in long-distance decision-making tasks compared with state-of-the-art methods.
强化学习(RL)在决策任务中得到了广泛应用,但由于与环境的交互要求,在训练过程中无法保证智能体的安全性,这严重限制了其在自动驾驶等领域的工业应用。为了处理这个问题,人们发展了一些安全RL方法,通过将预期安全违规成本作为训练目标来约束,但这些方法仍然允许不安全的状态发生,这是在自动驾驶任务中不可接受的。此外,这些方法很难在成本和回报期望之间实现平衡,导致算法的学习性能下降。在本文中,我们提出了一种基于长短期约束(LSTC)的安全RL新算法。短期约束旨在确保车辆探索过程中短期的安全性,而长期约束确保了车辆在决策过程中整个安全性。此外,我们基于拉格朗日乘数开发了一种安全RL方法,用于优化端到端自动驾驶的训练过程。在元驱动仿真器上进行了全面的实验。实验结果表明,与最先进的方法相比,所提出的方法在连续状态和动作任务中实现了更高的安全性,并且在长途决策任务中表现出更高的探索性能。
https://arxiv.org/abs/2403.18209
There has been significant progress in implementing deep learning models in disease diagnosis using chest X- rays. Despite these advancements, inherent biases in these models can lead to disparities in prediction accuracy across protected groups. In this study, we propose a framework to achieve accurate diagnostic outcomes and ensure fairness across intersectional groups in high-dimensional chest X- ray multi-label classification. Transcending traditional protected attributes, we consider complex interactions within social determinants, enabling a more granular benchmark and evaluation of fairness. We present a simple and robust method that involves retraining the last classification layer of pre-trained models using a balanced dataset across groups. Additionally, we account for fairness constraints and integrate class-balanced fine-tuning for multi-label settings. The evaluation of our method on the MIMIC-CXR dataset demonstrates that our framework achieves an optimal tradeoff between accuracy and fairness compared to baseline methods.
在使用胸部X光片进行疾病诊断时,使用深度学习模型已经取得了显著进展。然而,这些模型的固有偏见可能导致不同保护群体之间的预测准确性差异。在这项研究中,我们提出了一个框架,以实现准确诊断结果和确保高维胸部X光多标签分类中交集群体之间的公平性。超越传统的保护属性,我们考虑了社会决定因素内复杂的相互作用,使得公平基准和评估更加精确。我们提出了一个简单而鲁棒的方法,涉及使用平衡数据集重新训练预训练模型的最末层分类层。此外,我们还考虑了公平约束,并针对多标签设置进行了类别平衡微调。在MIMIC-CXR数据集上评估我们的方法,证明了我们的框架在准确性和公平性之间实现了最优的平衡。
https://arxiv.org/abs/2403.18196
Recent advances in interactive large language models like ChatGPT have revolutionized various domains; however, their behavior in natural and role-play conversation settings remains underexplored. In our study, we address this gap by deeply investigating how ChatGPT behaves during conversations in different settings by analyzing its interactions in both a normal way and a role-play setting. We introduce a novel dataset of broad range of human-AI conversations annotated with user motives and model naturalness to examine (i) how humans engage with the conversational AI model, and (ii) how natural are AI model responses. Our study highlights the diversity of user motives when interacting with ChatGPT and variable AI naturalness, showing not only the nuanced dynamics of natural conversations between humans and AI, but also providing new avenues for improving the effectiveness of human-AI communication.
近年来,像ChatGPT这样的交互式大型语言模型在各种领域都取得了重大进展。然而,它们在自然和角色扮演对话场景中的行为仍然没有被充分探讨。在我们的研究中,我们通过深入研究ChatGPT在不同场景中的交互,来填补这一空白。我们引入了一个新的数据集,该数据集包含了人类和AI之间广泛范围的人机对话,并对其进行了用户动机和模型自然性的标注,以研究(i)人类如何与对话AI模型互动,(ii)AI模型的回答是否自然。我们的研究突出了用户在接触ChatGPT时的动机多样性,以及模型自然性的变异性。这不仅揭示了人类和AI之间自然对话的细微动态,而且为提高人类-AI通信的有效性提供了新的途径。
https://arxiv.org/abs/2403.18121
Earthquake monitoring is necessary to promptly identify the affected areas, the severity of the events, and, finally, to estimate damages and plan the actions needed for the restoration process. The use of seismic stations to monitor the strength and origin of earthquakes is limited when dealing with remote areas (we cannot have global capillary coverage). Identification and analysis of all affected areas is mandatory to support areas not monitored by traditional stations. Using social media images in crisis management has proven effective in various situations. However, they are still limited by the possibility of using communication infrastructures in case of an earthquake and by the presence of people in the area. Moreover, social media images and messages cannot be used to estimate the actual severity of earthquakes and their characteristics effectively. The employment of satellites to monitor changes around the globe grants the possibility of exploiting instrumentation that is not limited by the visible spectrum, the presence of land infrastructures, and people in the affected areas. In this work, we propose a new dataset composed of images taken from Sentinel-1 and a new series of tasks to help monitor earthquakes from a new detailed view. Coupled with the data, we provide a series of traditional machine learning and deep learning models as baselines to assess the effectiveness of ML-based models in earthquake analysis.
地震监测是迅速识别受影响地区、事件严重程度,并最终估计损失和规划恢复过程的必要手段。利用地震站监测地震的强度和起源是有局限性的,尤其是在处理远程地区时(我们无法实现全球毛细管覆盖)。所有受影响的区域的识别和分析是强制性的,以支持没有传统站点的地区。在危机管理中使用社交媒体图像已经证明在各种情况下非常有效。然而,它们仍然受到在地震发生时使用通信基础设施以及在影响区域存在人员的限制。此外,社交媒体图像和信息无法有效估计地震的实际严重程度及其特征。利用卫星监测全球范围内的变化为利用不局限于可见频谱的仪器提供了可能,不受地区人口和基础设施的影响。在这项工作中,我们提出了一个由来自Sentinel-1的图像组成的新数据集,以及一系列新的任务,从新的详细角度监测地震。结合数据,我们提供了传统机器学习模型和深度学习模型作为 baseline,以评估基于 ML 的模型的有效性。
https://arxiv.org/abs/2403.18116
The Segment Anything Model (SAM) has drawn significant attention from researchers who work on medical image segmentation because of its generalizability. However, researchers have found that SAM may have limited performance on medical images compared to state-of-the-art non-foundation models. Regardless, the community sees potential in extending, fine-tuning, modifying, and evaluating SAM for analysis of medical imaging. An increasing number of works have been published focusing on the mentioned four directions, where variants of SAM are proposed. To this end, a unified platform helps push the boundary of the foundation model for medical images, facilitating the use, modification, and validation of SAM and its variants in medical image segmentation. In this work, we introduce SAMM Extended (SAMME), a platform that integrates new SAM variant models, adopts faster communication protocols, accommodates new interactive modes, and allows for fine-tuning of subcomponents of the models. These features can expand the potential of foundation models like SAM, and the results can be translated to applications such as image-guided therapy, mixed reality interaction, robotic navigation, and data augmentation.
segment Anything Model (SAM) 引起了医疗图像分割领域研究人员的广泛关注,因为它具有很好的泛化能力。然而,研究人员发现,与最先进的非基础模型相比,SAM在医疗图像上的表现可能有限。尽管如此,社区认为,扩展、微调、修改和评估SAM在医疗图像分割分析中的应用是具有潜力的。越来越多的研究聚焦于提到的四个方向,提出了各种SAM变体。为此,一个统一平台有助于扩大基础模型(SAM)在医疗图像领域的边界,促进SAM及其变体的使用、修改和验证。在这方面,我们介绍了 SAMME,一个集成了新SAM变体模型的统一平台,采用了更快的通信协议,支持新的交互模式,并允许对模型的子组件进行微调。这些功能可以扩展像SAM这样的基础模型的潜力,并且这些结果可以应用于诸如图像指导治疗、混合现实交互、机器人导航和数据增强等应用中。
https://arxiv.org/abs/2403.18114
Numerous works concerning head pose estimation (HPE) offer algorithms or proposed neural network-based approaches for extracting Euler angles from either facial key points or directly from images of the head region. However, many works failed to provide clear definitions of the coordinate systems and Euler or Tait-Bryan angles orders in use. It is a well-known fact that rotation matrices depend on coordinate systems, and yaw, roll, and pitch angles are sensitive to their application order. Without precise definitions, it becomes challenging to validate the correctness of the output head pose and drawing routines employed in prior works. In this paper, we thoroughly examined the Euler angles defined in the 300W-LP dataset, head pose estimation such as 3DDFA-v2, 6D-RepNet, WHENet, etc, and the validity of their drawing routines of the Euler angles. When necessary, we infer their coordinate system and sequence of yaw, roll, pitch from provided code. This paper presents (1) code and algorithms for inferring coordinate system from provided source code, code for Euler angle application order and extracting precise rotation matrices and the Euler angles, (2) code and algorithms for converting poses from one rotation system to another, (3) novel formulae for 2D augmentations of the rotation matrices, and (4) derivations and code for the correct drawing routines for rotation matrices and poses. This paper also addresses the feasibility of defining rotations with right-handed coordinate system in Wikipedia and SciPy, which makes the Euler angle extraction much easier for full-range head pose research.
许多关于头姿态估计(HPE)的作品提供了算法或基于神经网络的提取欧拉角度的方法,其中许多作品没有明确定义使用时的坐标系和欧拉或泰特-布莱尼安角度的顺序。众所周知,旋转矩阵取决于坐标系,而俯仰、滚转和偏航角度对应用顺序非常敏感。如果没有精确的定义,则很难验证之前工作中使用的输出头姿和绘制算法的正确性。在本文中,我们对300W-LP数据集中的欧拉角度进行了深入研究,包括3DDFA-v2、6D-RepNet、WHENet等头姿态估计方法,以及它们提取欧拉角度的绘制算法的验证。必要时,我们从提供的代码中推断它们的坐标系和俯仰、滚转、偏航的序列。本文提出了以下内容:(1)从提供源代码中推断坐标系和欧拉角度的代码和算法;(2)将姿态从一个旋转系统中转换到另一个旋转系统中的代码和算法;(3)用于旋转矩阵的2D增强公式;(4)用于绘制欧拉角度和姿态的准确绘制算法的推导和代码。本文还讨论了在维基百科和SciPy中使用右手法则定义旋转的可行性,这使得全范围头姿研究中的欧拉角度提取变得容易得多。
https://arxiv.org/abs/2403.18104
Although mobile robots have on-board sensors to perform navigation, their efficiency in completing paths can be enhanced by planning to avoid human interaction. Infrastructure cameras can capture human activity continuously for the purpose of compiling activity analytics to choose efficient times and routes. We describe a cascade temporal filtering method to efficiently extract short- and long-term activity in two time dimensions, isochronal and chronological, for use in global path planning and local navigation respectively. The temporal filter has application either independently, or, if object recognition is also required, it can be used as a pre-filter to perform activity-gating of the more computationally expensive neural network processing. For a testbed 32-camera network, we show how this hybrid approach can achieve over 8 times improvement in frames per second throughput and 6.5 times reduction of system power use. We also show how the cost map of static objects in the ROS robot software development framework is augmented with dynamic regions determined from the temporal filter.
虽然移动机器人具有载有导航传感器的内置传感器来进行路径规划,但通过规划避免人类交互,其完成路径的效率可以得到提高。基础设施摄像头可以连续捕捉人类活动,用于收集活动分析以选择高效的时间和路线。我们描述了一种级联时间滤波方法,用于在两个时间维度(等时和历时)有效地提取短期和长期活动,用于全球路径规划和局部导航。时间滤波器可以单独使用,或者,如果需要物体识别,它可以作为前滤器,对更计算密集型的神经网络处理进行活动门控。对于一个32个摄像头的测试平台,我们展示了这种混合方法如何通过提高帧每秒吞吐量超过8倍和降低系统功耗6.5倍来实现显著的改进。我们还展示了在ROS机器人软件开发框架中,静态对象的成本图如何通过时间滤波器的动态区域得到增加。
https://arxiv.org/abs/2403.18096
Hand function is critical for our interactions and quality of life. Spinal cord injuries (SCI) can impair hand function, reducing independence. A comprehensive evaluation of function in home and community settings requires a hand grasp taxonomy for individuals with impaired hand function. Developing such a taxonomy is challenging due to unrepresented grasp types in standard taxonomies, uneven data distribution across injury levels, and limited data. This study aims to automatically identify the dominant distinct hand grasps in egocentric video using semantic clustering. Egocentric video recordings collected in the homes of 19 individual with cervical SCI were used to cluster grasping actions with semantic significance. A deep learning model integrating posture and appearance data was employed to create a personalized hand taxonomy. Quantitative analysis reveals a cluster purity of 67.6% +- 24.2% with with 18.0% +- 21.8% redundancy. Qualitative assessment revealed meaningful clusters in video content. This methodology provides a flexible and effective strategy to analyze hand function in the wild. It offers researchers and clinicians an efficient tool for evaluating hand function, aiding sensitive assessments and tailored intervention plans.
手功能对于我们的人际交往和生活质量至关重要。脊髓损伤(SCI)可能损害手功能,降低独立性。对残疾人士在家庭和社区环境中的功能进行全面评估需要为损伤手功能的人开发手握类目。由于标准 taxonomies 中未包括的握姿类型、等级水平数据分布不均以及数据的限制,开发这种税目具有挑战性。 本研究旨在通过语义聚类在以自我为中心的视频中自动识别主导的握姿。使用19个患有脊髓SCI的个别人家中收集的自我中心视频进行聚类。采用深度学习模型结合姿势和外观数据创建了个性化的手 taxonomy。定量分析显示,握姿集群的纯度为67.6% +- 24.2%,冗余度为18.0% +- 21.8%。定性评估显示视频内容中有意义的聚类。这种方法为研究者和临床医生研究手功能提供了一个灵活而有效的工具,有助于敏感评估和定制干预计划。
https://arxiv.org/abs/2403.18094
Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. On RepCount, ESCounts increases the off-by-one from 0.39 to 0.56 and decreases the mean absolute error from 0.38 to 0.21. Detailed ablations further demonstrate the effectiveness of our method.
视频重复计数推断了视频内重复动作或运动的次数。我们提出了一种基于示例的 approach,发现了目标视频中重复示例之间的视觉对应关系。我们提出的 Every Shot Counts (ESCounts) 模型是一种自注意力编码器-解码器,它与来自相同和不同视频的示例一起编码不同长度的视频。在训练过程中,ESCounts 回归视频内高相似度的示例位置。与此同时,我们的方法学习了一个隐含表示,编码了通用重复运动的表示。我们使用这个隐含表示进行无示例、零示例推理。在常用数据集(RepCount、Countix 和 UCFRep)上进行的大量实验展示了 ESCounts 在三个数据集上的最佳性能。在 RepCount 上,ESCounts 将 off-by-one 从 0.39 增加到了 0.56,将平均绝对误差从 0.38 降低到了 0.21。进一步的详细分析证明了我们的方法的的有效性。
https://arxiv.org/abs/2403.18074
Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at this https URL
近年来,在大型语言模型(LLMs)方面取得了显著的进展,特别是针对英语。这些进步使得这些LLM能够以前所未有的准确性和流畅性理解和执行复杂的指令。然而,尽管取得了这些进步,汉语指令调度的开发仍然存在明显的差距。汉语的语言特性和文化深度使得指令调定任务具有挑战性。现有的数据要么来自以英语为模型的LLM,要么不适合与现实世界中国用户的交互模式对齐。为了弥合这个差距,我们引入了COIG-CQIA,一个高质量的中文指令调定数据集。我们的目标是建立一个多样、广泛的指令调定数据集,更好地将模型行为与人类交互对齐。为此,我们从各种来源收集了高质量的人类写作语料,包括问答社区、维基百科、考试和现有的自然语言处理数据集。这个语料库经过严格筛选和精心处理,形成了COIG-CQIA数据集。此外,我们在CQIA的不同子集上训练了各种规模的模型,并进行了深入评估和分析。我们实验的结果为我们选择和开发中文指令调定数据集提供了宝贵的见解。我们还发现,在CQIA子集上训练的模型在人类评估和知识与安全基准测试中都取得了竞争力的结果。数据可在此处下载:https://www.aclweb.org/anthology/N22-11965
https://arxiv.org/abs/2403.18058