The interactions between tumor cells and the tumor microenvironment (TME) dictate therapeutic efficacy of radiation and many systemic therapies in breast cancer. However, to date, there is not a widely available method to reproducibly measure tumor and immune phenotypes for each patient's tumor. Given this unmet clinical need, we applied multiple instance learning (MIL) algorithms to assess activity of ten biologically relevant pathways from the hematoxylin and eosin (H&E) slide of primary breast tumors. We employed different feature extraction approaches and state-of-the-art model architectures. Using binary classification, our models attained area under the receiver operating characteristic (AUROC) scores above 0.70 for nearly all gene expression pathways and on some cases, exceeded 0.80. Attention maps suggest that our trained models recognize biologically relevant spatial patterns of cell sub-populations from H&E. These efforts represent a first step towards developing computational H&E biomarkers that reflect facets of the TME and hold promise for augmenting precision oncology.
肿瘤细胞与肿瘤微环境(TME)之间的相互作用决定了放射治疗和许多系统治疗在乳腺癌中的治疗效果。然而,目前还没有一种可重复测量每个患者肿瘤的肿瘤和免疫表型的广泛可用方法。鉴于这一未满足的临床需求,我们将多实例学习(MIL)算法应用于从原始乳腺癌的哈希和电子显微镜(H&E)切片评估十种生物相关的通路的活动。我们采用了不同的特征提取方法和最先进的模型架构。使用二分类,我们的模型在几乎所有基因表达通路上的接收者操作特征(AUROC)分数都超过了0.70,在某些情况下甚至超过了0.80。注意力图表明,经过训练的模型能够识别H&E中的细胞亚群的空间模式。这些努力代表了解决计算H&E生物标志物的第一步,这些生物标志物可以反映TME的方面,并具有提高精准癌症治疗的精度的潜力。
https://arxiv.org/abs/2404.16397
Learning from Demonstration (LfD) stands as an efficient framework for imparting human-like skills to robots. Nevertheless, designing an LfD framework capable of seamlessly imitating, generalizing, and reacting to disturbances for long-horizon manipulation tasks in dynamic environments remains a challenge. To tackle this challenge, we present Logic Dynamic Movement Primitives (Logic-DMP), which combines Task and Motion Planning (TAMP) with an optimal control formulation of DMP, allowing us to incorporate motion-level via-point specifications and to handle task-level variations or disturbances in dynamic environments. We conduct a comparative analysis of our proposed approach against several baselines, evaluating its generalization ability and reactivity across three long-horizon manipulation tasks. Our experiment demonstrates the fast generalization and reactivity of Logic-DMP for handling task-level variants and disturbances in long-horizon manipulation tasks.
学习演示(LfD)作为一种将人类技能传递给机器人的有效框架,在长时程操作任务中为机器人提供了重要的价值。然而,设计一个能够无缝模拟、推广和应对干扰的LfD框架仍然具有挑战性。为了解决这个问题,我们提出了逻辑动态运动元启发式(Logic-DMP),它结合了任务和动作规划(TAMP)和DMP的最佳控制公式,允许我们通过级联运动点规格,并处理动态环境中的任务级别差异或干扰。我们对所提出的方案进行了与几个基线的比较分析,评估了其在长时程操作任务中的泛化能力和反应性。我们的实验结果表明,Logic-DMP在处理长时程操作任务中的任务级别变异和干扰方面具有快速泛化能力和反应性。
https://arxiv.org/abs/2404.16138
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785
Report generation models offer fine-grained textual interpretations of medical images like chest X-rays, yet they often lack interactivity (i.e. the ability to steer the generation process through user queries) and localized interpretability (i.e. visually grounding their predictions), which we deem essential for future adoption in clinical practice. While there have been efforts to tackle these issues, they are either limited in their interactivity by not supporting textual queries or fail to also offer localized interpretability. Therefore, we propose a novel multitask architecture and training paradigm integrating textual prompts and bounding boxes for diverse aspects like anatomical regions and pathologies. We call this approach the Chest X-Ray Explainer (ChEX). Evaluations across a heterogeneous set of 9 chest X-ray tasks, including localized image interpretation and report generation, showcase its competitiveness with SOTA models while additional analysis demonstrates ChEX's interactive capabilities.
报告生成模型可以提供对医学图像(如胸部X光片)的细粒度文本解释,但通常缺乏交互性(即通过用户查询引导生成过程的能力)和本地解释性(即通过视觉将预测与图像相结合的能力),我们认为这对于未来在临床实践中采用是至关重要的。尽管已经做出了努力来解决这些问题,但它们要么因不支持文本查询而限制了交互性,要么未能提供本地解释性。因此,我们提出了一个新颖的多任务架构和训练范式,结合文本提示和边界框来处理各种方面(如解剖区域和疾病)。我们将这种方法称为胸部X光片解释器(ChEX)。在包括局部图像解释和报告生成的异质集合9个胸部X光片任务中进行的评估显示,与最先进的模型相比,ChEX具有竞争力,而其他分析则展示了ChEX的交互能力。
https://arxiv.org/abs/2404.15770
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
驾驶员活动分类对于确保道路安全具有至关重要的作用,应用范围从驾驶员辅助系统到自动驾驶车辆控制转换。在本文中,我们提出了一种利用可扩展性来自视觉-语言模型的驾驶员活动分类新方法。我们的方法采用了一个预训练的视觉-语言编码器来处理来自多个视角的同步视频帧。每个帧都使用预训练的视觉-语言编码器进行编码,然后将得到的嵌入进行融合以生成分类概率预测。通过利用对比学习得到的视觉-语言表示,我们的方法在多样驾驶员活动中取得了稳健的性能。我们在自然驾驶行动识别数据集上评估我们的方法,证明了在许多类别中具有强大的准确性。我们的结果表明,视觉-语言表示为驾驶员监测系统提供了有前途的途径,通过自然语言描述符实现准确性和可解释性。
https://arxiv.org/abs/2404.14906
In the field of fraud detection, the availability of comprehensive and privacy-compliant datasets is crucial for advancing machine learning research and developing effective anti-fraud systems. Traditional datasets often focus on transaction-level information, which, while useful, overlooks the broader context of customer behavior patterns that are essential for detecting sophisticated fraud schemes. The scarcity of such data, primarily due to privacy concerns, significantly hampers the development and testing of predictive models that can operate effectively at the customer level. Addressing this gap, our study introduces a benchmark that contains structured datasets specifically designed for customer-level fraud detection. The benchmark not only adheres to strict privacy guidelines to ensure user confidentiality but also provides a rich source of information by encapsulating customer-centric features. We have developed the benchmark that allows for the comprehensive evaluation of various machine learning models, facilitating a deeper understanding of their strengths and weaknesses in predicting fraudulent activities. Through this work, we seek to bridge the existing gap in data availability, offering researchers and practitioners a valuable resource that empowers the development of next-generation fraud detection techniques.
在欺诈检测领域,全面且符合隐私规范的数据集的可用性对于推动机器学习研究和开发有效的反欺诈系统至关重要。传统数据集通常关注交易层面的信息,虽然这些信息对于欺诈检测很有用,但忽略了检测复杂欺诈计划所需的更广泛的客户行为模式。这类数据的稀缺性,主要由于隐私问题,严重阻碍了开发和测试在客户级别有效运行的预测模型的过程。为解决这一空白,我们的研究引入了一个针对客户级别欺诈检测的基准数据集。该基准数据集不仅遵循严格的隐私指南以确保用户保密,而且通过封装客户中心功能提供了丰富的信息。我们开发了这个基准,允许对各种机器学习模型进行全面的评估,从而更好地了解它们在预测欺诈活动方面的优缺点。通过这项工作,我们希望弥合现有数据可用性方面的空白,为研究人员和实践者提供了一个有价值的资源,以促进下一代欺诈检测技术的开发。
https://arxiv.org/abs/2404.14746
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
近年来,短视频应用程序的用户基础经历了空前的增长,导致对视频内容分析的需求显著增加。特别是,文本-视频检索,旨在从庞大的视频语料库中找到与给定文本描述的匹配视频,是至关重要的功能,其挑战在于弥合模态差距。然而,大多数现有方法仅仅将文本视为离散的标记,而忽略了它们的语法结构。此外,视频中丰富的空间和时间线索往往没有被充分利用,因为缺乏与文本的交互。为了应对这些问题,我们认为将文本作为指导,集中关注视频中的相关时态帧和空间区域,从两个角度弥合模态差距是有益的。在本文中,我们提出了一种新颖的语法层次结构增强文本-视频检索方法(SHE-Net),它利用了文本固有的语义和语法层次结构,从两个角度弥合模态差距。首先,为了促进更细粒度的视觉内容整合,我们采用文本语法层次结构,该结构揭示了文本描述的语法结构,指导视觉表示。其次,为了进一步增强多模态交互和对齐,我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。
https://arxiv.org/abs/2404.14066
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
理解大脑中的认知过程需要复杂且能够在大尺度上复制神经动态的模型。我们提出了一个生理学上启发的语音识别架构,与深度学习框架兼容并具有可扩展性,并证明了端到端梯度下降训练会导致中央尖峰神经网络中神经振荡的出现。在语音处理过程中,我们测量了跨频联系,这些联系表明了这些振荡,而在处理背景噪声输入时,并没有观察到这样的相互作用。此外,我们的研究结果突出了反馈机制(如尖峰频率适应和循环连接)在调节和同步神经活动以提高识别性能中的关键抑制作用。总的来说,在发展我们人类听觉通路中同步现象的基础上,我们的架构表现出动态和高效的信息处理,与类神经形态技术有关。
https://arxiv.org/abs/2404.14024
In traditional human living environment landscape design, the establishment of three-dimensional models is an essential step for designers to intuitively present the spatial relationships of design elements, as well as a foundation for conducting landscape analysis on the site. Rapidly and effectively generating beautiful and realistic landscape spaces is a significant challenge faced by designers. Although generative design has been widely applied in related fields, they mostly generate three-dimensional models through the restriction of indicator parameters. However, the elements of landscape design are complex and have unique requirements, making it difficult to generate designs from the perspective of indicator limitations. To address these issues, this study proposes a park space generative design system based on deep learning technology. This system generates design plans based on the topological relationships of landscape elements, then vectorizes the plan element information, and uses Grasshopper to generate three-dimensional models while synchronously fine-tuning parameters, rapidly completing the entire process from basic site conditions to model effect analysis. Experimental results show that: (1) the system, with the aid of AI-assisted technology, can rapidly generate space green space schemes that meet the designer's perspective based on site conditions; (2) this study has vectorized and three-dimensionalized various types of landscape design elements based on semantic information; (3) the analysis and visualization module constructed in this study can perform landscape analysis on the generated three-dimensional models and produce node effect diagrams, allowing users to modify the design in real time based on the effects, thus enhancing the system's interactivity.
在传统的人类居住环境景观设计中,建立三维模型是设计师直觉地呈现设计元素的空间关系以及进行场地景观分析的基础。快速有效地生成美丽且逼真的景观空间是设计师面临的一个重要挑战。尽管在相关领域中应用了大量的生成设计,但它们主要是通过限制指示参数来生成三维模型。然而,景观设计的元素是复杂和独特的,使得从指标限制的角度生成设计具有困难。为解决这些问题,本研究提出了一个基于深度学习技术的公园空间生成设计系统。该系统根据景观元素的拓扑关系生成设计计划,然后对计划元素信息进行向量化,并使用Grasshopper生成三维模型,同时同步微调参数,快速完成整个过程,从基本场地条件到模型效果分析。实验结果表明:(1)在AI辅助技术的帮助下,系统可以快速生成满足设计师观点的景观空间方案;(2)本研究根据语义信息对各种景观设计元素进行了向量化和解构;(3)本研究中的分析和可视化模块可以对生成的三维模型进行场地分析,并生成节点效应图,使用户可以根据影响效果实时修改设计,从而提高系统的互动性。
https://arxiv.org/abs/2404.16067
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
翻译 Temporal sentence grounding 涉及从自然语言查询中检索具有自然语言查询的视频时刻。 许多现有作品直接包含给定的视频和时间局部化的查询以进行时间关联,而忽视了不同模态之间固有的领域差距。在本文中,我们利用包含相同视频查询对中广泛的时间全局文本知识伪查询特征,来增强领域之间的桥梁,达到多模态特征之间更高的相似度。 具体来说,我们提出了一种名为 Pseudo-query Intermediary Network (PIN) 的伪查询中间网络,通过对比学习在特征空间中改善视觉和综合伪查询特征的对齐。 然后,我们使用可学习提示来封装伪查询的知识,将其传递到文本编码器和多模态融合模块中,进一步加强了视觉和语言之间的特征匹配,实现更好的时间关联。 在 Charades-STA 和 ActivityNet-Captions 数据集上进行的大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2404.13611
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels. Despite the significant progress, existing methods suffer from severe performance degradation when transferring to different distributions and thus may hardly adapt to real-world scenarios . To address this problem, we propose the Generalizable Temporal Action Localization task (GTAL), which focuses on improving the generalizability of action localization methods. We observed that the performance decline can be primarily attributed to the lack of generalizability to different action scales. To address this problem, we propose STAT (Self-supervised Temporal Adaptive Teacher), which leverages a teacher-student structure for iterative refinement. Our STAT features a refinement module and an alignment module. The former iteratively refines the model's output by leveraging contextual information and helps adapt to the target scale. The latter improves the refinement process by promoting a consensus between student and teacher models. We conduct extensive experiments on three datasets, THUMOS14, ActivityNet1.2, and HACS, and the results show that our method significantly improves the Baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution evaluation performance.
弱监督的时间动作定位(WTAL)旨在仅通过视频级别的标签来识别和定位动作实例。尽管取得了显著的进展,但现有的方法在转移到不同分布时性能严重下降,因此可能难以适应现实场景。为解决这个问题,我们提出了泛化的动作定位任务(GTAL),它关注于提高动作定位方法的泛化能力。我们观察到,性能下降主要归因于不同动作规模缺乏泛化能力。为解决这个问题,我们提出了STAT(自监督的时间适应教师),它利用了教师-学生结构进行迭代精炼。我们的STAT包括精炼模块和调整模块。前者通过利用上下文信息逐步优化模型的输出,有助于适应目标规模。后者通过促进学生和教师模型的共识来改善精炼过程。我们在THUMOS14、ActivityNet1.2和HACS三个数据集上进行了广泛的实验,结果表明,我们的方法在跨分布评估设置下显著提高了基线方法的成绩,甚至接近同分布评估的性能。
https://arxiv.org/abs/2404.13311
Decoding visual information from human brain activity has seen remarkable advancements in recent research. However, due to the significant variability in cortical parcellation and cognition patterns across subjects, current approaches personalized deep models for each subject, constraining the practicality of this technology in real-world contexts. To tackle the challenges, we introduce Wills Aligner, a robust multi-subject brain representation learner. Our Wills Aligner initially aligns different subjects' brains at the anatomical level. Subsequently, it incorporates a mixture of brain experts to learn individual cognition patterns. Additionally, it decouples the multi-subject learning task into a two-stage training, propelling the deep model and its plugin network to learn inter-subject commonality knowledge and various cognition patterns, respectively. Wills Aligner enables us to overcome anatomical differences and to efficiently leverage a single model for multi-subject brain representation learning. We meticulously evaluate the performance of our approach across coarse-grained and fine-grained visual decoding tasks. The experimental results demonstrate that our Wills Aligner achieves state-of-the-art performance.
近年来,从人脑活动解读视觉信息的研究取得了显著的进展。然而,由于不同受试者之间皮质分叶和认知模式的重大差异,为每个受试者定制深度模型在现实场景中限制了技术的实用性。为了解决这些挑战,我们引入了Wills Aligner,一个 robust 的多subject brain representation learner。 我们的Wills Aligner首先在解剖层面上对不同受试者的的大脑进行对齐。然后,它结合了多位脑专家来学习个体认知模式。此外,它将多subject学习任务转化为两个阶段的训练,推动深度模型及其插件网络学习跨受试者共性知识和各种认知模式。Wills Aligner使我们能够克服解剖差异,并有效地利用单个模型进行多subject brain representation learning。 我们详细评估了我们的方法在粗粒度和细粒度视觉解码任务上的性能。实验结果表明,我们的Wills Aligner达到了最先进的水平。
https://arxiv.org/abs/2404.13282
Deepfakes, as AI-generated media, have increasingly threatened media integrity and personal privacy with realistic yet fake digital content. In this work, we introduce an open-source and user-friendly online platform, DeepFake-O-Meter v2.0, that integrates state-of-the-art methods for detecting Deepfake images, videos, and audio. Built upon DeepFake-O-Meter v1.0, we have made significant upgrades and improvements in platform architecture design, including user interaction, detector integration, job balancing, and security management. The platform aims to offer everyday users a convenient service for analyzing DeepFake media using multiple state-of-the-art detection algorithms. It ensures secure and private delivery of the analysis results. Furthermore, it serves as an evaluation and benchmarking platform for researchers in digital media forensics to compare the performance of multiple algorithms on the same input. We have also conducted detailed usage analysis based on the collected data to gain deeper insights into our platform's statistics. This involves analyzing two-month trends in user activity and evaluating the processing efficiency of each detector.
深度伪造(Deepfakes)作为人工智能生成的媒体, increasingly 威胁到媒体诚信和个人隐私,因为它们具有真实但虚假的数字内容。在这项工作中,我们介绍了一个开源且用户友好的在线平台 DeepFake-O-Meter v2.0,它整合了最先进的方法来检测 Deepfake 图像、视频和音频。基于 DeepFake-O-Meter v1.0,我们在架构设计方面进行了显著的升级和改进,包括用户交互、检测器集成、工作平衡和安全管理。该平台旨在为用户提供一个方便的 Deepfake 媒体分析服务,使用多种最先进的技术。它确保了分析结果的安全和隐私交付。此外,它还成为数字媒体法医研究人员比较多种算法在相同输入上的性能的平台。我们根据收集到的数据进行了详细的使用分析,以更深入地了解我们平台的统计数据。这包括分析两个月内的用户活动趋势和评估每个检测器的处理效率。
https://arxiv.org/abs/2404.13146
Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich-semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
解码自然视觉场景的大脑活动已经蓬勃发展,在单人任务和跨subject任务中都有广泛的研究,然而跨subject任务中的研究较少。在跨subject任务中重构高质量图像是一个具有挑战性的问题,由于不同受试者之间的深刻个人差异和数据注释的稀缺性。在这项工作中,我们提出了MindTuner,一种跨subject视觉解码器,利用仅1小时的fMRI训练数据实现高质量和高语义重建,并得益于人脑视觉系统中的视觉指纹现象和一种新颖的fMRI-to-text对齐范式。首先,我们在7个受试者之间预训练一个多受试者模型,然后用在新受试者上的稀缺数据对其进行微调,其中使用跳过LoRAs来学习视觉指纹。接着,我们将图像模态作为中间 pivot 模态来实现fMRI-to-text 对齐,实现令人印象深刻的fMRI-to-text检索性能,并纠正fMRI-to-图像重构,通过微调语义来修复。定性和定量的分析结果表明,MindTuner在自然场景数据集(NSD)上超越了最先进的跨subject视觉解码模型,无论是使用1小时的训练数据还是40小时的训练数据。
https://arxiv.org/abs/2404.12630
The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{this http URL}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.
大规模多模态模型的兴起为生成建模和推理的突破奠定了道路,同时也为各种复杂任务中发现了 transformative 应用。然而,一个仍然需要解决的问题是对更强的泛化能力的真正能力,这在多模态环境中已经没有被深入探讨。我们的研究旨在通过研究使用 \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ities)\footnote{项目页面:\url{这个链接})来解决这一问题,这是一个精心构建的、基于感知基础的丰富背景的元场景视频数据集。我们数据集中的每个实例都用原始视频 footage、自然产生的声音和来自众包的逐步描述来表示。更重要的是,我们的设置确保了单个概念在训练和评估集中是一致的,而它们的组合在评估集上是新颖的。我们对几个单模态和多模态模型进行了全面的评估。我们的研究结果表明,双模态和三模态模型在文本仅有的模型方面具有明显的优势。这突出了在研究这个领域未来发展中多模态的重要性。
https://arxiv.org/abs/2404.12013
Online learning has soared in popularity in the educational landscape of COVID-19 and carries the benefits of increased flexibility and access to far-away training resources. However, it also restricts communication between peers and teachers, limits physical interactions and confines learning to the computer screen and keyboard. In this project, we designed a novel way to engage students in collaborative online learning by using haptic-enabled tangible robots, Cellulo. We built a library which connects two robots remotely for a learning activity based around the structure of a biological cell. To discover how separate modes of haptic feedback might differentially affect collaboration, two modes of haptic force-feedback were implemented (haptic co-location and haptic consensus). With a case study, we found that the haptic co-location mode seemed to stimulate collectivist behaviour to a greater extent than the haptic consensus mode, which was associated with individualism and less interaction. While the haptic co-location mode seemed to encourage information pooling, participants using the haptic consensus mode tended to focus more on technical co-ordination. This work introduces a novel system that can provide interesting insights on how to integrate haptic feedback into collaborative remote learning activities in future.
在 COVID-19 对教育领域的推动下,在线学习已经迅速普及,带来了 increased flexibility 和远离培训资源的学习机会。然而,它也限制了同学和教师之间的交流,限制了身体接触,将学习局限于电脑屏幕和键盘。在这个项目中,我们设计了一种新颖的方法来激发学生在合作在线学习中的积极参与,利用触觉传感器实现的可编程机器人 Cellulo。我们建立了一个库,用于基于生物细胞结构的远程学习活动中连接两个机器人。为了探索两种触觉反馈模式如何不同地影响协作,我们实现了两种触觉力反馈模式:(触觉共驻和触觉共识)。通过一个案例研究,我们发现,触觉共驻模式似乎比触觉共识模式更能激发合作行为,这与个人主义和更少的互动有关。尽管触觉共驻模式似乎鼓励信息汇集,但使用触觉共识模式的学生则更多地关注技术协调。这项工作为将触觉反馈融入协作远程学习活动提供了有趣的见解,为未来在线教育的发展提供了新的思路。
https://arxiv.org/abs/2404.11876
Internet of Things (IoT) devices generate heterogeneous data over time; and relying solely on individual data points is inadequate for accurate analysis. Segmentation is a common preprocessing step in many IoT applications, including IoT-based activity recognition, aiming to address the limitations of individual events and streamline the process. However, this step introduces at least two families of uncontrollable biases. The first is caused by the changes made by the segmentation process on the initial problem space, such as dividing the input data into 60 seconds windows. The second category of biases results from the segmentation process itself, including the fixation of the segmentation method and its parameters. To address these biases, we propose to redefine the segmentation problem as a special case of a decomposition problem, including three key components: a decomposer, resolutions, and a composer. The inclusion of the composer task in the segmentation process facilitates an assessment of the relationship between the original problem and the problem after the segmentation. Therefore, It leads to an improvement in the evaluation process and, consequently, in the selection of the appropriate segmentation method. Then, we formally introduce our novel meta-decomposition or learning-to-decompose approach. It reduces the segmentation biases by considering the segmentation as a hyperparameter to be optimized by the outer learning problem. Therefore, meta-decomposition improves the overall system performance by dynamically selecting the appropriate segmentation method without including the mentioned biases. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposal.
物联网设备随着时间的推移会产生异构数据;仅依赖个体数据点进行准确分析是不够的。在许多物联网应用中,包括基于物联网的活动识别,分段是一个常见的预处理步骤,旨在解决个体事件的局限性,并简化过程。然而,这一步骤引入了至少两个不可控的偏见家族。第一个偏见是由分段过程对初始问题空间所做的改变引起的,例如将输入数据分为60秒窗口。第二个偏见是由分段过程本身引起的,包括对分段方法和其参数的固定。为解决这些偏见,我们提出将分段问题重新定义为分解问题的一般情况,包括三个关键组件:分解者、分辨率和支持者。将支持者任务包含在分段过程中有助于评估分割前后原始问题与问题之间的关系。因此,这导致了评估过程的改进,并相应地选择了适当的分段方法。接着,我们正式引入了我们新颖的元分解或学习分解方法。它通过将分段视为外学习问题中的超参数来减少分段偏见。因此,元分解通过动态选择适当的分段方法来提高整体系统性能,而不会包括上述偏见。在四个真实世界数据集上的大量实验证明了我们建议的有效性。
https://arxiv.org/abs/2404.11742
Satellite imagery is regarded as a great opportunity for citizen-based monitoring of activities of interest. Relevant imagery may however not be available at sufficiently high resolution, quality, or cadence -- let alone be uniformly accessible to open-source analysts. This limits an assessment of the true long-term potential of citizen-based monitoring of nuclear activities using publicly available satellite imagery. In this article, we demonstrate how modern game engines combined with advanced machine-learning techniques can be used to generate synthetic imagery of sites of interest with the ability to choose relevant parameters upon request; these include time of day, cloud cover, season, or level of activity onsite. At the same time, resolution and off-nadir angle can be adjusted to simulate different characteristics of the satellite. While there are several possible use-cases for synthetic imagery, here we focus on its usefulness to support tabletop exercises in which simple monitoring scenarios can be examined to better understand verification capabilities enabled by new satellite constellations and very short revisit times.
卫星图像被视为公民基于监测感兴趣活动的重要机会。然而,相关图像可能不会在足够高的分辨率、质量或频率上可用,更不用说让开源分析师轻松访问了。这限制了对使用公共卫星图像进行公民基于监测的长期潜在能力的评估。在本文中,我们展示了如何将现代游戏引擎与先进机器学习技术相结合,生成具有按请求选择相关参数的感兴趣地点的合成图像。这些参数包括时间、云层、季节或现场活动水平。同时,分辨率和服务角可以调整以模拟卫星的不同特性。虽然合成图像有多种可能的用例,但在这里我们关注其对桌面练习的支持,这些练习可以用来更好地了解新卫星星座及其非常短的重新访问时间所能带来的验证能力。
https://arxiv.org/abs/2404.11461