Light field (LF) imaging captures both angular and spatial light distributions, enabling advanced photographic techniques. However, micro-lens array (MLA)- based cameras face a spatial-angular resolution tradeoff due to a single shared sensor. We propose a novel light field framework for resolution enhancement, employing a modular approach. The first module generates a high-resolution, all-in-focus image. The second module, a texture transformer network, enhances the resolution of each light field perspective independently using the output of the first module as a reference image. The final module leverages light field regularity to jointly improve resolution across all LF image perspectives. Our approach demonstrates superior performance to existing methods in both qualitative and quantitative evaluations.
光场(LF)成像同时捕捉到角速度和空间光分布,使得高级摄影技术成为可能。然而,基于微透镜阵列(MLA)的相机由于共享传感器,面临着空间-角速度分辨率权衡。我们提出了一个用于分辨率增强的新型光场框架,采用模块化方法。第一个模块生成一个高分辨率、全焦距的图像。第二个模块,一个纹理转换器网络,通过第一个模块的输出作为参考图像,独立地增强每个光场视角的分辨率。最后一个模块利用光场 regularity 共同提高所有 LF 图像视角的分辨率。我们的方法在质性和量化评价中表现出优越的性能。
https://arxiv.org/abs/2405.02787
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
检索增强的大型语言模型(LLMs)利用信息检索系统检索的相关内容来生成正确的答案,旨在减轻混杂问题。然而,现有的检索响应方法通常在LLM的提示中附加相关文档进行文本生成任务,而没有考虑检索到的文档与LLM之间细粒度语义结构的交互。这个问题在准确回答问题方面尤为重要,因为LLM在处理带有长文档的输入提示时往往会出现“在中途迷失”的情况。在本文中,我们提出了一个名为“强化检索-排序-回答者”(R$^4$)的新管道来学习检索增强LLM的文档顺序,从而在保持LLM的大参数的同时进一步增强其生成能力。排序学习过程根据生成的响应质量分为两个步骤:文档顺序调整和文档表示增强。具体来说,文档顺序调整旨在根据图注意力学习将检索到的文档顺序组织为开始、中间和结束位置,从而最大化响应质量的强化奖励。文档表示增强通过文档级的梯度 adversarial 学习进一步优化了用于低质量响应的文档表示。大量实验证明,与各种公共数据集上的强大基线相比,我们提出的管道在知识密集型任务上的事实问题回答表现更好。源代码和训练好的模型将在论文接受后发布。
https://arxiv.org/abs/2405.02659
Molecular property prediction is a key component of AI-driven drug discovery and molecular characterization learning. Despite recent advances, existing methods still face challenges such as limited ability to generalize, and inadequate representation of learning from unlabeled data, especially for tasks specific to molecular structures. To address these limitations, we introduce DIG-Mol, a novel self-supervised graph neural network framework for molecular property prediction. This architecture leverages the power of contrast learning with dual interaction mechanisms and unique molecular graph enhancement strategies. DIG-Mol integrates a momentum distillation network with two interconnected networks to efficiently improve molecular characterization. The framework's ability to extract key information about molecular structure and higher-order semantics is supported by minimizing loss of contrast. We have established DIG-Mol's state-of-the-art performance through extensive experimental evaluation in a variety of molecular property prediction tasks. In addition to demonstrating superior transferability in a small number of learning scenarios, our visualizations highlight DIG-Mol's enhanced interpretability and representation capabilities. These findings confirm the effectiveness of our approach in overcoming challenges faced by traditional methods and mark a significant advance in molecular property prediction.
分子性质预测是人工智能驱动的药物发现和分子特征分析学习中的关键组成部分。尽管近年来取得了进展,但现有方法仍然面临着一些挑战,如缺乏泛化能力以及从无标签数据中学习的不足,尤其是在分子结构特定的任务中。为了应对这些限制,我们引入了DIG-Mol,一种新颖的自监督图神经网络框架用于分子性质预测。该架构利用对比学习与双交互机制的独特分子图增强策略。DIG-Mol将卷积神经网络与两个相互连接的网络相结合,以有效地提高分子特征分析。该框架通过最小化对比损失来提取关于分子结构和更高阶语义的关键信息。通过在各种分子性质预测任务中的广泛实验评估,我们已经建立了DIG-Mol的最先进性能。除了在少数学习场景中表现出优越的可转移性外,我们的可视化结果还突显了DIG-Mol在增强可解释性和表示能力方面的优势。这些发现证实了我们的方法在克服传统方法的挑战方面取得了显著的进展,并为分子性质预测领域带来了重要的发展。
https://arxiv.org/abs/2405.02628
In recent years, autonomous driving has garnered escalating attention for its potential to relieve drivers' burdens and improve driving safety. Vision-based 3D occupancy prediction, which predicts the spatial occupancy status and semantics of 3D voxel grids around the autonomous vehicle from image inputs, is an emerging perception task suitable for cost-effective perception system of autonomous driving. Although numerous studies have demonstrated the greater advantages of 3D occupancy prediction over object-centric perception tasks, there is still a lack of a dedicated review focusing on this rapidly developing field. In this paper, we first introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task. Secondly, we conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects: feature enhancement, deployment friendliness and label efficiency, and provide an in-depth analysis of the potentials and challenges of each category of methods. Finally, we present a summary of prevailing research trends and propose some inspiring future outlooks. To provide a valuable reference for researchers, a regularly updated collection of related papers, datasets, and codes is organized at this https URL.
近年来,自动驾驶因为其减轻驾驶员负担和提高驾驶安全性的潜在优势而备受关注。基于视觉的3D占用预测,预测自动驾驶车辆周围3D体素网格的空间占用状态和语义,是一个适合自动驾驶低成本感知系统的 emerging perception 任务。尽管大量研究表明,与物体中心感知任务相比,3D占用预测具有更大的优势,但目前仍缺乏针对这一快速发展的领域的专门 review。在本文中,我们首先介绍了基于视觉的3D占用预测的背景,并讨论了这项任务的挑战。然后,我们从三个方面对基于视觉的3D占用预测的研究进展进行全面调查:特征增强、部署友好性和标签效率,并深入分析每种方法的潜在和挑战。最后,我们总结了当前研究趋势,并提出了鼓舞人心的未来展望。为了为研究人员提供有价值的参考,在 https://www. this URL 处组织了一个定期更新的相关论文、数据和代码的集合。
https://arxiv.org/abs/2405.02595
Computed tomography (CT) is a widely used non-invasive medical imaging technique for disease diagnosis. The diagnostic accuracy is often affected by image resolution, which can be insufficient in practice. For medical CT images, the through-plane resolution is often worse than the in-plane resolution and there can be overlap between slices, causing difficulties in diagnoses. Self-supervised methods for through-plane resolution enhancement, which train on in-plane images and infer on through-plane images, have shown promise for both CT and MRI imaging. However, existing self-supervised methods either neglect overlap or can only handle specific cases with fixed combinations of resolution and overlap. To address these limitations, we propose a self-supervised method called SR4ZCT. It employs the same off-axis training approach while being capable of handling arbitrary combinations of resolution and overlap. Our method explicitly models the relationship between resolutions and voxel spacings of different planes to accurately simulate training images that match the original through-plane images. We highlight the significance of accurate modeling in self-supervised off-axis training and demonstrate the effectiveness of SR4ZCT using a real-world dataset.
计算断层扫描(CT)是一种广泛应用于疾病诊断的非侵入性医疗影像技术。诊断准确性通常受到图像分辨率的影响,在实际应用中可能不足以准确。对于医学CT图像,通过平面的分辨率通常比在平面分辨率更差,而且切片之间可能存在重叠,导致诊断困难。自监督方法通过在平面图像上训练并通过平面图像进行推断,对CT和MRI成像都显示出希望。然而,现有的自监督方法要么忽视重叠,要么只能处理具有固定组合分辨率和平面重叠的特定情况。为了克服这些限制,我们提出了一个名为SR4ZCT的自监督方法。它采用相同的离轴训练方法,同时能够处理任意组合的分辨率和重叠。我们的方法明确地建模了不同平面的分辨率和体素空间之间的关系,准确地模拟了与原始通过平面图像匹配的训练图像。我们强调了在自监督离轴训练中准确建模的重要性,并通过一个真实世界的数据集证明了SR4ZCT的有效性。
https://arxiv.org/abs/2405.02515
Predictive Coding (PC) is a theoretical framework in cognitive science suggesting that the human brain processes cognition through spatiotemporal prediction of the visual world. Existing studies have developed spatiotemporal prediction neural networks based on the PC theory, emulating its two core mechanisms: Correcting predictions from residuals and hierarchical learning. However, these models do not show the enhancement of prediction skills on real-world forecasting tasks and ignore the Precision Weighting mechanism of PC theory. The precision weighting mechanism posits that the brain allocates more attention to signals with lower precision, contributing to the cognitive ability of human brains. This work introduces the Cognitive Diffusion Probabilistic Models (CogDPM), which demonstrate the connection between diffusion probabilistic models and PC theory. CogDPM features a precision estimation method based on the hierarchical sampling capabilities of diffusion models and weight the guidance with precision weights estimated by the inherent property of diffusion models. We experimentally show that the precision weights effectively estimate the data predictability. We apply CogDPM to real-world prediction tasks using the United Kindom precipitation and ERA surface wind datasets. Our results demonstrate that CogDPM outperforms both existing domain-specific operational models and general deep prediction models by providing more proficient forecasting.
预测性编码(PC)是一个在认知科学中的理论框架,它表明人类大脑通过处理视觉世界的时空预测来进行认知。根据PC理论,已经开发出基于PC理论的时空预测神经网络,模拟了其两个核心机制:从残差中进行预测校正和分层学习。然而,这些模型没有在现实世界的预测任务中展示预测能力的增强,并且忽略了PC理论的精确加权机制。精确加权机制认为,大脑对低精度信号分配更多的注意,有助于人类大脑的认知能力。 这项工作引入了认知扩散概率模型(CogDPM),它展示了扩散概率模型与PC理论之间的联系。CogDPM具有基于扩散模型分层采样能力的精确估计方法,并使用扩散模型固有特性的自属性精度加权器来加权指导。我们通过实验验证,证明了精确权重有效地估计数据可预测性。我们将CogDPM应用于现实世界的预测任务,使用美国国家风能和地球观测局(United Kingdom)降水和埃拉表面风数据集。我们的结果表明,CogDPM优于现有的领域特异性操作模型和通用深度预测模型,因为它提供了更精确的预测。
https://arxiv.org/abs/2405.02384
The complex challenge of detecting sarcasm in Arabic speech on social media is increased by the language diversity and the nature of sarcastic expressions. There is a significant gap in the capability of existing models to effectively interpret sarcasm in Arabic, which mandates the necessity for more sophisticated and precise detection methods. In this paper, we investigate the impact of a fundamental preprocessing component on sarcasm speech detection. While emojis play a crucial role in mitigating the absence effect of body language and facial expressions in modern communication, their impact on automated text analysis, particularly in sarcasm detection, remains underexplored. We investigate the impact of emoji exclusion from datasets on the performance of sarcasm detection models in social media content for Arabic as a vocabulary-super rich language. This investigation includes the adaptation and enhancement of AraBERT pre-training models, specifically by excluding emojis, to improve sarcasm detection capabilities. We use AraBERT pre-training to refine the specified models, demonstrating that the removal of emojis can significantly boost the accuracy of sarcasm detection. This approach facilitates a more refined interpretation of language, eliminating the potential confusion introduced by non-textual elements. The evaluated AraBERT models, through the focused strategy of emoji removal, adeptly navigate the complexities of Arabic sarcasm. This study establishes new benchmarks in Arabic natural language processing and presents valuable insights for social media platforms.
社会媒体中检测讽刺语的复杂性增加了语言多样性和讽刺表达的性质。现有模型有效解释阿拉伯语中的讽刺的能力存在显著的差距,这迫使需要更复杂和精确的检测方法。在本文中,我们研究了基本预处理组件对讽刺语音检测的影响。尽管表情符号在减轻现代通信中肢体语言和面部表情缺失效应方面起着关键作用,但它们对自动文本分析(特别是讽刺检测)的影响仍没有被深入研究。我们研究了表情符号从数据集中排除对阿拉伯语讽刺检测模型性能的影响。这项调查包括使用AraBERT预训练模型进行调整和增强,特别是通过排除表情符号,以提高讽刺检测能力。我们使用AraBERT预训练来优化指定模型,证明删除表情符号可以显著提高讽刺检测的准确性。这种方法使得对语言的解读更加精准,消除了非文本元素可能引起的混淆。评估的AraBERT模型通过移除表情符号,巧妙地处理了阿拉伯语讽刺的复杂性。本研究为阿拉伯自然语言处理设立了新的基准,并为社交媒体平台提供了宝贵的洞见。
https://arxiv.org/abs/2405.02195
The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0\% and a UAR of 82.0\%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
预训练语音模型的连续进化已经极大地推动了情感识别(SER)。然而,这些方法在表现上还有很大的提升潜力。在本文中,我们提出了GMP-ATL(性别增强多尺度伪标签自适应转移学习),一种新颖的HuBERT基情感识别(SER)自适应转移学习框架。具体来说,GMP-ATL首先采用预训练的HuBERT,实现多任务学习和多尺度k-means聚类,以获取帧级的性别增强多尺度伪标签。然后,为了充分利用获得的帧级和语料水平情感标签,我们引入模型重构和微调方法,进一步优化GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL取得了卓越的识别性能,具有80.0%的准确率(WAR)和82.0%的召回率(UAR),超越了当前最先进的单模态SER方法,同时与多模态SER方法相当。
https://arxiv.org/abs/2405.02151
Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.
深度学习在提高听觉辅助用户语音信号的清晰度和可听度方面具有潜力。适用于真实世界应用的深度模型应该具有低计算复杂度和低延迟,只有几毫秒。在本文中,我们探讨了符合这些要求的深度语音增强,并将在两个复杂的声场景中比较双耳和单耳处理算法。两种算法都用客观指标评估,并通过听觉受损的听众在语音噪声测试中进行实验。结果与两种传统增强策略(自适应差动麦克风处理和双耳波束成形)进行比较。虽然在扩散噪声中,所有算法表现相似,但双耳深度学习方法在存在空间干扰的情况下表现最佳。通过后分析,这可以归因于在低信噪比下信号质量的提高和精确的空间滤波。
https://arxiv.org/abs/2405.01967
This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy.
本文旨在有效地使大型语言模型(LLMs)能够使用外部知识和目标指导在会话推荐系统(CRS)任务中进行高效运用。先进的LLM(例如,ChatGPT)在领域特定CRS任务上存在限制,其一是生成具有推荐导向知识的有根回答,二是通过不同的对话目标主动引导对话。在这项工作中,我们通过全面的评估分析了这些限制,展示了外部知识和目标指导在提高推荐准确性和语言质量方面的重要性。鉴于这一发现,我们提出了一个新型的ChatCRS框架,通过实现知识检索代理和对话目标预测代理来分解复杂的CRS任务为几个子任务。在两个多目标CRSDataset上的实验结果表明,ChatCRS取得了最先进的基准,将信息提供性的语言质量提高了17%,活力提高了27%,并且推荐准确率提高了十倍。
https://arxiv.org/abs/2405.01868
Diagnosing language disorders associated with autism is a complex and nuanced challenge, often hindered by the subjective nature and variability of traditional assessment methods. Traditional diagnostic methods not only require intensive human effort but also often result in delayed interventions due to their lack of speed and specificity. In this study, we explored the application of ChatGPT, a state of the art large language model, to overcome these obstacles by enhancing diagnostic accuracy and profiling specific linguistic features indicative of autism. Leveraging ChatGPT advanced natural language processing capabilities, this research aims to streamline and refine the diagnostic process. Specifically, we compared ChatGPT's performance with that of conventional supervised learning models, including BERT, a model acclaimed for its effectiveness in various natural language processing tasks. We showed that ChatGPT substantially outperformed these models, achieving over 13% improvement in both accuracy and F1 score in a zero shot learning configuration. This marked enhancement highlights the model potential as a superior tool for neurological diagnostics. Additionally, we identified ten distinct features of autism associated language disorders that vary significantly across different experimental scenarios. These features, which included echolalia, pronoun reversal, and atypical language usage, were crucial for accurately diagnosing ASD and customizing treatment plans. Together, our findings advocate for adopting sophisticated AI tools like ChatGPT in clinical settings to assess and diagnose developmental disorders. Our approach not only promises greater diagnostic precision but also aligns with the goals of personalized medicine, potentially transforming the evaluation landscape for autism and similar neurological conditions.
诊断与自闭症相关的语言障碍是一个复杂而微妙的挑战,常常受到传统评估方法主观性和可变性的阻碍。传统的评估方法不仅需要大量的人力,而且通常由于其速度和准确性不足而导致延迟干预。在这项研究中,我们探讨了将 ChatGPT(一种最先进的 large language model)应用于克服这些障碍,通过提高诊断准确性和鉴定自闭症特定语言特征来提高诊断过程。利用 ChatGPT 先进自然语言处理能力,这项研究旨在简化并优化诊断过程。 具体来说,我们将 ChatGPT 的性能与包括 BERT(在各种自然语言处理任务中表现出色)在内的传统监督学习模型进行比较。我们发现 ChatGPT 远远超过了这些模型,在零散学习配置下实现了超过 13% 的准确性和 F1 分数的提高。这一显著的增强突显了该模型的潜力,作为神经诊断工具的优越性。此外,我们识别出十种与自闭症相关的语言障碍,这些障碍在不同的实验场景中具有显著的差异。这些特征(包括重复、指代词倒置和异常语言使用)对于准确诊断 ASD 和定制治疗计划至关重要。 我们在一起得出的研究结果主张,在临床环境中采用先进的 AI 工具如 ChatGPT 来评估和诊断发展障碍。我们的方法不仅承诺更高的诊断精确性,而且与个性化医疗的目标相一致,可能改变自闭症和其他神经障碍的评估格局。
https://arxiv.org/abs/2405.01799
As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. However, currently, no model exists that can simultaneously detect an object's position in both point clouds and images and ascertain their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at this https URL.
随着人机交互的不断进化,环境感知的能力变得越来越重要。将两种最常见的感官数据,图像和点云,集成在一起可以提高检测精度。然而,目前尚无同时检测物体在点云和图像中的位置以及确定它们相应关系的产品。这些信息对于人机交互非常重要,为它们的增强提供了新的可能性。因此,本文介绍了一个端到端的一致性物体检测(COD)算法框架,只需要一个前向推理即可同时获得物体在点云和图像中的位置并确定它们之间的关联。此外,为了评估点云和图像之间物体关联的准确性,本文提出了一个新的评估指标,一致性精度(CP)。为了验证所提出的框架的有效性,在KITTI和DAIR-V2X数据集上进行了大量实验。研究还探讨了在图像和点云之间的校准参数受到干扰时,所提出的一致性检测方法在图片上的表现,与现有后处理方法进行了比较。实验结果表明,与现有方法相比,所提出的检测方法具有卓越的检测性能和鲁棒性,实现了端到端的一致性检测。源代码将公开在https://这个网址上。
https://arxiv.org/abs/2405.01258
We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
我们提出了TRAMBA,一种适用于移动和可穿戴平台的混合变压器和Mamba架构的语音和骨传导增强,为语音和骨传导增强提供了一种高效且可扩展的方法。骨传导增强在移动和可穿戴平台上的实现一直是不实用的几个原因:首先,数据收集工作量很大,导致数据稀缺;其次,与具有数百MB内存开销的先进模型相比,更适用于资源受限系统的方法之间存在性能差距。为了将TRAMBA适应振动感知模式,我们使用音频语音数据集预先训练TRAMBA。然后,用户通过少量的骨传导数据进行微调。TRAMBA在PESQ和STOI方面的性能优于最先进的GAN,其内存开销减小了 orders of magnitude,并且推理速度加快了465倍。我们将TRAMBA集成到实际系统中,并证明了TRAMBA(i)通过要求更少的数据采样和传输来提高可穿戴设备的电池寿命,提高了160%;(ii)在嘈杂的环境中产生的声音质量高于通过空气传播的声音;(iii) 内存开销小于20.0 MB。
https://arxiv.org/abs/2405.01242
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at this https URL.
大语言模型(LLMs)在各种任务上取得了显著的成功,然而它们仍然容易受到对抗攻击,尤其是著名的 \textit{jailbreak} 攻击。最近, Greedy Coordinate Gradient (GCG) 攻击通过优化对抗提示并通过梯度启发式和贪心搜索相结合,成功地利用了这一漏洞。然而,这种攻击的效率在攻击过程中成为了一个瓶颈。为了减轻这一限制,本文通过优化视角重新思考了生成对抗提示的过程,旨在稳定优化过程并从之前的迭代中获得更多的启发式洞察。具体来说,我们引入了 \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) 攻击,该攻击在梯度启发式上的优化中引入了动量项。实验结果展示了 MAP 在基于梯度的攻击对对齐语言模型上的显著增强。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.01229
Change detection as an interdisciplinary discipline in the field of computer vision and remote sensing at present has been receiving extensive attention and research. Due to the rapid development of society, the geographic information captured by remote sensing satellites is changing faster and more complex, which undoubtedly poses a higher challenge and highlights the value of change detection tasks. We propose MFDS-Net: Multi-Scale Feature Depth-Supervised Network for Remote Sensing Change Detection with Global Semantic and Detail Information (MFDS-Net) with the aim of achieving a more refined description of changing buildings as well as geographic information, enhancing the localisation of changing targets and the acquisition of weak features. To achieve the research objectives, we use a modified ResNet_34 as backbone network to perform feature extraction and DO-Conv as an alternative to traditional convolution to better focus on the association between feature information and to obtain better training results. We propose the Global Semantic Enhancement Module (GSEM) to enhance the processing of high-level semantic information from a global perspective. The Differential Feature Integration Module (DFIM) is proposed to strengthen the fusion of different depth feature information, achieving learning and extraction of differential features. The entire network is trained and optimized using a deep supervision mechanism. The experimental outcomes of MFDS-Net surpass those of current mainstream change detection networks. On the LEVIR dataset, it achieved an F1 score of 91.589 and IoU of 84.483, on the WHU dataset, the scores were F1: 92.384 and IoU: 86.807, and on the GZ-CD dataset, the scores were F1: 86.377 and IoU: 76.021. The code is available at this https URL
目前,在计算机视觉和遥感领域,变化检测作为跨学科领域已经受到了广泛关注和研究。由于社会的快速发展,遥感卫星捕获的地理信息变化得更快、更复杂,无疑给变化检测任务带来了更高的挑战,同时也突出了变化检测任务的价值。我们提出了MFDS-Net:多尺度特征自监督网络用于远程 sensing变化检测全局语义和详细信息(MFDS-Net)作为目标,实现对变化建筑和地理信息的更精确描述,提高变化目标的定位,并获得弱特征的提取。为了实现研究目标,我们使用修改后的ResNet_34作为骨干网络进行特征提取,将DO-Conv作为传统卷积的替代方案,更好关注特征信息与特征信息的关联,以获得更好的训练结果。我们提出了全局语义增强模块(GSEM)用于从全局角度增强高级语义信息。差分特征融合模块(DFIM)提出了用于加强不同深度特征信息的融合,实现对差分特征的学习和提取。整个网络使用深度监督机制进行训练和优化。MFDS-Net的实验结果超越了当前主流变化检测网络。在LEVIR数据集上,其得分达到了91.589,IoU为84.483;在WHU数据集上,得分分别为F1: 92.384和IoU: 86.807;在GZ-CD数据集上,得分分别为F1: 86.377和IoU: 76.021。代码可在此https://url
https://arxiv.org/abs/2405.01065
Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data, presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However, they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges, we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework, which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically, within the module Faster Inversion via Meta-Generator, each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps, significantly accelerating the data recovery. Furthermore, we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach, marking a notable speed-up (20$\times$) and performance enhancement (1.42\% $\sim$ 4.78\%) in comparison to the state-of-the-art.
数据免费元学习(DFML)旨在从一组预训练模型中提取知识,而无需要求原始数据,在受到数据隐私问题限制的上下文中具有实际应用价值。当前的DFML方法主要集中在从预训练模型中数据恢复。然而,它们受到缓慢的数据恢复速度和忽视预训练模型异质性的不足的缺陷。为了应对这些挑战,我们引入了更快的数据免费元学习(FREE)框架,它包含:(i)一个元生成器,用于从预训练模型中迅速恢复训练任务;(ii)一个元学习器,用于对新预见到的任务进行泛化。具体来说,在Faster Inversion via Meta-Generator模块中,每个预训练模型都被视为一个独立任务。元生成器可以在只需五个步骤的情况下迅速适应特定任务,大大加快了数据恢复速度。此外,我们提出了更好的泛化通过元学习器,并引入了隐含梯度对齐算法来优化元学习器。通过对齐梯度方向消除了异质预训练模型任务之间的潜在冲突,使得我们的方法在多个基准上的实验结果证实了其优越性,与最先进的方法相比,速度更快(20$\times$),性能更卓越(1.42\% $\sim$ 4.78\%)。
https://arxiv.org/abs/2405.00984
Background and Purpose: Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention yet is often undetermined. This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin from histopathological images. Methods: The dataset included whole slide images (WSI) from the STRIP AI Kaggle challenge, consisting of retrieved clots from ischemic stroke patients following mechanical thrombectomy. Transformer-based deep learning models were developed using transfer learning and self-supervised pretraining for classifying WSI. Customizations included an attention pooling layer, weighted loss function, and threshold optimization. Various model architectures were tested and compared, and model performances were primarily evaluated using weighted logarithmic loss. Results: The model achieved a logloss score of 0.662 in cross-validation and 0.659 on the test set. Different model backbones were compared, with the swin_large_patch4_window12_384 showed higher performance. Thresholding techniques for clot origin classification were employed to balance false positives and negatives. Conclusion: The study demonstrates the extent of efficacy of transformer-based deep learning models in identifying ischemic stroke clot origins from histopathological images and emphasizes the need for refined modeling techniques specifically adapted to thrombi WSI. Further research is needed to improve model performance, interpretability, validate its effectiveness. Future enhancement could include integrating larger patient cohorts, advanced preprocessing strategies, and exploring ensemble multimodal methods for enhanced diagnostic accuracy.
背景和目的:确定动脉粥样硬化性中风血栓的来源对于治疗和二次预防至关重要,但通常很难确定。这项研究描述了一种自监督的深度学习方法,用于病理图像中的血栓分类,以确定动脉粥样硬化性中风血栓的来源。 方法:数据集包括从STRIP AI Kaggle挑战中获取的整张图像(WSI),这些 WSI 是来自接受机械取栓治疗的患者。使用迁移学习和自监督预训练的Transformer-based深度学习模型进行分类。自定义包括注意力池化层、加权损失函数和阈值优化。各种模型架构都被测试和比较,主要通过加权对数损失进行评估来评估模型性能。 结果:在交叉验证中,模型实现了logloss分数为0.662,在测试集中为0.659。对不同的模型骨干进行了比较, swin_large_patch4_window12_384 显示了更高的性能。采用阈值技术平衡 false positives 和 negatives。 结论:本研究证明了Transformer-based深度学习模型在从病理图像中确定动脉粥样硬化性中风血栓来源方面的有效性。强调了需要专门针对血栓 WSI 的精细建模技术。还需要进一步研究提高模型性能、可解释性和验证其有效性。未来的增强可以包括纳入更大的患者队列、采用更先进的预处理策略和探索集成多模态方法以提高诊断准确性。
https://arxiv.org/abs/2405.00908
The lane graph is a key component for building high-definition (HD) maps and crucial for downstream tasks such as autonomous driving or navigation planning. Previously, He et al. (2022) explored the extraction of the lane-level graph from aerial imagery utilizing a segmentation based approach. However, segmentation networks struggle to achieve perfect segmentation masks resulting in inaccurate lane graph extraction. We explore additional enhancements to refine this segmentation-based approach and extend it with a diffusion probabilistic model (DPM) component. This combination further improves the GEO F1 and TOPO F1 scores, which are crucial indicators of the quality of a lane graph, in the undirected graph in non-intersection areas. We conduct experiments on a publicly available dataset, demonstrating that our method outperforms the previous approach, particularly in enhancing the connectivity of such a graph, as measured by the TOPO F1 score. Moreover, we perform ablation studies on the individual components of our method to understand their contribution and evaluate their effectiveness.
道路图是构建高清晰度(HD)地图的关键组件,对于自动驾驶或导航规划等下游任务至关重要。之前,He等人(2022)利用基于分割的方法从无人机影像中提取道路级图。然而,分割网络很难实现完美的分割掩码,导致道路级图提取不准确。我们探讨了使用基于分割的改进方法,并将其与扩散概率模型(DPM)组件相结合,以进一步改进该方法。这种组合进一步提高了无向图中的GEO F1和TOPO F1分数,这些分数是衡量道路图质量的关键指标。我们在公开可用数据集上进行实验,证明我们的方法超越了以前的方法,特别是在增强道路图的连通性方面,根据TOPO F1得分。此外,我们对我们方法的个人组件进行了消融研究,以了解它们的贡献并评估其效果。
https://arxiv.org/abs/2405.00620
Fundus photography, in combination with the ultra-wide-angle fundus (UWF) techniques, becomes an indispensable diagnostic tool in clinical settings by offering a more comprehensive view of the retina. Nonetheless, UWF fluorescein angiography (UWF-FA) necessitates the administration of a fluorescent dye via injection into the patient's hand or elbow unlike UWF scanning laser ophthalmoscopy (UWF-SLO). To mitigate potential adverse effects associated with injections, researchers have proposed the development of cross-modality medical image generation algorithms capable of converting UWF-SLO images into their UWF-FA counterparts. Current image generation techniques applied to fundus photography encounter difficulties in producing high-resolution retinal images, particularly in capturing minute vascular lesions. To address these issues, we introduce a novel conditional generative adversarial network (UWAFA-GAN) to synthesize UWF-FA from UWF-SLO. This approach employs multi-scale generators and an attention transmit module to efficiently extract both global structures and local lesions. Additionally, to counteract the image blurriness issue that arises from training with misaligned data, a registration module is integrated within this framework. Our method performs non-trivially on inception scores and details generation. Clinical user studies further indicate that the UWF-FA images generated by UWAFA-GAN are clinically comparable to authentic images in terms of diagnostic reliability. Empirical evaluations on our proprietary UWF image datasets elucidate that UWAFA-GAN outperforms extant methodologies. The code is accessible at this https URL.
fundus摄影与超广角 fundus(UWF)技术相结合,在临床实践中成为一项不可或缺的诊断工具,因为它能提供对视网膜更全面的观察。然而,UWF 荧光血管造影(UWF-FA)需要通过注射荧光染料到患者手中或肘部来实施,而 UWF 扫描激光视网膜检查(UWF-SLO)不需要这样做。为了减轻注射可能带来的不良反应,研究人员提出了开发能够将 UWF-SLO 图像转换为 UWF-FA 图像的跨模态医疗图像生成算法。目前应用于 fundus 摄影的图像生成技术在生成高分辨率视网膜图像方面遇到困难,特别是在捕捉细微血管病变方面。为了应对这些问题,我们引入了一种名为 UWAFA-GAN 的条件生成对抗网络(GAN)用于从 UWF-SLO 合成 UWF-FA。这种方法采用多尺度生成器和关注传输模块来有效地提取全局结构和局部病变。此外,为了对抗训练数据不对齐导致的图像模糊问题,我们在该框架中引入了注册模块。我们的方法在 inception 分数和详细信息生成方面非同寻常。通过对我们专有 UWF 图像数据集的临床用户研究,证实了 UWAFA-GAN 生成的 UWF-FA 图像在诊断可靠性方面与真实图像相当。我们专有 UWF 图像数据集的实证评估证实了 UWAFA-GAN 优于现有方法。代码可在此链接访问:https://www.example.com/
https://arxiv.org/abs/2405.00542
Addressing the challenge of automated geometry math problem-solving in artificial intelligence (AI) involves understanding multi-modal information and mathematics. Current methods struggle with accurately interpreting geometry diagrams, which hinders effective problem-solving. To tackle this issue, we present the Geometry problem sOlver with natural Language Description (GOLD) model. GOLD enhances the extraction of geometric relations by separately processing symbols and geometric primitives within the diagram. Subsequently, it converts the extracted relations into natural language descriptions, efficiently utilizing large language models to solve geometry math problems. Experiments show that the GOLD model outperforms the Geoformer model, the previous best method on the UniGeo dataset, by achieving accuracy improvements of 12.7% and 42.1% in calculation and proving subsets. Additionally, it surpasses the former best model on the PGPS9K and Geometry3K datasets, PGPSNet, by obtaining accuracy enhancements of 1.8% and 3.2%, respectively.
解决自动化几何数学问题求解中的挑战涉及理解和处理多模态信息和数学。现有方法在准确解释几何图方面遇到困难,这阻碍了有效的求解。为解决这个问题,我们提出了Geometry problem solver with natural Language Description (GOLD)模型。GOLD通过分别处理图中的符号和几何基本要素来增强提取几何关系。然后,它将提取的关系转换成自然语言描述,有效利用大型语言模型来解决几何数学问题。实验证明,GOLD模型在计算和证明子集的准确性上超过了UniGeo数据集上之前最佳方法的12.7%和42.1%。此外,它在PGPS9K和Geometry3K数据集以及PGPSNet上超过了最佳模型,分别获得了1.8%和3.2%的准确性提升。
https://arxiv.org/abs/2405.00494