Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at this https URL
低光图像增强(LLIE)是计算摄影和成像中的一个关键任务。夜间或在光线较暗的环境中拍摄的照片增强问题已经在图像信号处理文献中得到了充分的研究。然而,当前基于深度学习的方法在实际场景中的效率和鲁棒性方面仍然存在问题(例如噪声、饱和像素以及不良照明条件)。我们提出了一种结合频域与空域图像处理的轻量级神经网络。我们的方法FLOL+是执行此任务最快的模型之一,在流行的现实场景数据集(如LOL和LSRW)上实现了最先进的性能。此外,我们可以以不到12毫秒的时间处理1080p分辨率的图片。代码与模型可在该网址获得:[请在此处插入实际URL]。
https://arxiv.org/abs/2501.09718
Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.
最近在大型语言模型(LLM)方面取得的进展已经证明了其在执行复杂任务方面的显著进步。虽然基于人类反馈的强化学习(RLHF)在将LLM与人类偏好对齐方面非常有效,但它容易受到奖励建模中的虚假相关性的困扰。这通常会导致诸如长度偏见、阿谀奉承倾向、概念偏差和歧视等偏见问题,这些都阻碍了模型捕捉真正因果关系的能力。为了解决这些问题,我们提出了一种新颖的因果奖励建模方法,该方法整合了因果推理来减轻这些虚假相关性。我们的方法强制执行反事实不变性,确保在无关变量改变时,奖励预测仍然保持一致。通过在合成和真实世界数据集上的实验,我们展示了我们的方法能够有效地缓解各种类型的虚假相关性,从而使得LLM与人类偏好的对齐更加可靠和公平。作为现有RLHF工作流程的即插即用增强功能,我们的因果奖励建模提供了一种实用的方法来提高LLM微调过程中的可信度和公平性。
https://arxiv.org/abs/2501.09620
Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets' shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows' saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
视频合成孔径雷达(ViSAR)在移动目标检测(MTD)领域引起了广泛关注,因为它能够持续监测目标区域的变化。在ViSAR中,移动目标的阴影不会偏移和模糊,这被广泛用作MTD的一个特征。然而,由于背景中的低散射区难以与阴影区分,会导致更多的漏报和误报。因此,研究如何增强阴影与背景之间的区别变得非常重要。 为此,在本研究中我们提出了用于ViSAR的阴影增强及背景抑制算法(SE-BSFV)。该算法基于低秩表示(LRR)理论,并采用在线子空间学习技术来增强ViSAR图像中的阴影并压制背景。首先,我们使用一个配准算法对ViSAR图像进行配准,并利用高斯混合分布(GMD)模型化ViSAR数据。其次,从先前帧中获得的知识被用来估计当前帧的GMD参数,同时采用期望最大算法来估算子空间参数。然后可以获取当前帧的前景矩阵。最后,使用交替方向乘子法(ADMM)消除前景矩阵中的强散射物体以得到最终结果。 实验结果显示,与几种其他先进的预处理算法相比,SE-BSFV算法显著增强了阴影的重要性,并在确保效率的同时大幅提高了检测性能。
https://arxiv.org/abs/2501.09341
This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.
这项工作引入了一种新颖的保留层(Retention Layer)机制,用于基于Transformer架构的设计中。此机制解决了这些模型固有的内在保持能力不足的问题。与人类认知不同,后者能够编码并动态回忆象征性模板,生成式预训练Transformer仅依赖于固定的预训练权重和短暂的有效上下文窗口,这限制了它们的适应性。 所提出的保留层(Retention Layer)包含一个持久内存模块,该模块能够在实时数据填充、动态召回以及指导输出生成方面发挥作用。这一增强使模型能够跨会话存储、更新并重复使用观察到的模式,从而实现增量学习,并弥合静态预训练与动态上下文敏感适应之间的差距。 保留层的设计借鉴了社会学习过程中的注意力、记忆保持、再现和激励阶段。技术上,它集成了内存注意机制和情景缓冲区,以管理内存可扩展性、缓解过拟合以及确保高效召回。该方法的应用范围广泛,包括自适应个人助手、实时欺诈检测、自主机器人系统、内容审核和医疗诊断等。 在每个领域中,保留机制使得系统能够进行增量学习,个性化输出,并有效应对不断变化的现实挑战。通过模仿人类学习的关键方面,这种改进后的架构促进了更加流畅且响应迅速的人工智能范式的发展,为动态会话感知模型铺平了道路,从而将传统的Transformer能力扩展到需要持续适应性的领域中去。
https://arxiv.org/abs/2501.09166
Accurate detection and resilience of object detectors in structural damage detection are important in ensuring the continuous use of civil infrastructure. However, achieving robustness in object detectors remains a persistent challenge, impacting their ability to generalize effectively. This study proposes DetectorX, a robust framework for structural damage detection coupled with a micro drone. DetectorX addresses the challenges of object detector robustness by incorporating two innovative modules: a stem block and a spiral pooling technique. The stem block introduces a dynamic visual modality by leveraging the outputs of two Deep Convolutional Neural Network (DCNN) models. The framework employs the proposed event-based reward reinforcement learning to constrain the actions of a parent and child DCNN model leading to a reward. This results in the induction of two dynamic visual modalities alongside the Red, Green, and Blue (RGB) data. This enhancement significantly augments DetectorX's perception and adaptability in diverse environmental situations. Further, a spiral pooling technique, an online image augmentation method, strengthens the framework by increasing feature representations by concatenating spiraled and average/max pooled features. In three extensive experiments: (1) comparative and (2) robustness, which use the Pacific Earthquake Engineering Research Hub ImageNet dataset, and (3) field-experiment, DetectorX performed satisfactorily across varying metrics, including precision (0.88), recall (0.84), average precision (0.91), mean average precision (0.76), and mean average recall (0.73), compared to the competing detectors including You Only Look Once X-medium (YOLOX-m) and others. The study's findings indicate that DetectorX can provide satisfactory results and demonstrate resilience in challenging environments.
准确检测和增强结构损伤检测中对象探测器的鲁棒性对于确保民用基础设施的持续使用至关重要。然而,实现对象探测器的稳健性能仍然是一个持久的挑战,影响了它们的有效泛化能力。本研究提出了一种名为DetectorX的框架,它结合了一个微型无人机,并用于结构损坏检测,具有较强的鲁棒性。DetectorX通过引入两个创新模块——茎干块和螺旋池化技术,解决了对象探测器稳健性的挑战。 干细胞块通过利用两个深度卷积神经网络(DCNN)模型的输出来引入动态视觉模式。框架采用了一种基于事件的奖励增强学习方法,约束父级与子级DCNN模型的操作以获得奖励。这导致了除传统的红绿蓝(RGB)数据外,还产生了两种新的动态视觉模式。这一改进显著增强了DetectorX在各种环境情况下的感知和适应能力。 此外,螺旋池化技术作为一种在线图像增强方法,通过连接螺旋式和平均/最大池化的特征来提高特征表示的强度,从而加强了框架的整体性能。 在三项广泛的实验中(1)比较实验、(2)鲁棒性实验以及(3)实地实验,DetectorX在各种度量标准下表现出色,包括精度(0.88)、召回率(0.84)、平均精度(0.91)、均值平均精度(0.76)和均值平均召回率(0.73),与YOLOX-medium和其他竞争探测器相比表现优异。研究结果表明,DetectorX可以在具有挑战性的环境中提供令人满意的性能,并展现出强大的适应性。
https://arxiv.org/abs/2501.08807
Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: this https URL
将剪辑艺术图像(clipart)动画化,要求在保持视觉保真度和时间连贯性的同时实现无缝运动。现有的方法,如 AniClipart,虽然能够有效建模空间变形,但在确保平滑的时间过渡方面往往表现不佳,导致突然的移动和几何失真等伪影问题。类似地,文本到视频(T2V)和图像到视频(I2V)模型在处理剪辑艺术时也因自然视频与剪辑艺术风格之间的统计特性不匹配而遇到困难。 本文介绍了一种新的方法——FlexiClip,旨在通过解决时间一致性与几何完整性这两大相互交织的挑战来克服这些限制。FlexiClip 在传统的 Bézier 曲线轨迹建模基础上引入了关键创新: - 时间雅可比矩阵(temporal Jacobians)用于逐步校正运动动力学。 - 通过概率流偏微分方程(pfODEs)进行连续时间建模,以缓解时间噪声。 - 受 GFlowNet 原理启发的流动匹配损失函数(flow matching loss),优化平滑过渡。 这些改进确保了在涉及快速移动和非刚体变形等复杂场景中的动画一致性。广泛的实验验证了 FlexiClip 在生成既平滑又自然、且跨多种剪辑艺术类型结构一致的动画方面的有效性,包括人类和动物形象。 通过将空间与时间建模结合到预训练的视频扩散模型中,FlexiClip 为高质量的剪辑艺术动画设定了新的标准,在广泛视觉内容范围内提供稳健性能。项目页面:[此链接](https://this-url.com)(请使用实际链接地址替代)。
https://arxiv.org/abs/2501.08676
Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery. The refined monodepth is in turn guides stereo effectively at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.1, MonSter ranks 1st across five most commonly used leaderboards -- SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements (Bad 1.0 on ETH3D) over the previous best method. Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art across the board. The code is publicly available at: this https URL.
立体匹配通过图像对应关系恢复深度信息。现有方法在处理缺乏匹配线索的难题区域(如遮挡和无纹理区)时遇到困难。为解决这一问题,我们提出了MonSter,一种结合了单目深度估计与立体匹配互补优势的新方法。MonSter采用双分支架构,将单目深度估计和立体匹配相互迭代地改进。基于置信度的引导机制自适应选择可靠的立体线索来恢复单眼深度尺度偏移,并且经过优化后的单眼深度反过来在难题区域有效指导立体匹配。这种互惠增强过程使得MonSter能够进化出从粗糙的对象级结构到像素级几何的单目深度先验,从而充分挖掘了立体匹配的潜力。 如图1所示,MonSter在五个最常用的排行榜上——SceneFlow、KITTI 2012、KITTI 2015、Middlebury和ETH3D中均排名第一。与先前的最佳方法相比,在ETH3D排行榜上的表现最高提升了49.5%(Bad 1.0)。全面的分析验证了MonSter在难题区域的有效性。就零样本泛化而言,无论在哪种情况下,MonSter都显著地超过了现有最先进的技术。 代码可在以下链接公开获取:[this https URL]。
https://arxiv.org/abs/2501.08643
Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision (mAP@0.5), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL's role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance.
自主无人飞行器(UAV)与边缘计算能力的集成使设备能够直接进行实时数据处理,显著减少了诸如野火检测等关键场景中的延迟。这项研究强调了迁移学习(Transfer Learning, TL)在增强对象检测器识别野火烟雾和火焰性能方面的重要性,尤其是在使用有限的数据集时,并探讨了TL对边缘计算指标的影响。后者关注的是经过TL增强的You Only Look Once (YOLO)模型在推理时间、功耗和能量消耗方面的表现,这些数据是在使用边缘设备的情况下收集的。 本研究利用Aerial Fire and Smoke Essential (AFSE) 数据集作为目标,以Flame and Smoke Detection Dataset (FASDD) 和 Microsoft Common Objects in Context (COCO) 数据集作为源数据集。我们探索了一种两阶段级联迁移学习方法,第一阶段使用D-Fire或FASDD作为目标数据集,第二阶段则转向AFSE数据集。通过微调,TL显著提升了检测精度,在AFSE数据集上的平均准确率(mAP@0.5)达到了79.2%,同时减少了训练时间,并增强了模型在该数据集上的泛化能力。 然而,级联迁移学习并没有带来明显的改进,且单独的迁移学习并未改善所评估的边缘计算指标。最终发现,在缺乏硬件加速的情况下,YOLOv5n仍然是一个强大的模型,结果表明,YOLOv5n可以比其较新的版本YOLO11n处理图像快近一倍。 总体而言,研究结果确认了TL在提高对象检测器准确性方面的作用,并展示了为了提升边缘计算性能,还需要进行额外的改进。
https://arxiv.org/abs/2501.08639
Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on these specifications. Recent advancements in Natural Language Processing (NLP), particularly with Large Language Models (LLMs), have enhanced REST API testing by extracting actionable rules and generating input values from the human-readable portions of the specification. However, these advancements overlook the potential of continuously refining the identified rules and test inputs based on server responses. To address this limitation, we present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs and uncover parameter dependencies during the testing process by incorporating server responses. These LLMs are created by fine-tuning the Llama3-8b model, using mined datasets of REST API example values and inter-parameter dependencies. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results show that fine-tuning enables smaller LLMs to outperform larger models in detecting actionable rules and generating inputs for REST API testing. We evaluated configurations from the base Llama3-8B to fine-tuned versions and explored 2-bit, 4-bit, and 8-bit quantization for efficiency. LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications, and an ablation study highlights the impact of its novel components.
现代网络服务很大程度上依赖于REST API,这些API通常使用OpenAPI规范进行文档编写。这种标准的广泛采用导致了众多黑盒测试工具的发展,这些工具可以根据这些规范生成测试用例。自然语言处理(NLP)领域的最新进展,特别是大型语言模型(LLMs),通过从人类可读的部分提取操作规则并根据这些规则生成输入值来增强REST API的测试效果。然而,这类改进忽视了基于服务器响应持续优化识别出的操作规则和测试输入的可能性。 为了解决这一局限性,我们提出了一种名为LlamaRestTest的新方法,该方法使用两个定制的LLM,在测试过程中通过整合服务器响应生成真实的测试输入并揭示参数依赖关系。这些LLM是通过对Llama3-8b模型进行微调而创建的,使用的训练数据集包括REST API示例值和参数间依赖性的挖掘集合。 我们在12个现实世界的服务(包括Spotify等流行服务)上评估了LlamaRestTest,并将其与基于GPT的规格增强工具RESTGPT以及RESTler、MoRest、EvoMaster和ARAT-RL等多个最先进的REST API测试工具进行了比较。我们的研究结果表明,微调使得较小规模的语言模型在检测操作规则并生成用于REST API测试的输入方面优于大型语言模型。 我们评估了从基础Llama3-8B到经过微调的不同版本的各种配置,并探索了2位、4位和8位量化以提高效率。即使使用了增强后的规格,LlamaRestTest也在代码覆盖率和错误检测方面超过了现有的最先进的工具。此外,消融研究突显了其创新组件的影响。
https://arxiv.org/abs/2501.08598
The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced capabilities offering potential applications from content creation to productivity enhancement. Despite their innovative potential, LVLMs exhibit vulnerabilities, especially in generating potentially toxic or unsafe responses. Malicious actors can exploit these vulnerabilities to propagate toxic content in an automated (or semi-) manner, leveraging the susceptibility of LVLMs to deception via strategically crafted prompts without fine-tuning or compute-intensive procedures. Despite the red-teaming efforts and inherent potential risks associated with the LVLMs, exploring vulnerabilities of LVLMs remains nascent and yet to be fully addressed in a systematic manner. This study systematically examines the vulnerabilities of open-source LVLMs, including LLaVA, InstructBLIP, Fuyu, and Qwen, using adversarial prompt strategies that simulate real-world social manipulation tactics informed by social theories. Our findings show that (i) toxicity and insulting are the most prevalent behaviors, with the mean rates of 16.13% and 9.75%, respectively; (ii) Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b are the most vulnerable models, exhibiting toxic response rates of 21.50%, 18.30% and 17.90%, and insulting responses of 13.40%, 11.70% and 10.10%, respectively; (iii) prompting strategies incorporating dark humor and multimodal toxic prompt completion significantly elevated these vulnerabilities. Despite being fine-tuned for safety, these models still generate content with varying degrees of toxicity when prompted with adversarial inputs, highlighting the urgent need for enhanced safety mechanisms and robust guardrails in LVLM development.
大型视觉-语言模型(LVLMs)的迅速发展增强了其功能,提供了从内容创作到生产力提升的各种潜在应用。尽管这些模型具有创新潜力,但它们也表现出一些脆弱性,尤其是在生成可能有害或不安全响应方面。恶意行为者可以利用这些漏洞通过精心设计的提示来欺骗LVLMs,在无需微调或计算密集型过程的情况下自动(或半自动化)传播有毒内容。尽管有红色团队的努力以及与LVLM相关的潜在风险,对LVLM脆弱性的探索仍处于初级阶段,尚未系统全面地解决。 本研究系统地考察了开源LVLM的脆弱性,包括LLaVA、InstructBLIP、Fuyu和Qwen等模型,并采用对抗性提示策略进行测试,这些策略模拟了基于社会理论的真实世界社交操控技巧。我们的发现表明: (i) 毒性和侮辱是最常见的行为,平均发生率分别为16.13%和9.75%; (ii) Qwen-VL-Chat、LLaVA-v1.6-Vicuna-7b 和 InstructBLIP-Vicuna-7b 是最脆弱的模型,分别表现为21.50%、18.30% 和 17.90% 的有毒响应率以及13.40%、11.70% 和 10.10% 的侮辱性响应; (iii) 结合阴暗幽默和多模态有毒提示完成的策略显著提升了这些模型的脆弱性。 尽管经过微调以提高安全性,当受到对抗输入提示时,这些模型仍然会产生不同程度的有害内容,这突显了在LVLM开发中急需增强的安全机制和稳健保护措施。
https://arxiv.org/abs/2501.09039
Glioblastoma, a highly aggressive brain tumor, poses major challenges due to its poor prognosis and high morbidity rates. Partial differential equation-based models offer promising potential to enhance therapeutic outcomes by simulating patient-specific tumor behavior for improved radiotherapy planning. However, model calibration remains a bottleneck due to the high computational demands of optimization methods like Monte Carlo sampling and evolutionary algorithms. To address this, we recently introduced an approach leveraging a neural forward solver with gradient-based optimization to significantly reduce calibration time. This approach requires a highly accurate and fully differentiable forward model. We investigate multiple architectures, including (i) an enhanced TumorSurrogate, (ii) a modified nnU-Net, and (iii) a 3D Vision Transformer (ViT). The optimized TumorSurrogate achieved the best overall results, excelling in both tumor outline matching and voxel-level prediction of tumor cell concentration. It halved the MSE relative to the baseline model and achieved the highest Dice score across all tumor cell concentration thresholds. Our study demonstrates significant enhancement in forward solver performance and outlines important future research directions.
胶质母细胞瘤是一种高度侵袭性的脑肿瘤,由于其预后不良和高发病率,给治疗带来了重大挑战。基于偏微分方程的模型通过模拟患者特异性的肿瘤行为来改进放疗计划,从而有望提高治疗效果。然而,模型校准仍然是瓶颈问题,因为像蒙特卡洛抽样和进化算法这样的优化方法对计算资源的需求很高。为了解决这个问题,我们最近提出了一种结合神经前向求解器与基于梯度的优化的方法,以显著减少校准时间。这种方法需要一个高度准确且完全可微分的前向模型。 我们在多项架构中进行了研究,包括(i)增强型肿瘤代理模型(TumorSurrogate)、(ii)修改后的nnU-Net以及(iii)三维视觉变换器(ViT)。优化后的肿瘤代理模型取得了最佳的整体效果,在肿瘤轮廓匹配和像素级预测的肿瘤细胞浓度方面表现出色。相较于基线模型,其均方误差减半,并在所有肿瘤细胞浓度阈值下都获得了最高的Dice分数。我们的研究展示了前向求解器性能的重大提升,并指出了未来重要的研究方向。 这种方法为胶质母细胞瘤治疗提供了新的希望,通过更精确地模拟和预测肿瘤行为,可以改善放疗计划并提高患者的生活质量。
https://arxiv.org/abs/2501.08226
The accelerated MRI reconstruction poses a challenging ill-posed inverse problem due to the significant undersampling in k-space. Deep neural networks, such as CNNs and ViT, have shown substantial performance improvements for this task while encountering the dilemma between global receptive fields and efficient computation. To this end, this paper pioneers exploring Mamba, a new paradigm for long-range dependency modeling with linear complexity, for efficient and effective MRI reconstruction. However, directly applying Mamba to MRI reconstruction faces three significant issues: (1) Mamba's row-wise and column-wise scanning disrupts k-space's unique spectrum, leaving its potential in k-space learning unexplored. (2) Existing Mamba methods unfold feature maps with multiple lengthy scanning paths, leading to long-range forgetting and high computational burden. (3) Mamba struggles with spatially-varying contents, resulting in limited diversity of local representations. To address these, we propose a dual-domain multi-scale Mamba for MRI reconstruction from the following perspectives: (1) We pioneer vision Mamba in k-space learning. A circular scanning is customized for spectrum unfolding, benefiting the global modeling of k-space. (2) We propose a multi-scale Mamba with an efficient scanning strategy in both image and k-space domains. It mitigates long-range forgetting and achieves a better trade-off between efficiency and performance. (3) We develop a local diversity enhancement module to improve the spatially-varying representation of Mamba. Extensive experiments are conducted on three public datasets for MRI reconstruction under various undersampling patterns. Comprehensive results demonstrate that our method significantly outperforms state-of-the-art methods with lower computational cost. Implementation code will be available at this https URL.
加速的MRI重建由于k空间中显著的欠采样而构成了一个具有挑战性的不适定逆问题。深度神经网络,如CNN和ViT,在解决这一任务时尽管取得了实质性的性能改进,但也面临着全局感受野与高效计算之间的困境。为此,本文开创性地探索了Mamba(一种用于长程依赖建模且计算复杂度为线性的新范式),以实现MRI重建的有效性和高效性。然而,直接将Mamba应用于MRI重建面临三个主要问题:(1) Mamba的行和列扫描方式会扰乱k空间的独特频谱分布,使其在k空间学习中的潜力未被充分发掘。(2) 现有的Mamba方法通过多个长路径展开特征图,导致了长程遗忘效应和计算负担过重的问题。(3) Mamba在处理空间变异内容时存在困难,这限制了局部表示的多样性。为了解决这些问题,我们从以下几个方面提出了用于MRI重建的双域多尺度Mamba:(1) 我们首次在k空间学习中引入视觉Mamba,并定制了一种圆周扫描方式,有利于全局建模。(2) 提出一种高效的多尺度Mamba方法,在图像和k空间两个领域采用有效的扫描策略,减轻了长程遗忘现象,并实现了效率与性能之间的更好平衡。(3) 开发了一个局部多样性增强模块以提升Mamba的空间变异表示能力。我们在三个公共数据集上针对不同欠采样模式进行了广泛的MRI重建实验,综合结果表明我们的方法在计算成本更低的情况下显著优于现有最佳方法。实现代码将在此网址提供。
https://arxiv.org/abs/2501.08163
The audio visual benefit in speech perception, where congruent visual input enhances auditory processing, is well documented across age groups, particularly in challenging listening conditions and among individuals with varying hearing abilities. However, most studies rely on highly controlled laboratory environments with scripted stimuli. Here, we examine the audio visual benefit using unscripted, natural speech from untrained speakers within a virtual acoustic environment. Using electroencephalography (EEG) and cortical speech tracking, we assessed neural responses across audio visual, audio only, visual only, and masked lip conditions to isolate the role of lip movements. Additionally, we analysed individual differences in acoustic and visual features of the speakers, including pitch, jitter, and lip openness, to explore their influence on the audio visual speech tracking benefit. Results showed a significant audio visual enhancement in speech tracking with background noise, with the masked lip condition performing similarly to the audio-only condition, emphasizing the importance of lip movements in adverse listening situations. Our findings reveal the feasibility of cortical speech tracking with naturalistic stimuli and underscore the impact of individual speaker characteristics on audio-visual integration in real world listening contexts.
在言语感知中,视觉信息与听觉输入相一致时能够增强听觉处理的视听益处,在各个年龄段尤其是面对具有挑战性的聆听环境以及听力水平各异的人群中得到了充分证明。然而,大多数研究依赖于实验室环境中使用脚本化刺激材料进行的高度控制实验条件。在此项研究中,我们采用未经训练的发言人在虚拟声学环境中提供的非脚本、自然语言来评估视听益处。 通过脑电图(EEG)和皮层言语跟踪技术,我们在听觉-视觉结合、仅听觉输入、仅视觉输入以及遮挡嘴唇条件下分析了神经反应,以明确唇部运动的作用。此外,我们还研究了个别发言人在音高、抖动及唇部张开程度等声学与视觉特征上的差异,探讨这些特性如何影响视听整合对言语跟踪的益处。 结果表明,在背景噪音中进行言语跟踪时存在显著的视听增强效应,并且遮挡嘴唇条件下表现与仅听觉条件类似,这突显了在不良聆听环境中唇部运动的重要性。我们的研究揭示了使用自然刺激材料实现皮层言语跟踪的可能性,并强调了个别发言人的特征对真实世界聆听情境下视听整合的影响。
https://arxiv.org/abs/2501.08124
As critical visual details become obscured, the low visibility and high ISO noise in extremely low-light images pose a significant challenge to human pose estimation. Current methods fail to provide high-quality representations due to reliance on pixel-level enhancements that compromise semantics and the inability to effectively handle extreme low-light conditions for robust feature learning. In this work, we propose a frequency-based framework for low-light human pose estimation, rooted in the "divide-and-conquer" principle. Instead of uniformly enhancing the entire image, our method focuses on task-relevant information. By applying dynamic illumination correction to the low-frequency components and low-rank denoising to the high-frequency components, we effectively enhance both the semantic and texture information essential for accurate pose estimation. As a result, this targeted enhancement method results in robust, high-quality representations, significantly improving pose estimation performance. Extensive experiments demonstrating its superiority over state-of-the-art methods in various challenging low-light scenarios.
当关键视觉细节在极低光环境下变得模糊时,低可见度和高ISO噪声对人类姿态估计构成了重大挑战。当前的方法由于依赖于像素级增强技术而无法提供高质量的表示,这些技术会损害语义信息,并且无法有效地处理极端低光照条件下的稳健特征学习问题。在这项工作中,我们提出了一种基于频率的极低光环境下的人体姿态估计框架,该框架采用了“分而治之”的原则。我们的方法不均匀地增强整个图像,而是专注于任务相关的数据。通过将动态照明校正应用于低频成分,并对高频成分进行低秩去噪处理,我们可以有效提升对于准确的姿态估计至关重要的语义和纹理信息。这种定向增强的方法产生了稳健且高质量的表示形式,显著提高了姿态估计性能。在各种具有挑战性的低光场景中的广泛实验表明了该方法优于最先进的技术。
https://arxiv.org/abs/2501.08038
Semantic segmentation of remote sensing images is essential for various applications, including vegetation monitoring, disaster management, and urban planning. Previous studies have demonstrated that the self-attention mechanism (SA) is an effective approach for designing segmentation networks that can capture long-range pixel dependencies. SA enables the network to model the global dependencies between the input features, resulting in improved segmentation outcomes. However, the high density of attentional feature maps used in this mechanism causes exponential increases in computational complexity. Additionally, it introduces redundant information that negatively impacts the feature representation. Inspired by traditional threshold segmentation algorithms, we propose a novel threshold attention mechanism (TAM). This mechanism significantly reduces computational effort while also better modeling the correlation between different regions of the feature map. Based on TAM, we present a threshold attention network (TANet) for semantic segmentation. TANet consists of an attentional feature enhancement module (AFEM) for global feature enhancement of shallow features and a threshold attention pyramid pooling module (TAPP) for acquiring feature information at different scales for deep features. We have conducted extensive experiments on the ISPRS Vaihingen and Potsdam datasets. The results demonstrate the validity and superiority of our proposed TANet compared to the most state-of-the-art models.
遥感图像的语义分割对于植被监测、灾害管理和城市规划等应用至关重要。先前的研究表明,自注意力机制(SA)是设计能够捕捉长距离像素依赖性的分割网络的有效方法。该机制使网络能够建模输入特征之间的全局依赖性,从而提高分割效果。然而,在这一机制中使用的密集注意图导致计算复杂度呈指数级增长,并且引入了冗余信息,这对特征表示产生了负面影响。 受传统阈值分割算法的启发,我们提出了一种新颖的阈值注意力机制(TAM)。这种机制显著降低了计算工作量,同时更好地建模了特征图不同区域之间的相关性。基于TAM,我们为语义分割提出了一个阈值注意网络(TANet)。TANet包含一个注意特征增强模块(AFEM),用于对浅层特征进行全局特征增强;以及一个阈值注意力金字塔池化模块(TAPP),用于在不同尺度上获取深层特征的信息。我们在ISPRS Vaihingen和Potsdam数据集上进行了广泛的实验,结果表明我们提出的TANet相比最先进模型具有有效性和优越性。
https://arxiv.org/abs/2501.07984
Aviation safety is paramount, demanding precise analysis of safety occurrences during different flight phases. This study employs Natural Language Processing (NLP) and Deep Learning models, including LSTM, CNN, Bidirectional LSTM (BLSTM), and simple Recurrent Neural Networks (sRNN), to classify flight phases in safety reports from the Australian Transport Safety Bureau (ATSB). The models exhibited high accuracy, precision, recall, and F1 scores, with LSTM achieving the highest performance of 87%, 88%, 87%, and 88%, respectively. This performance highlights their effectiveness in automating safety occurrence analysis. The integration of NLP and Deep Learning technologies promises transformative enhancements in aviation safety analysis, enabling targeted safety measures and streamlined report handling.
航空安全是至高无上的,需要对不同飞行阶段的安全事件进行精确分析。本研究采用自然语言处理(NLP)和深度学习模型,包括长短期记忆网络(LSTM)、卷积神经网络(CNN)、双向LSTM(BLSTM)以及简单的递归神经网络(sRNN),来分类澳大利亚运输安全局(ATSB)的安全报告中的飞行阶段。这些模型表现出高准确率、精确度、召回率和F1分数,其中LSTM表现最佳,分别达到了87%、88%、87%和88%。这一性能突显了它们在自动化安全事件分析方面的有效性。将NLP和深度学习技术结合使用有望对航空安全分析产生变革性的改进,使有针对性的安全措施得以实施,并简化报告处理流程。
https://arxiv.org/abs/2501.07923
Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
视觉-语言模型(VLMs)展示了卓越的泛化能力,针对VLMs的提示学习因其能够将预训练的VLM适配到具体下游任务的能力而备受关注。然而,现有研究主要集中在单模态提示或单一方向的模态交互上,忽视了视觉和语言模态之间交互所产生强大对齐效果的重要性。为此,我们提出了一种新颖的提示学习方法,称为“双向模态互动提示(BMIP)”,通过学习注意力层的信息来动态加权双模态信息,相比简单的信息聚合方法,它增强了训练能力和跨模态一致性。为了评估提示学习方法的有效性,我们提出了一个更加现实的评价范式,名为开放世界泛化,以补充广泛采用的跨数据集迁移和领域泛化任务。在各种数据集上的全面实验表明,BMIP不仅在所有三种评估范式中均优于当前最先进的方法,并且具有足够的灵活性,可以与其他基于提示的方法结合使用,从而实现一致性的性能提升。
https://arxiv.org/abs/2501.07769
This study underscores the pivotal role of syntax feedback in augmenting the syntactic proficiency of students. Recognizing the challenges faced by learners in mastering syntactic nuances, we introduce a specialized dataset named Essay-Syntax-Instruct designed to enhance the understanding and application of English syntax among these students. Leveraging the capabilities of Large Language Models (LLMs) such as GPT3.5-Turbo, Llama-2-7b-chat-hf, Llama-2-13b-chat-hf, and Mistral-7B-Instruct-v0.2, this work embarks on a comprehensive fine-tuning process tailored to the syntax improvement task. Through meticulous evaluation, we demonstrate that the fine-tuned LLMs exhibit a marked improvement in addressing syntax-related challenges, thereby serving as a potent tool for students to identify and rectify their syntactic errors. The findings not only highlight the effectiveness of the proposed dataset in elevating the performance of LLMs for syntax enhancement but also illuminate a promising path for utilizing advanced language models to support language acquisition efforts. This research contributes to the broader field of language learning technology by showcasing the potential of LLMs in facilitating the linguistic development of Students.
这项研究强调了语法反馈在提高学生语法能力方面的重要作用。鉴于学习者掌握语法规则细微之处的挑战,我们引入了一个专门的数据集,名为Essay-Syntax-Instruct,旨在提升这些学生的英语语法学能理解和应用能力。借助大型语言模型(LLM)如GPT3.5-Turbo、Llama-2-7b-chat-hf、Llama-2-13b-chat-hf和Mistral-7B-Instruct-v0.2的能力,本研究开始了针对语法改进任务的全面微调过程。通过详尽的评估,我们证明了经过微调的LLMs在解决语法规则相关挑战方面表现出显著提升,从而成为学生识别并纠正其语法错误的强大工具。这项研究不仅突显了所提议数据集在提高LLM语法增强性能方面的有效性,而且还为利用先进语言模型支持语言学习工作指明了一条有前景的道路。通过展示大型语言模型促进学生语言发展的潜力,本研究对语言学习技术的更广泛领域做出了贡献。
https://arxiv.org/abs/2501.07740
Large Language Models (LLMs) have demonstrated outstanding capabilities across various domains, but the increasing complexity of new challenges demands enhanced performance and adaptability. Traditional benchmarks, although comprehensive, often lack the granularity needed for detailed capability analysis. This study introduces the Cognitive Diagnostic Synthesis (CDS) method, which employs Cognitive Diagnosis Theory (CDT) for precise evaluation and targeted enhancement of LLMs. By decomposing complex tasks into discrete knowledge points, CDS accurately identifies and synthesizes data targeting model weaknesses, thereby enhancing the model's performance. This framework proposes a comprehensive pipeline driven by knowledge point evaluation, synthesis, data augmentation, and filtering, which significantly improves the model's mathematical and coding capabilities, achieving up to an 11.12% improvement in optimal scenarios.
大型语言模型(LLM)在多个领域展示了出色的能力,但日益复杂的挑战要求增强其性能和适应性。传统的基准测试虽然全面,但往往缺乏进行详细能力分析所需的细粒度。本研究引入了认知诊断合成(CDS)方法,该方法运用认知诊断理论(CDT),对大型语言模型进行精确评估并针对弱点进行有针对性的改进。通过将复杂任务分解为离散的知识点,CDS能够准确识别和合成针对模型弱项的数据,从而提升模型性能。这一框架提出了一个由知识点评估、合成、数据增强和过滤驱动的综合管道,在最佳情况下显著提高了模型的数学和编程能力,实现了高达11.12%的改进。
https://arxiv.org/abs/2501.07674
Improving robustness to uncertainty and rejection of external disturbances represents a significant challenge in aerial robotics. Nonlinear controllers based on Incremental Nonlinear Dynamic Inversion (INDI), known for their ability in estimating disturbances through measured-filtered data, have been notably used in such applications. Typically, these controllers comprise two cascaded loops: an inner loop employing nonlinear dynamic inversion and an outer loop generating the virtual control inputs via linear controllers. In this paper, a novel methodology is introduced, that combines the advantages of INDI with the robustness of linear structured $\mathcal{H}_\infty$ controllers. A full cascaded architecture is proposed to control the dynamics of a multirotor drone, covering both stabilization and guidance. In particular, low-order $\mathcal{H}_\infty$ controllers are designed for the outer loop by properly structuring the problem and solving it through non-smooth optimization. A comparative analysis is conducted between an existing INDI/PD approach and the proposed INDI/$\mathcal{H}_\infty$ strategy, showing a notable enhancement in the rejection of external disturbances. It is carried out first using MATLAB simulations involving a nonlinear model of a Parrot Bebop quadcopter drone, and then experimentally using a customized quadcopter built by the ENAC team. The results show an improvement of more than 50\% in the rejection of disturbances such as gusts.
提高空中机器人在不确定性和外部干扰下的鲁棒性是一项重要挑战。基于增量非线性动态逆(INDI)的非线性控制器因其能够通过测量过滤数据估计扰动的能力,在此类应用中得到了显著使用。这些控制器通常包括两个级联回路:内环采用非线性动态反演,外环则通过线性控制器生成虚拟控制输入。 本文提出了一种结合了INDI的优势和线性结构$\mathcal{H}_\infty$控制器鲁棒性的新方法。该方案提出了一个完整的级联架构来控制多旋翼无人机的动力学特性,涵盖稳定性和引导功能。特别地,在外环中设计了低阶的$\mathcal{H}_\infty$控制器,并通过合理的问题结构化和非光滑优化解决策略来实现。 本文对现有的INDI/PD方法与提出的INDI/$\mathcal{H}_\infty$策略进行了比较分析,结果显示在外部干扰排斥方面有显著改进。首先使用MATLAB仿真测试了Parrot Bebop四旋翼无人机的非线性模型,并随后通过由ENAC团队构建的定制化四旋翼机进行实验验证。结果表明,在诸如气流之类的扰动拒绝上提高了超过50%的表现。 此研究展示了一种新的控制器设计方法,该方法结合了INDI和$\mathcal{H}_\infty$控制的优势,为解决空中机器人面对复杂环境时的干扰问题提供了一个有效的解决方案。
https://arxiv.org/abs/2501.07223