This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.
本研究介绍了一种新颖的 Irony 检测方法,该方法采用基于提示的学习方法(LLMs)来促进情感中心化文本增强。传统的 Irony 检测技术通常因为其依赖静态语言特征和预定义知识库而不足,往往忽视了 Irony 中至关重要的细微情感维度。相比之下,我们的方法通过将微妙的情感线索通过 LLMs 增强,将三种广泛认为是 Irony 检测基础的预训练 NLP 模型 - BERT、T5 和 GPT-2 - 集成到检测过程中,从而增强了 Irony 检测能力。我们对该方法使用 SemEval-2018 任务 3 数据集进行了评估,并观察到 Irony 检测能力得到了显著提升。
https://arxiv.org/abs/2404.12291
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.
嵌入模型在现代自然语言处理应用(如IR和RAG)中扮演着关键角色。尽管LLM的上下文限制已经超过了100万个词,但嵌入模型仍然局限于不超出8K个词的狭窄上下文窗口,避免应用于需要长输入的应用场景,例如法律合同。本文探讨了现有嵌入模型的上下文窗口扩展,将限制扩展到32K,而无需进行额外训练。首先,我们检查当前嵌入模型在长上下文检索方面的性能。LongEmbed基准包括两个合成任务和四个精心选择的真实世界任务,具有不同长度和分散的目标信息。基准测试结果强调,这些模型在改进方面还有巨大的余地。基于这一结论,全面实验证明,像位置平滑这样的无训练上下文窗口扩展策略可以有效地将现有嵌入模型的上下文窗口扩展数倍,而不管它们的原始上下文是512个还是大于4K个。此外,对于采用绝对位置编码(APE)的模型,我们展示了使用RoPE特定方法进行进一步微调的可能,同时严格保留原始行为的短输入情况下显著的性能提升。对于采用旋转位置编码(RoPE)的模型,我们在 RoPE 特定方法的观察到了显著的增强,表明 RoPE 在上下文窗口扩展方面优于 APE。为了方便未来的研究,我们发布了E5-Base-4k和E5-RoPE-Base,并附上LongEmbed基准。
https://arxiv.org/abs/2404.12096
Accurate localization is fundamental for autonomous underwater vehicles (AUVs) to carry out precise tasks, such as manipulation and construction. Vision-based solutions using fiducial marker are promising, but extremely challenging underwater because of harsh lighting condition underwater. This paper introduces a gradient-based active camera exposure control method to tackle sharp lighting variations during image acquisition, which can establish better foundation for subsequent image enhancement procedures. Considering a typical scenario for underwater operations where visual tags are used, we proposed several experiments comparing our method with other state-of-the-art exposure control method including Active Exposure Control (AEC) and Gradient-based Exposure Control (GEC). Results show a significant improvement in the accuracy of robot localization. This method is an important component that can be used in visual-based state estimation pipeline to improve the overall localization accuracy.
准确的局部定位对于自主水下车辆(AUVs)执行精确任务(如操作和建设)至关重要。使用标记引导的视觉解决方案前景广阔,但在水下由于恶劣的照明条件而变得极其困难。本文介绍了一种基于梯度的主动相机曝光控制方法,以解决图像采集期间图像锐利的照明变化,为后续图像增强过程奠定更好的基础。考虑到水下操作中通常使用视觉标签的情况,我们提出了几种实验,将我们的方法与其他最先进的曝光控制方法(包括主动曝光控制(AEC)和基于梯度的曝光控制(GEC))进行比较。结果表明,机器人的局部定位精度得到了显著提高。这种方法是用于视觉 based 状态估计管道以提高整体局部定位精度的关键组成部分。
https://arxiv.org/abs/2404.12055
The widespread use of pre-trained language models (PLMs) in natural language processing (NLP) has greatly improved performance outcomes. However, these models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug dealers), particularly in the Chinese language with its rich character diversity/variation and complex structures, hatches vital apprehension. In this study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE), to increase the robustness of PLMs against character variation attacks in Chinese content. CHANGE presents a novel approach for incorporating a Chinese character variation graph into the PLMs. Through designing different supplementary tasks utilizing the graph structure, CHANGE essentially enhances PLMs' interpretation of adversarially manipulated text. Experiments conducted in a multitude of NLP tasks show that CHANGE outperforms current language models in combating against adversarial attacks and serves as a valuable contribution to robust language model research. These findings contribute to the groundwork on robust language models and highlight the substantial potential of graph-guided pre-training strategies for real-world applications.
预训练语言模型(PLMs)在自然语言处理(NLP)中的应用已经极大地提高了性能。然而,这些模型对于对抗攻击(例如,从毒贩那里获得的伪装提示)尤其是在具有丰富字符多样性和复杂结构的中文中,变得非常脆弱。在这项研究中,我们提出了一个名为“中文变体增强”(CHANGE)的新方法,以提高PLMs在中文内容中对抗字符变异攻击的鲁棒性。CHANGE提出了一种将中文字符变异图嵌入PLM的方法。通过设计不同的补充任务利用图结构,CHANGE本质上增强了PLMs对操纵文本的 adversarial 解释。在众多NLP任务中进行实验发现,CHANGE在对抗 adversarial 攻击方面优于当前语言模型,并为 robust 语言模型研究做出了宝贵的贡献。这些发现有助于为 robust 语言模型奠定基础,同时突出了基于图引导预训练策略在现实应用中的巨大潜力。
https://arxiv.org/abs/2404.12014
We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However, they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB), offering an alternative solution. In this work, we present a novel nighttime dynamic imaging method with an event camera. Specifically, we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently, we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover, we construct a paired real low-light event dataset (RLED) through a co-axial imaging system, including 64,200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: this https URL.
我们将注意力集中在一个非常具有挑战性的任务上:夜间动态场景的成像。大多数以前的方法依赖于传统RGB摄像机 low-light 增强。然而,它们会不可避免地面临夜间长曝光时间和动态场景运动模糊之间的困境。事件相机对动态变化具有更高的时间分辨率(微秒)和更高的动态范围(120dB),提供了另一种解决方案。在这项工作中,我们提出了一个事件相机驱动的夜间动态成像方法。具体来说,我们发现夜间事件表现出时间拖尾特征和空间非平稳分布。因此,我们提出了一个基于事件时钟的夜间事件重建网络(NER-Net),主要包括可学习的事件时间戳校准模块(LETC)和一个非均匀光照感知模块(NIAM),用于稳定事件的空间和时间分布。此外,我们还构建了一个通过轴向成像系统构建的成对低光事件数据集(RLED),包括64,200个空间和时间对齐的图像GT和低光事件。大量实验证明,与最先进的method相比,所提出的方法在现实世界的夜间数据集上的视觉质量和泛化能力都具有优势。该项目 available at: this https URL.
https://arxiv.org/abs/2404.11884
We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.
我们提出了一种通过利用反事实生成和微调来减轻计算机视觉模型中偏见的新方法。虽然反事实已经被用于分析并解决DNN模型的偏见,但反事实本身通常是从有偏的生成模型中生成的,这可能会引入额外的偏见或伪相关关系。为了解决这个问题,我们提出使用对抗性图像作为公平模型训练的反事实。我们的方法结合了级联学习框架和细粒度对抗损失,通过对抗性例子对模型进行微调。通过将对抗性图像纳入训练数据中,我们旨在防止偏见通过整个管道传播。我们通过定性和定量评估验证了我们的方法,证明了与现有方法相比,我们的方法具有更好的偏见缓解和准确性。定性结果表明,在训练后,模型的决策取决于敏感特征的程度,我们的模型更好地揭示了敏感特征和分类变量之间的关系。
https://arxiv.org/abs/2404.11819
In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.
在本文中,我们在半监督学习框架下提出了一种名为提示驱动特征扩散(PDFD)的新方法,用于开放世界半监督学习(OW-SSL)。其核心思想是,PDFD通过类特定提示来指导类级别特征级扩散模型,支持分类特征表示学习和特征生成,解决了OW-SSL中未见类别的标签数据不足的挑战。 具体来说,PDFD利用类原型作为扩散模型的提示,利用它们的分类歧视性和语义泛化能力来对所有可见和不可见类别的扩散过程进行条件和引导。此外,PDFD引入了分类条件 adversarial loss for diffusion model training,确保通过扩散过程生成的特征与真实数据的类条件特征对齐。 另外,类原型的计算仅在半监督学习框架中使用具有自信预测的未标注实例。我们通过广泛的实验评估了所提出的PDFD。实验结果表明,与现有方法相比,PDFD具有显著的性能增强。
https://arxiv.org/abs/2404.11795
Traditional cameras face a trade-off between low-light performance and high-speed imaging: longer exposure times to capture sufficient light results in motion blur, whereas shorter exposures result in Poisson-corrupted noisy images. While burst photography techniques help mitigate this tradeoff, conventional cameras are fundamentally limited in their sensor noise characteristics. Event cameras and single-photon avalanche diode (SPAD) sensors have emerged as promising alternatives to conventional cameras due to their desirable properties. SPADs are capable of single-photon sensitivity with microsecond temporal resolution, and event cameras can measure brightness changes up to 1 MHz with low bandwidth requirements. We show that these properties are complementary, and can help achieve low-light, high-speed image reconstruction with low bandwidth requirements. We introduce a sensor fusion framework to combine SPADs with event cameras to improves the reconstruction of high-speed, low-light scenes while reducing the high bandwidth cost associated with using every SPAD frame. Our evaluation, on both synthetic and real sensor data, demonstrates significant enhancements ( > 5 dB PSNR) in reconstructing low-light scenes at high temporal resolution (100 kHz) compared to conventional cameras. Event-SPAD fusion shows great promise for real-world applications, such as robotics or medical imaging.
传统相机在低光性能和高帧速成像之间存在权衡:较长的曝光时间可以捕捉到足够的照明,但较短的曝光时间会导致运动模糊,而较长的曝光时间会导致Poisson噪声污染的图像。尽管快速摄影技术有助于减轻这种权衡,但传统相机的传感器噪声特性根本有限。事件相机和单光子 avalanche 器件(SPAD)传感器已成为传统相机的有前景的替代品,因为它们具有所需的特性。SPAD具有微秒级的时间分辨率,能够实现单光子灵敏度,而事件相机可以在低带宽条件下测量亮度变化高达1 MHz。我们证明了这些特性是互补的,可以帮助实现低光,高速图像重建,同时降低使用每个SPAD帧的高带宽成本。我们在合成和真实传感器数据上进行的评估表明,与传统相机相比,在高速低光场景中重建低光图像的性能明显增强(>5 dB PSNR)。事件-SPAD融合在现实应用领域(如机器人或医学成像)具有巨大的潜力。
https://arxiv.org/abs/2404.11511
A significant challenge in the field of object detection lies in the system's performance under non-ideal imaging conditions, such as rain, fog, low illumination, or raw Bayer images that lack ISP processing. Our study introduces "Feature Corrective Transfer Learning", a novel approach that leverages transfer learning and a bespoke loss function to facilitate the end-to-end detection of objects in these challenging scenarios without the need to convert non-ideal images into their RGB counterparts. In our methodology, we initially train a comprehensive model on a pristine RGB image dataset. Subsequently, non-ideal images are processed by comparing their feature maps against those from the initial ideal RGB model. This comparison employs the Extended Area Novel Structural Discrepancy Loss (EANSDL), a novel loss function designed to quantify similarities and integrate them into the detection loss. This approach refines the model's ability to perform object detection across varying conditions through direct feature map correction, encapsulating the essence of Feature Corrective Transfer Learning. Experimental validation on variants of the KITTI dataset demonstrates a significant improvement in mean Average Precision (mAP), resulting in a 3.8-8.1% relative enhancement in detection under non-ideal conditions compared to the baseline model, and a less marginal performance difference within 1.3% of the mAP@[0.5:0.95] achieved under ideal conditions by the standard Faster RCNN algorithm.
在物体检测领域,一个重要的挑战是在非理想成像条件下,例如雨、雾、低照度或原始Bayer图像上,系统的性能。我们的研究引入了一种名为“特征纠正传输学习”的新方法,利用传输学习和自定义损失函数来促进在不需要将非理想图像转换为RGB同义品的情况下,端到端检测这些具有挑战性的场景中的物体。在我们的方法中,我们首先在一个纯净的RGB图像数据集上训练一个全面的模型。随后,通过将非理想图像的特征图与初始理想RGB模型的特征图进行比较来进行处理。这个比较采用了一种名为扩展区域新颖结构差异损失(EANSDL)的新损失函数,这是一种专门用于衡量相似性并将其集成到检测损失中的新损失函数。通过直接特征图纠正来优化模型的能力,包容了特征纠正传输学习的精髓。在KITTI数据集中的变体实验验证了这种方法在非理想条件下检测性能的显著提高,与基线模型相比,非理想条件的检测性能提高了3.8-8.1%,而在理想条件下,标准Faster RCNN算法的mAP@[0.5:0.95]的相对增强只有1.3%。
https://arxiv.org/abs/2404.11214
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
基于表面电生理(sEMG)的手势识别在许多三维交互场景中越来越重要。然而,sEMG很容易受到现实环境中各种形式的噪声影响,导致通过sEMG提供长期稳定交互存在挑战。现有的方法通常很难通过各种预定义的数据增强技术增强模型的噪声韧性。在这项工作中,我们从短期增强的角度重新审视问题,以改善精度和对各种常见噪声场景的鲁棒性,使用sEMG固有模式信息和滑动窗口注意力进行学习去噪。我们提出了一个短期增强模块(STEM),可以轻松地与各种模型集成。STEM带来几个优点:1)可学习去噪,无需手动数据增强;2)可扩展性,适用于各种模型;3)性价比高,通过高效的注意力机制实现短期增强。特别地,我们将STEM集成到Transformer中,创建了短期增强Transformer(STET)。与最佳竞争方法相比,STET受到的噪声影响降低了20%以上。我们还报道了在分类和回归数据集上的积极结果,并证明了STEM在各种手势识别任务上通用。
https://arxiv.org/abs/2404.11213
Constructing vectorized high-definition maps from surround-view cameras has garnered significant attention in recent years. However, the commonly employed multi-stage sequential workflow in prevailing approaches often leads to the loss of early-stage information, particularly in perspective-view features. Usually, such loss is observed as an instance missing or shape mismatching in the final birds-eye-view predictions. To address this concern, we propose a novel approach, namely \textbf{HybriMap}, which effectively exploits clues from hybrid features to ensure the delivery of valuable information. Specifically, we design the Dual Enhancement Module, to enable both explicit integration and implicit modification under the guidance of hybrid features. Additionally, the perspective keypoints are utilized as supervision, further directing the feature enhancement process. Extensive experiments conducted on existing benchmarks have demonstrated the state-of-the-art performance of our proposed approach.
近年来,从环绕视像相机的构建矢量化高清晰度地图引起了广泛关注。然而,现有的方法中通常采用的多阶段序列工作流程往往会导致在平视 view 特征中丢失早期的信息,尤其是在视角 view 的特征中。通常,这种损失表现为最终鸟瞰预测中实例缺失或形状不匹配。为了应对这种担忧,我们提出了一个新颖的方法,即 \textbf{HybriMap},它有效利用混合特征的线索来确保传递有价值的信息。具体来说,我们设计了一个双增强模块,在混合特征的指导下实现明确的集成和隐含修改。此外,将视角关键点作为监督,进一步指导特征增强过程。对现有基准进行的大量实验证明了我们所提出方法的优越性能。
https://arxiv.org/abs/2404.11155
In this challenge, we disentangle the deep filters from the original DeepfilterNet and incorporate them into our Spec-UNet-based network to further improve a hybrid Demucs (hdemucs) based remixing pipeline. The motivation behind the use of the deep filter component lies at its potential in better handling temporal fine structures. We demonstrate an incremental improvement in both the Signal-to-Distortion Ratio (SDR) and the Hearing Aid Audio Quality Index (HAAQI) metrics when comparing the performance of hdemucs against different versions of our model.
在这个挑战中,我们将原始的DeepfilterNet的深度过滤器从原始网络中分离出来,并将其集成到我们的Spec-UNet-based网络中,以进一步改进基于混合Demucs(hdemucs)的混剪视频复原流程。深度滤波器组件背后的动机在于其更好地处理时间微结构的能力。我们证明了当比较hdemucs与其他版本的模型时,信号对失真比(SDR)和听觉辅助音频质量指数(HAAQI)指标的逐步改进。
https://arxiv.org/abs/2404.11116
Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.
视频帧插值(VFI)是各种应用(如慢动作生成、帧率转换、视频帧恢复等)中的关键技术。本文介绍了一种高效的视频帧插值框架,旨在在效率和质量之间取得良好的平衡。我们的框架包括一个流估计算法和一个优化模块,并精心设计了一些组件。首先,我们采用大尺寸的卷积来减少参数并增强编码丰富语境和处理复杂运动的能力。其次,从常见的优化模块设计(我们发现它是冗余的)中进行差异,我们的仅解码器优化模块直接增强从粗到细的特征,实现更高效的过程。此外,为了处理高清晰度帧,我们在训练过程中引入了一种创新的高清度增强策略,在HD图像上实现一致的增强。我们在多种数据集(Vimeo90K、UCF101、Xiph和SNU-FILM)上进行了广泛的实验。结果表明,我们的方法在具有显着提高的同时需要更少的FLOPs和参数,达到更好的平衡点,实现最高性能。
https://arxiv.org/abs/2404.11108
In this paper, we address the Bracket Image Restoration and Enhancement (BracketIRE) task using a novel framework, which requires restoring a high-quality high dynamic range (HDR) image from a sequence of noisy, blurred, and low dynamic range (LDR) multi-exposure RAW inputs. To overcome this challenge, we present the IREANet, which improves the multiple exposure alignment and aggregation with a Flow-guide Feature Alignment Module (FFAM) and an Enhanced Feature Aggregation Module (EFAM). Specifically, the proposed FFAM incorporates the inter-frame optical flow as guidance to facilitate the deformable alignment and spatial attention modules for better feature alignment. The EFAM further employs the proposed Enhanced Residual Block (ERB) as a foundational component, wherein a unidirectional recurrent network aggregates the aligned temporal features to better reconstruct the results. To improve model generalization and performance, we additionally employ the Bayer preserving augmentation (BayerAug) strategy to augment the multi-exposure RAW inputs. Our experimental evaluations demonstrate that the proposed IREANet shows state-of-the-art performance compared with previous methods.
在本文中,我们使用一种新框架来解决Bracket Image Restoration and Enhancement(BracketIRE)任务,该框架需要从噪声、模糊和低动态范围(LDR)的多曝光RAW输入序列中恢复高质量的高动态范围(HDR)图像。为了克服这一挑战,我们提出了IReadNet,它通过引入流量引导特征对齐模块(FFAM)和增强特征聚合模块(EFAM)来改善多曝光对齐和聚合。具体来说,所提出的FFAM利用跨帧光流作为指导,以促进可变形对齐和空间注意模块(更好的特征对齐),而EFAM则进一步采用提出的增强残差块(ERB)作为基本组件,其中单向递归网络聚集对齐的时空特征以更好地重构结果。为了提高模型的泛化能力和性能,我们还使用Bayer preserving augmentation(BayerAug)策略来增强多曝光RAW输入。我们的实验评估结果表明,与以前的方法相比,所提出的IReadNet显示出最先进的性能。
https://arxiv.org/abs/2404.10358
The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
数字内容的爆发式增长对高效存储和检索方法提出了需求。传统的解决方案很难应对多媒体数据的日益复杂和规模。在本文中,我们提出的框架通过将人工智能原生多模态搜索功能与神经图像压缩相结合来应对这一挑战。首先,我们分析了压缩性和搜索性之间的复杂关系,认识到每个在存储和检索系统的效率中都扮演着关键角色。通过使用简单的适配器来桥接Learned Image Compression(LIC)和Contrastive Language-Image Pre-training(CLIP)的特征,同时保留语义保真度和多模态数据的检索,我们提出了一种方法。在柯达数据集的实验评估中,我们展示了我们方法的有效性,显示了与现有方法相比,压缩效率和搜索精度都有显著提高。我们的工作在大型数据时代的可扩展和高效多模态搜索系统方面迈出了重要的一步。
https://arxiv.org/abs/2404.10234
Large Language Models (LLMs) embed complex biases and stereotypes that can lead to detrimental user experiences and societal consequences, often without conscious awareness from the models themselves. This paper emphasizes the importance of equipping LLMs with mechanisms for better self-reflection and bias recognition. Our experiments demonstrate that by informing LLMs that their generated content does not represent their own views and questioning them about bias, their capability to identify and address biases improves. This enhancement is attributed to the internal attention mechanisms and potential internal sensitivity policies of LLMs. Building upon these findings, we propose a novel method to diminish bias in LLM outputs. This involves engaging LLMs in multi-role scenarios acting as different roles where they are tasked for bias exposure, with a role of an impartial referee in the end of each loop of debate. A ranking scoring mechanism is employed to quantify bias levels, enabling more refined reflections and superior output quality. Comparative experimental results confirm that our method outperforms existing approaches in reducing bias, making it a valuable contribution to efforts towards more ethical AI systems.
大语言模型(LLMs)通常会嵌入复杂的偏见和刻板印象,这可能导致用户体验的负面影响和社会后果,而这些偏见和刻板印象通常是模型自身无意识地导致的。本文强调了为LLM提供更好的自我反思和偏见识别机制的重要性。我们的实验证明,通过让LLM知道其生成的内容并不代表其自己的观点,并询问其关于偏见,其识别和解决偏见的能力会提高。这种提高是由LLM的内部关注机制和潜在的内部敏感性政策所导致的。在此基础上,我们提出了一个新方法来减少LLM输出中的偏见。这就涉及让LLM在多角色场景中扮演不同的角色,在每一轮辩论结束时,它们作为公正的评委。使用排名评分机制来量化偏见水平,实现更精细的反思和更好的输出质量。比较实验结果证实,我们的方法在减少偏见方面超越了现有方法,为努力构建更道德的AI系统做出了宝贵的贡献。
https://arxiv.org/abs/2404.10160
This study addresses the evolving challenges in urban traffic monitoring detection systems based on fisheye lens cameras by proposing a framework that improves the efficacy and accuracy of these systems. In the context of urban infrastructure and transportation management, advanced traffic monitoring systems have become critical for managing the complexities of urbanization and increasing vehicle density. Traditional monitoring methods, which rely on static cameras with narrow fields of view, are ineffective in dynamic urban environments, necessitating the installation of multiple cameras, which raises costs. Fisheye lenses, which were recently introduced, provide wide and omnidirectional coverage in a single frame, making them a transformative solution. However, issues such as distorted views and blurriness arise, preventing accurate object detection on these images. Motivated by these challenges, this study proposes a novel approach that combines a ransformer-based image enhancement framework and ensemble learning technique to address these challenges and improve traffic monitoring accuracy, making significant contributions to the future of intelligent traffic management systems. Our proposed methodological framework won 5th place in the 2024 AI City Challenge, Track 4, with an F1 score of 0.5965 on experimental validation data. The experimental results demonstrate the effectiveness, efficiency, and robustness of the proposed system. Our code is publicly available at this https URL.
本研究针对基于鱼眼镜头摄像头的城市交通监测检测系统所面临的不断演变挑战,提出了一个框架来提高这些系统的有效性和准确性。在城市的基础设施和交通管理背景下,先进的交通监测系统对于管理城市化复杂性和增加车辆密度至关重要。传统监测方法,依赖静态摄像头,其视野狭窄,在动态城市环境中无效,需要安装多个摄像头,这会增加成本。鱼眼镜头,最近引入,在单帧中提供广泛和全向覆盖,使得它们成为变革性的解决方案。然而,像扭曲和模糊这样的问题出现,使得这些图像上的准确物体检测效果受限。为了应对这些挑战,本研究提出了一个结合基于Transformer的图像增强框架和集成学习技术的新方法,以解决这些问题并提高交通监测准确性,对智能交通管理系统的发展做出了重要贡献。我们提出的方法论框架在2024 AI City Challenge Track 4中获得了第五名,实验验证数据中的F1分数为0.5965。实验结果证明了所提出系统的有效性、效率和稳健性。我们的代码公开在https://这个URL上。
https://arxiv.org/abs/2404.10078
Image restoration, which aims to recover high-quality images from their corrupted counterparts, often faces the challenge of being an ill-posed problem that allows multiple solutions for a single input. However, most deep learning based works simply employ l1 loss to train their network in a deterministic way, resulting in over-smoothed predictions with inferior perceptual quality. In this work, we propose a novel method that shifts the focus from a deterministic pixel-by-pixel comparison to a statistical perspective, emphasizing the learning of distributions rather than individual pixel values. The core idea is to introduce spatial entropy into the loss function to measure the distribution difference between predictions and targets. To make this spatial entropy differentiable, we employ kernel density estimation (KDE) to approximate the probabilities for specific intensity values of each pixel with their neighbor areas. Specifically, we equip the entropy with diffusion models and aim for superior accuracy and enhanced perceptual quality over l1 based noise matching loss. In the experiments, we evaluate the proposed method for low light enhancement on two datasets and the NTIRE challenge 2024. All these results illustrate the effectiveness of our statistic-based entropy loss. Code is available at this https URL.
图像修复的目标是从损坏的图像中恢复高质量的图像,通常面临着一个具有单个输入多项式解的问题。然而,大多数基于深度学习的作品仅仅采用L1损失来以确定性的方式训练网络,导致预测过拟合,感知质量差。在本文中,我们提出了一种新方法,将重点从确定性的像素逐像素比较转变为统计视角,强调学习分布而不是单个像素值。核心思想是引入空间熵到损失函数中,以测量预测和目标之间的分布差异。为了使空间熵不同寻常,我们采用核密度估计(KDE)来近似每个像素具有与其邻居区域的具体强度值的概率。具体来说,我们将熵与扩散模型相结合,旨在实现与基于L1噪声匹配的损失相比的卓越准确性和感知质量的提高。在实验中,我们对所提出的方法在两个数据集上的低光增强进行了评估,以及NTIRE挑战2024。所有这些结果都说明了基于统计熵的熵损失的有效性。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2404.09735
Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is this https URL.
提高实例特定图像目标导航(InstanceImageNav)对于机器人系统协助用户在现实环境中找到所需物品至关重要。挑战在于移动机器人观测到的低质量图像与用户提供的优质图像之间的领域差距。这种领域差距可能会显著降低任务成功率,但以前的工作并未将此作为重点。为解决这个问题,我们提出了名为Few-shot Cross-quality Instance-aware Adaptation(CrossIA)的新方法,该方法采用对比学习与实例分类器来将大型低质量图像和少量高质量图像的特征对齐。这种方法通过在实例基础上将跨质量图像的潜在表示拉近,有效减少了领域差距。此外,系统还集成了一个预训练去雾模型来提高观测到的图像质量。我们使用CrossIA对SimSiam模型进行微调。我们对20种不同类型的实例进行了InstanceImageNav任务评估,机器人在一个真实环境中识别出相同实例作为高质量查询图像。我们的实验结果表明,与基于SuperGlue的传统方法相比,我们的方法将任务成功率提高了300%以上。这些发现强调了利用对比学习和图像增强技术跨越领域差距并在机器人应用中改善物体定位的可能性。项目网站是https://www.xxx。
https://arxiv.org/abs/2404.09645