Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.
最近在大型语言模型(LLMs)方面取得的进步展示了强大的通用推理能力,但它们在金融推理中的有效性仍然未被充分探索。在这项研究中,我们对16个强大且具有普遍性的推理和一般性LLM进行了全面评估,这些评估涵盖了三项复杂的金融任务:涉及财务文本、表格数据和方程的任务,并对其数值推理、表格解读能力、金融术语理解、长上下文处理以及基于方程的问题解决能力进行了评估。我们的结果显示,尽管更好的数据集和预训练能提高金融推理的能力,但像CoT(链式思考)微调这样的通用改进方法并不总是一致地提升性能。此外,所有推理策略在应对长上下文和多表任务时都面临挑战。为了解决这些局限性,我们基于Llama-3.1-8B-Instruct开发了一个增强金融推理的模型,并通过领域特定的推理路径进行CoT微调和强化学习。即使是在简单地使用一个财务数据集进行微调的情况下,我们的模型在所有任务上也实现了平均10%的一致性能提升,超过了所有8B规模的模型,并且甚至优于Llama3-70B-Instruct和Llama3.1-70B-Instruct。 这些结果强调了在金融任务中需要领域特定适应的重要性,并指出未来的研究方向包括多表推理、长上下文处理以及财务术语理解。我们所有的数据集、模型和代码都是公开可用的。此外,我们还引入了一个排行榜来对未来的数据集和模型进行基准测试。
https://arxiv.org/abs/2502.08127
Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic arithmetic and matrix operations, this work offers a computationally accessible methodology - in the hypercomplex domain - that showcases versatility and consistency across image processing tasks and a range of computer vision and biomedical applications. The proposed non-data-driven methods achieve comparable or better results (particularly in cases involving well-known methods) to those reported in the literature, showcasing the potential of robust theoretical frameworks with practical effectiveness. Results, methods, and limitations are detailed alongside discussion of promising extensions, emphasizing the potential of feature-rich mathematical/computational frameworks for natural and biomedical images.
超复数图像处理扩展了传统技术,通过统一的范式涵盖了代数和几何原理。这项工作利用四元数以及二维正交平面分割框架(即将代表像素的四元数拆分为两对正交2D平面),用于自然/生物医学图像分析,并通过以下计算流程及成果来实现: - 自然/生物医学图像重新着色 - 自然图像去色彩化 - 自然/生物医学图像对比度增强 - 组织学图像的计算机重染和染料分离 - 用于组织学图像的机器学习/深度学习管道性能提升 这些工作流程分别对自然图像和生物医学图像进行分析,以展示所提出方法的有效性。提出的流程可以调节颜色外观(例如通过替代渲染和灰度转换),增强图像对比度,并可成为自动图像处理流水线的一部分(例如隔离染料成分、增强学习模型)。此外,它们还可以帮助数字病理学应用(例如增强生物标志物的可见性,实现对色盲友好的呈现)。 仅使用基本算术运算和矩阵操作,这项工作提供了一种在超复数域内易于计算的方法,并展示了其在图像处理任务中以及计算机视觉和生物医学应用范围内的灵活性和一致性。提出的非数据驱动方法实现了与文献报告相媲美或更好的结果(尤其是在涉及知名方法的情况下),突显了基于强大理论框架的实际效果的潜力。 详细介绍了成果、方法和限制,同时讨论了有前景的扩展方向,并强调了丰富特征数学/计算框架在自然图像和生物医学图像中的潜力。
https://arxiv.org/abs/2502.07758
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.
基于DiT的视频生成已经取得了显著成果,但对现有模型进行改进的研究相对较少。在这项工作中,我们提出了一种无训练的方法来增强基于DiT生成的视频的一致性和质量,名为Enhance-A-Video。该方法的核心思想是通过非对角线时间注意力分布来提升跨帧相关性。由于设计简单,我们的方法可以轻松应用于大多数基于DiT的视频生成框架中,无需重新训练或微调。在各种基于DiT的视频生成模型上,我们的方法展示了在时间和视觉质量方面的显著改进。我们希望这项研究能够激发未来在视频生成增强领域的进一步探索。
https://arxiv.org/abs/2502.07508
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: this https URL.
本文介绍了CodeQUEST,这是一种新颖的框架,利用大型语言模型(LLMs)迭代地评估和提升代码质量,涵盖了可读性、可维护性、效率和安全性等多个维度。该框架分为两个主要组件:一个评估器,用于在十个项目度量下评估代码质量,并提供定量评分和定性总结;以及一个优化器,根据评估器的反馈迭代改进代码。我们的研究证明了CodeQUEST能够有效地且稳健地评估代码质量,其评估结果与已建立的代码质量指标高度一致。通过一系列使用精心挑选的Python和JavaScript示例数据集进行的实验,CodeQUEST展示了显著的代码质量提升,平均相对百分比改善达到了52.6%。该框架的评估结果通过一组代理度量进行了验证,包括Pylint评分、Radon可维护性指数以及Bandit输出日志,显示出有意义的相关性。这突显了LLMs在自动化代码质量评估和改进过程中的潜力,并朝着提升软件开发实践迈出了重要一步。 框架的代码实现可在以下网址获得:this https URL.
https://arxiv.org/abs/2502.07399
Salient object detection (SOD) plays a critical role in vision-driven measurement systems (VMS), facilitating the detection and segmentation of key visual elements in an image. However, adverse imaging conditions such as haze during the day, low light, and haze at night severely degrade image quality, and complicating the SOD process. To address these challenges, we propose a multi-task-oriented nighttime haze imaging enhancer (MToIE), which integrates three tasks: daytime dehazing, low-light enhancement, and nighttime dehazing. The MToIE incorporates two key innovative components: First, the network employs a task-oriented node learning mechanism to handle three specific degradation types: day-time haze, low light, and night-time haze conditions, with an embedded self-attention module enhancing its performance in nighttime imaging. In addition, multi-receptive field enhancement module that efficiently extracts multi-scale features through three parallel depthwise separable convolution branches with different dilation rates, capturing comprehensive spatial information with minimal computational overhead. To ensure optimal image reconstruction quality and visual characteristics, we suggest a hybrid loss function. Extensive experiments on different types of weather/imaging conditions illustrate that MToIE surpasses existing methods, significantly enhancing the accuracy and reliability of vision systems across diverse imaging scenarios. The code is available at this https URL.
显著物体检测(SOD)在视觉驱动的测量系统(VMS)中扮演着关键角色,它能够帮助识别和分割图像中的重要视觉元素。然而,恶劣的成像条件如白天的雾、低光环境以及夜晚的雾会严重降低图像质量,并使SOD过程变得复杂化。为了应对这些挑战,我们提出了一种多任务导向夜间去雾增强器(MToIE),该系统集成了三项任务:白天去雾、低光增强和夜间去雾。MToIE包含两个关键创新组件: 首先,网络采用一种面向任务的节点学习机制来处理三种特定降级类型:白天雾霾、低光照条件以及夜间雾霾情况,并通过嵌入自注意力模块增强了其在夜间成像中的性能。 其次,多感受野增强模块可以通过三个并行深度可分离卷积分支(具有不同的膨胀率)有效地提取多层次特征,同时以最小的计算开销捕捉全面的空间信息。 为了确保最佳的图像重建质量和视觉特性,我们建议使用混合损失函数。多种天气/成像条件下的广泛实验表明,MToIE超越了现有方法,在各种成像场景下显著提高了视觉系统的准确性和可靠性。代码可在该网址获得:[请在此处提供实际链接](原文中提到应包含一个URL以供访问代码)。
https://arxiv.org/abs/2502.07351
Zero-shot composed image retrieval (ZS-CIR) enables image search using a reference image and text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches face three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be publicly available.
零样本组合图像检索(ZS-CIR)允许使用参考图片和文本提示进行图像搜索,而无需依赖于大规模配对数据上训练的专用文本-图像合成网络。然而,现有的ZS-CIR方法面临三个关键限制:依赖静态查询嵌入表示、不够充分地利用图像嵌入以及在融合文本和图像嵌入时表现不佳。为解决这些挑战,我们引入了提示方向向量(PDV),这是一种简单且有效的无需训练的增强技术,能够捕捉用户提示引起的语义变化。 PDV带来了三个关键改进: 1. 动态组合文本嵌入,其中可以通过调整因子来控制提示修改。 2. 通过从文本提示到图像特征进行语义转移来实现组合图像嵌入。 3. 权重融合组合文本和图像嵌入,从而通过平衡视觉和语义相似性来增强检索性能。 我们的方法可以作为现成的增强功能用于现有的ZS-CIR方法,并且几乎不需要额外的计算开销。在多个基准上的广泛实验表明,当与最先进的ZS-CIR方法结合使用时,PDV能够一致地提高检索性能,尤其是在生成准确组合嵌入的方法中效果更佳。代码将公开发布。
https://arxiv.org/abs/2502.07215
Insulators are crucial insulation components and structural supports in power grids, playing a vital role in the transmission lines. Due to temperature fluctuations, internal stress, or damage from hail, insulators are prone to injury. Automatic detection of damaged insulators faces challenges such as diverse types, small defect targets, and complex backgrounds and shapes. Most research for detecting insulator defects has focused on a single defect type or a specific material. However, the insulators in the grid's transmission lines have different colors and materials. Various insulator defects coexist, and the existing methods have difficulty meeting the practical application requirements. Current methods suffer from low detection accuracy and mAP0.5 cannot meet application requirements. This paper proposes an improved YOLOv7 model for multi-type insulator defect detection. First, our model replaces the SPPCSPC module with the RFB module to enhance the network's feature extraction capability. Second, a CA mechanism is introduced into the head part to enhance the network's feature representation ability and to improve detection accuracy. Third, a WIoU loss function is employed to address the low-quality samples hindering model generalization during training, thereby improving the model's overall performance. The experimental results indicate that the proposed model exhibits enhancements across various performance metrics. Specifically, there is a 1.6% advancement in mAP_0.5, a corresponding 1.6% enhancement in mAP_0.5:0.95, a 1.3% elevation in precision, and a 1% increase in recall. Moreover, the model achieves parameter reduction by 3.2 million, leading to a decrease of 2.5 GFLOPS in computational cost. Notably, there is also an improvement of 2.81 milliseconds in single-image detection speed.
绝缘子是电力网中的关键绝缘和结构支撑组件,在输电线路中扮演着重要角色。由于温度波动、内部应力或冰雹造成的损伤,绝缘子容易受损。自动检测损坏的绝缘子面临着诸如类型多样、缺陷目标小以及背景复杂等问题的挑战。大多数关于检测绝缘子缺陷的研究主要集中在单一类型的缺陷或者特定材料上。然而,在电网输电线路中的绝缘子具有不同的颜色和材质,并且各种类型的绝缘子缺陷共存,现有的方法难以满足实际应用需求。当前的方法存在检测精度低的问题,mAP0.5无法达到应用要求。本文提出了一种改进的YOLOv7模型用于多类型绝缘子缺陷检测。首先,我们的模型用RFB模块替代了SPPCSPC模块以增强网络的特征提取能力。其次,在头部部分引入了CA机制来提升网络的特征表示能力和提高检测精度。第三,采用了WIoU损失函数来解决训练过程中低质量样本阻碍模型泛化的问题,从而提升了整个模型的表现。实验结果表明,所提出的模型在各种性能指标上都显示出改进:mAP_0.5提高了1.6%,mAP_0.5:0.95相应地提高了1.6%,精度提高了1.3%,召回率增加了1%。此外,该模型实现了参数减少320万个,并且计算成本减少了2.5 GFLOPS。值得注意的是,单张图像的检测速度也提升了2.81毫秒。
https://arxiv.org/abs/2502.07179
Ischaemic stroke, a leading cause of death and disability, critically relies on neuroimaging for characterising the anatomical pattern of injury. Diffusion-weighted imaging (DWI) provides the highest expressivity in ischemic stroke but poses substantial challenges for automated lesion segmentation: susceptibility artefacts, morphological heterogeneity, age-related comorbidities, time-dependent signal dynamics, instrumental variability, and limited labelled data. Current U-Net-based models therefore underperform, a problem accentuated by inadequate evaluation metrics that focus on mean performance, neglecting anatomical, subpopulation, and acquisition-dependent variability. Here, we present a high-performance DWI lesion segmentation tool addressing these challenges through optimized vision transformer-based architectures, integration of 3563 annotated lesions from multi-site data, and algorithmic enhancements, achieving state-of-the-art results. We further propose a novel evaluative framework assessing model fidelity, equity (across demographics and lesion subtypes), anatomical precision, and robustness to instrumental variability, promoting clinical and research utility. This work advances stroke imaging by reconciling model expressivity with domain-specific challenges and redefining performance benchmarks to prioritize equity and generalizability, critical for personalized medicine and mechanistic research.
缺血性卒中是导致死亡和残疾的主要原因,其诊断严重依赖于神经影像学来描述损伤的解剖模式。扩散加权成像(DWI)在缺血性卒中的表现力最高,但对自动病变分割提出了重大挑战:磁敏感伪影、形态异质性、与年龄相关的共病状况、时间依赖性的信号动态变化、仪器变异性以及有限的标注数据。基于U-Net的模型因此表现出不佳的表现,这一问题因评价指标不充分而加剧,这些指标只关注平均性能,忽视了解剖学差异、亚群体差异和采集方式差异。 在此背景下,我们提出了一种高性能DWI病变分割工具,通过优化视觉变压器架构、整合来自多中心数据集的3563个标注病灶以及算法改进,实现了最先进的成果。此外,我们还提出了一个新颖的评估框架,用于评价模型的保真度、公平性(在不同的人口统计学和病灶亚型之间)、解剖精确度和对仪器变异性鲁棒性的能力,从而促进临床和研究应用。 本工作通过调和模型表现力与特定领域的挑战,并重新定义以优先考虑公平性和泛化性的性能基准,推进了卒中成像技术的发展。这对于个性化医疗和机制性研究至关重要。
https://arxiv.org/abs/2502.06939
Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at this https URL.
计算神经成像涉及分析大脑图像或信号,以提供人类认知和行为的机制性见解及预测工具。虽然扩散模型在自然图像中表现出稳定性和高质量生成能力,但人们越来越有兴趣将它们应用于脑数据的分析,用于各种神经学任务,如数据增强、疾病诊断和大脑解码等。本文综述了最近将扩散模型整合到计算神经成像中的努力。首先介绍了常见的神经影像数据模态,接着讨论了扩散形式化及其条件机制。然后,我们探讨了通过调整扩散模型的去噪起点、条件输入及生成目标的变化如何发展并提升特定的神经影像任务。为了全面了解正在进行的研究,我们在以下网址提供了一个公开可用的资源库:[此链接](https://this-url.com)。 (注释中的URL应替换为实际提供的有效链接)
https://arxiv.org/abs/2502.06552
Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
定制化生成在图像合成领域取得了显著进展,然而个性化视频生成仍面临时间一致性差和质量下降的挑战。本文介绍了CustomVideoX框架,该框架利用视频扩散变换器从参考图片生成个性化视频。CustomVideoX通过仅训练LoRA参数来提取参考特征,充分利用了预训练的视频网络,并确保高效性和适应性。为了促进参考图像与视频内容之间的无缝交互,我们提出了3D参考注意力机制,使参考图像特征能够直接和同时与所有视频帧在空间和时间维度上进行互动。 为了解决参考图片特征及文本指导对生成视频内容的影响过大的问题,在推断过程中我们实施了基于时间感知的参考注意力偏差(TAB)策略,动态调节不同时间步长上的参考偏置。此外,我们引入了实体区域感知增强模块(ERAE),通过调整注意力偏差来将关键实体标记的高度激活区域与参考特征注入对齐。 为了全面评估个性化视频生成的效果,我们建立了一个新的基准测试平台VideoBench,涵盖超过50个物体和100种提示词进行全面评测。实验结果表明,在视频一致性和质量方面,CustomVideoX显著优于现有方法。
https://arxiv.org/abs/2502.06527
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 42% on ImageNet64x64.
本文介绍了一种成本效益高的主动学习(AL)框架,用于分类任务,并提出了一种称为候选集查询的新型查询设计。与传统AL查询要求oracle检查所有可能类别不同,我们的方法缩小了包含真实类别的候选类集合范围,从而大大减少了搜索空间和标记成本。此外,我们利用置信规则库预测动态生成小而可靠的候选集,使这些候选集能够随着后续主动学习轮次中的模型改进而调整。 为此,本文提出了一种获取函数,用于优先考虑以较低成本提供较高信息增益的数据点。在CIFAR-10、CIFAR-100和ImageNet64x64数据集上的实证评估表明了该框架的有效性和可扩展性。特别值得注意的是,在ImageNet64x64上,我们的方法将标记成本降低了42%。
https://arxiv.org/abs/2502.06209
Time delay estimation (TDE) plays a key role in acoustic echo cancellation (AEC) using adaptive filter method. Considerable residual echo will be left if estimation error arises. Here, in this paper, we proposed an adaptive filter bank based neural network approach where the delay is estimated by a bank of adaptive filters with overlapped time scope, and all the energy of filter weights are concatenated and feed to a classification network. The index with maximal probability is chosen as the estimated delay. Based on this TDE, an AEC scheme is designed using a neural network for residual echo and noise suppression, and the optimally-modified log-spectral amplitude (OMLSA) algorithm is adopted to make it robust. Also, a robust automatic gain control (AGC) scheme with spectrum smoothing method is designed to amplify speech segments. Performance evaluations reveal that higher performance can be achieved for our scheme.
时间延迟估计(TDE)在使用自适应滤波方法进行声回波消除(AEC)中扮演着关键角色。如果出现估算误差,将会留下相当大的残余回声。本文提出了一种基于自适应滤波器组的神经网络方法,在这种方法中,通过一组具有重叠时间范围的自适应滤波器来估计延迟,并将所有滤波器权重的能量串联起来输入分类网络,选择概率最大的索引作为估计的时间延迟。在此基础上,利用一个用于残余回声和噪声抑制的神经网络设计了一种AEC方案,并采用优化后的对数频谱幅度(OMLSA)算法使其具有鲁棒性。此外,还设计了一个使用光谱平滑方法的鲁棒自动增益控制(AGC)方案来放大语音段落。 性能评估表明,我们的方案可以实现更高的性能。
https://arxiv.org/abs/2502.06098
The usage of digital content (photos and videos) in a variety of applications has increased due to the popularity of multimedia devices. These uses include advertising campaigns, educational resources, and social networking platforms. There is an increasing need for high-quality graphic information as people become more visually focused. However, captured images frequently have poor visibility and a high amount of noise due to the limitations of image-capturing devices and lighting conditions. Improving the visual quality of images taken in low illumination is the aim of low-illumination image enhancement. This problem is addressed by traditional image enhancement techniques, which alter noise, brightness, and contrast. Deep learning-based methods, however, have dominated recently made advances in this area. These methods have effectively reduced noise while preserving important information, showing promising results in the improvement of low-illumination images. An extensive summary of image signal processing methods for enhancing low-illumination images is provided in this paper. Three categories are classified in the review for approaches: hybrid techniques, deep learning-based methods, and traditional approaches. Conventional techniques include denoising, automated white balancing, and noise reduction. Convolutional neural networks (CNNs) are used in deep learningbased techniques to recognize and extract characteristics from low-light images. To get better results, hybrid approaches combine deep learning-based methodologies with more conventional methods. The review also discusses the advantages and limitations of each approach and provides insights into future research directions in this field.
数字内容(照片和视频)在多媒体设备普及的推动下,在各种应用程序中的使用量不断增加。这些应用包括广告活动、教育资源和社会网络平台。随着人们越来越注重视觉效果,对高质量图形信息的需求也在增加。然而,由于图像捕捉设备和光照条件的限制,捕获到的图片往往可见度低且噪声大。改善在弱光条件下拍摄的照片的视觉质量是低光照图像增强的目标。传统图像增强技术通过调整噪音、亮度和对比度来解决这个问题。不过,在这个领域最近的进步主要由基于深度学习的方法主导。这些方法有效地减少了噪声,同时保留了重要信息,并展示了改进低照明图像的巨大潜力。 本文提供了一种广泛的总结,概述用于提高低光照图像质量的图像信号处理方法。根据这种方法,三种分类被归入到回顾中:混合技术、基于深度学习的方法和传统方法。传统方法包括去噪、自动白平衡以及噪声减少等手段。在基于深度学习的技术中,则使用卷积神经网络(CNN)来识别并提取低光环境下图像的特征。为了获得更好的结果,混合方法将基于深度学习的方法与更传统的技术相结合。 回顾还讨论了每种方法的优势和局限性,并为未来的研究方向提供了见解。
https://arxiv.org/abs/2502.05995
Medical images often exhibit low and blurred contrast between lesions and surrounding tissues, with considerable variation in lesion edges and shapes even within the same disease, leading to significant challenges in segmentation. Therefore, precise segmentation of lesions has become an essential prerequisite for patient condition assessment and formulation of treatment plans. Significant achievements have been made in research related to the U-Net model in recent years. It improves segmentation performance and is extensively applied in the semantic segmentation of medical images to offer technical support for consistent quantitative lesion analysis methods. First, this paper classifies medical image datasets on the basis of their imaging modalities and then examines U-Net and its various improvement models from the perspective of structural modifications. The research objectives, innovative designs, and limitations of each approach are discussed in detail. Second, we summarize the four central improvement mechanisms of the U-Net and U-Net variant algorithms: the jump-connection mechanism, residual-connection mechanism, 3D-UNet, and transformer mechanism. Finally, we examine the relationships among the four core enhancement mechanisms and commonly utilized medical datasets and propose potential avenues and strategies for future advancements. This paper provides a systematic summary and reference for researchers in related fields, and we look forward to designing more efficient and stable medical image segmentation network models based on the U-Net network.
医学图像中,病变与其周围组织之间的对比通常较低且模糊,即使在同一疾病内,病灶边缘和形状的变化也非常大,这使得分割工作极具挑战性。因此,精确地对病变进行分割已成为评估患者病情和制定治疗方案的重要前提条件。近年来,在与U-Net模型相关的研究领域取得了显著成就,该模型提升了分割性能,并广泛应用于医学图像的语义分割中,为一致性的定量病灶分析方法提供了技术支持。 本文首先基于成像模式对医疗影像数据集进行分类,然后从结构改进的角度审视U-Net及其各种改进模型。详细讨论了每种方法的研究目标、创新设计和局限性。其次,我们总结了四种中心改善机制:跳跃连接机制(jump-connection mechanism)、残差连接机制(residual-connection mechanism)、3D-U-Net以及转换器机制(transformer mechanism)。最后,本文还分析了这四个核心增强机制与常用的医学数据集之间的关系,并提出了未来可能的发展方向和策略。 本论文为相关领域的研究人员提供了系统的总结和参考,我们期待基于U-Net网络设计出更高效、更稳定的医疗影像分割网络模型。
https://arxiv.org/abs/2502.06895
Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: this https URL.
可控的3D场景生成在虚拟现实和室内设计等领域有着广泛的应用,生成的场景需要表现出高度的真实感与几何形状上的可控制性。场景图提供了一种合适的数据表示方式,便于实现这些应用需求。然而,当前基于图形的方法进行场景生成时,受限于文本输入,并且对于灵活多变的用户输入适应性不足,这限制了精确控制物体几何结构的能力。 为了解决这个问题,我们提出了MMGDreamer,这是一种用于场景生成的双分支扩散模型,集成了新颖的混合模态图、视觉增强模块和关系预测器。这种混合模态图允许对象节点整合文本和视觉两种模式,并支持节点间的可选关系连接。这提升了对灵活用户输入的适应性,并使得精确控制生成场景中物体的几何形状成为可能。 此外,MMGDreamer中的视觉增强模块通过使用文本嵌入来构建视觉表示,从而丰富了仅基于文本节点的视觉保真度。我们的关系预测器利用节点表示来推断节点间缺失的关系,进而产生更为连贯的场景布局。 广泛的实验结果表明,MMGDreamer在物体几何控制方面表现出色,并达到了目前最先进的场景生成性能水平。项目页面链接:[此URL](this https URL)(请替换为实际链接)。
https://arxiv.org/abs/2502.05874
Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10\%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
离线偏好对齐在语言模型中的应用,如直接偏好优化(DPO),因其有效性和简单性而受到青睐,并且不需要成本高昂的强化学习。针对不同的数据设置已经开发出了各种离线算法,但它们缺乏统一的理解。在这项研究中,我们引入了先验信息引导的偏好对齐(PIPA),这是一个统一、无RL的概率框架,将语言模型的偏好对齐表述为具有先验约束的最大似然估计(MLE)问题。该方法有效地适应了成对和不成对的数据以及答案级和步骤级别的注释。我们说明了DPO和KTO在我们的框架内是不同先验约束下的特殊情况。通过整合不同类型的知识,我们开发了两种PIPA的变体:PIPA-M和PIPA-N。这两种算法在GSM8K和MATH基准测试的所有配置下均表现出3%~10%的性能提升,并且这些改进无需额外训练或计算成本即可实现,相较于现有的方法而言具有显著优势。
https://arxiv.org/abs/2502.05773
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (this https URL).
大型语言模型(LLMs)在代码生成和问题解决方面取得了显著进展。当前的方法采用基于外部工具的迭代调试器,这些调试器利用编译器或其他工具提供的运行时反馈来改进各种方法生成的粗略程序。然而,这种方法的有效性很大程度上依赖于初始代码生成的质量,这是一个尚未解决的问题。本文介绍了CodeSim,这是一种新颖的多代理代码生成框架,它全面解决了程序合成中的规划、编码和调试阶段,并采用了类似人类感知的方法。正如人类通过视觉模拟验证他们对算法的理解一样,CodeSim独具特色地提供了一种计划验证方法以及内部调试方法,该方法通过逐步模拟输入/输出来实现。在七个具有挑战性的竞赛型问题解决和程序合成基准测试中进行了广泛的实验,结果表明CodeSim的代码生成能力非常出色。我们的框架在多个基准上取得了新的最佳性能(pass@1):HumanEval 95.1%,MBPP 90.7%,APPS 22%,以及CodeContests 29.1%。此外,当与外部调试器串联使用时,我们提出的方法显示出了进一步改进的潜力。为了促进该领域的进一步研究和发展,我们在以下链接开源了我们的框架(https://thislink.com)。
https://arxiv.org/abs/2502.05664
A truly intelligent Large Language Model (LLM) should be capable of correcting errors in its responses through external interactions. However, even the most advanced models often face challenges in improving their outputs. In this paper, we explore how to cultivate LLMs with the self-refinement capability through iterative preference training, and how this ability can be leveraged to improve model performance during inference. To this end, we introduce a novel post-training and inference framework, called ARIES: Adaptive Refinement and Iterative Enhancement Structure. This method iteratively performs preference training and self-refinement-based data collection. During training, ARIES strengthen the model's direct question-answering capability while simultaneously unlocking its self-refinement potential. During inference, ARIES harnesses this self-refinement capability to generate a series of progressively refined responses, which are then filtered using either the Reward Model Scoring or a simple yet effective Rule-Based Selection mechanism, specifically tailored to our approach, to construct a dataset for the next round of preference training. Experimental results demonstrate the remarkable performance of ARIES. When applied to the Llama-3.1-8B model and under the self-refinement setting, ARIES surpasses powerful models such as GPT-4o, achieving 62.3% length-controlled (LC) and a 63.3% raw win rates on AlpacaEval 2, outperforming Iterative DPO by 27.8% and 35.5% respectively, as well as a 50.3% win rate on Arena-Hard, surpassing Iterative DPO by 26.6%. Furthermore, ARIES consistently enhances performance on mathematical reasoning tasks like GSM8K and MATH.
一个真正智能的大语言模型(LLM)应当有能力通过外部互动来纠正其响应中的错误。然而,即使是最先进的模型也常常面临提升输出质量的挑战。在这篇论文中,我们探索了如何利用迭代偏好训练培养大语言模型自我精进的能力,并研究这种能力如何在推理过程中提高模型的表现。 为此,我们引入了一个新颖的后训练和推理框架,称为ARIES(自适应细化与迭代增强结构)。该方法通过迭代执行偏好训练以及基于自身改进的数据收集来工作。在训练阶段,ARIES不仅能强化模型直接回答问题的能力,还能解锁其自我精进的潜力。在推理过程中,ARIES利用这种自我精进能力生成一系列逐步优化的回答,并使用奖励模型评分或我们特定定制的有效规则选择机制进行过滤,从而构建用于下一轮偏好训练的数据集。 实验结果展示了ARIES出色的表现:当应用于Llama-3.1-8B模型并在自改进设置中运行时,它在AlpacaEval 2中的长度控制(LC)胜率为62.3%,原始胜率为63.3%,分别超过了迭代DPO模型27.8%和35.5%;同时,在Arena-Hard上的胜率为50.3%,也比迭代DPO高出26.6%。此外,ARIES在数学推理任务(如GSM8K和MATH)上持续提高了性能。 这一系列成果证明了通过自适应细化与迭代增强结构来提升大语言模型的自我改进能力的有效性和优越性。
https://arxiv.org/abs/2502.05605
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: this https URL.
在这项调查中,我们系统地分析了用于调整大型多模态模型(LMM)以适应低资源(LR)语言的技术,涵盖了从视觉增强和数据创建到跨模式转移和融合策略的各种方法。通过对75种低资源语言的106篇研究进行全面分析,我们确定了研究人员在处理有限数据和计算资源挑战时的关键模式。我们发现,在低资源设置中,视觉信息通常作为提高模型性能的重要桥梁,尽管在幻觉缓解和计算效率方面仍存在重大挑战。我们的目标是为研究人员提供对当前方法以及使大型多模态模型更易于低资源(研究不足)语言使用者使用的剩余挑战的清晰理解。我们还通过一个开放源代码库支持这项调查,该库可在以下网址访问:this https URL.
https://arxiv.org/abs/2502.05568
Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining Strategy (FRAMES), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAMES achieves a remarkable 16.8% average improvement over random sampling across MMLU and CMMLU, effectively boosting LLM performance.
大型语言模型(LLMs)在人类语言的理解和生成方面取得了显著进展,预训练数据的质量和组织对于其性能至关重要。多阶段预训练是一种有前景的方法,但现有的方法通常缺乏定量的数据划分标准,而是依赖于直观的启发式方法。在这篇论文中,我们提出了一种新颖的四象限多阶段预训练策略(FRAMES),该策略遵循将预训练过程划分为四个阶段以实现损失减少四次的原则,以此来指导。这一原则基于两个关键发现:首先,在高困惑度(PPL)数据上进行训练后在低困惑度数据上进行训练;其次,在低困惑度差异(PD)数据上进行训练后再在高困惑度差异数据上进行训练,这两者都能使损失显著减少两次并提高性能。通过将数据划分为四个象限,并对其进行战略性的组织,FRAMES 在 MMLU 和 CMMLU 上实现了平均 16.8% 的随机采样改进,从而有效提升了 LLM 的性能。
https://arxiv.org/abs/2502.05551