Cancer is an abnormal growth with potential to invade locally and metastasize to distant organs. Accurate auto-segmentation of the tumor and surrounding normal tissues is required for radiotherapy treatment plan optimization. Recent AI-based segmentation models are generally trained on large public datasets, which lack the heterogeneity of local patient populations. While these studies advance AI-based medical image segmentation, research on local datasets is necessary to develop and integrate AI tumor segmentation models directly into hospital software for efficient and accurate oncology treatment planning and execution. This study enhances tumor segmentation using computationally efficient hybrid UNet-Transformer models on magnetic resonance imaging (MRI) datasets acquired from a local hospital under strict privacy protection. We developed a robust data pipeline for seamless DICOM extraction and preprocessing, followed by extensive image augmentation to ensure model generalization across diverse clinical settings, resulting in a total dataset of 6080 images for training. Our novel architecture integrates UNet-based convolutional neural networks with a transformer bottleneck and complementary attention modules, including efficient attention, Squeeze-and-Excitation (SE) blocks, Convolutional Block Attention Module (CBAM), and ResNeXt blocks. To accelerate convergence and reduce computational demands, we used a maximum batch size of 8 and initialized the encoder with pretrained ImageNet weights, training the model on dual NVIDIA T4 GPUs via checkpointing to overcome Kaggle's runtime limits. Quantitative evaluation on the local MRI dataset yielded a Dice similarity coefficient of 0.764 and an Intersection over Union (IoU) of 0.736, demonstrating competitive performance despite limited data and underscoring the importance of site-specific model development for clinical deployment.
癌症是一种具有局部侵犯和远处器官转移潜能的异常生长。为了优化放射治疗计划,准确地自动分割肿瘤及其周围正常组织是必需的。最近基于人工智能的分割模型通常是在大型公共数据集上训练的,这些数据集缺乏本地患者群体的多样性。虽然这些研究推动了基于AI的医学图像分割技术的发展,但使用本地数据进行研究对于开发和整合适用于医院软件的人工智能肿瘤分割模型以实现高效且准确的肿瘤治疗计划制定至关重要。 本研究利用来自当地医院并在严格隐私保护下获取的磁共振成像(MRI)数据集,通过计算效率高的混合UNet-Transformer模型来提高肿瘤分割效果。我们构建了一个强大的数据流水线,能够无缝提取和预处理DICOM文件,并进行了广泛的图像增强以确保在不同临床环境中模型的泛化能力,最终形成一个包含6080张训练图像的数据集。 我们的新型架构将基于UNet的卷积神经网络与变压器瓶颈以及互补注意模块(包括高效注意、挤压激励块(SE)、卷积块注意力模块(CBAM)和ResNeXt块)相结合。为了加速收敛并减少计算需求,我们使用了最大批量大小为8,并用预训练的ImageNet权重初始化编码器,在两个NVIDIA T4 GPU上通过检查点功能进行模型训练以克服Kaggle运行时间限制。 在本地MRI数据集上的定量评估显示,Dice相似系数为0.764,交并比(IoU)为0.736。尽管数据有限,这些结果仍然表明了竞争性的性能,并强调了开发特定于位置的模型对于临床部署的重要性。
https://arxiv.org/abs/2506.15562
We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.
我们研究了变压器中自我注意机制的内在(在注意力头内)和外在(在注意力头之间)结构。通过利用参数微分学,获得了自我注意机制对softmax激活不变性的理论证据(并得到了计算示例的支持),这依赖于注意力头的内在组织方式。此外,我们使用现有的张量层次化组织方法来构建基于查询、键和头部轴的网络3维张量的层级分区树,以分析网络结构。这种组织具有重要意义,因为它允许在由组织后的网络3维张量表现出规则性的几何空间中有效地执行常见信号处理任务。通过可视化注意力头组成的树形图以及扩散映射嵌入,我们定性地展示了这一点,并通过研究个体注意力头和整个网络相对于查询、键和头部的双向(bi)和三向(tri)哈尔基的扩展系数来定量分析了网络稀疏性。 为了展示我们的理论与方法发现的实际效用,我们使用视觉和语言变压器提供计算示例。这些发现的影响是两方面的:(1) 在解释性分析中,理论上可以进行下一步,并且可以在实践中利用以执行下游任务;(2) 可以通过利用网络稀疏性来使用网络3维张量组织来进行模型剪枝等实际应用,并比较不同网络架构。
https://arxiv.org/abs/2506.15541
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
我们提出了层次音频编解码器(HAC),这是一种统一的神经语音编解码器,它在一个单一模型中将瓶颈分解为三个语言层级:声学层、音素层和词汇层。HAC利用了两个知识蒸馏目标:一个来自预训练的语音编码器(HuBERT),用于提取音素级别的结构;另一个则来源于基于文本的编码器(LaBSE),以获取词汇线索。在英语及多语言数据上的实验表明,HAC分解后的瓶颈产生了分离式的标记集合:其中一个与音素对齐,而另一个捕捉了词级语义信息。定量评估确认了HAC标记能够保持自然性,并提供可解释的语言信息,在分离性和重建质量方面均优于单一层次的基线模型。这些发现突显了HAC作为一种统一离散语音表示的巨大潜力,它在下游语音生成和理解任务中连接了声学细节与词汇意义之间的桥梁。
https://arxiv.org/abs/2506.15456
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at this https URL.
基于大型语言模型的多智能体系统在社会模拟和复杂任务解决领域展现出了巨大的潜力。然而,当前框架面临着架构设计、跨域泛化能力以及性能保证等方面的挑战,尤其是在任务复杂度增加和代理数量增多时更为显著。我们在此介绍AgentGroupChat-V2,这是一个通过三大创新来应对这些挑战的新型框架: 1. 分治全并行架构:将用户查询分解为层次化的任务森林结构,以管理依赖关系,并实现分布式并发处理。 2. 自适应协作引擎:根据任务特性动态选择异构大型语言模型组合和交互模式。 3. 结合分治方法的问题优化组织策略,以高效地进行问题分解。 广泛的实验表明,AgentGroupChat-V2在多个领域中均表现出色: - 在GSM8K数据集上达到了91.50%的准确性(优于最佳基线模型5.6个百分点)。 - 在竞赛级别的AIME数据集中达到了30.4%的准确率(几乎翻倍于其他方法)。 - 在HumanEval数据集上的通过率为79.20%。 随着任务难度增加,性能优势变得更加明显,在Level 5 MATH问题上相较于最先进的基线模型提高了超过11个百分点。这些结果证实了AgentGroupChat-V2能够为构建高效、通用的大型语言模型多智能体系统提供全面解决方案,并且在复杂的推理场景中具有显著的优势。 源代码可在以下链接获取:[此URL](https://this-url.com)
https://arxiv.org/abs/2506.15451
In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.
在半导体行业中,由于需求高但竞争也十分激烈且日益加剧,市场进入时间和产品质量是确保在各种应用领域中占有显著市场份额的关键因素。近年来,深度学习方法在计算机视觉、工业4.0和5.0应用(如缺陷分类)等领域取得了显著成功。特别是域适应(Domain Adaptation, DA)技术已证明非常有效,因为它专注于利用一个特定领域的知识来调整并使其能在另一个相关但不同的领域中有效运作。通过提高鲁棒性和可扩展性,DA减少了对广泛的重新标注或模型再训练的需求,这不仅降低了计算和资源成本,还使人类专家能够将精力集中在高价值的任务上。 因此,在半导体行业的背景下,我们测试了域适应技术在半监督和无监督设置中的有效性。此外,我们提出了一种名为DBACS的方法,这是一种受到CycleGAN启发的模型,并增加了额外的损失项以提高性能。所有这些方法都基于真实世界的电子显微镜图像进行研究和验证,在无监督和半监督设置中证明了我们的方法在推进半导体领域域适应技术方面的实用性。
https://arxiv.org/abs/2506.15260
Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images
医学影像数据包含敏感的患者信息,需要强有力的隐私保护措施。许多分析设置要求将数据发送到服务器进行推理操作。同态加密(HE)提供了一种解决方案,在不解密的情况下对加密数据执行计算。然而,同态加密推理在计算上非常昂贵,特别是对于大尺寸图像(例如胸部X光片)。在这项研究中,我们提出了一种针对医学影像的基于VQGAN压缩技术的同态加密推理框架,通过将图像压缩为潜在表示来显著减少计算负担同时保持图像质量。我们将激活函数近似为低阶多项式以平衡准确性和效率,并满足同态加密的要求。观察到在压缩时采用8倍下采样因子可以实现性能与计算成本之间的最佳平衡点。我们进一步改进了挤压和激励模块,这一技术被证明能够提升传统卷积神经网络(CNN)的表现力,从而增强我们的HE框架。我们在两个胸部X光片数据集上使用基本的CNN骨干结构进行了多标签分类任务测试。尽管同态加密推理仍然相对缓慢,并且与未加密推理相比引入了较小的性能差异,但我们的方法在医学影像的实际应用中显示出巨大的潜力。
https://arxiv.org/abs/2506.15258
Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.
语音增强,尤其是降噪技术,在改善真实世界应用场景中语音信号的可懂度和质量方面至关重要,尤其是在噪音环境中。尽管此前的研究已经提出了各种用于此目的的深度学习模型,但许多模型在噪声抑制、感知质量以及说话人特定特征保留之间难以取得平衡,留下了比较性能评估中的一个重要研究缺口。本研究对Wave-U-Net、CMGAN和U-Net这三种最先进的模型,在SpEAR、VPQAD和Clarkson数据集等多样化数据集上进行了基准测试。这些模型因其在文献中的相关性和代码可获取性而被选中进行研究。 评价结果表明,U-Net在噪声抑制方面表现出色,在SpEAR数据集上的信噪比(SNR)提高了71.96%,VPQAD数据集上提高了64.83%,Clarkson数据集上则提高了364.2%。CMGAN模型在感知质量方面表现优异,分别在SpEAR和VPQAD数据集中获得了最高的PESQ评分4.04和1.46,使其非常适合需要自然且易于理解的语音的应用场景。Wave-U-Net模型在保留说话人特定特征的同时也实现了噪声抑制方面的改进,这体现在VeriSpeak评分上的提升:在SpEAR数据集上提高了10.84%,VPQAD数据集上则提升了27.38%。 这项研究揭示了先进方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。该研究的发现可能会推进语音生物识别技术、法医音频分析、电信通讯和在复杂声学条件下的说话人验证等领域的发展。
https://arxiv.org/abs/2506.15000
Anatomical trees play an important role in clinical diagnosis and treatment planning. Yet, accurately representing these structures poses significant challenges owing to their intricate and varied topology and geometry. Most existing methods to synthesize vasculature are rule based, and despite providing some degree of control and variation in the structures produced, they fail to capture the diversity and complexity of actual anatomical data. We developed a Recursive variational Neural Network (RvNN) that fully exploits the hierarchical organization of the vessel and learns a low-dimensional manifold encoding branch connectivity along with geometry features describing the target surface. After training, the RvNN latent space can be sampled to generate new vessel geometries. By leveraging the power of generative neural networks, we generate 3D models of blood vessels that are both accurate and diverse, which is crucial for medical and surgical training, hemodynamic simulations, and many other purposes. These results closely resemble real data, achieving high similarity in vessel radii, length, and tortuosity across various datasets, including those with aneurysms. To the best of our knowledge, this work is the first to utilize this technique for synthesizing blood vessels.
解剖树在临床诊断和治疗计划中扮演着重要角色。然而,由于其复杂的拓扑结构和几何形状,准确地表示这些结构面临巨大挑战。目前大多数合成血管的方法都是基于规则的,在提供一定程度的控制和变化的同时,却无法捕捉到实际解剖数据的多样性和复杂性。 我们开发了一种递归变分神经网络(Recursive variational Neural Network, RvNN),它充分利用了血管层次组织的特点,并学习低维流形编码分支连接以及描述目标表面的几何特征。经过训练后,RvNN的潜在空间可以被采样以生成新的血管几何形状。 通过利用生成性神经网络的力量,我们能够产生准确且多样化的3D血管模型,这对于医学和外科培训、血流动力学模拟以及其他许多用途来说至关重要。我们的结果非常接近实际数据,在各种数据集(包括动脉瘤)中实现了高度相似的血管半径、长度和曲折度。 据我们所知,这是首次利用该技术来合成血管的工作。
https://arxiv.org/abs/2506.14914
Aerial bombardment of the Gaza Strip beginning October 7, 2023 is one of the most intense bombing campaigns of the twenty-first century, driving widespread urban damage. Characterizing damage over a geographically dynamic and protracted armed conflict requires active monitoring. Synthetic aperture radar (SAR) has precedence for mapping disaster-induced damage with bi-temporal methods but applications to active monitoring during sustained crises are limited. Using interferometric SAR data from Sentinel-1, we apply a long temporal-arc coherent change detection (LT-CCD) approach to track weekly damage trends over the first year of the 2023- Israel-Hamas War. We detect 92.5% of damage labels in reference data from the United Nations with a negligible (1.2%) false positive rate. The temporal fidelity of our approach reveals rapidly increasing damage during the first three months of the war focused in northern Gaza, a notable pause in damage during a temporary ceasefire, and surges of new damage as conflict hot-spots shift from north to south. Three-fifths (191,263) of all buildings are damaged or destroyed by the end of the study. With massive need for timely data on damage in armed conflict zones, our low-cost and low-latency approach enables rapid uptake of damage information at humanitarian and journalistic organizations.
2023年10月7日开始对加沙地带的空中轰炸是二十一世纪最激烈的轰炸行动之一,导致了广泛的都市破坏。在地理动态且持续时间较长的武装冲突中,描述损害需要积极监测。合成孔径雷达(SAR)已经通过双时相方法用于绘制灾害造成的损害图谱,但在持续危机期间的应用则较为有限。我们利用Sentinel-1干涉测量SAR数据,采用长时段相干变化检测(LT-CCD)的方法来追踪2023年以以色列和哈马斯冲突的第一年内每周的破坏趋势。我们在联合国参考数据中检测到92.5%的损害标签,并且误报率极低(仅为1.2%)。我们的方法的时间准确性揭示了战争前三个多月期间北部加沙地区的损害迅速增加,一个临时停火期内的显著损害暂停期,以及随着冲突热点从北向南转移而出现的新一轮破坏激增。在研究结束时,有三分之二(共191,263座)的建筑物被毁或遭到损坏。 鉴于武装冲突地区需要及时的数据来了解损害情况,我们的低成本和低延迟的方法能够使人道主义组织和新闻机构迅速获取损害信息。
https://arxiv.org/abs/2506.14730
Integrating General Models (GMs) such as Large Language Models (LLMs), with Specialized Models (SMs) in autonomous driving tasks presents a promising approach to mitigating challenges in data diversity and model capacity of existing specialized driving models. However, this integration leads to problems of asynchronous systems, which arise from the distinct characteristics inherent in GMs and SMs. To tackle this challenge, we propose NetRoller, an adapter that incorporates a set of novel mechanisms to facilitate the seamless integration of GMs and specialized driving models. Specifically, our mechanisms for interfacing the asynchronous GMs and SMs are organized into three key stages. NetRoller first harvests semantically rich and computationally efficient representations from the reasoning processes of LLMs using an early stopping mechanism, which preserves critical insights on driving context while maintaining low overhead. It then applies learnable query embeddings, nonsensical embeddings, and positional layer embeddings to facilitate robust and efficient cross-modality translation. At last, it employs computationally efficient Query Shift and Feature Shift mechanisms to enhance the performance of SMs through few-epoch fine-tuning. Based on the mechanisms formalized in these three stages, NetRoller enables specialized driving models to operate at their native frequencies while maintaining situational awareness of the GM. Experiments conducted on the nuScenes dataset demonstrate that integrating GM through NetRoller significantly improves human similarity and safety in planning tasks, and it also achieves noticeable precision improvements in detection and mapping tasks for end-to-end autonomous driving. The code and models are available at this https URL .
将通用模型(GMs),如大型语言模型(LLMs),与自动驾驶任务中的专用模型(SMs)集成,为解决现有专用车辆模型在数据多样性和模型容量方面的挑战提供了一种有前景的方法。然而,这种整合会导致异步系统问题,这些问题源于GM和SM各自独特的特性。为了应对这一挑战,我们提出了NetRoller,这是一种适配器,它包含一系列新颖机制以促进通用模型与专用驾驶模型的无缝集成。具体而言,我们的接口异步GMs和SMs的方法被组织为三个关键阶段。 首先,NetRoller通过采用早期停止机制从LLMs的推理过程中提取语义丰富且计算效率高的表示形式,从而保留了关于驾驶环境的关键洞察,同时保持较低的开销。然后,它应用可学习查询嵌入、无意义嵌入和位置层嵌入来促进稳健高效的跨模态转换。最后,通过采用计算高效性的查询偏移和特征偏移机制,在少轮次微调的情况下增强SMs的表现。 基于这三个阶段中正式化的机制,NetRoller使专用驾驶模型能够在其原生频率下运行,同时保持对通用模型的态势感知能力。在nuScenes数据集上进行的实验表明,通过NetRoller集成GM显著提高了规划任务中的人类相似性和安全性,并且对于端到端自动驾驶的任务如检测和地图绘制也取得了明显的精度提升。 代码和模型可在[此处](https://this_https_URL/)获取(请将"this https URL"替换为实际链接)。
https://arxiv.org/abs/2506.14589
We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture. While these methods have achieved decent harmonization effects, they struggle to generate realistic and natural light interaction effects between the foreground and background. To alleviate these challenges, we reorganize the input data into a unified format and leverage the semantic prior provided by the pretrained diffusion model to facilitate the generation of natural results. Moreover, we propose a Position-Guided Light Adapter (PGLA) that condenses light information from different directions in the background into designed light query embeddings, and modulates the foreground with direction-biased masked attention. In addition, we present a post-processing module named Spectral Foreground Fixer (SFF) to adaptively reorganize different frequency components of subject and relighted background, which helps enhance the consistency of foreground appearance. Extensive comparisons and user study demonstrate that our DreamLight achieves remarkable relighting performance.
在这项工作中,我们介绍了一种名为DreamLight的模型,用于通用图像再照明。该模型能够将主体无缝地合成到新的背景中,并保持在光照和色调方面的美学一致性。背景可以通过自然图像(基于图像的再照明)或从无限文本提示生成(基于文本的再照明)来指定。现有的研究主要集中在基于图像的再照明上,而对于基于文本的情景则缺乏探索。 一些工作采用复杂的分解管道设计,依赖环境地图提供相关信息,但这种方法面临着内在分解和光源所需的数据成本高昂的问题。其他方法将此任务视为图像转换问题,并通过自动编码器架构执行像素级变换。尽管这些方法已经实现了相当协调的效果,但在前景和背景之间的光照交互效果的真实性和自然性方面仍存在挑战。 为了缓解这些问题,我们将输入数据重新组织成统一格式,并利用预训练扩散模型提供的语义先验来促进生成自然结果。此外,我们提出了一种位置引导的光适配器(Position-Guided Light Adapter, PGLA),它从背景的不同方向中提取光照信息,并将其压缩到设计好的光查询嵌入中;通过带有方向偏置的掩码注意力机制调制前景图像。 另外,我们还介绍了一个名为Spectral Foreground Fixer (SFF) 的后处理模块,该模块能够自适应地重新组织主体和再照明背景的不同频率成分,这有助于增强前景外观的一致性。广泛的比较实验和用户研究显示,我们的DreamLight模型在再照明性能方面表现出显著优势。
https://arxiv.org/abs/2506.14549
Retrieval-Augmented Generation (RAG) enriches Large Language Models (LLMs) by combining their internal, parametric knowledge with external, non-parametric sources, with the goal of improving factual correctness and minimizing hallucinations. The LiveRAG 2025 challenge explores RAG solutions to maximize accuracy on DataMorgana's QA pairs, which are composed of single-hop and multi-hop questions. The challenge provides access to sparse OpenSearch and dense Pinecone indices of the Fineweb 10BT dataset. It restricts model use to LLMs with up to 10B parameters and final answer generation with Falcon-3-10B. A judge-LLM assesses the submitted answers along with human evaluators. By exploring distinct retriever combinations and RAG solutions under the challenge conditions, our final solution emerged using InstructRAG in combination with a Pinecone retriever and a BGE reranker. Our solution achieved a correctness score of 1.13 and a faithfulness score of 0.55, placing fourth in the SIGIR 2025 LiveRAG Challenge.
检索增强生成(RAG)通过将大型语言模型(LLMs)的内部参数化知识与外部非参数化来源相结合,旨在提高事实准确性并最小化幻觉。LiveRAG 2025挑战赛探索了各种RAG解决方案,以最大限度地提高在DataMorgana问题回答对上的准确度,这些问题包括单跳和多跳问题。该挑战提供了访问Fineweb 10BT数据集的稀疏OpenSearch索引和密集Pinecone索引的机会,并限制使用最多具有10B参数的LLMs进行最终答案生成,且使用Falcon-3-10B模型来生成答案。由裁判LLM和人类评估员一起评审提交的答案。 通过在挑战条件下探索不同的检索器组合及RAG解决方案,我们的最终方案采用的是InstructRAG与Pinecone检索器和BGE重排序器的结合方式。我们的解决方案取得了1.13的确切性得分和0.55的忠实度得分,在SIGIR 2025 LiveRAG挑战中排名第四。
https://arxiv.org/abs/2506.14412
Deep learning in medical imaging faces obstacles: limited data diversity, ethical issues, high acquisition costs, and the need for precise annotations. Bleeding detection and localization during surgery is especially challenging due to the scarcity of high-quality datasets that reflect real surgical scenarios. We propose orGAN, a GAN-based system for generating high-fidelity, annotated surgical images of bleeding. By leveraging small "mimicking organ" datasets, synthetic models that replicate tissue properties and bleeding, our approach reduces ethical concerns and data-collection costs. orGAN builds on StyleGAN with Relational Positional Learning to simulate bleeding events realistically and mark bleeding coordinates. A LaMa-based inpainting module then restores clean, pre-bleed visuals, enabling precise pixel-level annotations. In evaluations, a balanced dataset of orGAN and mimicking-organ images achieved 90% detection accuracy in surgical settings and up to 99% frame-level accuracy. While our development data lack diverse organ morphologies and contain intraoperative artifacts, orGAN markedly advances ethical, efficient, and cost-effective creation of realistic annotated bleeding datasets, supporting broader integration of AI in surgical practice.
深度学习在医学影像领域面临诸多挑战,包括数据多样性有限、伦理问题、高昂的数据获取成本以及精确标注的需求。尤其是在手术过程中检测和定位出血方面,由于高质量数据集稀缺且难以反映真实的外科场景,这一任务尤为艰巨。 我们提出了一种基于生成对抗网络(GAN)的系统——orGAN,用于生成高保真度并带有注释标签的手术出血图像。通过利用小型“模拟器官”数据集,并使用合成模型复制组织特性和出血情况,我们的方法减少了伦理担忧和数据收集成本。orGAN 建立在 StyleGAN 和关系位置学习基础上,能够逼真地模拟出血事件并标记出血坐标。随后,orGAN 使用 LaMa(Large Mask)为基础的图像修复模块恢复干净、无出血的原始视图,从而实现精确的像素级注释。 评估结果显示,在手术场景中,一个平衡混合了 orGAN 生成数据和“模拟器官”真实数据的数据集达到了90%的检测准确率,并在帧级别上实现了高达99%的准确度。尽管我们的开发数据缺乏多样化的器官形态并且包含术中的图像伪影,但orGAN 显著推进了伦理、高效且低成本地创建逼真的标注出血数据集的方法。这种方法支持人工智能技术更广泛应用于外科实践中。
https://arxiv.org/abs/2506.14303
Advances in treatment technology now allow for the use of customizable 3D-printed hydrogel wound dressings for patients with osteoradionecrosis (ORN) of the jaw (ONJ). Meanwhile, deep learning has enabled precise segmentation of 3D medical images using tools like nnUNet. However, the scarcity of labeled data in ONJ imaging makes supervised training impractical. This study aims to develop an unsupervised training approach for automatically identifying anomalies in imaging scans. We propose a novel two-stage training pipeline. In the first stage, a VQ-GAN is trained to accurately reconstruct normal subjects. In the second stage, random cube masking and ONJ-specific masking are applied to train a new encoder capable of recovering the data. The proposed method achieves successful segmentation on both simulated and real patient data. This approach provides a fast initial segmentation solution, reducing the burden of manual labeling. Additionally, it has the potential to be directly used for 3D printing when combined with hand-tuned post-processing.
治疗技术的进步现在允许使用定制的3D打印水凝胶伤口敷料来治疗颌骨放射性坏死(ORN)患者。同时,深度学习使得可以利用像nnUNet这样的工具对3D医学影像进行精确分割。然而,在颌骨放射性坏死(ONJ)成像中缺乏标记数据使得监督训练变得不切实际。本研究旨在开发一种无监督的训练方法,以自动识别影像扫描中的异常情况。我们提出了一种新颖的两阶段训练管道。在第一阶段,通过训练一个VQ-GAN来精确重建正常受试者图像。在第二阶段,使用随机立方体掩码和专门针对ONJ的掩码训练一个新的编码器,使其能够恢复数据。该方法成功地对模拟和真实患者的影像进行了分割。这种方法提供了一种快速初始分割解决方案,减少了手动标记的工作量。此外,结合人工调整后的后处理步骤,它可以直接用于3D打印。
https://arxiv.org/abs/2506.14209
Every individual carries a unique and personal life story shaped by their memories and experiences. However, these memories are often scattered and difficult to organize into a coherent narrative, a challenge that defines the task of autobiography writing. Existing conversational writing assistants tend to rely on generic user interactions and pre-defined guidelines, making it difficult for these systems to capture personal memories and develop a complete biography over time. We introduce StorySage, a user-driven software system designed to meet the needs of a diverse group of users that supports a flexible conversation and a structured approach to autobiography writing. Powered by a multi-agent framework composed of an Interviewer, Session Scribe, Planner, Section Writer, and Session Coordinator, our system iteratively collects user memories, updates their autobiography, and plans for future conversations. In experimental simulations, StorySage demonstrates its ability to navigate multiple sessions and capture user memories across many conversations. User studies (N=28) highlight how StorySage maintains improved conversational flow, narrative completeness, and higher user satisfaction when compared to a baseline. In summary, StorySage contributes both a novel architecture for autobiography writing and insights into how multi-agent systems can enhance human-AI creative partnerships.
每个人都有一个由个人记忆和经历塑造的独特而个性化的生命故事。然而,这些回忆往往是零散的,并且难以组织成连贯的故事线,这正是自传写作所面临的挑战所在。现有的对话式写作助手通常依赖于通用的用户交互模式和预定义的指南,使得这些系统很难捕捉到个人的记忆并随着时间推移构建完整的人物生平。 我们推出了一款名为StorySage的用户驱动软件系统,旨在满足不同用户群体的需求,支持灵活的对话,并提供结构化的自传写作方法。该系统基于由采访者、会话记录员、规划师、章节撰写人和会话协调员组成的多代理框架,能够迭代地收集用户的记忆,更新其个人生平叙述并计划未来的对话。 在实验模拟中,StorySage展示了它能够在多次会话中导航,并且能够跨越多个对话捕捉用户回忆的能力。用户研究(N=28)强调了与基准相比,StorySage如何提升了对话流畅性、叙事完整性及用户体验满意度。 综上所述,StorySage不仅贡献了一种新颖的自传写作架构,而且还为多代理系统如何增强人机创造力合作提供了见解。
https://arxiv.org/abs/2506.14159
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: this https URL
数据在语言模型获取技能和知识的过程中扮演着最重要的角色。缺乏大规模且组织良好的预训练数据集会导致成本高昂且难以访问的数据管道问题。我们推出了Essential-Web v1.0,这是一个包含24万亿个标记的大型数据集,其中每一份文档都通过一个涵盖主题、格式、内容复杂度和质量的十二类分类法进行了标注。这些分类标签是由EAI-Distill-0.5b模型生成的,这是一款经过微调的拥有0.5亿参数量的模型,其标注者一致性达到了Qwen2.5-32B-Instruct的97%水平。 通过使用类似SQL风格的过滤器,我们可以获得在数学(相对优于当前最优方法SOTA低8.0%)、网络代码(高14.3%)、STEM(高24.5%)和医学(高8.6%)领域的竞争力强的网页精选数据集。Essential-Web v1.0可以在HuggingFace上获取:[此链接](this https URL)
https://arxiv.org/abs/2506.14111
Neural networks are a powerful tool for learning patterns from data. However, they do not respect known scientific laws, nor can they reveal novel scientific insights due to their black-box nature. In contrast, scientific reasoning distills biological or physical principles from observations and controlled experiments, and quantitatively interprets them with process-based models made of mathematical equations. Yet, process-based models rely on numerous free parameters that must be set in an ad-hoc manner, and thus often fit observations poorly in cross-scale predictions. While prior work has embedded process-based models in conventional neural networks, discovering interpretable relationships between parameters in process-based models and input features is still a grand challenge for scientific discovery. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge, further enhancing its interpretability. While the embedded process-based model enforces established scientific knowledge, the encoder reveals new scientific mechanisms and relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability -- it can infer latent scientific mechanisms and their relationships with input features.
神经网络是学习数据模式的强大工具。然而,它们不遵守已知的科学定律,并且由于其黑盒性质也无法揭示新的科学见解。相比之下,科学研究从观察和控制实验中提炼出生物学或物理学原理,并用基于过程的数学方程模型对其进行定量解释。尽管如此,基于过程的模型依赖于许多需要随意设定的自由参数,在跨尺度预测中往往难以很好地拟合观测数据。先前的工作已经将这些基于过程的模型嵌入到传统的神经网络中,但发现这些模型中的参数与输入特征之间的可解释关系仍然是科学发现的一大挑战。 因此,我们提出了一个完全透明的框架——Scientifically-Interpretable Reasoning Network (ScIReN),该框架结合了可解释的神经推理和基于过程的推理。在这个框架中,一个可解释的编码器预测出具有科学意义的潜在参数,然后这些参数通过一个微分方程构成的过程解码器传递以预测带有标签的目标变量。此外,ScIReN 使用了一种新颖的硬 sigmoid 约束层来限制潜在参数到由先验科学知识定义的意义范围内,进一步增强了其可解释性。 嵌入的基于过程模型施加了已确立的科学知识,而编码器揭示了隐藏在传统黑盒模型中的新的科学机制和关系。我们在两个任务中应用了 ScIReN:模拟有机碳通过土壤的流动以及从植物中建模生态系统呼吸作用。在这两项任务中,ScIReN 不仅比传统的黑箱网络表现出更高的预测准确性,而且还提供了大量的科学可解释性——它可以推断出潜在的科学机制及其与输入特征的关系。
https://arxiv.org/abs/2506.14054
This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.
这篇综述探讨了常识推理和意图检测在自然语言理解中的最新进展,这两项是该领域的关键挑战。我们分析了从ACL、EMNLP和CHI(2020-2025)中选出的28篇论文,并按方法论和应用领域对其进行分类。对于常识推理,综述涵盖了零样本学习、文化适应、结构化评估以及互动场景的方法;而对于意图检测,则通过开放集模型、生成式公式、聚类分析及以用户为中心的设计进行探讨。本文结合了自然语言处理(NLP)与人机交互(HCI)领域的见解,强调了向更具适应性、多语种和情境感知的模型发展的新趋势,并指出了在实证基础、泛化能力和基准测试设计方面存在的关键缺口。
https://arxiv.org/abs/2506.14040
Elastic Decision Transformers (EDTs) have proved to be particularly successful in offline reinforcement learning, offering a flexible framework that unifies sequence modeling with decision-making under uncertainty. Recent research has shown that incorporating intrinsic motivation mechanisms into EDTs improves performance across exploration tasks, yet the representational mechanisms underlying these improvements remain unexplored. In this paper, we introduce a systematic post-hoc explainability framework to analyze how intrinsic motivation shapes learned embeddings in EDTs. Through statistical analysis of embedding properties (including covariance structure, vector magnitudes, and orthogonality), we reveal that different intrinsic motivation variants create fundamentally different representational structures. Our analysis demonstrates environment-specific correlation patterns between embedding metrics and performance that explain why intrinsic motivation improves policy learning. These findings show that intrinsic motivation operates beyond simple exploration bonuses, acting as a representational prior that shapes embedding geometry in biologically plausible ways, creating environment-specific organizational structures that facilitate better decision-making.
弹性决策变换器(EDTs)在离线强化学习中已被证明特别成功,它们提供了一个灵活的框架,将序列建模与不确定性条件下的决策制定统一起来。最近的研究表明,在EDTs中加入内在动机机制可以提高探索任务中的性能,然而这些改进背后的表示机制仍不清楚。在这篇论文中,我们引入了一种系统性的事后可解释性框架来分析内在动机如何塑造EDTs中的学习嵌入。通过统计分析嵌入属性(包括协方差结构、向量幅度和正交性),我们揭示了不同的内在动机变体创建出根本上不同的表示结构。我们的分析表明,特定环境下的嵌入度量与性能之间的相关模式解释了为什么内在动机能够改善策略学习。 这些发现表明,内在动机的作用不仅限于简单的探索奖励机制,它还作为一种表征先验知识,在生物学上合理的方式塑造嵌入几何形状,并创建出适合不同环境的组织结构,从而促进更好的决策制定。
https://arxiv.org/abs/2506.13958
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
一个有效的奖励模型在强化学习中起着关键作用,特别是在视觉生成模型训练后的增强方面。然而,当前的奖励建模方法由于依赖大量的手动标注偏好数据或精心设计的质量维度(这些维度往往不完整且工程复杂度高),而面临实施上的复杂性问题。受对抗训练在生成对抗网络(GAN)中的启发,本文提出了一种名为GAN-RM的有效奖励建模框架,该框架消除了对人工偏好注释和显式质量维度工程的需求。我们的方法通过区分少量代表性的未配对目标样本(称为偏好代理数据)和模型生成的普通输出来训练奖励模型,只需要几百个目标样本即可完成训练。 全面的实验表明,GAN-RM在包括测试时间缩放(如Best-of-N采样过滤)、训练后的改进方法(如监督微调SFT和直接偏好优化DPO)等关键应用中均表现出显著的有效性。
https://arxiv.org/abs/2506.13846