We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
我们提出了一种自适应时间因果视频分词器AdapTok,它可以根据视频内容灵活地为不同帧分配标记。AdapTok装备有块级别的掩码策略,在训练过程中随机丢弃每个块的尾部标记,并且有一个块因果评分器用于预测使用不同数量令牌时视频帧的重建质量。在推理阶段,我们进一步提出了一种基于整数线性规划的自适应令牌分配策略,以根据预测得分调整令牌使用情况。这种设计允许在可控的整体预算下进行样本级、内容感知和时间动态变化的标记分配。 在UCF-101和Kinetics-600数据集上的大量实验表明了我们方法的有效性。无需额外的图像数据,在不同的令牌预算下,AdapTok能够持续提高重建质量和生成性能,从而允许更可扩展且高效的视频生成建模。
https://arxiv.org/abs/2505.17011
Driving simulation plays a crucial role in developing reliable driving agents by providing controlled, evaluative environments. To enable meaningful assessments, a high-quality driving simulator must satisfy several key requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with realistic scene rendering to minimize observational discrepancies; closed-loop evaluation to support free-form trajectory behaviors; highly diverse traffic scenarios for thorough evaluation; multi-agent cooperation to capture interaction dynamics; and high computational efficiency to ensure affordability and scalability. However, existing simulators and benchmarks fail to comprehensively meet these fundamental criteria. To bridge this gap, this paper introduces RealEngine, a novel driving simulation framework that holistically integrates 3D scene reconstruction and novel view synthesis techniques to achieve realistic and flexible closed-loop simulation in the driving context. By leveraging real-world multi-modal sensor data, RealEngine reconstructs background scenes and foreground traffic participants separately, allowing for highly diverse and realistic traffic scenarios through flexible scene composition. This synergistic fusion of scene reconstruction and view synthesis enables photorealistic rendering across multiple sensor modalities, ensuring both perceptual fidelity and geometric accuracy. Building upon this environment, RealEngine supports three essential driving simulation categories: non-reactive simulation, safety testing, and multi-agent interaction, collectively forming a reliable and comprehensive benchmark for evaluating the real-world performance of driving agents.
驾驶模拟在开发可靠的自动驾驶系统中扮演着至关重要的角色,通过提供受控的、评估性的环境来实现这一目标。为了进行有意义的评估,一个高质量的驾驶模拟器必须满足几个关键要求:多模态感知能力(例如相机和激光雷达),真实场景渲染以减少观测差异;闭环评价支持自由形式轨迹行为;高度多样化的交通场景以进行全面测试;多智能体协作以捕捉交互动态;以及高计算效率以确保可负担性和扩展性。然而,现有的模拟器和基准测试未能全面满足这些基本标准。 为了解决这一差距,本文介绍了一种名为RealEngine的新颖驾驶模拟框架。该框架通过将3D场景重建与新颖视角合成技术有机结合,在驾驶环境中实现了逼真的闭环仿真。借助真实世界的多模态传感器数据,RealEngine能够分别重构背景和前景交通参与者,从而通过灵活的场景组合实现高度多样化且真实的交通场景。这种场景重建与视图合成相结合的方式使得跨多种传感模式进行照片级现实渲染成为可能,确保了感知精度和几何准确性。 基于此环境,RealEngine支持三大关键驾驶模拟类别:非反应性仿真、安全测试以及多智能体交互,共同构成了一套可靠且全面的评估基准,以评测自动驾驶系统的实际性能。
https://arxiv.org/abs/2505.16902
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.
大多数神经语音编解码器通过恒定帧率(CFR)下的帧内机制(如代码本丢弃)来实现比特率调整。然而,语音段本质上具有时间变化的信息密度(例如,静默间隔与有声区域)。这一特性使得CFR在比特率和令牌序列长度方面不是最优的,从而阻碍了实时应用中的效率。 在本文中,我们提出了一种时域灵活编码(TFC)技术,首次将可变帧率(VFR)引入神经语音编解码器。TFC能够无缝调整平均帧率,并根据时间熵动态分配帧率。实验结果显示,采用TFC的编解码器在保持高灵活性的同时实现了最优重建质量,即使是在较低的帧率下也能保持竞争力的性能。 我们的方法为开发低帧率神经语音编解码器以实现更高效的下游任务与其它努力相结合提供了有前景的方向。
https://arxiv.org/abs/2505.16845
Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.
基于事件的相机提供了诸如高时间分辨率、宽动态范围和低功耗等独特优势。然而,现有合成数据生成管道的巨大存储需求和输入输出(I/O)负担,以及真实数据的稀缺性,限制了基于事件的训练数据集的扩展能力,从而阻碍了事件视觉模型的发展和泛化能力。为了解决这一挑战,我们引入了一种名为视频到体素(Video-to-Voxel, V2V)的方法,该方法可以直接将传统的视频帧转换为基于事件的体素网格表示,完全绕过了存储密集型的事件流生成过程。通过采用V2V技术,我们可以实现150倍的存储需求减少,并支持实时参数随机化以增强模型鲁棒性。利用这一效率,我们在总计长达52小时(包含10,000多个多样视频)的数据集上训练了几个视频重建和光流估计模型架构——这比现有的基于事件数据集大了一个数量级,从而显著提高了性能。
https://arxiv.org/abs/2505.16797
Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs severely degrades the performance. We identify the critical cause of this issue as a state-clashing problem-where the forward diffusion of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned using typical reverse diffusion process with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collision between distinct molecular graphs. This is achieved through a parameterized noise scheduling network that assigns distinct corruption rates to individual graph elements, i.e., atoms and bonds. Extensive experiments on diverse molecular benchmarks reveal that MELD markedly enhances overall generation quality compared to element-agnostic noise scheduling, increasing the chemical validity of vanilla MDMs on ZINC250K from 15% to 93%, Furthermore, it achieves state-of-the-art property alignment in conditional generation tasks.
掩码扩散模型(MDMs)在离散数据建模方面取得了显著进展,然而其在分子生成中的潜力尚未被充分探索。在这项工作中,我们探讨了它们的潜力,并引入了一个令人惊讶的结果:直接应用标准的MDMs会严重降低性能。我们将这一问题的关键原因归结为“状态冲突”问题——即不同分子的前向扩散过程最终收敛到同一个状态,这导致了一种混合重构目标,无法通过典型的单模逆向扩散过程进行学习。为解决这个问题,我们提出了掩码元素级可学习扩散(MELD),该方法通过为每个图元(如原子和键)分配不同的污染率来协调各个元素的腐败轨迹,从而避免不同分子图之间的冲突。这一机制由一个参数化的噪声调度网络实现。 在多种分子基准测试中进行广泛实验后发现,与不区分元素的噪声调度相比,MELD显著提高了整体生成质量。具体而言,它将纯MDMs模型在ZINC250K数据集上的化学有效性从15%提升到了93%,同时还在条件生成任务中达到了最先进的属性对齐水平。
https://arxiv.org/abs/2505.16790
In group decision-making (GDM) scenarios, uncertainty, dynamic social structures, and vague information present major challenges for traditional opinion dynamics models. To address these issues, this study proposes a novel social network group decision-making (SNGDM) framework that integrates three-way decision (3WD) theory, dynamic network reconstruction, and linguistic opinion representation. First, the 3WD mechanism is introduced to explicitly model hesitation and ambiguity in agent judgments, thereby preventing irrational decisions. Second, a connection adjustment rule based on opinion similarity is developed, enabling agents to adaptively update their communication links and better reflect the evolving nature of social relationships. Third, linguistic terms are used to describe agent opinions, allowing the model to handle subjective, vague, or incomplete information more effectively. Finally, an integrated multi-agent decision-making framework is constructed, which simultaneously considers individual uncertainty, opinion evolution, and network dynamics. The proposed model is applied to a multi-UAV cooperative decision-making scenario, where simulation results and consensus analysis demonstrate its effectiveness. Experimental comparisons further verify the advantages of the algorithm in enhancing system stability and representing realistic decision-making behaviors.
在群体决策(GDM)场景中,不确定性、动态的社会结构和模糊信息给传统的意见动力学模型带来了重大挑战。为了解决这些问题,本研究提出了一种新的社会网络群体决策(SNGDM)框架,该框架整合了三元决策(3WD)理论、动态网络重构以及语言意见表达方法。 首先,引入了三元决策机制来明确建模代理判断中的犹豫和模糊性,从而防止做出非理性的决定。其次,开发了一种基于意见相似度的连接调整规则,使代理能够自适应地更新其通信链接,并更好地反映社会关系的发展性质。第三,使用语言术语描述代理的意见,使得模型能够更有效地处理主观、模糊或不完整的信息。最后,构建了一个集成多智能体决策框架,同时考虑个体不确定性、意见演化和网络动态。 提出的模型被应用于一个多无人机协同决策场景中,在该场景中的仿真结果和共识分析验证了其有效性。实验对比进一步证实了算法在增强系统稳定性和表现真实决策行为方面的优势。
https://arxiv.org/abs/2505.16781
While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec -- that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 40% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Code will be released later.
尽管最近基于扩散的生成图像编解码器展示了令人印象深刻的性能,但其迭代采样过程引入了不愉快的延迟。在这项工作中,我们重新审视了基于扩散的编解码器的设计,并认为多步采样对于生成式压缩并非必要。根据这一洞察,我们提出了OneDC——一种一步式的基于扩散的生成图像编解码器,它将隐变量压缩模块与一步式扩散生成器集成在一起。 认识到语义引导在一步式扩散中的关键作用,我们提议使用超先验作为语义信号,从而克服文本提示在表示复杂视觉内容方面的局限性。为了进一步增强超先验的语义能力,我们引入了一种知识蒸馏机制,将预训练的生成标记器的知识转移到超先验编解码器中。此外,我们采用了一种混合像素域和隐变量域优化方法,以同时提升重建保真度和感知逼真度。 广泛的实验表明,即使在一步式生成的情况下,OneDC也能达到SOTA(State-of-the-Art)的感知质量,并且与之前的多步扩散编解码器相比,在比特率方面减少了超过40%,解码速度提高了20倍。代码将在稍后发布。
https://arxiv.org/abs/2505.16687
We present a novel framework for dynamic 3D scene reconstruction that integrates three key components: an explicit tri-plane deformation field, a view-conditioned canonical radiance field with spherical harmonics (SH) attention, and a temporally-aware latent diffusion prior. Our method encodes 4D scenes using three orthogonal 2D feature planes that evolve over time, enabling efficient and compact spatiotemporal representation. These features are explicitly warped into a canonical space via a deformation offset field, eliminating the need for MLP-based motion modeling. In canonical space, we replace traditional MLP decoders with a structured SH-based rendering head that synthesizes view-dependent color via attention over learned frequency bands improving both interpretability and rendering efficiency. To further enhance fidelity and temporal consistency, we introduce a transformer-guided latent diffusion module that refines the tri-plane and deformation features in a compressed latent space. This generative module denoises scene representations under ambiguous or out-of-distribution (OOD) motion, improving generalization. Our model is trained in two stages: the diffusion module is first pre-trained independently, and then fine-tuned jointly with the full pipeline using a combination of image reconstruction, diffusion denoising, and temporal consistency losses. We demonstrate state-of-the-art results on synthetic benchmarks, surpassing recent methods such as HexPlane and 4D Gaussian Splatting in visual quality, temporal coherence, and robustness to sparse-view dynamic inputs.
我们提出了一种新颖的动态三维场景重建框架,该框架集成了三个关键组件:明确的三平面变形场、带有球谐函数(SH)注意机制的视角条件标准辐射场以及时间感知潜在扩散先验。我们的方法使用三个正交2D特征平面来编码4D场景,这些特征随着时间推移而变化,从而实现了高效且紧凑的空间-时间表示。通过一个变形偏置场将这些特征明确地变换到标准空间中,这消除了基于MLP的运动建模的需求。在标准空间内,我们用一种结构化的SH基础渲染头取代传统的MLP解码器,该渲染头通过学习频率带上的注意机制合成视图依赖的颜色,从而提高了可解释性和渲染效率。为了进一步增强保真度和时间一致性,我们引入了一个由变压器引导的潜在扩散模块,在压缩潜空间中精炼三平面和变形特征。这个生成模块在模棱两可或分布外(OOD)运动情况下对场景表示进行去噪处理,从而提高了泛化能力。我们的模型通过两个阶段进行训练:首先独立预训练扩散模块,然后联合整个流程使用图像重建、扩散降噪以及时间一致性损失进行微调。 我们在合成基准测试中展示了最先进的结果,在视觉质量、时间连贯性和稀疏视图动态输入的鲁棒性方面超过了最近的方法如HexPlane和4D高斯点置。
https://arxiv.org/abs/2505.16535
3D Gaussian Splatting (3DGS) has emerged as a high-fidelity and efficient paradigm for online free-viewpoint video (FVV) reconstruction, offering viewers rapid responsiveness and immersive experiences. However, existing online methods face challenge in prohibitive storage requirements primarily due to point-wise modeling that fails to exploit the motion properties. To address this limitation, we propose a novel Compact Gaussian Streaming (ComGS) framework, leveraging the locality and consistency of motion in dynamic scene, that models object-consistent Gaussian point motion through keypoint-driven motion representation. By transmitting only the keypoint attributes, this framework provides a more storage-efficient solution. Specifically, we first identify a sparse set of motion-sensitive keypoints localized within motion regions using a viewspace gradient difference strategy. Equipped with these keypoints, we propose an adaptive motion-driven mechanism that predicts a spatial influence field for propagating keypoint motion to neighboring Gaussian points with similar motion. Moreover, ComGS adopts an error-aware correction strategy for key frame reconstruction that selectively refines erroneous regions and mitigates error accumulation without unnecessary overhead. Overall, ComGS achieves a remarkable storage reduction of over 159 X compared to 3DGStream and 14 X compared to the SOTA method QUEEN, while maintaining competitive visual fidelity and rendering speed. Our code will be released.
3D高斯点化(3D Gaussian Splatting,简称3DGS)作为一种高保真且高效的在线自由视角视频(FVV)重建范式已经出现,为观众提供了快速响应和沉浸式的体验。然而,现有的在线方法由于逐点建模未能充分利用运动特性,面临着存储需求过大的挑战。为了克服这一限制,我们提出了一种新的紧凑型高斯流(Compact Gaussian Streaming,简称ComGS)框架,该框架利用了动态场景中的局部性和运动一致性,并通过关键点驱动的运动表示来建模具有对象一致性的高斯点运动。通过仅传输关键点属性,该框架提供了一个更为存储高效的解决方案。 具体而言,我们首先使用视空间梯度差策略识别出一组稀疏的关键点集,这些关键点位于运动区域内并对其敏感。基于这些关键点,我们提出了一个自适应的运动驱动机制,预测用于传播关键点运动至具有相似运动的邻近高斯点的空间影响场。此外,ComGS还采用了一种错误感知校正策略来进行关键帧重建,该策略选择性地精炼出错区域并防止误差累积,同时避免了不必要的开销。 总体而言,在保持竞争性的视觉保真度和渲染速度的同时,ComGS相比3DGStream减少了超过159倍的存储需求,并比最先进的QUEEN方法减少了14倍。我们的代码将公开发布。
https://arxiv.org/abs/2505.16533
We present a novel implicit neural shape optimization framework for 3D high-contrast Electrical Impedance Tomography (EIT), addressing scenarios where conductivity exhibits sharp discontinuities across material interfaces. These high-contrast cases, prevalent in metallic implant monitoring and industrial defect detection, challenge traditional reconstruction methods due to severe ill-posedness. Our approach synergizes shape optimization with implicit neural representations, introducing key innovations including a shape derivative-based optimization scheme that explicitly incorporates high-contrast interface conditions and an efficient latent space representation that reduces variable dimensionality. Through rigorous theoretical analysis of algorithm convergence and extensive numerical experiments, we demonstrate substantial performance improvements, establishing our framework as promising for practical applications in medical imaging with metallic implants and industrial non-destructive testing.
我们提出了一种新颖的隐式神经形状优化框架,用于三维高对比度电阻抗断层成像(EIT)。该方法针对导电性在材料界面处表现出急剧不连续性的场景。这些高对比度的情况常见于金属植入物监测和工业缺陷检测中,并且由于严重的不适定问题给传统的重建方法带来了挑战。 我们的方法将形状优化与隐式神经表示相结合,引入了包括基于形状导数的优化方案在内的关键创新,该方案明确地结合了高对比度界面条件。此外,我们还提出了一种高效的潜在空间表示来减少变量维度。 通过严格的算法收敛性理论分析和大量的数值实验,我们展示了显著的性能改进,并确立了我们的框架在医学成像(如含金属植入物)及工业无损检测等实际应用中的前景。
https://arxiv.org/abs/2505.16487
Panchromatic (PAN) -assisted Dual-Camera Compressive Hyperspectral Imaging (DCCHI) is a key technology in snapshot hyperspectral imaging. Existing research primarily focuses on exploring spectral information from 2D compressive measurements and spatial information from PAN images in an explicit manner, leading to a bottleneck in HSI reconstruction. Various physical factors, such as temperature, emissivity, and multiple reflections between objects, play a critical role in the process of a sensor acquiring hyperspectral thermal signals. Inspired by this, we attempt to investigate the interrelationships between physical properties to provide deeper theoretical insights for HSI reconstruction. In this paper, we propose a Physics-Informed Cross-Modal State Space Model Network (PCMamba) for DCCHI, which incorporates the forward physical imaging process of HSI into the linear complexity of Mamba to facilitate lightweight and high-quality HSI reconstruction. Specifically, we analyze the imaging process of hyperspectral thermal signals to enable the network to disentangle the three key physical properties-temperature, emissivity, and texture. By fully exploiting the potential information embedded in 2D measurements and PAN images, the HSIs are reconstructed through a physics-driven synthesis process. Furthermore, we design a Cross-Modal Scanning Mamba Block (CSMB) that introduces inter-modal pixel-wise interaction with positional inductive bias by cross-scanning the backbone features and PAN features. Extensive experiments conducted on both real and simulated datasets demonstrate that our method significantly outperforms SOTA methods in both quantitative and qualitative metrics.
全色(PAN)辅助双摄像头压缩高光谱成像(DCCHI)是快照式高光谱成像中的关键技术。现有研究主要集中在从二维压缩测量中显式地探索光谱信息以及从全色图像中提取空间信息,这导致了在高光谱成像重建过程中的瓶颈现象。各种物理因素,如温度、发射率和物体间的多次反射,在传感器获取高光谱热信号的过程中扮演着关键角色。受此启发,我们尝试探究这些物理属性之间的相互关系,从而为HSI(高光谱图像)的重建提供更深入的理论见解。 在本文中,我们提出了一种基于物理信息的跨模态状态空间模型网络(PCMamba),用于DCCHI技术的应用场景下。该方法将高光谱成像的正向物理过程嵌入到Mamba架构的线性复杂度中,从而实现轻量级且高质量的HSI重建。具体而言,我们分析了高光谱热信号的成像过程,并使网络能够分解出温度、发射率和纹理这三个关键物理属性。通过充分利用2D测量与全色图像中的潜在信息,HSIs(高光谱图像)可通过物理驱动的合成过程进行重建。 此外,我们设计了一种跨模态扫描Mamba块(CSMB),该模块通过交叉扫描主干特征与PAN特征来引入跨模式像素级交互,并利用位置归纳偏差。在真实和模拟数据集上进行的一系列实验表明,我们的方法在定量和定性指标方面均显著优于现有最先进的方法。 这种方法不仅为高光谱成像提供了新的视角,还通过结合物理知识和深度学习技术开辟了提高HSI重建质量的新途径。
https://arxiv.org/abs/2505.16373
In this paper, we propose to compress human body video with interactive semantics, which can facilitate video coding to be interactive and controllable by manipulating semantic-level representations embedded in the coded bitstream. In particular, the proposed encoder employs a 3D human model to disentangle nonlinear dynamics and complex motion of human body signal into a series of configurable embeddings, which are controllably edited, compactly compressed, and efficiently transmitted. Moreover, the proposed decoder can evolve the mesh-based motion fields from these decoded semantics to realize the high-quality human body video reconstruction. Experimental results illustrate that the proposed framework can achieve promising compression performance for human body videos at ultra-low bitrate ranges compared with the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes. Furthermore, the proposed framework enables interactive human body video coding without any additional pre-/post-manipulation processes, which is expected to shed light on metaverse-related digital human communication in the future.
在这篇论文中,我们提出了一种具有互动语义的人体视频压缩方法。这种方法可以使视频编码通过操作嵌入在码流中的语义级表示来实现交互性和可控性。具体来说,所提出的编解码器采用了一个3D人体模型,将人体信号的非线性动态和复杂运动分解为一系列可配置的嵌入,并且这些嵌入可以被可控地编辑、紧凑地压缩并高效传输。此外,该论文中提议的解码器可以从这些解码后的语义中进化出基于网格的运动场,从而实现高质量的人体视频重建。 实验结果表明,在与最先进的视频编码标准通用视频编码(VVC)和最新的生成式压缩方案相比时,所提出的框架在极低比特率范围内实现了令人满意的人体视频压缩性能。此外,该框架能够实现在不进行任何额外的预处理或后处理的情况下进行交互式人体视频编码,这有望对未来元宇宙相关的数字人通信带来启示。
https://arxiv.org/abs/2505.16152
Social media's rise establishes user-generated content (UGC) as pivotal for travel decisions, yet analytical methods lack scalability. This study introduces a dual-method LLM framework: unsupervised expectation extraction from UGC paired with survey-informed supervised fine-tuning. Findings reveal leisure/social expectations drive engagement more than foundational natural/emotional factors. By establishing LLMs as precision tools for expectation quantification, we advance tourism analytics methodology and propose targeted strategies for experience personalization and social travel promotion. The framework's adaptability extends to consumer behavior research, demonstrating computational social science's transformative potential in marketing optimization.
社交媒体的兴起使用户生成内容(UGC)成为旅行决策中的关键因素,然而分析方法却缺乏可扩展性。本研究引入了一种双管齐下的大语言模型(LLM)框架:即从UGC中进行无监督期望提取,并结合调查信息进行有监督微调。研究结果表明,休闲和社会方面的期望比自然和情感等基础因素更能驱动用户参与度。通过将大语言模型确立为精确量化期望的工具,我们推动了旅游分析方法的发展,并提出了针对体验个性化及社交旅行推广的目标策略。该框架的适应性还扩展到了消费者行为研究领域,展示了计算社会科学在市场营销优化中具有变革性的潜力。
https://arxiv.org/abs/2505.16118
Pretrained latent diffusion models have shown strong potential for lossy image compression, owing to their powerful generative priors. Most existing diffusion-based methods reconstruct images by iteratively denoising from random noise, guided by compressed latent representations. While these approaches have achieved high reconstruction quality, their multi-step sampling process incurs substantial computational overhead. Moreover, they typically require training separate models for different compression bit-rates, leading to significant training and storage costs. To address these challenges, we propose a one-step diffusion codec across multiple bit-rates. termed OSCAR. Specifically, our method views compressed latents as noisy variants of the original latents, where the level of distortion depends on the bit-rate. This perspective allows them to be modeled as intermediate states along a diffusion trajectory. By establishing a mapping from the compression bit-rate to a pseudo diffusion timestep, we condition a single generative model to support reconstructions at multiple bit-rates. Meanwhile, we argue that the compressed latents retain rich structural information, thereby making one-step denoising feasible. Thus, OSCAR replaces iterative sampling with a single denoising pass, significantly improving inference efficiency. Extensive experiments demonstrate that OSCAR achieves superior performance in both quantitative and visual quality metrics. The code and models will be released at this https URL.
预训练的潜在扩散模型在有损图像压缩方面展现出了强大的潜力,这得益于其强大的生成先验。大多数现有的基于扩散的方法通过迭代去噪从随机噪声重建图像,并受到压缩潜表示的指导。虽然这些方法已经实现了高质量的重构效果,但它们的多步骤采样过程带来了大量的计算开销。此外,通常需要为不同的压缩比特率训练单独的模型,导致了高昂的训练和存储成本。 为了应对这些挑战,我们提出了一种跨多个比特率的一步扩散编解码器OSCAR。具体来说,我们的方法将压缩潜变量视为原始潜变量的噪声变体,其失真程度取决于比特率。这一视角使它们可以作为扩散轨迹上的中间状态建模。通过建立从压缩比特率到伪扩散时间步长的映射,我们可以对单个生成模型进行条件设定,使其能够支持多个比特率下的重构。同时,我们主张压缩潜变量保留了丰富的结构信息,从而使一步去噪成为可能。因此,OSCAR用一次去噪过程取代了迭代采样过程,显著提高了推理效率。 广泛的实验表明,无论是定量还是视觉质量指标,OSCAR都取得了卓越的性能表现。代码和模型将在以下链接发布:[此URL](请将此占位符替换为实际的网址)。
https://arxiv.org/abs/2505.16091
We present the new bidirectional variational autoencoder (BVAE) network architecture. The BVAE uses a single neural network both to encode and decode instead of an encoder-decoder network pair. The network encodes in the forward direction and decodes in the backward direction through the same synaptic web. Simulations compared BVAEs and ordinary VAEs on the four image tasks of image reconstruction, classification, interpolation, and generation. The image datasets included MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and CelebA-64 face images. The bidirectional structure of BVAEs cut the parameter count by almost 50% and still slightly outperformed the unidirectional VAEs.
我们介绍了新的双向变分自编码器(BVAE)网络架构。与传统的编码器-解码器网络对不同,BVAE使用单一神经网络来进行编码和解码操作。该网络在前向方向上进行编码,在同一个突触网络中通过反向路径进行解码。我们通过模拟比较了BVAE与普通VAE在图像重建、分类、插值和生成这四项图像任务上的表现,所用的图像数据集包括MNIST手写数字、Fashion-MNIST、CIFAR-10以及CelebA-64面部图像。双向结构使BVAE将参数数量减少了近50%,并且在性能上仍略微优于单向VAE。
https://arxiv.org/abs/2505.16074
Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.
稀疏自编码器(SAEs)通常用于通过将其映射到人类可解释的概念表示来解析大型语言模型(LLMs)的内部激活。虽然现有的SAE评估主要集中在诸如重构与稀疏性权衡、人工(自动)可解释性和特征解缠等指标上,但它们忽略了一个关键方面:概念表示对输入扰动的鲁棒性。我们认为,对于概念表示来说,鲁棒性必须是一个基本考虑因素,这反映了概念标注的准确性。为此,我们将鲁棒性的量化表述为输入空间优化问题,并开发了一个全面评估框架,在该框架中通过设计对抗性扰动来操纵SAE表示以模拟真实场景。在实证研究中,我们发现微小的对抗性输入扰动可以在大多数情况下有效操控基于概念的解释,而不显著影响基础LLMs本身的输出结果。总体而言,我们的研究结果表明,SAE的概念表示是脆弱的,并且可能不适合用于模型监控和监管的应用程序中。
https://arxiv.org/abs/2505.16004
We consider the limits of super-resolution using imaging constraints. Due to various theoretical and practical limitations, reconstruction-based methods have been largely restricted to small increases in resolution. In addition, motion-blur is usually seen as a nuisance that impedes super-resolution. We show that by using high-precision motion information, sparse image priors, and convex optimization, it is possible to increase resolution by large factors. A key operation in super-resolution is deconvolution with a box. In general, convolution with a box is not invertible. However, we obtain perfect reconstructions of sparse signals using convex optimization. We also show that motion blur can be helpful for super-resolution. We demonstrate that using pseudo-random motion it is possible to reconstruct a high-resolution target using a single low-resolution image. We present numerical experiments with simulated data and results with real data captured by a camera mounted on a computer controlled stage.
我们探讨了在成像约束下超分辨率技术的极限。由于各种理论和实际限制,基于重建的方法主要局限于小幅提升分辨率。此外,运动模糊通常被视为阻碍超分辨率的一个问题。然而,通过利用高精度的运动信息、稀疏图像先验以及凸优化方法,我们可以实现大幅度提高分辨率的目标。 在超分辨率中,一个重要操作是使用矩形函数进行反卷积处理。一般来说,与矩形函数的卷积运算不可逆。但是,我们证明可以通过凸优化得到稀疏信号的完美重构。此外,我们还展示了运动模糊可以帮助实现超分辨率。通过利用伪随机运动,我们可以仅用一张低分辨率图像来重建出高分辨率的目标。 本文通过使用模拟数据和安装在计算机控制平台上的相机捕捉的真实数据进行了数值实验,并呈现了相应的结果。
https://arxiv.org/abs/2505.15961
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain's high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: this https URL.
从功能磁共振成像(fMRI)数据中解码视觉体验为理解人类感知和开发先进的脑机接口提供了一条强有力的途径。然而,目前的研究进展往往侧重于最大化重建保真度,而忽视了可解释性这一获得神经科学洞察的重要方面。为了填补这一空白,我们提出了MoRE-Brain框架,这是一个以神经系统为灵感设计的、旨在实现高保真、灵活和可解释视觉重构的方法。MoRE-Brain的独特之处在于它采用了一种层次化的专家混合架构,在该架构中,不同的专家处理来自功能相关的体素群的功能磁共振信号,模仿了专门的大脑网络。 在训练过程中,这些专家首先被训练将fMRI编码到预设的CLIP空间内(即冻结的CLIP空间)。然后,通过一个新颖的双阶段路由机制引导经过微调的扩散模型合成图像,这种机制能够在整个扩散过程动态调整各个专家贡献的权重。MoRE-Brain的主要进展有三个方面: 1. 引入了一种基于大脑网络原则的新混合专家架构,用于神经解码。 2. 实现了高效的跨受试者泛化,通过共享核心专家网络而仅适应特定于每个受试者的路由器来实现。 3. 提供增强的机制洞察力,因为明确的路由揭示了不同的建模脑区如何精确地塑造重构图像的语义和空间属性。 广泛的实验验证了MoRE-Brain的高度重建保真度,并通过瓶颈分析进一步展示了其有效利用fMRI信号的能力,从而区分真正的神经解码与过度依赖生成先验。因此,MoRE-Brain标志着向更具泛化性和可解释性的基于fMRI的视觉解码迈进了一大步。代码即将公开发布:[此处请替换为实际链接]。
https://arxiv.org/abs/2505.15946
Structure from Motion (SfM) refers to the problem of recovering both structure (i.e., 3D coordinates of points in the scene) and motion (i.e., camera matrices) starting from point correspondences in multiple images. It has attracted significant attention over the years, counting practical reconstruction pipelines as well as theoretical results. This paper is conceived as a conceptual review of SfM methods, which are grouped into three main categories, according to which part of the problem - between motion and structure - they focus on. The proposed taxonomy brings a new perspective on existing SfM approaches as well as insights into open problems and possible future research directions. Particular emphasis is given on identifying the theoretical conditions that make SfM well posed, which depend on the problem formulation that is being considered.
结构从运动(SfM)是指通过多张图像中的点对应关系来恢复场景中点的三维坐标和摄像机矩阵的问题。多年来,它吸引了广泛的关注,包括实际重建流水线以及理论成果。本文旨在对SfM方法进行概念性回顾,并根据它们关注问题的不同部分——即运动与结构之间的哪一部分——将这些方法分为三个主要类别。所提出的分类法为现有的SfM方法提供了新的视角,并探讨了开放性问题及未来研究方向的可能性。特别强调的是,确定使SfM成为良好定义的理论条件,这取决于正在考虑的问题表述。
https://arxiv.org/abs/2505.15814
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. % To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications. Code will be available at this https URL.
大脑信号的复杂性推动了研究,这些研究利用多模态AI来将大脑模式与视觉和文本数据对齐,以实现可解释性的描述。然而,大多数现有的研究表明,它们局限于粗略的解读,并且在对象描述、位置、属性及其关系等方面的细节不足。这导致使用此类线索进行视觉解码时会出现不精确和模糊的结果。为了应对这一挑战,我们分析了来自多模态大型语言模型(MLLMs)中预训练视觉组件的不同视觉特征空间选择,并引入了一种零样本多模态脑信号解码方法,该方法可以与这些模型互动,在多个粒度级别上进行解码。 为了评估模型从大脑信号中解析精细细节的能力,我们提出了一个多粒度大脑细节理解基准测试(MG-BrainDub)。这个基准包括两个关键任务:详细的描述和显著的问题回答,并且其指标突出了关键的视觉元素,如对象、属性和关系。我们的方法提高了神经解码的精确性,并支持更准确的神经解码应用。相关代码将在以下链接提供:[此URL]。
https://arxiv.org/abs/2505.15755