Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{this https URL}
我们介绍了SynthLight,这是一种用于人物肖像重新照明的扩散模型。我们的方法将图像重新照明问题视为一个重渲染过程,在此过程中,像素会根据环境光照条件的变化进行转换。通过基于物理的渲染引擎,我们在不同的光照条件下使用3D头部资产来合成数据集,从而模拟这种光照条件下的转换。 为了弥合合成与真实图像领域的差距,我们提出了两种训练和推理策略:(1)多任务训练方法,利用没有光照标签的真实人类肖像;(2)在推理时间采用基于无分类器引导的扩散采样过程,该过程利用输入肖像来更好地保留细节。我们的方法可以应用于各种真实的照片,并产生现实主义照明效果,包括镜面反射和投影阴影的同时还能保持主体的身份特征。 我们在Light Stage数据上的定量实验显示了与最先进的重新照明方法相当的结果。我们对野外图像的定性结果显示出了丰富且前所未有的照明效果。 项目页面: \url{this https URL}
https://arxiv.org/abs/2501.09756
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
这段文本描述了一项关于通过改进自动编码器(auto-encoder)架构来提升图像和视频生成模型性能的研究工作。以下是该工作的主要内容翻译: 视觉标记化通过自编码能够使最先进的图像和视频生成模型受益,它将像素压缩到潜在空间中。尽管基于Transformer的生成器的扩展在最近的进步中占据了中心地位,但其组件标记器本身很少被扩大规模,这留下了一些关于自动编码器设计选择如何影响重建目标以及下游生成性能的问题。我们的工作旨在通过探索自动编码器的扩展来填补这一空白。 为了促进这项研究,我们用增强型视觉Transformer架构(ViTok)替换了传统的卷积骨干网络以进行标记化,并在ImageNet-1K数据集规模之上训练了ViTok,从而消除了对令牌生成器扩展的数据限制。我们首先研究了压缩自动编码器瓶颈如何影响重建和生成——发现尽管它与重建高度相关,但其与生成的关系更加复杂。 接下来,我们探讨单独扩大自动编码器的编码器和解码器在重建和生成性能上的效果。最重要的是,我们发现在对任何重建或生成都没有显著增益的情况下扩展了编码器,而扩展解码器则提升了重建的效果,但对于生成的影响则是好坏参半。 基于我们的探索结果,设计了一种轻量级的自动编码器ViTok,在ImageNet-1K和COCO图像重建任务(256p和512p)中表现与最先进的自动编码器相当,并且在UCF-101数据集上对于16帧128p视频重建,性能超越了现有的自动编码器,同时计算量仅为原来的2到5倍。当将ViTok集成至Diffusion Transformers时,在ImageNet-1K图像生成中展示了竞争力的表现,并为UCF-101的类条件视频生成设定了新的最先进基准。 这项研究不仅扩展了对自动编码器如何影响图像和视频生成的理解,还提出了一种轻量级而高效的解决方案ViTok,能够在不牺牲性能的前提下大幅减少计算资源需求。
https://arxiv.org/abs/2501.09755
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
卷积神经网络(CNN)是计算机视觉任务中的关键工具,但它们缺乏一些传统上期望的特性,这些特性能够进一步提升模型性能,比如旋转等变性。这种性质在生物医学图像中很常见,而这类图像通常没有明确的方向信息。虽然当前的研究主要依赖于数据增强或显式模块来捕捉方向信息,但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战,我们提出了一种新颖且高效的对称旋转等变(SRE)卷积核(SRE-Conv)的实现方法,旨在学习旋转不变特征的同时压缩模型大小。SRE-Conv 核可以轻松集成到任何 CNN 主干网络中。我们使用公开的 MedMNISTv2 数据集(共16个任务)验证了深层 SRE-CNN 捕获旋转等变性的能力。在二维和三维图像的所有16个测试数据集中,SRE-Conv-CNN 在所有情况下都显示出了更高的旋转图像分类精度,并且通过减少参数数量和降低内存占用提高了效率。代码可在以下网址获得:[提供链接]。 请注意,在实际翻译过程中,请确保使用正确的URL以供访问相关资源。
https://arxiv.org/abs/2501.09753
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
基于大型语言模型的机器写作通常依赖于检索增强生成技术。然而,这些方法仍然局限于模型预定义的范围内,限制了内容丰富信息的生成能力。具体而言,常规检索到的信息往往缺乏深度、实用性,并且存在冗余问题,这会降低生成文章的质量,导致产出浅薄、重复和缺乏原创性的结果。为了解决这些问题,我们提出了一种名为OmniThink的机器写作框架,该框架模仿人类迭代扩展与反思的过程。OmniThink的核心思想是模拟学习者在其主题知识中逐步深化的认知行为。 实验结果显示,相比现有方法,OmniThink能够提高生成文章的知识密度,并且不会影响连贯性和深度等指标的表现。通过人工评价和专家反馈进一步表明,OmniThink有潜力解决长篇幅文章生成中的现实挑战。
https://arxiv.org/abs/2501.09751
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
近期的大规模语言模型(LLMs)在通用文本嵌入任务上表现出色。虽然稠密嵌入一直主导着相关研究,我们首次提出了基于词典的嵌入方法(LENS),这种方法利用了LLM并在此类任务中取得了竞争性性能。针对传统因果LLM中存在的固有分词冗余问题和单向注意力限制,LENS通过分词嵌入聚类来整合词汇空间,并探索双向注意机制及多种池化策略。具体而言,LENS简化了词典匹配过程,为每个维度分配一个特定的分词簇,在这个簇中,语义相似的单词被聚集在一起,而通过双向注意力则释放出LLM的全部潜力。广泛的实验表明,LENS在大规模文本嵌入基准(MTEB)上优于稠密嵌入方法,并提供与稠密嵌入相同大小的紧凑特征表示。值得一提的是,将LENS与稠密嵌入相结合,在MTEB中的检索子集(即BEIR)中取得了最先进的性能。
https://arxiv.org/abs/2501.09749
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
自回归序列模型,如基于Transformer的视觉-语言-行动(VLA)策略,在捕捉复杂且可泛化的机器人行为方面非常有效。然而,这类模型需要我们选择连续动作信号的标记化方案,这决定了模型预测出的离散符号如何映射到连续的机器人动作上。我们发现,目前基于每维度、每时间步简单分箱方法的机器人行动标记化技术,在从高频机器人数据中学习灵巧技能时通常表现不佳。为了解决这一挑战,我们提出了一种新的基于离散余弦变换的机器人动作压缩式标记化方案。我们的标记化方法,即频域操作序列标记化(FAST),使我们能够训练自回归VLA模型来处理高度灵巧且高频的任务,在这些任务中标准的离散化方法完全失败了。基于FAST,我们发布了FAST+,这是一个通用的机器人动作标记器,它是在100万条真实机器人的行动轨迹上进行训练的。它可以作为一个黑盒标记器用于各种不同操作空间和控制频率范围内的机器人行动序列。最后,我们展示了当与pi0 VLA结合使用时,我们的方法可以扩展到在10,000小时的数据集上进行训练,并且性能能够匹敌扩散VLA模型,同时将训练时间减少最多5倍。
https://arxiv.org/abs/2501.09747
Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.
机器学习开发者经常使用像Jupyter笔记本这样的交互式计算本,来托管数据处理和模型训练的代码。Jupyter笔记本为编写机器学习管道和互动观察输出提供了一个方便的工具,然而,由于笔记本网页的长度和复杂性,维护这些笔记(例如添加新功能或修复错误)可能会变得具有挑战性。此外,目前还没有关于开发人员在Jupyter笔记本上的编辑相关的基准测试。为此,我们发布了第一个包含来自GitHub上792个机器学习仓库中的20,095次修订的48,398条Jupyter笔记编辑的数据集,并进行了首次使用大型语言模型(LLM)来预测Jupyter笔记代码修改的研究。我们的数据集捕捉了单元级别和行级别的修改细节,为理解现实世界中机器学习工作流程中的维护模式提供了基础。我们观察到,在Jupyter笔记本上的编辑具有高度的局部性,平均每个仓库只有166行代码的变化。尽管较大的模型在代码编辑方面胜过较小的模型,但所有模型在经过微调后仍在我方数据集上表现出较低的准确率,这表明现实世界中的机器学习维护任务相当复杂。我们的研究结果强调了改善模型性能的上下文信息的关键作用,并指向了提高大型语言模型在工程机器学习代码方面能力的有希望的方向。
https://arxiv.org/abs/2501.09745
The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
现有的视频异常检测数据集在表示由于物体之间相互作用而产生的复杂异常方面是不足的。以前的视频异常检测数据集中缺乏复杂的异常情况,这影响了研究方向,使其偏向于关注简单的异常情况。为了解决这个问题,我们引入了一个新的大规模数据集:ComplexVAD。此外,我们还提出了一种新颖的方法,通过使用带有时空属性的场景图来建模物体之间的相互作用,以此检测复杂异常。利用我们的新方法以及其他两种最先进的视频异常检测方法,在ComplexVAD上获得了基准分数,并证明了我们的新方法优于现有的工作。
https://arxiv.org/abs/2501.09733
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
This article analyzes the use of two parallel multi-objective soft computing algorithms to automatically search for high-quality settings of the Ad hoc On Demand Vector routing protocol for vehicular networks. These methods are based on an evolutionary algorithm and on a swarm intelligence approach. The experimental analysis demonstrates that the configurations computed by our optimization algorithms outperform other state-of-the-art optimized ones. In turn, the computational efficiency achieved by all the parallel versions is greater than 87 %. Therefore, the line of work presented in this article represents an efficient framework to improve vehicular communications.
本文分析了使用两种并行多目标软计算算法自动搜索适用于车载网络的自组织需求矢量路由协议(Ad hoc On Demand Vector routing protocol)的高质量设置的方法。这些方法基于进化算法和群体智能方法。实验分析表明,我们优化算法计算出的配置优于其他最先进的优化方案。同时,并行版本所实现的计算效率均超过了87%。因此,本文提出的研究工作代表了一种提高车载通信效率的有效框架。
https://arxiv.org/abs/2501.09725
With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84\% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model's efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
随着互联网和社交网络在在线讨论中的使用增加,社交媒体平台上毒性和不适当内容的传播也有所增加。不同语言中已经进行了多项研究,但在南亚语言中利用深度学习技术进行不当内容识别的研究工作较少。乌尔都语拼写并不唯一,同一单词有多种常见的拼写方式,而且会与其他语言(如英语)混合使用,这使得处理这种语言更具挑战性,并且可用的算法研究有限。 使用注意力层与深度学习模型相结合可以帮助处理长期依赖关系并提高其效率。为了探索注意力层的效果,本研究提出了一种基于注意力的双向GRU混合模型,用于识别乌尔都语Unicode文本中的不当内容。四种不同的基线深度学习模型:LSTM、Bi-LSTM、GRU和TCN被用来比较所提出的模型性能。根据评估指标、数据集大小以及词嵌入层的影响来对比这些模型的结果。我们使用了预训练的乌尔都语word2Vec嵌入。 我们的拟议模型BiGRU-A在不使用预训练的word2Vec层的情况下达到了84%的准确率,优于所有其他基线模型。从实验中得出结论,注意力层可以提高模型效率,并且与不当内容数据集相比,预训练的词向量层表现不佳。
https://arxiv.org/abs/2501.09722
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels. The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks. However, they pose issues of accessibility and resource availability. Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. But its dependency on training data severely limits scalability. Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding. Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable. Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot's sensitivity to prompt content. The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.
这项研究对12种机器学习模型及其变体在检测经济意识形态方面的能力进行了系统的评估。作为评价标准,我使用了跨越英国六次选举的宣言数据,并由专家和众包编码者预先标注。该分析评估了几种生成式、微调型和零样本模型在颗粒级和汇总级上的表现。 研究结果表明,像GPT-4o和Gemini 1.5 Flash这样的生成式模型,在所有基准测试中都持续优于其他模型。然而,这些模型面临可访问性和资源可用性的问题。微调可以产生具有竞争力的性能,并通过特定领域的优化提供可靠的替代方案。但是,它对训练数据的依赖严重限制了其扩展能力。零样本模型在识别经济意识形态信号方面经常遇到困难,往往与人类编码的结果存在负面关联。 使用通用知识来执行特定领域(如意识形态量表)的任务被证明是不可靠的。其他关键发现包括政党内部差异较大、微调从更大规模的数据中受益更多以及零样本模型对提示内容敏感性高。评估涵盖了每种模型的优势和局限,并得出了自动化分析政治内容的最佳实践。
https://arxiv.org/abs/2501.09719
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at this https URL
低光图像增强(LLIE)是计算摄影和成像中的一个关键任务。夜间或在光线较暗的环境中拍摄的照片增强问题已经在图像信号处理文献中得到了充分的研究。然而,当前基于深度学习的方法在实际场景中的效率和鲁棒性方面仍然存在问题(例如噪声、饱和像素以及不良照明条件)。我们提出了一种结合频域与空域图像处理的轻量级神经网络。我们的方法FLOL+是执行此任务最快的模型之一,在流行的现实场景数据集(如LOL和LSRW)上实现了最先进的性能。此外,我们可以以不到12毫秒的时间处理1080p分辨率的图片。代码与模型可在该网址获得:[请在此处插入实际URL]。
https://arxiv.org/abs/2501.09718
Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform's impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system's ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor's open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.
许多非传统学生在网络安全项目中往往缺乏来自同龄人、家庭成员和教授的建议,这会妨碍他们的学习经历。此外,由于内容的相关性、建议的地方性、最低专业知识要求以及时机等问题,这些学生可能无法充分利用各种LLM(大型语言模型)驱动的人工智能助手提供的服务。本文通过介绍一款专门为满足这些学生的知识、技能和职业准备咨询需求而设计的应用程序来解决这些问题。我们开发了一个学习工具平台“CyberMentor”,旨在应对网络安全专业学生多样化的需要与痛点。该平台利用代理工作流和生成式大型语言模型(LLMs),并通过检索增强生成技术(RAG)实现准确且上下文相关的信息检索,以确保可访问性和个性化服务。 我们展示了CyberMentor在满足网络安全教育的知识需求、职业市场的适应性要求、分析和编程任务的技能需求以及提供即时按需学习支持方面的作用。通过三种使用场景的应用展示,CyberMentor在促进知识获取与职业准备方面发挥了重要作用,并提供了无缝的技术指导和支持。我们还采用了LangChain提示评价法来评估该平台的影响,确认其在帮助性、准确性和完整性方面的优秀表现。 这些结果强调了该系统支持学生发展实用的网络安全技能的能力,同时提高高等教育中的公平性和可持续性。此外,“CyberMentor”的开源设计允许它被其他学科领域采纳和适应,推动教育创新并扩大其潜在影响。
https://arxiv.org/abs/2501.09709
Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.
价值观或原则是人类社会的关键要素,它们影响人们的行为和功能,使之遵循一套公认的社会规则以维持社会秩序。随着AI系统在人类社会中的普及,一个主要的担忧在于这些系统可能会违反这些规范或价值,并可能造成伤害。因此,为了防止有意或无意的危害,期望AI系统采取符合这些原则的行动。训练系统表现出这种行为是困难且通常需要专门的数据集。 这项工作提供了一个多模态数据集,通过自然语言和艺术图像描绘了现实生活中规范与非规范的行为。该训练集中包含了一组精心策划的图片,旨在教导儿童关于社会准则的知识。我们主张,鉴于这一事实,这是一个用于训练具有社会规范行为的代理的理想数据集。
https://arxiv.org/abs/2501.09707
We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains.
我们介绍e-Llama模型:这是两个分别拥有80亿和700亿参数的大型语言模型,专为电子商务领域进行了优化。这些模型作为基础模型,在电子商务方面具备深厚的知识积累,并为基础模型的指令微调提供了依据。通过在特定领域的1万亿个标记数据上进行持续预训练,我们得到了e-Llama模型。 我们在一系列消融研究中讨论了我们的方法,并根据实验结果解释了选择超参数的理由。为了量化这些模型在适应电子商务领域方面的效果,我们定义并实施了一套多语言、专门针对电子商务的评估任务集。结果显示,在精心设计的培训设置下,Llama 3.1模型可以被调整以适应新的电子商务领域,同时不会牺牲其在通用领域任务上的显著性能。 此外,我们还探索了将优化后的模型和基础模型合并的可能性,以便更好地控制不同领域之间的性能权衡。
https://arxiv.org/abs/2501.09706