This paper makes three key contributions. First, via a substantial corpus of 51,278 interview questions sourced from 888 YouTube videos of mock interviews of Indian civil service candidates, we demonstrate stark gender bias in the broad nature of questions asked to male and female candidates. Second, our experiments with large language models show a strong presence of gender bias in explanations provided by the LLMs on the gender inference task. Finally, we present a novel dataset of 51,278 interview questions that can inform future social science studies.
本文做出了三个关键的贡献。首先,通过从888个印度公共服务候选人的模拟面试视频的51,278个问题,我们证明了在面试问题的广泛范围内,男女候选人的性别偏见非常明显。其次,我们对大型语言模型的实验结果表明,LLM在性别推断任务中提供的性别偏见解释中存在强烈的性别偏见。最后,我们提出了一个由51,278个问题组成的新数据集,可以指导未来的社会科学研究。
https://arxiv.org/abs/2409.12194
We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at this https URL.
我们踏上了一个古老的探索之旅:揭示物体从仅仅瞥见其可见部分的微小视角中隐藏的维度。为了解决这个问题,我们推出了Vista3D,一个在短短5分钟内实现快速且一致3D生成的框架。Vista3D的核心是两个阶段的方法:粗阶段和细阶段。在粗阶段,我们通过高斯平滑从单个图像中快速生成初始几何。在细阶段,我们直接从学习的高斯平滑中提取Signed Distance Function(SDF),并通过不同的可微表面表示来优化它。此外,它通过角扩散 prior 使生成质量得到提高,同时通过角扩散 prior 组合来与3D感知扩散 prior 和谐。通过广泛的评估,我们证明了Vista3D在生成3D物体的一致性和多样性之间取得了良好的平衡。演示和代码将在这个https URL上提供。
https://arxiv.org/abs/2409.12193
Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL
模仿学习已被证明是一种强大的训练复杂视觉运动策略的有力工具。然而,当前的方法通常需要几百到数千个专家演示才能处理高维视觉观察数据。导致这种低数据效率的一个关键原因是,视觉表示主要是通过行为克隆目标直接训练,或者在非域数据上预训练。在这项工作中,我们提出了DynaMo,一种新的在域自监督学习方法,用于学习视觉表示。给定一组专家演示,我们通过一系列图像嵌入共同学习一个潜在的逆动态模型和一个前动态模型,预测下一个帧在潜在空间中,无需增强、对比采样或访问地面真实动作。重要的是,DynaMo不需要任何非域数据,如互联网数据或跨嵌入数据。在六个模拟和现实环境上,我们证明了使用DynaMo学习到的表示能够显著提高下游模仿学习性能,以及预训练表示。使用DynaMo的优势在行为Transformer、扩散策略、MLP和最近邻策略的类别中保持不变。最后,我们抽象了DynaMo的关键组件,并测量了其对下游策略性能的影响。机器人视频最好在以下链接查看:
https://arxiv.org/abs/2409.12192
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{this https URL}.
我们推出了Qwen2-VL系列,这是对之前Qwen-VL模型的先进升级,重新定义了在视觉处理中预先确定分辨率的经典方法。Qwen2-VL采用了朴素动态分辨率机制,使模型能够动态处理不同分辨率的图像,将其转化为视觉token的数目。这种方法使得模型能够生成更高效和准确的视觉表示,与人类感知过程密切相关。模型还集成了多模态旋转位置嵌入(M-RoPE),促进在文本、图像和视频之间有效融合位置信息。我们采用统一的方法处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL研究了大型视觉语言模型(LVLM)的缩放定律。通过缩放模型大小(以2B、8B和72B参数版本为基础)和训练数据量,Qwen2-VL系列在各种多模态基准测试中实现了高度竞争性的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中的结果与诸如GPT-4和Claude3.5-Sonnet等领先模型相当,甚至超过了其他通用模型。代码可在此处访问:\url{此链接}
https://arxiv.org/abs/2409.12191
Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA frameworks, such as GTSAM, g$^2$o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ compared to GTSAM, g$^2$o, and Ceres, respectively.
捆绑调整(BA)是一种关键的技术,在各种机器人应用中都有广泛的应用,如同时定位与映射(SLAM)、增强现实(AR)和摄影测量。BA优化参数,如相机姿态和3D地标,以与观测值对齐。随着深度学习在感知系统中的重要性不断增加,越来越多的需要将BA与深度学习框架集成以提高可靠性和性能。然而,广泛使用的基于C++的BA框架,如GTSAM、g$^2$o和Ceres,与现代深度学习库(如PyTorch)的本地集成缺乏。这一限制影响了它们的灵活性、适应性、调试难度和整体实现效率。为了填补这一空白,我们引入了一个与PyPose无缝集成的 eager-mode BA框架,提供与PyTorch兼容的接口,具有高效率。我们的方法包括为2阶优化、Lie组和Lie代数运算以及线性求解设计的GPU加速、可导和稀疏操作。我们的 eager-mode BA在GPU 上具有实质性的运行效率,与 GTSAM、g$^2$o 和 Ceres 分别相比,平均速度提升为 18.5$\times$、22$\times$ 和 23$\times$。
https://arxiv.org/abs/2409.12190
Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.
预测长期3D人动是具有挑战性的:人类行为的随机性使得从输入序列 alone 生成逼真的人类运动很难。关于场景环境和附近人的运动信息可以大大提高生成过程。我们提出了一种场景感知的社交Transformer模型(SAST)来预测长期(10秒)人类运动。与之前的方法不同,我们的方法可以建模场景中广泛变化的数量人和物的相互作用。我们结合了时间卷积编码器-解码器架构和基于Transformer的瓶颈,可以有效地结合运动和场景信息。我们使用去噪扩散模型建模条件运动分布。我们在Kitchens数据集上对人类进行基准测试,该数据集包含1到16个人员和29到50个同时可见的物体。我们的模型在不同的指标和用户研究中比其他方法在逼真性和多样性方面表现出色。代码可在此处访问:https://thisurl.com/
https://arxiv.org/abs/2409.12189
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.
此报告介绍了Qwen2.5-Coder系列,这是其前任CodeQwen1.5的重大升级。此系列包括两个模型:Qwen2.5-Coder-1.5B和Qwen2.5-Coder-7B。作为一个特定的代码模型,Qwen2.5-Coder基于Qwen2.5架构,并在超过5.5万亿个标记的庞大的数据集上继续预训练。通过仔细的数据清洗、可扩展的合成数据生成和平衡的数据混合,Qwen2.5-Coder展示了出色的代码生成能力,同时保留了一般的可扩展性。该模型在广泛的代码相关任务上进行了评估,在超过10个基准测试中实现了最先进的性能,包括代码生成、完成、推理和修复,始终比相同模型大小的更大模型表现出更卓越的性能。我们认为,Qwen2.5-Coder系列的发布不仅将推动代码智能研究的边界,还将通过其宽松的许可证鼓励开发者在实际应用中更广泛地采用。
https://arxiv.org/abs/2409.12186
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
通过对提示进行翻译来获取链式思考(CoT)是向大型语言模型(LLMs)征求推理能力的一种实际方法。但是这种额外的“思考”究竟在哪些任务中真正有用呢?为了分析这个问题,我们进行了一项涉及超过100篇论文的定量元分析,并对14种模型进行了自己的评估。我们的研究结果表明,在涉及数学或逻辑等任务时,CoT具有很强的性能优势,而在其他类型的任务上则相对较小。在MMLU上,直接生成答案而无需CoT会导致几乎相同的准确率,除非问题或模型的回答中包含等号符号,表示符号操作和推理。 为了验证这个发现,我们对这些问题进行了进一步的分析,将规划与执行分离并将其与工具增强的LLM进行比较。CoT的很多优势来自于提高符号执行,但它在使用符号求解器方面的表现相对较差。我们的研究结果表明,CoT可以有针对性地应用,在保持性能的同时降低推理成本。此外,它们还表明需要从基于提示的CoT向更利用LLM应用范围的中间计算的新范式进行转变。
https://arxiv.org/abs/2409.12183
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
广泛的文本理解和学习需要利用整个文档上下文的语言模型。由于直接训练长语境模型所带来的实施挑战,许多方法都提出了扩展模型以处理长语境。然而,由于数据和模型类的差异,很难比较这些方法,因此不确定如何评估长语境性能以及它是否与标准评估方法有所不同。我们实现了一个带有标准化评估的扩展方法控制协议,利用一致的基础模型和扩展数据。我们的研究得出了一些关于长语境行为的见解。首先,我们再次确认了置信度作为通用性能指标在长语境任务中的关键作用。其次,我们发现当前的近似注意方法在长语境任务中普遍表现不佳。最后,我们证实,在扩展范围内,精确微调方法通常有效,而扩展则仍然具有挑战性。所有代码库、模型和检查点都将公开开源,以促进透明度和促进在这个关键领域 AI 发展的进一步研究。
https://arxiv.org/abs/2409.12181
Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks. Despite their broad utility, LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing. As a result, end-users struggle to consistently align the confidence expressed by LLMs with the accuracy of their predictions, often leading to either blind trust in all outputs or a complete disregard for their reliability. In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty. Specifically, we measure the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty. Through experiments on various question-answering datasets, we demonstrate that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model's own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.
大语言模型(LLMs)正日益在信息寻求和决策任务中得到应用。尽管它们具有广泛的适用性,但LLMs往往生成与现实世界事实相冲突的信息,其说服性风格可以使这些不准确之处显得自信和令人信服。因此,用户很难将LLMs所表达的置信与他们的预测准确性一致,这往往导致用户对所有输出产生盲目的信任,或者完全忽视它们的可靠性。在这项工作中,我们探讨了在 uncertainty-augmented 预测上进行监督微调作为开发产生不确定语言表达的方法。具体来说,我们测量了预训练模型的可校性,然后对语言模型进行微调以生成可校定的不确定语言表达。通过在各种问题回答数据集上的实验,我们证明了LLM在评估其预测方面表现良好,而基于模型的置信进行监督微调可以产生准确的不确定语言表达,特别是对于单个主张回答。
https://arxiv.org/abs/2409.12180
We study the computational complexity theory of smooth, finite-dimensional dynamical systems. Building off of previous work, we give definitions for what it means for a smooth dynamical system to simulate a Turing machine. We then show that 'chaotic' dynamical systems (more precisely, Axiom A systems) and 'integrable' dynamical systems (more generally, measure-preserving systems) cannot robustly simulate universal Turing machines, although such machines can be robustly simulated by other kinds of dynamical systems. Subsequently, we show that any Turing machine that can be encoded into a structurally stable one-dimensional dynamical system must have a decidable halting problem, and moreover an explicit time complexity bound in instances where it does halt. More broadly, our work elucidates what it means for one 'machine' to simulate another, and emphasizes the necessity of defining low-complexity 'encoders' and 'decoders' to translate between the dynamics of the simulation and the system being simulated. We highlight how the notion of a computational dynamical system leads to questions at the intersection of computational complexity theory, dynamical systems theory, and real algebraic geometry.
我们研究光滑、有限维动态系统的计算复杂度理论。在之前工作的基础上,我们给出了一个光滑动态系统模拟图灵机的意思的定义。然后我们证明了“混沌”动态系统(更精确地说,公理A系统)和“可微分”动态系统(更一般地说,测度保持系统)无法稳健地模拟通用图灵机,尽管这类机器可以通过其他类型的动态系统来稳健地模拟。接下来,我们证明了任何可以被编码成结构稳定的一维动态系统的图灵机都必须有可决定的停止问题,并且在停止的情况下具有明确的运行时间复杂度。更广泛地说,我们的工作阐明了“一个机器”模拟另一个的意义,并强调了定义低复杂度“编码器”和“解码器”来在模拟动态和被模拟系统的动态之间进行转换的必要性。我们强调了计算动态系统概念导致计算复杂性理论、动力系统理论和实数代数几何领域中的问题。
https://arxiv.org/abs/2409.12179
While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO's competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
尽管在文本到关系数据库任务上已经取得了显著的进展,但最近提出的解决方案在每一个问题上都编码了相同的数据库模式,导致不必要的推理成本过高,并且经常忽视关键数据库知识。为了应对这些问题,我们提出了You Only Read Once(YORO)这一新范式,它将数据库知识直接内化于文本到关系模型在训练过程中的参数化知识中,从而在推理过程中消除了模式编码的需求。YORO显著地将输入词长降低了66%-98%。 尽管YORO的输入词长较短,但我们的实证结果表明,与传统系统相比,YORO在三个基准测试上的竞争性能非常强,同时在大型数据库上的表现也非常显著。此外,YORO在处理具有挑战性的值检索问题(如缩写)时表现出色。
https://arxiv.org/abs/2409.12172
Brain Tumor Segmentation (BraTS) plays a critical role in clinical diagnosis, treatment planning, and monitoring the progression of brain tumors. However, due to the variability in tumor appearance, size, and intensity across different MRI modalities, automated segmentation remains a challenging task. In this study, we propose a novel Transformer-based framework, multiPI-TransBTS, which integrates multi-physical information to enhance segmentation accuracy. The model leverages spatial information, semantic information, and multi-modal imaging data, addressing the inherent heterogeneity in brain tumor characteristics. The multiPI-TransBTS framework consists of an encoder, an Adaptive Feature Fusion (AFF) module, and a multi-source, multi-scale feature decoder. The encoder incorporates a multi-branch architecture to separately extract modality-specific features from different MRI sequences. The AFF module fuses information from multiple sources using channel-wise and element-wise attention, ensuring effective feature recalibration. The decoder combines both common and task-specific features through a Task-Specific Feature Introduction (TSFI) strategy, producing accurate segmentation outputs for Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET) regions. Comprehensive evaluations on the BraTS2019 and BraTS2020 datasets demonstrate the superiority of multiPI-TransBTS over the state-of-the-art methods. The model consistently achieves better Dice coefficients, Hausdorff distances, and Sensitivity scores, highlighting its effectiveness in addressing the BraTS challenges. Our results also indicate the need for further exploration of the balance between precision and recall in the ET segmentation task. The proposed framework represents a significant advancement in BraTS, with potential implications for improving clinical outcomes for brain tumor patients.
Brain Tumor Segmentation (BraTS) 在临床诊断、治疗规划和监测肿瘤进展方面起着关键作用。然而,由于不同MRI模态肿瘤外观、大小和强度的变异性,自动分割仍然具有挑战性。在这项研究中,我们提出了一个新型的Transformer-基框架,多PI-TransBTS,整合了多物理信息以提高分割准确性。该模型利用空间信息、语义信息和多模态成像数据,解决肿瘤特征固有的异质性。多PI-TransBTS框架包括编码器、自适应特征融合模块和多源多尺度特征解码器。编码器采用多分支架构,分别从不同MRI序列中提取模块特定的特征。自适应特征融合模块利用通道级和元素级注意将多个来源的信息进行有效融合,确保精确的特征重调。解码器通过任务特定特征引入(TSFI)策略将常见和任务特定的特征结合,产生准确的全肿瘤(WT)、肿瘤核心(TC)和增强肿瘤(ET)区域的分割输出。对BraTS2019和BraTS2020数据集的全面评估表明,多PI-TransBTS相对于最先进的方法具有优越性。模型在Dice系数、汉明距离和敏感分数方面始终优于其他方法,突出了其在解决BraTS挑战方面的重要性。我们的结果还表明,在ET分割任务中需要进一步研究精度和召回之间的平衡。所提出的框架在 BraTS 研究中取得了显著的进展,对改善脑肿瘤患者的临床治疗效果具有潜在影响。
https://arxiv.org/abs/2409.12167
The intermittency of solar power, due to occlusion from cloud cover, is one of the key factors inhibiting its widespread use in both commercial and residential settings. Hence, real-time forecasting of solar irradiance for grid-connected photovoltaic systems is necessary to schedule and allocate resources across the grid. Ground-based imagers that capture wide field-of-view images of the sky are commonly used to monitor cloud movement around a particular site in an effort to forecast solar irradiance. However, these wide FOV imagers capture a distorted image of sky image, where regions near the horizon are heavily compressed. This hinders the ability to precisely predict cloud motion near the horizon which especially affects prediction over longer time horizons. In this work, we combat the aforementioned constraint by introducing a deep learning method to predict a future sky image frame with higher resolution than previous methods. Our main contribution is to derive an optimal warping method to counter the adverse affects of clouds at the horizon, and learn a framework for future sky image prediction which better determines cloud evolution for longer time horizons.
太阳能的间歇性,由于云层遮挡,是阻碍其在商业和住宅环境广泛应用的关键因素。因此,对电网连接的太阳能光伏系统进行实时预测是必要的,以便调度和分配资源。通常,基于地面的成像器用于监测特定地点上云的运动,以预测太阳能辐射。然而,这些宽FOV成像器捕捉到天空图像的扭曲图像,其中近地平线附近的区域被严重压缩。这阻碍了在近地平线附近精确预测云运动的能力,特别是在较长的预测时间范围内。在这项工作中,我们通过引入一种深度学习方法来预测具有比现有方法更高分辨率的未来天空图像帧,来对抗前述约束。我们主要的贡献是确定一个最优的变形方法来抵消云对地平线的影响,并学习一个未来天空图像预测框架,可以更准确地预测云的演变。
https://arxiv.org/abs/2409.12162
There is a large population of wheelchair users. Most of the wheelchair users need help with daily tasks. However, according to recent reports, their needs are not properly satisfied due to the lack of caregivers. Therefore, in this project, we develop WeHelp, a shared autonomy system aimed for wheelchair users. A robot with a WeHelp system has three modes, following mode, remote control mode and tele-operation mode. In the following mode, the robot follows the wheelchair user automatically via visual tracking. The wheelchair user can ask the robot to follow them from behind, by the left or by the right. When the wheelchair user asks for help, the robot will recognize the command via speech recognition, and then switch to the teleoperation mode or remote control mode. In the teleoperation mode, the wheelchair user takes over the robot with a joy stick and controls the robot to complete some complex tasks for their needs, such as opening doors, moving obstacles on the way, reaching objects on a high shelf or on the low ground, etc. In the remote control mode, a remote assistant takes over the robot and helps the wheelchair user complete some complex tasks for their needs. Our evaluation shows that the pipeline is useful and practical for wheelchair users. Source code and demo of the paper are available at \url{this https URL}.
有很大的轮椅用户人口。大多数轮椅用户需要日常生活中的帮助。然而,据最近报道,由于缺乏护理人员,他们的需求没有得到适当的满足。因此,在这个项目中,我们开发了一个名为WeHelp的共享自治系统,专为轮椅用户设计。具有WeHelp系统的机器人有三种模式:跟随模式,遥控模式和远程操作模式。在跟随模式下,机器人通过视觉跟踪来跟随轮椅用户。轮椅用户可以通过左或右向机器人发出请求。当轮椅用户寻求帮助时,机器人将通过语音识别接收到命令,然后切换到遥控模式或远程操作模式。在遥控模式下,轮椅用户通过摇杆掌控机器人,并帮助机器人完成一些复杂的任务,如打开门,在路上移动障碍物,到达高货架或低地面等。在远程操作模式下,一个远程助手接管机器人,帮助轮椅用户完成一些复杂的任务。我们的评估显示,该流程对于轮椅用户来说是有用且实用的。论文的源代码和演示文稿可在此处访问:\url{this <https://this URL>}.
https://arxiv.org/abs/2409.12159
We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
我们提出了一种新的方法,用于联合表达和音频指导的谈话面部生成。最近的方法要么难以保留说话者的身份,要么无法产生准确的面部表情。为了应对这些挑战,我们提出了一个基于NeRF的网络。由于我们在单目视频上进行训练,没有 ground truth,因此学习音频和表情的分离表示至关重要。我们首先通过自监督学习学习音频特征,根据多个主题的对话进行训练。通过引入对比学习技术,我们确保学习到的音频特征与嘴唇运动对齐,并与其他面部肌肉运动分离。然后我们设计了一个Transformer架构,用于学习表情特征,捕捉长距离面部表情,并将其与特定说话者的口腔运动分离。通过定量和定性评估,我们证明了我们的方法可以合成高保真的谈话面部视频,实现与未见到的音频的同步嘴唇运动,达到最先进的面部表情转移。
https://arxiv.org/abs/2409.12156
Lesion segmentation in PET/CT imaging is essential for precise tumor characterization, which supports personalized treatment planning and enhances diagnostic precision in oncology. However, accurate manual segmentation of lesions is time-consuming and prone to inter-observer variability. Given the rising demand and clinical use of PET/CT, automated segmentation methods, particularly deep-learning-based approaches, have become increasingly more relevant. The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images in a multitracer multicenter setting, addressing the clinical need for quantitative, robust, and generalizable solutions. Building on previous challenges, the third iteration of the autoPET challenge introduces a more diverse dataset featuring two different tracers (FDG and PSMA) from two clinical centers. To this extent, we developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan. We trained two individual nnUNet-ensembles for each tracer where anatomical labels are included as a multi-label task to enhance the model's performance. Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets, respectively. The code is available at this https URL .
在PET/CT成像中,病变分割对于精确肿瘤特征描述至关重要,这支持个性化治疗规划和提高癌症诊断精度。然而,准确的手动分割病变是耗时的且容易受到操作者变异性影响。随着PET/CT临床应用的需求和需求的增加,自动分割方法,特别是基于深度学习的 approaches,变得越来越相关。 自动PET III挑战专注于在多发射器多中心环境下推进PET/CT图像肿瘤病变的自动分割,解决临床需要高质量的、可靠的和可扩展的解决方案。在以前挑战的基础上,第三个自动PET挑战引入了一个包含两个不同示踪剂(FDG和PSMA)的更 diverse数据集。为此,我们开发了一个分类器,根据PET扫描的最大强度投影来确定给定PET/CT的示踪剂。我们为每个示踪剂训练两个单阶段nnUNet集成,将解剖标签作为一个多标签任务来提高模型的性能。 我们的最终提交在公开可用的FDG和PSMA数据集上的交叉验证Dice分数分别为76.90%和61.33%。代码可在此处访问:https://this URL.
https://arxiv.org/abs/2409.12155
Abductive explanations (AXp's) are widely used for understanding decisions of classifiers. Existing definitions are suitable when features are independent. However, we show that ignoring constraints when they exist between features may lead to an explosion in the number of redundant or superfluous AXp's. We propose three new types of explanations that take into account constraints and that can be generated from the whole feature space or from a sample (such as a dataset). They are based on a key notion of coverage of an explanation, the set of instances it explains. We show that coverage is powerful enough to discard redundant and superfluous AXp's. For each type, we analyse the complexity of finding an explanation and investigate its formal properties. The final result is a catalogue of different forms of AXp's with different complexities and different formal guarantees.
绑架解释(AXp's)广泛用于理解分类器的决策。当特征相互独立时,现有定义是合适的。然而,我们发现,当存在特征之间的约束时,忽略这些约束可能会导致无限制或多余的AXp数量爆炸。我们提出了三种新的解释类型,这些解释考虑了约束,并且可以从整个特征空间或从样本(例如数据集)生成。它们基于一个名为解释覆盖集合的关键概念。我们证明,覆盖足够强大,可以丢弃冗余和多余的AXp。对于每种类型,我们分析了找到解释的复杂性,并研究了其形式属性。最后的结果是一份不同形式的AXp的目录,具有不同的复杂性和不同的形式保证。
https://arxiv.org/abs/2409.12154
Robots can influence people to accomplish their tasks more efficiently: autonomous cars can inch forward at an intersection to pass through, and tabletop manipulators can go for an object on the table first. However, a robot's ability to influence can also compromise the safety of nearby people if naively executed. In this work, we pose and solve a novel robust reach-avoid dynamic game which enables robots to be maximally influential, but only when a safety backup control exists. On the human side, we model the human's behavior as goal-driven but conditioned on the robot's plan, enabling us to capture influence. On the robot side, we solve the dynamic game in the joint physical and belief space, enabling the robot to reason about how its uncertainty in human behavior will evolve over time. We instantiate our method, called SLIDE (Safely Leveraging Influence in Dynamic Environments), in a high-dimensional (39-D) simulated human-robot collaborative manipulation task solved via offline game-theoretic reinforcement learning. We compare our approach to a robust baseline that treats the human as a worst-case adversary, a safety controller that does not explicitly reason about influence, and an energy-function-based safety shield. We find that SLIDE consistently enables the robot to leverage the influence it has on the human when it is safe to do so, ultimately allowing the robot to be less conservative while still ensuring a high safety rate during task execution.
机器人可以更有效地影响人们以完成任务:自动驾驶汽车可以在路口逐步前进以通过,而桌面操作器可以首先尝试桌子上的物体。然而,机器人影响力的能力也可能危及附近人的安全,如果盲目执行。在这项工作中,我们提出并解决了新颖的鲁棒到达避免动态游戏,使机器人只有在存在安全备份控制时才能发挥最大影响力。在人类方面,我们将人类的行为建模为以目标为导向,但受到机器人计划的条件限制,使我们能够捕捉影响力。在机器人方面,我们在物理和信念空间中解决了动态游戏,使机器人能够关于其对人类行为的不确定性如何随时间演变进行思考。我们通过高维(39-D)的模拟人类-机器人协同操作任务,使用离线游戏理论强化学习来解决该任务,实例化我们的方法,称为SLIDE(在动态环境中安全利用影响力)。我们比较了我们的方法与将人类视为最坏情况 adversary 的稳健基线、不明确考虑影响力的安全控制器以及基于能量函数的安全护盾之间的差异。我们发现,SLIDE 能够使机器人安全地利用其对人类的影响力,从而在执行任务时允许机器人更加保守,但在任务执行过程中仍能确保高安全率。
https://arxiv.org/abs/2409.12153
We present Residual Descent Differential Dynamic Game (RD3G), a Newton-based solver for constrained multi-agent game-control problems. The proposed solver seeks a local Nash equilibrium for problems where agents are coupled through their rewards and state constraints. We compare the proposed method against competing state-of-the-art techniques and showcase the computational benefits of the RD3G algorithm on several example problems.
我们提出了Residual Descent Differential Dynamic Game (RD3G),一种基于Newton法的多智能体游戏控制问题的求解方法。该求解器旨在寻找在奖励和状态约束下相互耦合的智能体所面临的局部纳什均衡。我们比较所提出的技术 against 当前最先进的竞争技术,并展示了RD3G算法在多个例子问题上的计算优势。
https://arxiv.org/abs/2409.12152