Action recognition from video data forms a cornerstone with wide-ranging applications. Single-view action recognition faces limitations due to its reliance on a single viewpoint. In contrast, multi-view approaches capture complementary information from various viewpoints for improved accuracy. Recently, event cameras have emerged as innovative bio-inspired sensors, leading to advancements in event-based action recognition. However, existing works predominantly focus on single-view scenarios, leaving a gap in multi-view event data exploitation, particularly in challenges like information deficit and semantic misalignment. To bridge this gap, we introduce HyperMV, a multi-view event-based action recognition framework. HyperMV converts discrete event data into frame-like representations and extracts view-related features using a shared convolutional network. By treating segments as vertices and constructing hyperedges using rule-based and KNN-based strategies, a multi-view hypergraph neural network that captures relationships across viewpoint and temporal features is established. The vertex attention hypergraph propagation is also introduced for enhanced feature fusion. To prompt research in this area, we present the largest multi-view event-based action dataset $\text{THU}^{\text{MV-EACT}}\text{-50}$, comprising 50 actions from 6 viewpoints, which surpasses existing datasets by over tenfold. Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios, and also exceeds the state-of-the-arts in frame-based multi-view action recognition.
视频数据中的动作识别是广泛应用的基础。单视图动作识别因依赖单个视点而面临局限性。相比之下,多视角方法从各个视角捕获互补信息以提高准确性。近年来,事件相机作为一种创新的生物启发传感器,在基于事件的动作识别方面取得了进展。然而,现有作品主要集中在单视图场景,导致多视角事件数据利用的空白,特别是在信息缺乏和语义错配等挑战中。为了弥合这个空白,我们引入了HyperMV,一个基于多视角事件的动作识别框架。HyperMV将离散事件数据转换为类似于帧的表示,并利用共享卷积网络提取与视角相关的特征。通过将段落视为顶点,并使用基于规则和KNN策略构建边,一个多视角超图神经网络被建立起来,该网络捕捉视角和时间特征之间的关系。此外,还引入了顶点注意图传播策略以增强特征融合。为了激发该领域的研究,我们提出了THU-MV-EACT史上最大的多视角事件基于动作数据集THU-MV-EACT-50,包括6个视角下的50个动作,超过现有数据集近10倍。实验结果表明,HyperMV在跨学科和跨视图场景中显著优于基线,并且还超过了基于帧的多视角动作识别的状态。
https://arxiv.org/abs/2403.19316
Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however, editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper, we present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing. The code is available at \href{this https URL}{this https URL}.
场景从多视角图像中的重建是一个计算机视觉和图形学中的基本问题。最近,神经隐式表面重建方法已经取得了高质量的结果;然而,编辑和操作重构场景的3D几何仍然具有挑战性,因为缺少自然分解的对象实体和复杂的对象/背景组合。在本文中,我们提出了Total-Decom,一种具有最小人交互的分解式3D重建方法。我们的方法无缝地整合了分段任意模型(SAM)与混合隐式-显式神经表面表示和基于网格的区域生长技术,实现了准确的三维物体分解。Total-Decom不需要太多的人类注释,同时为用户提供了对分解粒度和大小的实时控制。我们在基准数据集上进行了广泛的评估,并证明了其在下游应用中的潜力,如动画和场景编辑。代码可在此处下载:\href{this <https://this URL>}{this <https://this URL>}.
https://arxiv.org/abs/2403.19314
Recent advancements in robotics have led to the development of numerous interfaces to enhance the intuitiveness of robot navigation. However, the reliance on traditional 2D displays imposes limitations on the simultaneous visualization of information. Mixed Reality (MR) technology addresses this issue by enhancing the dimensionality of information visualization, allowing users to perceive multiple pieces of information concurrently. This paper proposes Mixed reality-based robot navigation interface using an optical-see-through MR-beacon (MRNaB), a novel approach that incorporates an MR-beacon, situated atop the real-world environment, to function as a signal transmitter for robot navigation. This MR-beacon is designed to be persistent, eliminating the need for repeated navigation inputs for the same location. Our system is mainly constructed into four primary functions: "Add", "Move", "Delete", and "Select". These allow for the addition of a MR-beacon, location movement, its deletion, and the selection of MR-beacon for navigation purposes, respectively. The effectiveness of the proposed method was then validated through experiments by comparing it with the traditional 2D system. As the result, MRNaB was proven to increase the performance of the user when doing navigation to a certain place subjectively and objectively. For additional material, please check: this https URL
机器人技术的发展导致了多种接口的发展,以增强机器人导航的直观性。然而,对传统2D显示的依赖限制了同时可视化信息的能力。混合现实(MR)技术通过增强信息可视化的维度,允许用户同时感知多个信息。本文提出了基于混合现实(MR)的机器人导航接口,使用光学透视的MR信标(MRNaB)作为 novel 方法,该方法将位于现实环境之上的 MR 信标作为信号传输器,用于机器人导航。这个 MR 信标被设计为持久,消除相同地点重复导航输入的需求。我们的系统主要构建了四个基本功能:“添加”,“移动”,“删除”和“选择”。这些允许添加 MR 信标,位置移动,其删除以及为导航目的选择 MR 信标。通过与传统2D系统进行实验比较,验证了所提出方法的有效性。结果表明,MRNaB 在主观和客观上增加了用户进行导航到某个地方时的表现。此外,请查看:https://
https://arxiv.org/abs/2403.19310
The development of an intelligent agricultural decision-supporting system for crop selection and disease forecasting in Bangladesh is the main objective of this work. The economy of the nation depends heavily on agriculture. However, choosing crops with better production rates and efficiently controlling crop disease are obstacles that farmers have to face. These issues are addressed in this research by utilizing machine learning methods and real-world datasets. The recommended approach uses a variety of datasets on the production of crops, soil conditions, agro-meteorological regions, crop disease, and meteorological factors. These datasets offer insightful information on disease trends, soil nutrition demand of crops, and agricultural production history. By incorporating this knowledge, the model first recommends the list of primarily selected crops based on the soil nutrition of a particular user location. Then the predictions of meteorological variables like temperature, rainfall, and humidity are made using SARIMAX models. These weather predictions are then used to forecast the possibilities of diseases for the primary crops list by utilizing the support vector classifier. Finally, the developed model makes use of the decision tree regression model to forecast crop yield and provides a final crop list along with associated possible disease forecast. Utilizing the outcome of the model, farmers may choose the best productive crops as well as prevent crop diseases and reduce output losses by taking preventive actions. Consequently, planning and decision-making processes are supported and farmers can predict possible crop yields. Overall, by offering a detailed decision support system for crop selection and disease prediction, this work can play a vital role in advancing agricultural practices in Bangladesh.
孟加拉国农作物选择和疾病预测智能农业决策支持系统的开发是本工作的主要目标。孟加拉国的经济严重依赖农业。然而,选择产量高且有效控制病虫害的农作物是农民必须面对的障碍。这些问题通过利用机器学习方法和现实世界数据来解决。建议的方法利用了作物生产数据、土壤条件、农业气象区域、病虫害和气象因素等各个方面的数据。这些数据提供了关于病虫害趋势、作物土壤营养需求和农业历史的有洞察力的信息。通过整合这些知识,该模型首先根据特定用户位置的土壤营养推荐了主要选用的农作物列表。然后,使用SARIMAX模型预测气象变量(如温度、降雨和湿度)得出这些天气预测。接着,这些气象预测用于通过支持向量分类器预测主要农作物列表上的病虫害可能性。最后,利用决策树回归模型预测作物产量,并提供了一个与可能病虫害相关的最终农作物列表。利用模型的结果,农民可以选出最佳高产农作物,并通过采取预防措施预防和减少产量损失。因此,本研究通过提供详细的决策支持系统来促进孟加拉国农业实践的发展具有关键作用。
https://arxiv.org/abs/2403.19273
Laparoscopic video tracking primarily focuses on two target types: surgical instruments and anatomy. The former could be used for skill assessment, while the latter is necessary for the projection of virtual overlays. Where instrument and anatomy tracking have often been considered two separate problems, in this paper, we propose a method for joint tracking of all structures simultaneously. Based on a single 2D monocular video clip, we train a neural field to represent a continuous spatiotemporal scene, used to create 3D tracks of all surfaces visible in at least one frame. Due to the small size of instruments, they generally cover a small part of the image only, resulting in decreased tracking accuracy. Therefore, we propose enhanced class weighting to improve the instrument tracks. We evaluate tracking on video clips from laparoscopic cholecystectomies, where we find mean tracking accuracies of 92.4% for anatomical structures and 87.4% for instruments. Additionally, we assess the quality of depth maps obtained from the method's scene reconstructions. We show that these pseudo-depths have comparable quality to a state-of-the-art pre-trained depth estimator. On laparoscopic videos in the SCARED dataset, the method predicts depth with an MAE of 2.9 mm and a relative error of 9.2%. These results show the feasibility of using neural fields for monocular 3D reconstruction of laparoscopic scenes.
腹腔镜视频跟踪主要集中在两种目标类型:手术器械和解剖学。前者可用于技能评估,而后者对于虚拟层叠的投影是必要的。在通常将器械和解剖学跟踪视为两个独立问题的背景下,本文提出了一种同时跟踪所有结构的方法。基于单个2D单目视频片段,我们训练一个神经场来表示连续的时空场景,用于创建至少在一个帧中可见的所有表面的3D跟踪。由于器械通常只覆盖图像小部分,因此跟踪准确性降低。为了提高器械跟踪质量,我们提出了增强的分类加权方法。我们在腹腔镜胆囊切除手术的视频片段上评估了跟踪,其中解剖结构的平均跟踪准确率为92.4%,器械的平均跟踪准确率为87.4%。此外,我们还评估了从方法的场景重构中获得的深度图的质量。我们发现,这些伪深度图的质量与最先进的预训练深度估计算法相当。在SCARED数据集中的腹腔镜视频中,该方法预测深度的均方误差(MAE)为2.9毫米,相对误差为9.2%。这些结果表明了使用神经场进行腹腔镜场景单目3D重建是可行的。
https://arxiv.org/abs/2403.19265
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing, DreamSalon semantically mixes source and target textual prompts, guided by differences in their embedding covariances, to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces, outperforming existing methods both qualitatively and quantitatively.
虽然大规模预训练的文本图像模型可以合成多样且高质量的人为中心图像,但具有细微任务的“身份微调”会带来新颖的挑战:在保持个体身份和上下文的同时,精确修改特定特征。现有的个性化方法要么需要耗时优化,要么需要学习额外的编码器,擅长“身份重新contextualization”。然而,它们通常在处理复杂且敏感的任务(如人脸编辑)时遇到困难。为解决这些挑战,我们引入了DreamSalon,一种以噪音为导向的分阶段编辑框架,专注于详细图像操作和身份上下文保留。通过通过预测噪声的频率和梯度来判断编辑和提升阶段,DreamSalon在编辑阶段对编辑的特定特征进行详细操作,并利用随机去噪在提升阶段提高图像质量。为了实现更精确的编辑,DreamSalon在源文本提示和目标文本提示之间进行语义混合,根据它们嵌入协方差差异来指导模型关注特定编辑区域。我们的实验证明了DreamSalon能够高效且忠实地编辑精细的人脸细节,超越现有方法无论是质量还是数量。
https://arxiv.org/abs/2403.19235
The digitisation campaigns carried out by libraries and archives in recent years have facilitated access to documents in their collections. However, exploring and exploiting these documents remain difficult tasks due to the sheer quantity of documents available for consultation. In this article, we show how the semantic annotation of the textual content of study corpora of archival documents allow to facilitate their exploitation and valorisation. First, we present a methodological framework for the construction of new interfaces based on textual semantics, then address the current technological obstacles and their potential solutions. We conclude by presenting a practical case of the application of this framework.
近年来,图书馆和档案馆进行的数字化活动促进了其收藏文献的访问。然而,由于可用文献的数量庞大,探索和利用这些文献仍然是一项困难的任务。在这篇文章中,我们展示了基于文本语义对档案文献文本内容进行元数据注释的方法如何促进其利用和价值。首先,我们提出了一个基于文本语义构建新界面的方法论框架,然后讨论了当前的技术障碍及其潜在解决方案。最后,我们通过一个实际案例展示了该框架的应用。
https://arxiv.org/abs/2403.19201
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and language models (e.g., CLIP) and large-scale generative language models (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.
监督图像标注方法取得了很大的进展,但收集高质量的人标注图像文本数据具有挑战性。最近,大规模视觉和自然语言模型(如CLIP)和大规模生成语言模型(如GPT-2)在各种任务上都表现出强大的性能,并为图像与文本配对数据、无配对数据或甚至文本数据提供了新的解决方案。在这些方法中,主流解决方案是将CLIP模型的图像嵌入投影到文本嵌入空间,借助于CLIP模型中图像-文本对之间的一致表示。然而,现有方法在适应数据配置的多样性、准确估计图像-文本嵌入偏差以及在推理阶段纠正不满意的预测结果方面仍然面临一些挑战。本文提出了一种以文本为中心的图像标注方法,名为TIPCap。我们考虑了四种不同的设置,这些设置逐渐减少了配对数据的影响。我们构建了一个基于多维高斯分布的映射模块,以减轻模态差距,适用于上述四个不同的设置。我们提出了一个提示交互模块,可以在生成摘要之前包含可选的提示信息。大量实验证明,我们的TIPCap在其他弱或无监督的图像标注方法中表现优异,在两个广泛使用的数据集(即MS-COCO和Flicker30K)上实现了最先进的性能。
https://arxiv.org/abs/2403.19193
This study investigates the efficacy of six major Generative AI (GenAI) text detectors when confronted with machine-generated content that has been modified using techniques designed to evade detection by these tools (n=805). The results demonstrate that the detectors' already low accuracy rates (39.5%) show major reductions in accuracy (17.4%) when faced with manipulated content, with some techniques proving more effective than others in evading detection. The accuracy limitations and the potential for false accusations demonstrate that these tools cannot currently be recommended for determining whether violations of academic integrity have occurred, underscoring the challenges educators face in maintaining inclusive and fair assessment practices. However, they may have a role in supporting student learning and maintaining academic integrity when used in a non-punitive manner. These results underscore the need for a combined approach to addressing the challenges posed by GenAI in academia to promote the responsible and equitable use of these emerging technologies. The study concludes that the current limitations of AI text detectors require a critical approach for any possible implementation in HE and highlight possible alternatives to AI assessment strategies.
本研究探讨了当主要生成式 AI(GenAI)文本检测器面临经过设计以逃避检测的技术修改过的内容时,其效果。结果显示,当检测器的准确率(39.5%)面临经过修改以逃避检测的内容时,其准确率大幅下降(17.4%),其中某些技术的效果比其他技术更有效地躲避了检测。准确率的限制以及可能的冤枉控告表明,目前这些工具不能推荐用于确定是否发生了学术诚信违规,突出了教育工作者在保持包容和公正评估实践方面所面临的挑战。然而,当这些工具以非惩罚方式使用时,它们可能有助于支持学生学习和维护学术诚信。这些结果强调了在学术界解决 GenAI 带来的挑战以及促进这些新兴技术 responsible and equitable use 所需要采取的联合方法。研究结论认为,目前 AI 文本检测器的限制需要对任何可能应用于高等教育领域的实施进行关键评估,并强调了可能的替代 AI 评估策略。
https://arxiv.org/abs/2403.19148
Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.
传统的基于GAN的嘴形生成模型通常存在质量和训练不稳定的问题。为了克服这些问题,近年来基于扩散模型的方法试图解决这些限制并提高保真度。然而,它们仍然面临着一些挑战,包括扩展的采样时间和扩散模型中高随机性的困难。为了克服这些挑战,我们提出了一个名为MoDiTalker的新运动去噪扩散模型,旨在实现高品质嘴形生成。我们引入了两个模块:音频到运动(AToM)和运动到视频(MToV),分别用于生成同步的嘴部运动和高质量的嘴部视频。AToM通过利用音频注意机制捕捉微妙的嘴部运动。此外,MToV通过利用高效的三角平面表示来增强时间一致性。我们对标准基准进行实验证明,我们的模型与现有模型相比具有卓越的性能。我们还提供了全面的消融研究和用户研究结果。
https://arxiv.org/abs/2403.19144
Ensuring stable object placement is crucial to prevent objects from toppling over, breaking, or causing spills. When an object makes initial contact to a surface, and some force is exerted, the moment of rotation caused by the instability of the object's placing can cause the object to rotate in a certain direction (henceforth referred to as direction of corrective rotation). Existing methods often employ a Force/Torque (F/T) sensor to estimate the direction of corrective rotation by detecting the moment of rotation as a torque. However, its effectiveness may be hampered by sensor noise and the tension of the external wiring of robot cables. To address these issues, we propose a method for stable object placing using GelSights, vision-based tactile sensors, as an alternative to F/T sensors. Our method estimates the direction of corrective rotation of objects using the displacement of the black dot pattern on the elastomeric surface of GelSight. We calculate the Curl from vector analysis, indicative of the rotational field magnitude and direction of the displacement of the black dots pattern. Simultaneously, we calculate the difference (Diff) of displacement between the left and right fingers' GelSight's black dots. Then, the robot can manipulate the objects' pose using Curl and Diff features, facilitating stable placing. Across experiments, handling 18 differently characterized objects, our method achieves precise placing accuracy (less than 1-degree error) in nearly 100% of cases. An accompanying video is available at the following link: this https URL
确保稳定的物体对齐 crucial 防止物体倾倒、破裂或造成溢出。当一个物体与表面 initial 接触并受到一定的力量时,由于物体对齐的不稳定性,物体可能会以某种方向(以下称为纠正旋转方向)旋转。现有的方法通常采用力/力矩(F/T)传感器来估计纠正旋转方向,通过检测转子的力矩来检测转子的力矩。然而,它的有效性可能会受到传感器噪音和机器人电缆外部接线的张力的影响。为了应对这些问题,我们提出了使用GelSights(基于视觉的触觉传感器)作为稳定物体对齐的替代方法,而不是F/T传感器。我们的方法通过GelSights弹性表面上的黑点图案的位移来估计物体的纠正旋转方向。我们计算微分,表示旋转场的大小和方向,是从矢量分析中获得的。同时,我们计算左右手指GelSights的黑色点之间的位移差(Diff)。然后,机器人可以使用Curl和Diff特征来操纵物体的姿态,从而实现稳定的对齐。在实验中,处理了18个不同形状的物体,我们的方法在几乎所有情况下都能实现精确的对齐准确性(小于1度的误差)。附带的视频链接如下:https://this URL
https://arxiv.org/abs/2403.19129
In today's fast-paced industry, professionals face the challenge of summarizing a large number of documents and extracting vital information from them on a daily basis. These metrics are frequently hidden away in tables and/or their nested hyperlinks. To address this challenge, the approach of Table Question Answering (QA) has been developed to extract the relevant information. However, traditional Table QA training tasks that provide a table and an answer(s) from a gold cell coordinate(s) for a question may not always ensure extracting the accurate answer(s). Recent advancements in Large Language Models (LLMs) have opened up new possibilities for extracting information from tabular data using prompts. In this paper, we introduce the Multi-hop Few-shot Open Rich Table QA (MFORT-QA) approach, which consists of two major steps. The first step involves Few-Shot Learning (FSL), where relevant tables and associated contexts of hyperlinks are retrieved based on a given question. The retrieved content is then used to construct few-shot prompts as inputs to an LLM, such as ChatGPT. To tackle the challenge of answering complex questions, the second step leverages Chain-of-thought (CoT) prompting to decompose the complex question into a sequential chain of questions and reasoning thoughts in a multi-hop manner. Retrieval-Augmented Generation (RAG) enhances this process by retrieving relevant tables and contexts of hyperlinks that are relevant to the resulting reasoning thoughts and questions. These additional contexts are then used to supplement the prompt used in the first step, resulting in more accurate answers from an LLM. Empirical results from OTT-QA demonstrate that our abstractive QA approach significantly improves the accuracy of extractive Table QA methods.
在当今快速发展的行业中,专业人员每天面临着总结大量文档并从中提取关键信息所带来的挑战。这些指标通常隐藏在表格和/或其嵌套的超链接中。为解决这个挑战,已经开发了Table Question Answering(QA)的方法来提取相关信息。然而,传统Table QA训练任务,为问题提供一个表格和一个/或答案(s)从一个金细胞坐标,可能并不能保证提取到准确的答案(s)。近年来,大型语言模型(LLMs)的进步为使用提示从表格数据中提取信息提供了新的可能性。在本文中,我们引入了多级逐步开放稀疏表格QA(MFORT-QA)方法,该方法包括两个主要步骤。第一步涉及Few-Shot学习(FSL),其中根据给定问题检索相关表格和相关超链接的内容。检索到的内容随后用于构造Few-Shot提示,作为LLM的输入,如ChatGPT。为了解决复杂问题的挑战,第二步利用连锁思考(CoT)提示将复杂问题分解为多级链式的问题和推理思想。检索增强生成(RAG)过程通过检索与结果推理思想和问题相关的表格和超链接来增强这一过程。这些附加上下文然后用于补充第一步中的提示,从而使LLM产生更准确的结果。来自OTT-QA的实证结果表明,我们的抽象QA方法显著提高了提取性Table QA方法的准确性。
https://arxiv.org/abs/2403.19116
The rise in additive manufacturing comes with unique opportunities and challenges. Massive part customization and rapid design changes are made possible with additive manufacturing, however, manufacturing industries that desire the implementation of robotics automation to improve production efficiency could face challenges in the gripper design and grasp planning due to highly complex geometrical shapes resulting from massive part customization. Yet, current gripper design for such objects are often manual and rely on ad-hoc design intuition. This would be limiting as such grippers would lack the ability to grasp different objects or grasp points, which is important for practical implementations. Hence, we introduce a fast, end-to-end approach to customize rigid gripper fingerpads that could achieve precise and stable grasping for different objects at multiple grasp points. Our approach relies on two key components: (i) a method based on set Boolean operations, e.g. intersections, subtractions, and unions to extract object features and synthesize gripper surfaces that conform to different local shapes to form caging grasps; (ii) a method to evaluate the grasp quality of synthesized grippers. We experimentally demonstrate the validity of our approach by synthesizing fingerpads that, once mounted on a physical robot gripper, are able to grasp different objects at multiple grasp points, all with tightly constrained grasps.
增材制造带来了独特的机会和挑战。通过增材制造实现大规模部件定制和快速设计变化是可能的,然而,由于大量部件定制产生的复杂几何形状,希望采用机器人自动化来提高生产效率的制造业行业在抓取设计和握持规划方面可能会面临挑战。然而,目前这样的抓取设计通常都是手动依赖的,并且依赖于偶然的 design intuition。这将导致限制,这样的抓取器将无法握住不同物体或握住点,这对于实际应用来说很重要。因此,我们引入了一种快速、端到端的定制刚性抓取手指垫的方法,可以在多个握点精确稳定地握住不同物体。我们的方法依赖于两个关键组件: (i)基于布尔运算的方法,例如交点、差分和并集,提取物体特征并合成符合不同局部形状的抓取表面,形成捕获握力; (ii)评估合成抓取的握力质量的方法。我们通过实验合成手指垫,一旦安装在物理机器人抓手中,可以在多个握点精确稳定地握住不同物体。所有这些抓取点都具有紧密约束的握力。
https://arxiv.org/abs/2403.19102
Explaining Deep Learning models is becoming increasingly important in the face of daily emerging multimodal models, particularly in safety-critical domains like medical imaging. However, the lack of detailed investigations into the performance of explainability methods on these models is widening the gap between their development and safe deployment. In this work, we analyze the performance of various explainable AI methods on a vision-language model, MedCLIP, to demystify its inner workings. We also provide a simple methodology to overcome the shortcomings of these methods. Our work offers a different new perspective on the explainability of a recent well-known VLM in the medical domain and our assessment method is generalizable to other current and possible future VLMs.
解释深度学习模型的变得越来越重要,尤其是在面对日益出现的多种模态模型,特别是在安全关键领域(如医学成像)。然而,对可解释性方法在这些模型上的性能的详细调查的缺乏正在扩大它们的发展和安全部署之间的差距。在这项工作中,我们分析了可解释性AI方法在医学语言模型MedCLIP上的性能,以揭示其内部工作原理。我们还提供了克服这些方法的缺陷的简单方法。我们的工作为医学领域中最近知名VLM的可解释性提供了一种不同的全新视角,并且我们的评估方法可以适用于其他当前和可能的未来VLM。
https://arxiv.org/abs/2403.18996
Hallucination has emerged as the most vulnerable aspect of contemporary Large Language Models (LLMs). In this paper, we introduce the Sorry, Come Again (SCA) prompting, aimed to avoid LLM hallucinations by enhancing comprehension through: (i) optimal paraphrasing and (ii) injecting [PAUSE] tokens to delay LLM generation. First, we provide an in-depth analysis of linguistic nuances: formality, readability, and concreteness of prompts for 21 LLMs, and elucidate how these nuances contribute to hallucinated generation. Prompts with lower readability, formality, or concreteness pose comprehension challenges for LLMs, similar to those faced by humans. In such scenarios, an LLM tends to speculate and generate content based on its imagination (associative memory) to fill these information gaps. Although these speculations may occasionally align with factual information, their accuracy is not assured, often resulting in hallucination. Recent studies reveal that an LLM often neglects the middle sections of extended prompts, a phenomenon termed as lost in the middle. While a specific paraphrase may suit one LLM, the same paraphrased version may elicit a different response from another LLM. Therefore, we propose an optimal paraphrasing technique to identify the most comprehensible paraphrase of a given prompt, evaluated using Integrated Gradient (and its variations) to guarantee that the LLM accurately processes all words. While reading lengthy sentences, humans often pause at various points to better comprehend the meaning read thus far. We have fine-tuned an LLM with injected [PAUSE] tokens, allowing the LLM to pause while reading lengthier prompts. This has brought several key contributions: (i) determining the optimal position to inject [PAUSE], (ii) determining the number of [PAUSE] tokens to be inserted, and (iii) introducing reverse proxy tuning to fine-tune the LLM for [PAUSE] insertion.
幻觉已经成为当代大型语言模型(LLMs)最脆弱的方面。在本文中,我们引入了抱歉,再来的提示(SCA)来避免LLM的幻觉,通过: (i)最优的转述和 (ii)延迟生成[PAUSE]标记来增强理解。首先,我们对21个LLMs的提示的语义细微差别进行了深入分析: 形式, 可读性和清晰度。并阐明这些细微差别如何导致 LLM 的幻觉产生。提示的清晰度、形式和准确性较低时,LLM 在理解上会受到挑战,类似于人类面对的情况。在这些场景中,LLM 倾向于根据其想象力(联想记忆)进行推测和生成内容,以填补这些信息空白。虽然这些推测偶尔会与事实信息相符,但它们的准确性不能保证,通常导致幻觉。最近的研究表明,LLM 经常忽视扩展提示的中间部分,这种现象被称为“中间丢失”。虽然针对特定的转述可能适合某个LLM,但相同的转述版本可能会对另一个LLM产生不同的反应。因此,我们提出了一个最优转述技术来确定给定提示的最可理解转述,通过采用集成梯度(及其变种)来确保LLM准确处理所有单词。虽然人类在阅读长句子时会在各个点暂停以更好地理解已读到的内容。我们通过向LLM 注入[PAUSE]标记,使其在阅读较长的提示时暂停。这为带来几个关键贡献: (i)确定注入[PAUSE]的最佳位置; (ii)确定要注入的[PAUSE]标记的数量; (iii)引入反向代理调整,以微调LLM的[PAUSE]插入。
https://arxiv.org/abs/2403.18976
Recent advancements in Large Language Models (LLMs), particularly those built on Transformer architectures, have significantly broadened the scope of natural language processing (NLP) applications, transcending their initial use in chatbot technology. This paper investigates the multifaceted applications of these models, with an emphasis on the GPT series. This exploration focuses on the transformative impact of artificial intelligence (AI) driven tools in revolutionizing traditional tasks like coding and problem-solving, while also paving new paths in research and development across diverse industries. From code interpretation and image captioning to facilitating the construction of interactive systems and advancing computational domains, Transformer models exemplify a synergy of deep learning, data analysis, and neural network design. This survey provides an in-depth look at the latest research in Transformer models, highlighting their versatility and the potential they hold for transforming diverse application sectors, thereby offering readers a comprehensive understanding of the current and future landscape of Transformer-based LLMs in practical applications.
近年来,在大型语言模型(LLMs)方面,特别是那些基于Transformer架构的模型,显著扩大了自然语言处理(NLP)应用的范围,超越了最初在聊天机器人技术上的应用。本文调查了这些模型的多方面应用,重点关注GPT系列。这次探索关注了人工智能(AI)驱动工具在变革传统任务(如编码和解决问题)以及在各个行业研究和开发中的新路径。从代码解释和图像摘要到促进构建交互式系统和推动计算领域的进步,Transformer模型体现了深度学习、数据分析和神经网络设计的协同作用。本文对Transformer模型的最新研究进行了深入调查,强调了它们的多样性和所具有的变革不同应用领域的潜力,为读者提供了对Transformer基LLM在实际应用中当前和未来格局的全面理解。
https://arxiv.org/abs/2403.18969
This paper introduces a novel approach to temporal action localization (TAL) in few-shot learning. Our work addresses the inherent limitations of conventional single-prompt learning methods that often lead to overfitting due to the inability to generalize across varying contexts in real-world videos. Recognizing the diversity of camera views, backgrounds, and objects in videos, we propose a multi-prompt learning framework enhanced with optimal transport. This design allows the model to learn a set of diverse prompts for each action, capturing general characteristics more effectively and distributing the representation to mitigate the risk of overfitting. Furthermore, by employing optimal transport theory, we efficiently align these prompts with action features, optimizing for a comprehensive representation that adapts to the multifaceted nature of video data. Our experiments demonstrate significant improvements in action localization accuracy and robustness in few-shot settings on the standard challenging datasets of THUMOS-14 and EpicKitchens100, highlighting the efficacy of our multi-prompt optimal transport approach in overcoming the challenges of conventional few-shot TAL methods.
本文提出了一种在少量样本学习中的新颖的时间动作局部化(TAL)方法。我们的工作解决了传统单提示学习方法在无法跨越现实视频中的各种上下文进行泛化的问题。认识到视频中相机视角、背景和对象存在的多样性,我们提出了一个带有最优传输的多提示学习框架。这种设计允许模型学习每个动作的一组多样性提示,更有效地捕捉通用特征并分散表示以降低过拟合风险。此外,通过采用最优传输理论,我们有效地将这些提示与动作特征对齐,优化全面表示以适应视频数据的多样化特点。我们的实验结果表明,在THUMOS-14标准和EpicKitchens100等具有挑战性的数据集上,动作局部化精度和鲁棒性在少量样本设置中得到了显著提高,这充分证明了我们在多提示最优传输方法中克服了传统少量样本TAL方法的挑战。
https://arxiv.org/abs/2403.18915
Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when naïvely supervising them on sparse camera views, as the field simply overfits to the sparse-view inputs. To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn the radiance field weights solely from potentially sparse multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. This prior provides a good network weight initialization, thereby effectively addressing ambiguities in sparse-view capture. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose-canonicalized space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions as well as novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on both public and WildDynaCap dataset.
忠诚的人类表演捕捉和从稀疏的RGB观察中进行无限制渲染是一个长期存在于计算机视觉和图形学中的问题。主要挑战是缺乏观察和设置固有的不确定性,例如遮挡和深度不确定性。因此,表现出在密集设置中捕获高频度和几何细节的辐射场在稀疏相机视图中粗略监督时表现不佳,因为该场域简单地过拟合到稀疏视图输入。为解决这个问题,我们提出了MetaCap方法,一种基于稀疏或甚至仅有一个视角的人体几何恢复和新的视图合成方法。我们的关键想法是仅从可能稀疏的多视角视频中元学习辐射场权重,这可以作为在稀疏图像中对其进行微调的先验。这个先验为好的网络权重初始化提供了支持,从而有效地解决了稀疏视图捕捉的不确定性。由于人体的复杂结构和运动引起的表面变形,学习这种先验非易事。因此,我们提出了在姿态规范化空间中元学习场权重的方法,这减少了空间特征范围并使特征学习更有效。因此,您可以通过微调场参数快速泛化到未见过的姿态、新颖的照明条件和新颖的稀疏(甚至单目)相机视图。为了评估我们在不同场景下的方法,我们收集了一个新的数据集WildDynaCap,该数据集包含了在密集相机穹顶和野外稀疏相机设备中捕捉的主题,并展示出与最近 state-of-the-art方法相比在公共和WildDynaCap 数据集上的优越结果。
https://arxiv.org/abs/2403.18820
We present SplatFace, a novel Gaussian splatting framework designed for 3D human face reconstruction without reliance on accurate pre-determined geometry. Our method is designed to simultaneously deliver both high-quality novel view rendering and accurate 3D mesh reconstructions. We incorporate a generic 3D Morphable Model (3DMM) to provide a surface geometric structure, making it possible to reconstruct faces with a limited set of input images. We introduce a joint optimization strategy that refines both the Gaussians and the morphable surface through a synergistic non-rigid alignment process. A novel distance metric, splat-to-surface, is proposed to improve alignment by considering both the Gaussian position and covariance. The surface information is also utilized to incorporate a world-space densification process, resulting in superior reconstruction quality. Our experimental analysis demonstrates that the proposed method is competitive with both other Gaussian splatting techniques in novel view synthesis and other 3D reconstruction methods in producing 3D face meshes with high geometric precision.
我们提出了SplatFace,一种新颖的Gaussian splatting框架,专为3D人脸建模而设计,不依赖于精确预先确定的几何形状。我们的方法旨在同时提供高质量的新视图渲染和准确的3D网格重构。我们引入了一个通用的3D可塑模型(3DMM)来提供表面几何结构,使得使用有限的输入图像可以重构人脸。我们提出了一种联合优化策略,通过一种协同非刚性对齐过程来精化和改善Gaussians和可塑表面的几何结构。提出了一种新的距离度量标准,即splat-to-surface度量,通过考虑Gaussian的位置和协方差来改善对齐。还利用表面信息来包含一个世界空间密度过程,从而获得卓越的建模质量。我们的实验分析表明,与其它新颖视图合成技术和3D重建方法相比,所提出的方法在生成具有高几何精度的3D面部网格方面具有竞争力。
https://arxiv.org/abs/2403.18784
This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.
本文提出了一种新颖的K-means聚类算法,作为传统大方法论的改进。所提出的算法有效地将并行处理、随机抽样和竞争优化相结合,创建了一个适用于大数据应用的可扩展版本。它解决了传统技术中常见的可扩展性和计算时间挑战。在执行过程中,算法会动态调整每个工作者的样本量,优化性能。这些样本量的数据会持续分析,有助于确定最有效的配置。通过在不同的样本量中使用竞争元素,算法进一步促进了Big-means算法中的效率。本质上,该算法通过在并行计算环境中使用随机、竞争抽样策略来平衡计算时间和聚类质量。
https://arxiv.org/abs/2403.18766