PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.
The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
During the COVID-19 pandemic, medical imaging techniques like computed tomography (CT) scans have demonstrated effectiveness in combating the rapid spread of the virus. Therefore, it is crucial to conduct research on computerized models for the detection of COVID-19 using CT imaging. A novel processing method has been developed, utilizing radiomic features, to assist in the CT-based diagnosis of COVID-19. Given the lower specificity of traditional features in distinguishing between different causes of pulmonary diseases, the objective of this study is to develop a CT-based radiomics framework for the differentiation of COVID-19 from other lung diseases. The model is designed to focus on outlining COVID-19 lesions, as traditional features often lack specificity in this aspect. The model categorizes images into three classes: COVID-19, non-COVID-19, or normal. It employs enhancement auto-segmentation principles using intensity dark channel prior (IDCP) and deep neural networks (ALS-IDCP-DNN) within a defined range of analysis thresholds. A publicly available dataset comprising COVID-19, normal, and non-COVID-19 classes was utilized to validate the proposed model's effectiveness. The best performing classification model, Residual Neural Network with 50 layers (Resnet-50), attained an average accuracy, precision, recall, and F1-score of 98.8%, 99%, 98%, and 98% respectively. These results demonstrate the capability of our model to accurately classify COVID-19 images, which could aid radiologists in diagnosing suspected COVID-19 patients. Furthermore, our model's performance surpasses that of more than 10 current state-of-the-art studies conducted on the same dataset.
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
ICASSP 2023 声学回声抵消挑战旨在刺激声学回声抵消研究(AEC),这是语音增强的一个重要领域,仍然是音频通信中的一个主要问题。这是第四个 AEC 挑战,通过添加 personalized 声学回声抵消的第二个轨道,将算法和缓冲延迟降低到 20ms,并包括 AECMOS 全波段版本,我们开源了两个大型数据集,用于训练 AEC 模型,无论是在单人对话还是双人对话场景中。这些数据集包括从超过 10,000 个真实的音频设备和人类演讲者在真实环境中录制的录音,以及一个合成数据集。我们开源了一个在线主观测试框架,并为研究人员提供一个客观的指标,以快速测试他们的结果。该挑战的获胜者是根据在所有场景中实现的平均值意见得分(MOS)和单词准确性(Wacc)率来选择的。
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
神经网络对单通道语音增强的研究最近受到了广泛关注。特别是,基于Mask的架构在与传统方法相比实现了显著的性能提升。本文提出了一种基于Mask的多维度自编码器(MSAE),用于实现基于Mask的端到端神经网络语音增强。MSAE在 separate band-limited分支内执行谱分解操作,每个分支以不同的速率和尺度运行,以提取多尺度嵌入序列。 proposed 框架采用直觉的自编码器参数化,包括基于康普顿-Q变换的灵活谱带设计。此外,MSAE完全由不同的操作员构建,使其能够在端到端神经网络内部实现,并进行有选择性的训练。MSAE从最近的多尺度网络拓扑和传统语音处理中的多分辨率变换中吸取了动力。实验结果表明,与传统的单分支自编码器相比,MSAE可以提供明显的性能优势。此外, proposed 框架在 objective speech quality metrics 和自动语音识别精度方面击败了多种最先进的增强系统。
To integrate action recognition methods into autonomous robotic systems, it is crucial to consider adverse situations involving target occlusions. Such a scenario, despite its practical relevance, is rarely addressed in existing self-supervised skeleton-based action recognition methods. To empower robots with the capacity to address occlusion, we propose a simple and effective method. We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples. Next, we employ K-nearest-neighbor (KNN) to fill in missing skeleton data based on the closest sample neighbors. Imputing incomplete skeleton sequences to create relatively complete sequences as input provides significant benefits to existing skeleton-based self-supervised models. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning (OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for better use of high-quality, intact skeletons. The effectiveness of our imputation methods is verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D 120. The source code will be made publicly available at this https URL.
将行动识别方法整合到自主机器人系统中,必须考虑涉及目标遮挡的不利情况。尽管这种场景的实际 relevance 很低,但在当前基于骨骼的行动识别方法中却很少有人考虑。为了赋予机器人处理遮挡的能力,我们提出了一种简单而有效的方法。我们首先使用遮挡的骨骼序列进行预训练,然后使用 k-means 聚类(KMeans)将序列嵌入向量分组语义相似的样本。接下来,我们使用 KNN 根据最接近的样本邻居填充缺失的骨骼数据。将不完整的骨骼序列输入生成相对完整的序列作为输入,为当前基于骨骼的自监督模型带来重大的好处。同时,基于当前先进的 partial Spatial-Temporal Learning(PSTL)技术,我们提出了一个被改进的遮挡 partial Spatial-Temporal Learning(OPSTL)框架。这种改进利用自适应空间遮蔽(ASM)更好地利用高质量的完整的骨骼。我们的代入方法的有效性在 NturGB+D 60 和 NturGB+D 120 等挑战性的遮挡版本上进行了验证。源代码将在 this https://www.tensorflow.org/zh/api_docs/python/tf/keras/models/Sequential 网站上公开发布。
In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. Code is available at this https URL.
过去几年,视频实例分割(VIS)取得了巨大的进展,许多 offline 和 online 方法都实现了最先进的性能。虽然 offline 方法具有产生时间一致性预测的优势,但不适合实时场景。相反, online 方法更实用,但维持时间一致性仍然是一个挑战性的任务。在本文中,我们提出了一种全新的 online 方法,称为 TCOVIS,它 fully 利用了视频片段中的时间信息。我们的方法和核心由一个全局实例分配策略和一个空间-时间增强模块组成,以提高特征的时间一致性。具体来说,我们在整个视频片段中执行全球最优匹配,并监督模型以全球最优目标。我们还捕捉空间特征,并在帧之间将它们与语义特征相结合,从而实现空间-时间增强。我们评估了 four widely adopted VIS 基准点,即 YouTube-VIS 2019/2021/2022 和 OVIS,并在所有基准点 without bell-and-whistle 情况下实现了最先进的性能。例如,在 YouTube-VIS 2021 中,TCOVIS 使用 ResNet-50 和 Swin-L 骨干网络分别实现 49.5 元和 61.3 元的性能。代码可在本页的 https 链接中获取。
Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
Extreme head postures pose a common challenge across a spectrum of facial analysis tasks, including face detection, facial landmark detection (FLD), and head pose estimation (HPE). These tasks are interdependent, where accurate FLD relies on robust face detection, and HPE is intricately associated with these key points. This paper focuses on the integration of these tasks, particularly when addressing the complexities posed by large-angle face poses. The primary contribution of this study is the proposal of a real-time multi-task detection system capable of simultaneously performing joint detection of faces, facial landmarks, and head poses. This system builds upon the widely adopted YOLOv8 detection framework. It extends the original object detection head by incorporating additional landmark regression head, enabling efficient localization of crucial facial landmarks. Furthermore, we conduct optimizations and enhancements on various modules within the original YOLOv8 framework. To validate the effectiveness and real-time performance of our proposed model, we conduct extensive experiments on 300W-LP and AFLW2000-3D datasets. The results obtained verify the capability of our model to tackle large-angle face pose challenges while delivering real-time performance across these interconnected tasks.
We present Clinical Prediction with Large Language Models (CPLLM), a method that involves fine-tuning a pre-trained Large Language Model (LLM) for clinical disease prediction. We utilized quantization and fine-tuned the LLM using prompts, with the task of predicting whether patients will be diagnosed with a target disease during their next visit or in the subsequent diagnosis, leveraging their historical diagnosis records. We compared our results versus various baselines, including Logistic Regression, RETAIN, and Med-BERT, which is the current state-of-the-art model for disease prediction using structured EHR data. Our experiments have shown that CPLLM surpasses all the tested models in terms of both PR-AUC and ROC-AUC metrics, displaying noteworthy enhancements compared to the baseline models.
我们提出了使用大型语言模型(LLM)进行临床疾病预测的方法,该方法涉及微调预训练的大型语言模型(LLM)以进行临床疾病预测。我们使用量化方法和使用提示来微调LLM,任务是预测患者在未来 visit 或后续诊断中是否会被诊断出目标疾病,利用其历史诊断记录。我们比较了我们的结果与各种基线模型,包括Logistic Regression、RETAIN和Med-BERT,这是使用结构化EHR数据进行疾病预测的最新最先进的模型。我们的实验表明,CPLLM在PR-AUC和ROC-AUC metrics上超越了所有测试模型,与基线模型相比表现出显著的增强。
We consider speech enhancement for signals picked up in one noisy environment that must be rendered to a listener in another noisy environment. For both far-end noise reduction and near-end listening enhancement, it has been shown that excessive focus on noise suppression or intelligibility maximization may lead to excessive speech distortions and quality degradations in favorable noise conditions, where intelligibility is already at ceiling level. Recently [1,2] propose to remedy this with a minimum processing framework that either reduces noise or enhances listening a minimum amount given that a certain intelligibility criterion is still satisfied. Additionally, it has been shown that joint consideration of both environments improves speech enhancement performance. In this paper, we formulate a joint far- and near-end minimum processing framework, that improves intelligibility while limiting speech distortions in favorable noise conditions. We provide closed-form solutions to specific boundary scenarios and investigate performance for the general case using numerical optimization. We also show that concatenating existing minimum processing far- and near-end enhancement methods preserves the effects of the initial methods. Results show that the joint optimization can further improve performance compared to the concatenated approach.
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent progress within this field, propelled by contrastive learning and prompt engineering, has significantly bridged the gap between unsupervised and supervised strategies. Nonetheless, the potential utilization of Chain-of-Thought, remains largely untapped within this trajectory. To unlock latent capabilities within pre-trained models, such as BERT, we propose a two-stage approach for sentence representation: comprehension and summarization. Subsequently, the output of the latter phase is harnessed as the vectorized representation of the input sentence. For further performance enhancement, we meticulously refine both the contrastive learning loss function and the template denoising technique for prompt engineering. Rigorous experimentation substantiates our method, CoT-BERT, transcending a suite of robust baselines without necessitating other text representation models or external databases.
unsupervised sentence representation learning 旨在将输入句子转换为固定长度的向量,其中包含了丰富的语义信息,同时避免了依赖标记数据。近期,通过比较学习和创新工程,这一领域的进展得到了显著加强。然而,在这一路径中,信念链的潜力利用率仍 largely untapped。为了解锁预训练模型(如 BERT)中的隐藏能力,我们提出了一种 sentence Representation 的两步方法:理解和摘要。随后,后阶段的输出被用作输入句子的向量表示。为了进一步改善性能,我们仔细优化了比较学习损失函数和模板去噪技术,以 prompt engineering。严格的实验支持了我们的方法 CoT-BERT,它超越了一组稳健基准,而不需要其他文本表示模型或外部数据库。
Recent studies have increasingly acknowledged the advantages of incorporating visual data into speech enhancement (SE) systems. In this paper, we introduce a novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with conformer network). The proposed DCUC-Net leverages complex domain features and a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed using a complex U-Net-based framework. The audio and visual signals are processed using a complex encoder and a ResNet-18 model, respectively. These processed signals are then fused using the conformer blocks and transformed into enhanced speech waveforms via a complex decoder. The conformer blocks consist of a combination of self-attention mechanisms and convolutional operations, enabling DCUC-Net to effectively capture both global and local audio-visual dependencies. Our experimental results demonstrate the effectiveness of DCUC-Net, as it outperforms the baseline model from the COG-MHEAR AVSE Challenge 2023 by a notable margin of 0.14 in terms of PESQ. Additionally, the proposed DCUC-Net performs comparably to a state-of-the-art model and outperforms all other compared models on the Taiwan Mandarin speech with video (TMSV) dataset.
最近的研究表明,将视觉数据引入语音增强(SE)系统具有很大的优势。在本文中,我们介绍了一种新的音频-视觉SE方法,称为DCUC-Net(深度复杂U-Net与变分自编码器网络),它利用复杂的域特征和变分自编码器网络。 DCUC-Net的编码器和解码器使用了一个复杂的U-Net框架进行设计。音频和视觉信号使用复杂的编码器和ResNet-18模型进行处理。这些处理的信号然后用变分自编码器网络 fusion 并用复杂的解码器转换增强语音波形,通过复杂的解码器。变分自编码器网络由自我注意力机制和卷积操作的组合组成,使DCUC-Net有效地捕捉全球和 local 音频-视觉依赖关系。我们的实验结果证明了DCUC-Net的有效性,因为在PESQ方面,它在与2023年COG-MHearAVSE挑战基准模型的竞争中比基准模型高出了0.14。此外, proposed DCUC-Net 在 Taiwan Mandarin speech with video(TMSV) 数据集上表现与先进的模型相当,并在所有其他比较模型上在 Taiwan Mandarin speech with video(TMSV) 数据集上击败了所有其他模型。
Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: this https URL.
视觉估计(VO)在自主系统中扮演了关键的角色,其中主要挑战是相机图像缺乏深度信息。本文介绍了Occ3D-nuScenes,这是一种新框架,利用深度学习的最新进展将2D相机图像转换为3D语义占用,从而绕过了传统的同时估计自我姿态和地标位置的需求。在这个框架中,我们使用TPV- former将周围的视图相机图像转换为3D语义占用。为了解决这种转换所带来的挑战,我们特别定制了姿态估计和映射算法,包括语义标签过滤器、动态物体过滤器,并最终使用 Voxel PFilter维持一个稳定的全球语义地图。在Occ3D-nuScenes的评估中,不仅表现出与ORB-SLAM3相比20.6%的成功 Ratio 和29.6%的轨迹精度提高,也强调了我们构建全面了解图的能力。我们的实现是开源的,可以在以下httpsURL上获取。
Electroaerodynamic (EAD) propulsion, where thrust is produced by collisions between electrostatically-accelerated ions and neutral air, is a potentially transformative method for indoor flight owing to its silent and solid-state nature. Like rotors, EAD thrusters exhibit changes in performance based on proximity to surfaces. Unlike rotors, they have no fragile and quickly spinning parts that have to avoid those surfaces; taking advantage of the efficiency benefits from proximity effects may be a route towards longer-duration indoor operation of ion-propelled fliers. This work presents the first empirical study of ground proximity effects for EAD propulsors, both individually and as quad-thruster arrays. It focuses on multi-stage ducted centimeter-scale actuators suitable for use on small robots envisioned for deployment in human-proximal and indoor environments. Three specific effects (ground, suckdown, and fountain lift), each occurring with a different magnitude at a different spacing from the ground plane, are investigated and shown to have strong dependencies on geometric parameters including thruster-to-thruster spacing, thruster protrusion from the fuselage, and inclusion of flanges or strakes. Peak thrust enhancement ranging from 300 to 600% is found for certain configurations operated in close proximity (0.2 mm) to the ground plane and as much as a 20% increase is measured even when operated centimeters away.
Electroaerodynamic (EAD)推进器,其动力由静电加速离子和中性气体的碰撞产生,由于它的沉默和固态性质,是一种潜在的转变方法,对室内飞行具有潜在的改变作用。与转子不同,EAD推进器的表现基于与表面的接近程度。与转子不同,它们没有脆弱的快速旋转部件,需要避免这些表面;利用接近效应可能是一种通往离子推进机在室内长时间运行的方法。这项工作提出了对EAD推进器个体和四发推进器的地面接近效应的第一次实证研究。它专注于用于小型机器人计划在人类附近和室内环境中部署的多层导轨厘米级驱动装置。三个特定效应(地面、吸力回收和喷泉升力)每只推进器在不同的地面平面间距上以不同幅度发生,被研究并表明它们 strongly依赖于几何参数,包括推进器到推进器的间距、推进器从机身突出的量以及包括轴或垫片。发现峰值推进力增强范围从300到600%,即使在距离推进器厘米远的情况下,也达到了20%的增加。
Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at this https URL. More details of our project are available at this https URL.
拥有数十亿参数的大型语言模型(LLMs)在各种自然语言处理任务中表现出了卓越的性能。本报告介绍了开源的15B双语不对称序列2序列模型OpenBA,以向中文开源模型社区贡献一个LLM变体。我们通过对OpenBA进行高效和有效的技术提升,并采用三个训练阶段的策略来从头训练模型。我们的解决方案仅使用380B代币就能实现非常卓越的性能,在BelebelE基准测试中比LLaMA-70B更好,在MMLU基准测试中比BLOOM-176B更好,在C-Eval(困难)基准测试中比GLM-130B更好。本报告提供了训练类似模型的主要细节,包括预处理数据、双语Flan数据收集、激励我们的模型架构设计的实验观察、不同训练阶段的目标以及其他增强技术。我们重构了代码,遵循Huggingface Transformers Library的设计原则,使其更易于开发者使用,并在此httpsURL上发布了不同训练阶段的不同 checkpoint。我们的项目更多信息可用在此httpsURL上。
In this work, we explore the influence of entropy change in deep learning systems by adding noise to the inputs/latent features. The applications in this paper focus on deep learning tasks within computer vision, but the proposed theory can be further applied to other fields. Noise is conventionally viewed as a harmful perturbation in various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers (ViTs), as well as different learning tasks like image classification and transfer learning. However, this paper aims to rethink whether the conventional proposition always holds. We demonstrate that specific noise can boost the performance of various deep architectures under certain conditions. We theoretically prove the enhancement gained from positive noise by reducing the task complexity defined by information entropy and experimentally show the significant performance gain in large image datasets, such as the ImageNet. Herein, we use the information entropy to define the complexity of the task. We categorize the noise into two types, positive noise (PN) and harmful noise (HN), based on whether the noise can help reduce the complexity of the task. Extensive experiments of CNNs and ViTs have shown performance improvements by proactively injecting positive noise, where we achieved an unprecedented top 1 accuracy of over 95% on ImageNet. Both theoretical analysis and empirical evidence have confirmed that the presence of positive noise can benefit the learning process, while the traditionally perceived harmful noise indeed impairs deep learning models. The different roles of noise offer new explanations for deep models on specific tasks and provide a new paradigm for improving model performance. Moreover, it reminds us that we can influence the performance of learning systems via information entropy change.
Existing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code and models will be released.
现有的夜晚无人飞行器(UAV)跟踪器遵循“增强-然后跟踪”架构 - 首先使用一盏光线增强器来照亮夜晚视频,然后使用白天跟踪器来定位物体。这种分开增强和跟踪的设计无法构建一个完整的可训练的视觉系统。为了解决这一问题,我们提出了一种名为“黑暗线索提示跟踪”(DCPT)的新型架构,它能够在晚上高效地学习生成黑暗线索提示,从而实现UAV的稳健跟踪。在没有单独的增强器的情况下,DCPT使用黑暗线索提示器(DCP)直接编码反黑暗能力 into提示。具体来说,DCP迭代地学习强调和削弱黑暗线索的投影。它然后将这些学到的视觉提示注入到白天跟踪器中,使用跨Transformer层固定的参数。此外,一个门控特征聚合机制使得提示和提示与基模型之间的自适应融合实现。广泛的实验结果表明,DCPT在多个黑暗场景基准测试中表现出最先进的性能。DCPT的统一的增强和跟踪端到端学习实现了更可训练的系统。DCPT的黑暗线索提示器不需要额外的模块即可高效地注入反黑暗知识。代码和模型将发布。
3D scene graphs offer a more efficient representation of the environment by hierarchically organizing diverse semantic entities and the topological relationships among them. Fiducial markers, on the other hand, offer a valuable mechanism for encoding comprehensive information pertaining to environments and the objects within them. In the context of Visual SLAM (VSLAM), especially when the reconstructed maps are enriched with practical semantic information, these markers have the potential to enhance the map by augmenting valuable semantic information and fostering meaningful connections among the semantic objects. In this regard, this paper exploits the potential of fiducial markers to incorporate a VSLAM framework with hierarchical representations that generates optimizable multi-layered vision-based situational graphs. The framework comprises a conventional VSLAM system with low-level feature tracking and mapping capabilities bolstered by the incorporation of a fiducial marker map. The fiducial markers aid in identifying walls and doors in the environment, subsequently establishing meaningful associations with high-level entities, including corridors and rooms. Experimental results are conducted on a real-world dataset collected using various legged robots and benchmarked against a Light Detection And Ranging (LiDAR)-based framework (S-Graphs) as the ground truth. Consequently, our framework not only excels in crafting a richer, multi-layered hierarchical map of the environment but also shows enhancement in robot pose accuracy when contrasted with state-of-the-art methodologies.
3D场景Graph通过Hierarchically organizing diverse semantic entities和它们之间的topological关系,提供了更高效的对环境的表示。标志位图则提供了一个重要的机制,用于编码与环境和其中的对象相关的全面信息。在视觉多时态SLAM(VSLAM)的背景下,特别是当重构的地图中添加实际语义信息时,这些标志位图有潜力通过增加宝贵的语义信息并促进语义对象之间的有意义连接来增强地图。在这方面,本文利用标志位图的潜力,将其纳入一个VSLAM框架,该框架通过Hierarchically representing产生可优化的多层视觉场景 Graph。框架包括一个传统的VSLAM系统,通过添加标志位图增强了低级别特征跟踪和映射能力。标志位图帮助识别环境中的墙壁和门,随后与高级别实体,包括走廊和房间建立有意义的连接。实验结果使用了使用各种腿机器人收集的现实世界数据集,并将其与基于光检测和测量(LiDAR)框架(S-Graphs)作为基准值进行比较。因此,我们的框架不仅 excels 在构建更丰富、多层的Hierarchically organize environmental map方面,而且在与最先进的方法学进行对比时,还表现出机器人姿态准确性的提高。