The representations of neural networks are often compared to those of biological systems by performing regression between the neural network responses and those measured from biological systems. Many different state-of-the-art deep neural networks yield similar neural predictions, but it remains unclear how to differentiate among models that perform equally well at predicting neural responses. To gain insight into this, we use a recent theoretical framework that relates the generalization error from regression to the spectral bias of the model activations and the alignment of the neural responses onto the learnable subspace of the model. We extend this theory to the case of regression between model activations and neural responses, and define geometrical properties describing the error embedding geometry. We test a large number of deep neural networks that predict visual cortical activity and show that there are multiple types of geometries that result in low neural prediction error as measured via regression. The work demonstrates that carefully decomposing representational metrics can provide interpretability of how models are capturing neural activity and points the way towards improved models of neural activity.
神经网络的表示经常被用来与生物系统的表示进行比较,通过比较神经网络响应与从生物系统测量的响应之间的回归结果。许多最先进的深度学习网络都产生了类似的神经网络预测结果,但如何区分在预测神经网络响应方面表现同样出色的模型仍然是一个问题。为了解决这个问题,我们使用了一个最近的理论框架,该框架将回归 generalization 误差与模型激活函数的谱偏差以及将神经网络响应对齐到模型可学习子空间的几何性质联系起来。我们将这个理论扩展到模型激活值和神经网络响应之间的回归情况,并定义了描述错误嵌入几何性质的几何性质。我们测试了大量预测视觉皮层活动的深度学习网络,并表明有多种类型的几何形状会导致通过回归测量的神经网络预测误差较低。这项工作表明,仔细分解表示度量可以提供模型如何捕捉神经网络活动的可解释性,并指向改进神经网络活动模型的方向。
Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
视频分析是近年来备受关注的计算机视觉任务之一。当前视频分析的最佳性能是通过深度神经网络(DNN)实现的,这些网络具有高计算成本,需要大量的标记数据进行训练。在神经生成硬件上实现Spiking Neural Networks(SNNs)具有显著更低的计算成本(数千倍),比传统的非Spiking Neural Networks(NNNs)更低。这些方法,例如3DConvolutionalSpiking Neural Networks(3D CSNNs),已经被用于视频分析。然而,与Spiking 2D CSNNs相比,这些网络具有更多的参数,这不仅增加了计算成本,也使这些网络在神经生成硬件上实现变得更加困难。在本文中,我们使用无监督的Spiking Timing-Dependent Plasticity(STDP)规则训练的CSNNs,以降低视频分析所需的参数数量,并首次介绍了Spiking Separated Spatial and TemporalConvolutions(S3TCs),以减少视频分析所需的参数数量。这种无监督学习的优势是不需要大量标记数据进行训练。将单个时间和空间SpikingConvolution分解成空间和时间SpikingConvolution,可以降低网络中的参数数量。我们使用KTH、Weizmann和IXMAS数据集测试我们的网络,并表明S3TCs成功地从视频中提取时间和空间信息,同时增加输出SpikingActivity,并超越了Spiking 3DConvolutions。
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.
目标音频信号语音活动检测(TS-VAD)使用一组 speaker profiles 与输入音频信号一起执行 speaker 去声化。虽然其相对于传统方法的优势已经得到证明,但方法可能会受到 speaker profiles 的错误,因为这些 profiles 通常是通过运行传统的基于簇聚类的去声化方法在输入信号上得到的。本文提出了一个扩展,称为 profiles-error-容忍的 TS-VAD(PET-TSVAD),能够 robustly 应对 such speaker profiles 的错误。这通过使用能够处理可变数量的 speaker 的 transformer-based TS-VAD 实现,并引入了一组额外的伪 speaker profiles,以处理在第一遍去声化中未被发现的演讲者。在训练期间,我们使用多个不同的簇聚类算法估计的演讲者 profiles 以减少训练和测试条件之间的不匹配。实验结果显示,PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上 consistently 优于现有的 TS-VAD 方法。
We present Large Language Model for Mixed Reality (LLMR), a framework for the real-time creation and modification of interactive Mixed Reality experiences using LLMs. LLMR leverages novel strategies to tackle difficult cases where ideal training data is scarce, or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. Our framework relies on text interaction and the Unity game engine. By incorporating techniques for scene understanding, task planning, self-debugging, and memory management, LLMR outperforms the standard GPT-4 by 4x in average error rate. We demonstrate LLMR's cross-platform interoperability with several example worlds, and evaluate it on a variety of creation and modification tasks to show that it can produce and edit diverse objects, tools, and scenes. Finally, we conducted a usability study (N=11) with a diverse set that revealed participants had positive experiences with the system and would use it again.
我们提出了大型语言模型混合现实框架(LLMR),一个实时创建和修改使用LLM的交互混合现实体验的框架。LLMR利用新策略来解决罕见 ideal training data 缺乏或设计目标需要内部动态、直觉分析或高级交互实现的困难情况。我们的框架依赖于文本交互和Unity游戏引擎。通过引入场景理解、任务规划、自我调试和内存管理技术,LLMR在平均错误率方面比标准GPT-4高出4倍。我们与多个示例世界展示了LLMR跨平台互操作性,并在不同的创建和修改任务中评估其性能,表明它可以产生和编辑各种对象、工具和场景。最后,我们进行了一项使用 diverse 组的数据的使用效能研究(N=11),表明参与者对该系统有积极体验,并将再次使用它。
The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced.
当前的发言反作弊措施(CMs)在特定的dataset上表现出出色的性能。然而,通过语音活动检测(VAD)去除测试语音的沉默可能会严重地降低性能。在本文中,分析了沉默对 speech 反作弊措施的影响。首先,要探索影响的原因,包括沉默的持续时间和内容的比例。从文本到语音(TTS)算法生成的伪声相比真声,沉默的持续时间比例更低。不同波形生成器生成的沉默内容相比真声也有所不同。然后,探索了沉默对模型预测的影响。即使在重新训练后,基于神经网络的端到端 TTS 算法生成的伪声在去除沉默后的错误率也显著增加。为了证明影响的原因,CMs 的注意力分布通过类激活映射(CAM)可视化了。此外,实验中通过遮蔽沉默或非沉默来演示了沉默的持续时间对于检测 TTS 和声音转换(VC)的重要性。基于实验结果,还提出了通过遮蔽沉默来改进CMs 对未知伪声攻击的鲁棒性的建议。最后,介绍了通过合并沉默来攻击反作弊CMs 和通过低通滤波来缓解 VAD 和沉默攻击的方法。
This paper plans to develop an Equitable and Responsible AI framework with enabling techniques and algorithms for the Internet of Energy (IoE), in short, RAI4IoE. The energy sector is going through substantial changes fueled by two key drivers: building a zero-carbon energy sector and the digital transformation of the energy infrastructure. We expect to see the convergence of these two drivers resulting in the IoE, where renewable distributed energy resources (DERs), such as electric cars, storage batteries, wind turbines and photovoltaics (PV), can be connected and integrated for reliable energy distribution by leveraging advanced 5G-6G networks and AI technology. This allows DER owners as prosumers to participate in the energy market and derive economic incentives. DERs are inherently asset-driven and face equitable challenges (i.e., fair, diverse and inclusive). Without equitable access, privileged individuals, groups and organizations can participate and benefit at the cost of disadvantaged groups. The real-time management of DER resources not only brings out the equity problem to the IoE, it also collects highly sensitive location, time, activity dependent data, which requires to be handled responsibly (e.g., privacy, security and safety), for AI-enhanced predictions, optimization and prioritization services, and automated management of flexible resources. The vision of our project is to ensure equitable participation of the community members and responsible use of their data in IoE so that it could reap the benefits of advances in AI to provide safe, reliable and sustainable energy services.
本 paper 计划开发一个平等和负责任的 AI 框架,以支持能源互联网(IoE),也就是 RAI4IoE。能源行业正在经历由两个关键驱动因素引起的重大变革:建设零碳排放能源部门以及能源基础设施的数字转型。我们期望看到这两个驱动因素的趋同,最终导致IoE,其中可再生能源分布式能源资源(DERs)如电动汽车、储能电池、风电和光伏发电可以连接和整合,以可靠地分配能源,利用先进的5G-6G网络和人工智能技术。这使得DER 所有者作为新手可以参与能源市场并从中获利。 DERs 本质上是一种资产驱动的资源,面临公平挑战(即公正、多样化和包容性)。没有公平机会,特权个人、团体和组织可以参与并利用不利群体的代价。实时管理 DER 资源不仅揭示了IoE中的公平问题,还收集了高度敏感的位置、时间和活动依赖数据,这些数据需要负责任地处理(例如隐私、安全和安全),用于增强AI预测、优化和优先级服务,以及自动化灵活的资源管理。我们项目的目标在于确保社区成员的平等参与和负责任地使用其数据,在IoE中为安全、可靠和可持续的能源服务做出贡献。
Solar energetic particles are mainly protons and originate from the Sun during solar flares or coronal shock waves. Forecasting the Solar Energetic Protons (SEP) flux is critical for several operational sectors, such as communication and navigation systems, space exploration missions, and aviation flights, as the hazardous radiation may endanger astronauts', aviation crew and passengers' health, the delicate electronic components of satellites, space stations, and ground power stations. Therefore, the prediction of the SEP flux is of high importance to our lives and may help mitigate the negative impacts of one of the serious space weather transient phenomena on the near-Earth space environment. Numerous SEP prediction models are being developed with a variety of approaches, such as empirical models, probabilistic models, physics-based models, and AI-based models. In this work, we use the bi-directional long short-term memory (BiLSTM) neural network model architecture to train SEP forecasting models for 3 standard integral GOES channels (>10 MeV, >30 MeV, and >60 MeV) with 3 forecast windows (1-day, 2-day, and 3-day ahead) based on daily data obtained from the OMNIWeb database from 1976 to 2019. As the SEP variability is modulated by the solar cycle, we select input parameters that capture the short-term, typically within a span of a few hours, and long-term, typically spanning several days, fluctuations in solar activity. We take the F10.7 index, the sunspot number, the time series of logarithm of the x-ray flux, the solar wind speed, and the average strength of the interplanetary magnetic field as input parameters to our model. The results are validated with an out-of-sample testing set and benchmarked with other types of models.
太阳能高能粒子主要是质子,在太阳爆发或 coronal shock 波期间从太阳产生。预测太阳能高能粒子流对多个业务部门,例如通信和导航系统、太空探测任务、航空飞行,非常重要,因为这些危险的辐射可能会危及宇航员、航空 crew 和乘客的健康,卫星、空间站和地面电力站 delicate 的电子组件。因此,预测 SEP 流对我们生命非常重要,可能有助于减轻一颗严重的近地空间天气瞬时现象对地球附近空间环境负面影响。多个 SEP 预测模型正在以各种方法开发,例如经验模型、概率模型、基于物理模型和基于人工智能模型的模型。在这项工作中,我们使用双向长期短期记忆(BiLSTM)神经网络模型架构,训练 3 个标准整微分 GOES 通道(>10MeV、>30MeV 和 >60MeV)的 3 个预测窗口(1天、2天和3天)的 SEP 预测模型,基于从 1976 年到 2019 年每日数据从 OMNIWeb 数据库获取的数据。由于 SEP 变异受太阳周期影响,我们选择捕捉短期,通常在数小时内,以及长期,通常在数天内的太阳活动波动的输入参数。我们使用 F10.7 指数、太阳spot 数、x 射线Flux 的指数时间序列、太阳能风速度和行星磁场的平均强度作为我们的模型输入参数。结果通过样本测试集进行了验证,并与其他类型模型进行了基准。
We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.
In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
近年来,Web视频的爆炸使得文本-视频检索对于视频过滤器、推荐和搜索变得越来越重要和流行。文本-视频检索的目标是将相关的文本/视频排名高于无关的文本/视频。这项工作的核心在于精确测量文本和视频的跨modal相似性。最近,对比学习方法在文本-视频检索中取得了良好的结果,其中大部分方法专注于构建正交和反交对来学习文本和视频表示。然而,他们并没有足够重视坚固的负对对,并且缺乏建模不同级别的语义相似性的能力。为了解决这些问题,本文使用两个新技术来提高对比学习。首先,利用坚固的示例来增强鲁棒性,我们提出了一种新的双模注意力增强模块(DMAE),以从文本和视觉线索中挖掘坚固的负对对。此外,我们引入了一个负 aware infoNCE(NegNCE)损失,可以自适应地识别所有的坚固的负对对,并明确地突出它们在训练损失中的影响。其次,我们的工作认为三组样本比二组样本更好地模型精细语义相似性。因此,我们提出了一个新的三组 partial margin Contrastive Learning(TPM-CL)模块,以通过自动生成匹配的文本-视频对的精细坚固的负对对来构建 partial order 三组样本。该提出的 TPM-CL 设计了一个自适应的元掩码策略,通过跨modal交互来建模微妙的语义差异。广泛的实验表明,该方法在常用的文本-视频检索数据集上,包括 MSR-VTT、MSVD、DiDeMo 和活动网络中,表现出了比现有方法更好的性能。
Pathology diagnosis based on EEG signals and decoding brain activity holds immense importance in understanding neurological disorders. With the advancement of artificial intelligence methods and machine learning techniques, the potential for accurate data-driven diagnoses and effective treatments has grown significantly. However, applying machine learning algorithms to real-world datasets presents diverse challenges at multiple levels. The scarcity of labelled data, especially in low regime scenarios with limited availability of real patient cohorts due to high costs of recruitment, underscores the vital deployment of scaling and transfer learning techniques. In this study, we explore a real-world pathology classification task to highlight the effectiveness of data and model scaling and cross-dataset knowledge transfer. As such, we observe varying performance improvements through data scaling, indicating the need for careful evaluation and labelling. Additionally, we identify the challenges of possible negative transfer and emphasize the significance of some key components to overcome distribution shifts and potential spurious correlations and achieve positive transfer. We see improvement in the performance of the target model on the target (NMT) datasets by using the knowledge from the source dataset (TUAB) when a low amount of labelled data was available. Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.
基于EEG信号和解码脑活动的尸体病理学诊断在理解神经退行性疾病方面具有重要意义。随着人工智能技术和机器学习技术的进步,使用准确的数据驱动诊断和有效的治疗方案的潜力已经显著增长。然而,将机器学习算法应用于实际数据集面临着多个层次的不同挑战。标记数据短缺,特别是低政权场景下由于招聘成本很高而有限的真实患者群体,强调了 Scale-up 和转移学习技术的重要部署。在本研究中,我们探索了一个实际尸体病理学分类任务,以强调数据和模型 Scale-up 和跨数据集知识转移的有效性。因此,我们通过数据 Scale-up 观察了不同的性能改进,这表明需要仔细评估和标记。此外,我们识别了可能 negative 转移的挑战并强调了某些关键组件的重要性,以克服分布 Shift 和潜在的伪相关,并实现积极转移。我们使用来自原始数据集(TUAB)的知识,在可用少量标记数据的情况下,在目标模型(NMT)数据集上观察到目标模型的性能改进。我们的发现表明,一个小而通用的模型(如浅度网络)在单个数据集上表现良好,然而,一个大而多样化模型(如 TCN)在从更大而不同的数据集上进行转移和学习方面表现更好。
NASA's Solar Dynamics Observatory (SDO) mission collects large data volumes of the Sun's daily activity. Data compression is crucial for space missions to reduce data storage and video bandwidth requirements by eliminating redundancies in the data. In this paper, we present a novel neural Transformer-based video compression approach specifically designed for the SDO images. Our primary objective is to efficiently exploit the temporal and spatial redundancies inherent in solar images to obtain a high compression ratio. Our proposed architecture benefits from a novel Transformer block called Fused Local-aware Window (FLaWin), which incorporates window-based self-attention modules and an efficient fused local-aware feed-forward (FLaFF) network. This architectural design allows us to simultaneously capture short-range and long-range information while facilitating the extraction of rich and diverse contextual representations. Moreover, this design choice results in reduced computational complexity. Experimental results demonstrate the significant contribution of the FLaWin Transformer block to the compression performance, outperforming conventional hand-engineered video codecs such as H.264 and H.265 in terms of rate-distortion trade-off.
美国国家航空航天局的太阳动力学观测站(SDO)任务收集了太阳每日活动的巨大数据量。数据压缩对于太空任务来说至关重要,可以通过消除数据中的冗余来减少数据存储和视频带宽要求。在本文中,我们介绍了一种专门为SDO图像所设计的全新的神经网络Transformer视频压缩方法。我们的主要目标是有效地利用太阳图像固有的时间空间和冗余来获得高压缩比。我们提出的架构得益于一种名为Fused Local-aware Window(FLaWin)的新型Transformer块,该块包括基于窗口的自我注意力模块和一个高效的融合本地aware反馈(FLaFF)网络。这种架构设计允许我们同时捕捉短程和远程信息,并方便地提取丰富的多样化上下文表示。此外,这种设计选择导致计算复杂性的降低。实验结果显示,FLaWinTransformer块对压缩性能做出了重要贡献,在 Rate-distortion 权衡方面超越了传统的手动编码视频编解码器,如H.264和H.265。
Metaphors and sarcasm are precious fruits of our highly-evolved social communication skills. However, children with Asperger syndrome are known to have difficulties in comprehending sarcasm, even if they possess a certain level of verbal IQ sufficient for understanding metaphors. Given that, a screening test that scores the ability to understand metaphor and sarcasm has been used to differentiate Asperger syndrome from other symptoms exhibiting akin external behaviors (e.g., attention-deficit/hyperactivity disorder). This study uses the standardized test to examine the capability of recent large language models (LLMs) in understanding human nuanced communication. The results divulged that, whereas their ability to comprehend metaphors has been improved with the increase of the number of model parameters, the improvement in sarcasm understanding was not observed. This implies that an alternative approach is imperative to imbue LLMs with the capacity to grasp sarcasm, which has been associated with the amygdala, a pivotal cerebral region for emotional learning, in the case of humans.
Cellular traffic prediction is a crucial activity for optimizing networks in fifth-generation (5G) networks and beyond, as accurate forecasting is essential for intelligent network design, resource allocation and anomaly mitigation. Although machine learning (ML) is a promising approach to effectively predict network traffic, the centralization of massive data in a single data center raises issues regarding confidentiality, privacy and data transfer demands. To address these challenges, federated learning (FL) emerges as an appealing ML training framework which offers high accurate predictions through parallel distributed computations. However, the environmental impact of these methods is often overlooked, which calls into question their sustainability. In this paper, we address the trade-off between accuracy and energy consumption in FL by proposing a novel sustainability indicator that allows assessing the feasibility of ML models. Then, we comprehensively evaluate state-of-the-art deep learning (DL) architectures in a federated scenario using real-world measurements from base station (BS) sites in the area of Barcelona, Spain. Our findings indicate that larger ML models achieve marginally improved performance but have a significant environmental impact in terms of carbon footprint, which make them impractical for real-world applications.
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the crossmodal similarity of different granularity, UCoFiA allows the effective unification of multi-grained alignments. Empirically, UCoFiA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at this https URL.
视频文本检索的常用方法利用了视觉和文本信息之间的粗调和精细对齐。然而,根据文本查询检索正确的视频通常非常困难,因为这需要对高级别(场景)和低级别(对象)的视觉线索以及它们如何与文本查询进行推理。为此,我们提出了一种名为 UCoFiA 的统一粗调和精细对齐模型。具体来说,我们的模型在不同粒度级别上捕获了跨模态相似性信息。为了减轻无关视觉线索的影响,我们还应用了一个交互相似性聚合模块(ISA),考虑不同视觉特征的重要性,同时聚合跨模态相似性以获取每个粒度的相似性分数。最后,我们应用了Sinkhorn-Knopp算法对每个级别的相似性进行归一化,然后在总和之前进行标准化,减轻不同级别的过度表示和不足表示问题。通过同时考虑不同粒度级别的跨模态相似性,UCoFiA 允许有效统一多粒度对齐。实验结果表明,UCoFiA 在多个视频文本检索基准测试中比先前最先进的Clip方法表现出色,在MSR-VTT、Activity-Net和DiDeMo等网站上,文本到视频检索R@1得分分别提高了2.4%、1.4%和1.3%。我们的代码在此httpsURL上公开可用。
As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
In the era of information explosion, spatio-temporal data mining serves as a critical part of urban management. Considering the various fields demanding attention, e.g., traffic state, human activity, and social event, predicting multiple spatio-temporal attributes simultaneously can alleviate regulatory pressure and foster smart city construction. However, current research can not handle the spatio-temporal multi-attribute prediction well due to the complex relationships between diverse attributes. The key challenge lies in how to address the common spatio-temporal patterns while tackling their distinctions. In this paper, we propose an effective solution for spatio-temporal multi-attribute prediction, PromptST. We devise a spatio-temporal transformer and a parameter-sharing training scheme to address the common knowledge among different spatio-temporal attributes. Then, we elaborate a spatio-temporal prompt tuning strategy to fit the specific attributes in a lightweight manner. Through the pretrain and prompt tuning phases, our PromptST is able to enhance the specific spatio-temoral characteristic capture by prompting the backbone model to fit the specific target attribute while maintaining the learned common knowledge. Extensive experiments on real-world datasets verify that our PromptST attains state-of-the-art performance. Furthermore, we also prove PromptST owns good transferability on unseen spatio-temporal attributes, which brings promising application potential in urban computing. The implementation code is available to ease reproducibility.
在信息爆炸的时代,时空数据挖掘是城市管理的重要部分。考虑到需要关注的各个领域,例如交通状态、人类活动和社会事件,同时预测多个时空属性可以减轻监管压力,促进智慧城市建设。然而,当前研究无法很好地处理时空多属性预测,因为不同属性之间的复杂关系。关键的挑战是如何同时处理 common 时空模式和解决它们的区别。在本文中,我们提出了一种有效的解决方案,称为PromptST,该方案设计了一个时空转换器和参数共享训练计划,以解决不同时空属性之间的共同知识。然后,我们详细阐述了时空promptTuning策略,以轻量级方式适应特定的属性。在预训练和prompttuning阶段,我们的PromptST能够增强特定的时空特征捕捉,通过Prompt Backbone模型引导核心模型适应特定的目标属性,同时保持学到的共同知识。在实地数据集上的广泛实验证实,我们的PromptST达到了最先进的性能。此外,我们还证明了PromptST具有良好的未观测时空属性转移性,在智慧城市建设中具有有前途的应用潜力。实现代码已提供,以方便复用。
As Large Language Models are deployed within Artificial Intelligence systems, that are increasingly integrated with human society, it becomes more important than ever to study their internal structures. Higher level abilities of LLMs such as GPT-3.5 emerge in large part due to informative language representations they induce from raw text data during pre-training on trillions of words. These embeddings exist in vector spaces of several thousand dimensions, and their processing involves mapping between multiple vector spaces, with total number of parameters on the order of trillions. Furthermore, these language representations are induced by gradient optimization, resulting in a black box system that is hard to interpret. In this paper, we take a look at the topological structure of neuronal activity in the "brain" of Chat-GPT's foundation language model, and analyze it with respect to a metric representing the notion of fairness. We develop a novel approach to visualize GPT's moral dimensions. We first compute a fairness metric, inspired by social psychology literature, to identify factors that typically influence fairness assessments in humans, such as legitimacy, need, and responsibility. Subsequently, we summarize the manifold's shape using a lower-dimensional simplicial complex, whose topology is derived from this metric. We color it with a heat map associated with this fairness metric, producing human-readable visualizations of the high-dimensional sentence manifold. Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments. This indicates that GPT-based language models develop a moral dimension within their representation spaces and induce an understanding of fairness during their training process.
大型语言模型被部署在越来越与人类社会融合的人工智能系统中,因此研究其内部结构变得越来越重要。例如,GPT-3.5等高级别的LLM能力的大部分来自在数十亿个单词的预训练数据中从原始文本数据中诱导的 informative 语言表示。这些表示存在于数百维度的向量空间中,其处理涉及在不同向量空间之间的映射,总参数数量高达数十亿。此外,这些语言表示是由梯度优化诱导的,导致一个难以解释的 black box 系统。在本文中,我们研究了 Chat-GPT 基础语言模型“大脑”中的神经元活动拓扑结构,并相对于一个代表公平性的度量进行分析。我们开发了一种新的方法来可视化 GPT 的道德维度。我们首先计算一个公平性的度量,受到社会心理学文献的启发,以确定通常影响人类公平评估的因素,例如合法性、需要和责任。随后,我们使用一个低维度的斯普林格复杂度,其拓扑由这个度量推导出来,用与这个公平性的度量相关的热图着色,产生了高维度句子分支的可读可视化。我们的结果显示,基于 GPT-3.5 的句子嵌入可以被分解为两个子分支,对应着公平和不公平的道德判断。这表明,基于 GPT 的语言模型在他们的表示空间中开发了道德维度,并在他们的训练过程中引入了公平的理解。
Trajectory forecasting has become an interesting research area driven by advancements in wearable sensing technology. Sensors can be seamlessly integrated into clothing using cutting-edge electronic textiles technology, allowing long-term recording of human movements outside the laboratory. Motivated by the recent findings that clothing-attached sensors can achieve higher activity recognition accuracy than body-attached sensors, this work investigates motion prediction and trajectory forecasting using rigid-attached and clothing-attached sensors. The future trajectory is forecasted from the probabilistic trajectory model formulated by left-to-right hidden Markov model (LR-HMM) and motion prediction accuracy is computed by the classification rule. Surprisingly, the results show that clothing-attached sensors can forecast the future trajectory and have better performance than body-attached sensors in terms of motion prediction accuracy. In some cases, the clothing-attached sensor can enhance accuracy by 45% compared to the body-attached sensor and requires approximately 80% less duration of the historical trajectory to achieve the same level of accuracy as the body-attached sensor.
轨迹预测已成为一个受 wearable 传感技术 advancements 推动有趣的研究领域。通过使用先进的电子纺织品技术,传感器可以无缝地集成到服装中,在户外实验室长期记录人类运动。最近发现,与身体绑定的传感器相比,衣物绑定的传感器可以实现更高的活动识别准确性,因此本研究使用固定绑定和衣物绑定的传感器来研究运动预测和轨迹预测。未来轨迹是从LR-HMM的概率轨迹模型中预测的,运动预测准确性通过分类规则计算。令人惊讶地,结果显示,衣物绑定的传感器可以预测未来轨迹,在运动预测准确性方面比身体绑定的传感器表现更好。在某些情况下,衣物绑定的传感器可以提高准确性45%,与身体绑定的传感器相比,需要大约80%更少的历史轨迹持续时间,才能达到与身体绑定传感器相同的准确性水平。
Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes that are trained before inference. To customize target classes after training, we tackle zero- and few-shot SELD tasks, in which we set new classes with a text sample or a few audio samples. While zero-shot sound classification tasks are achievable by embedding from contrastive language-audio pretraining (CLAP), zero-shot SELD tasks require assigning an activity and a DOA to each embedding, especially in overlapping cases. To tackle the assignment problem in overlapping cases, we propose an embed-ACCDOA model, which is trained to output track-wise CLAP embedding and corresponding activity-coupled Cartesian direction-of-arrival (ACCDOA). In our experimental evaluations on zero- and few-shot SELD tasks, the embed-ACCDOA model showed a better location-dependent scores than a straightforward combination of the CLAP audio encoder and a DOA estimation model. Moreover, the proposed combination of the embed-ACCDOA model and CLAP audio encoder with zero- or few-shot samples performed comparably to an official baseline system trained with complete train data in an evaluation dataset.
Monitoring wellbeing and stress is one of the problems covered by ambient intelligence, as stress is a significant cause of human illnesses directly affecting our emotional state. The primary aim was to propose a deliberation architecture for an ambient intelligence healthcare application. The architecture provides a plan for comforting stressed seniors suffering from negative emotions in an assisted living home and executes the plan considering the environment's dynamic nature. Literature was reviewed to identify the convergence between deliberation and ambient intelligence and the latter's latest healthcare trends. A deliberation function was designed to achieve context-aware dynamic human-robot interaction, perception, planning capabilities, reactivity, and context-awareness with regard to the environment. A number of experimental case studies in a simulated assisted living home scenario were conducted to demonstrate the approach's behavior and validity. The proposed methods were validated to show classification accuracy. The validation showed that the deliberation function has effectively achieved its deliberative objectives.
监测福祉和压力是环境智能所涵盖的问题之一,因为压力是人类疾病直接影响我们情绪状态的重要原因。其主要目标是提出一个环境智能医疗应用的思考型架构。该架构为在协助居住公寓中安慰患有负面情绪的老年人提供计划,并考虑环境的动态性质来执行计划。文献进行了审查,以确定思考型和环境智能之间的共同点和最新的医疗趋势。思考功能旨在实现对环境的动态人类-机器人交互、感知、规划和反应的上下文aware。在模拟协助居住公寓场景中进行了一些实验case study,以证明该方法的行为和有效性。 proposed methods werevalidated以证明分类准确性。验证表明,思考功能已经有效地实现了其思考目标。