Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
由于网络上存在大量教学视频,学习从视频中呈现的多步骤任务模型是一个令人着迷的目标。我们引入了一个新的预训练视频模型,VideoTaskformer,专注于代表教学视频语义和结构。我们使用一个简单的有效目标来预训练VideoTaskformer:预测从教学视频中随机掩盖的步骤的弱监督文本标签(掩码步建模)。与以前 Local 学习的步骤表示方法相比,我们的方法涉及全球学习,利用整个任务周围的视频作为上下文。从这些学习表示中,我们可以验证未观测视频是否正确执行给定任务,并预测哪些步骤可能在给定步骤后执行。我们引入了两个新的基准来检测教学视频中的错误,以验证是否存在异常步骤,以及步骤是否按照正确的顺序执行。我们还引入了一个长期预测基准,其目标是从给定步骤预测长期步骤。我们的方法在这些任务中表现出色,我们认为这些任务将成为一个有价值的方式,用于衡量步骤表示质量。此外,我们评估了VideoTaskformer,针对三个现有基准,即操作活动识别、步骤分类和步骤预测,并在每个基准上证明了我们的方法和以前基准的卓越表现,实现了新的技术水平。
https://arxiv.org/abs/2303.13519
In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.
在电视服务中,对话水平个性化是满足用户偏好和需求的关键。当对话和背景声从生产阶段单独获得时,对话分离(DS)可以估计它们,实现个性化。DS已被证明为用户提供了明确的好处。然而,估计的信号并不是完美的,可能会有一些泄漏。这不希望发生,特别是在没有对话的情况下。我们建议将对话分离和语音活动检测(VAD)两项最近为电视音频提出的技术结合起来。当它们的组合提示对话未活动时,背景成分在对话估计中泄漏的部分会被重新分配给背景估计。在没有对话的信号中,音频质量明显改进,而在对话活动时则没有性能下降。还生成了改进了检测精度的 post-processed VAD估计。因此,得出结论,DS和VAD可以互相改进,最好一起使用。
https://arxiv.org/abs/2303.13453
Detecting sets of relevant patterns from a given dataset is an important challenge in data mining. The relevance of a pattern, also called utility in the literature, is a subjective measure and can be actually assessed from very different points of view. Rule-based languages like Answer Set Programming (ASP) seem well suited for specifying user-provided criteria to assess pattern utility in a form of constraints; moreover, declarativity of ASP allows for a very easy switch between several criteria in order to analyze the dataset from different points of view. In this paper, we make steps toward extending the notion of High Utility Pattern Mining (HUPM); in particular we introduce a new framework that allows for new classes of utility criteria not considered in the previous literature. We also show how recent extensions of ASP with external functions can support a fast and effective encoding and testing of the new framework. To demonstrate the potential of the proposed framework, we exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients. Finally, an extensive experimental activity demonstrates both from a quantitative and a qualitative point of view the effectiveness of the proposed approach. Under consideration in Theory and Practice of Logic Programming (TPLP)
在数据挖掘中,检测给定数据集中的相关模式是一项重要的挑战。模式 relevance 也称为文献中的 utility,是一种主观测量,可以从非常不同的角度进行评估。规则语言如 Answer Set Programming (ASP)似乎非常适合指定用户提供的标准以评估模式 utility 的形式进行约束;此外,ASP 的 declarativity 允许非常轻松地切换多个标准以分析数据集从不同的角度。在本文中,我们朝着扩展高 Utility 模式挖掘 (HUPM)的概念迈出一步;特别是,我们引入了一个新的框架,该框架允许新的 utility 标准class,在先前的文献中未考虑。我们还展示了如何使用最近的 ASP 扩展外部函数支持快速且有效的编码和测试新框架。为了展示新框架的潜力,我们利用它作为构建块,定义一种预测 COVID-19 患者重症监护病房接纳方法的创新性方法。最后,一项广泛的实验活动从定量和定性角度证明了所提出的方法的有效性。在逻辑编程的理论和实践中,正在考虑。
https://arxiv.org/abs/2303.13191
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb additional knowledge, which models the interactions between the additional knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the long-tail words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e.g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results. In addition, the cross attention mechanism is also used in between the two streams for sharing information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSRVTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset.
视频字幕旨在用自然语言描述视频内容。尽管已经取得了很大进展,但针对实际应用程序的性能仍有很多可以提高的空间,主要原因是长词挑战。在本文中,我们提出了一种基于知识图增强的文本transformer(TextKG)用于视频字幕。值得注意的是,TextKG是一个由外部流和内部流组成的二元transformer,以外部流和内部流为基础。外部流旨在吸收额外的知识,以模拟额外的知识、例如预先构建的知识图以及视频内置信息,例如引人注目的对象区域、语音转录和视频字幕,以减轻长词挑战。与此同时,内部流旨在利用视频的多媒体信息(例如视频帧的外观、语音转录和视频字幕),以确保字幕结果的质量。此外,在两个流之间还使用了交叉注意力机制来共享信息。因此,两个流可以互相帮助,获得更准确的结果。在四个具有挑战性的视频字幕数据集上进行了广泛的实验,包括YouCookII、ActivityNetcaptions、MSRVTT和MSVD,结果表明,我们提出的方法在与当前最佳方法的比较中表现良好。具体而言,我们提出的TextKG方法在YouCookII数据集上比最佳公开结果提高了18.7%的绝对CIDEr得分。
https://arxiv.org/abs/2303.12423
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
弱监督的时间行动定位旨在在未剪辑的视频中提取行动区域并识别行动类别,仅使用视频级别的标签作为监督信息。伪标签生成是一种解决挑战性问题有前途的策略,但大多数现有方法局限于使用片段级别的分类结果来指导生成,并忽视了视频的自然时间结构也可以提供丰富的信息来帮助这种生成过程。在本文中,我们提出了一种新的弱监督的时间行动定位方法,通过推断片段特征亲和力来实现。首先,我们设计了一个亲和力推断模块,利用时间相邻片段之间的亲和力关系来生成初始的粗仿标签。然后,我们引入了一个信息交互模块,通过探索视频内部和外部的关系来优化粗仿标签,并最后使用信息交互模块生成的高保真的仿标签来监督行动定位网络的训练。在两个公开数据集THUMOS14和ActivityNet v1.3上进行广泛的实验,证明了我们提出的方法相比现有方法取得了显著的改进。
https://arxiv.org/abs/2303.12332
This paper presents a data-driven approach to mitigate the effects of air pollution from industrial plants on nearby cities by linking operational decisions with weather conditions. Our method combines predictive and prescriptive machine learning models to forecast short-term wind speed and direction and recommend operational decisions to reduce or pause the industrial plant's production. We exhibit several trade-offs between reducing environmental impact and maintaining production activities. The predictive component of our framework employs various machine learning models, such as gradient-boosted tree-based models and ensemble methods, for time series forecasting. The prescriptive component utilizes interpretable optimal policy trees to propose multiple trade-offs, such as reducing dangerous emissions by 33-47% and unnecessary costs by 40-63%. Our deployed models significantly reduced forecasting errors, with a range of 38-52% for less than 12-hour lead time and 14-46% for 12 to 48-hour lead time compared to official weather forecasts. We have successfully implemented the predictive component at the OCP Safi site, which is Morocco's largest chemical industrial plant, and are currently in the process of deploying the prescriptive component. Our framework enables sustainable industrial development by eliminating the pollution-industrial activity trade-off through data-driven weather-based operational decisions, significantly enhancing factory optimization and sustainability. This modernizes factory planning and resource allocation while maintaining environmental compliance. The predictive component has boosted production efficiency, leading to cost savings and reduced environmental impact by minimizing air pollution.
这篇文章提出了一种基于数据的处理方法,通过将 operational decisions 与 weather conditions 联系起来,以减少工业工厂对周边城市空气污染的影响。我们的方法和预测与指令机器学习模型相结合,以预测短期风速和方向,并建议 operational decisions 以减少或暂停工厂的生产。我们展示了减少环境负面影响与维持生产活动之间的多个权衡。我们的框架中的预测部分使用各种机器学习模型,如梯度增强的Tree模型和集成方法,进行时间序列预测。指令部分使用可解释的最佳政策树提出多个权衡,例如减少危险的排放可以减少33-47%,不必要的成本可以减少40-63%。我们部署的模型 significantly 减少了预测错误,在小于12小时预测 lead 时间的情况下,准确率为38-52%,在12-48小时预测 lead 时间的情况下,准确率为14-46%。我们在OCP Safi工厂成功实施了预测部分,这是摩洛哥最大的化学工业工厂,目前正在部署指令部分。我们的框架通过消除污染工业活动权衡,通过数据驱动的天气操作决策,实现了可持续工业发展,极大地增强了工厂优化和可持续性。这种方法现代化了工厂规划和资源分配,同时保持了环境合规。预测部分提高了生产效率,通过最小化空气污染实现了成本节省和减少环境负面影响。
https://arxiv.org/abs/2303.12285
Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.
视频摘要的目标是从源视频中提取最重要的信息,以生成一种简短的片段或文本叙事。传统上,不同的方法会根据输出是否为视频或文本而提出,从而忽视了视觉摘要和文本摘要这两个语义相关的任务之间的相关性。我们提出了一个新的联合视频和文本摘要任务。该任务的目标是从一段较长的视频中提取 both a缩短的视频片段和相应的文本摘要,并将其统称为跨媒体摘要。生成的缩短视频片段和文本叙事应该语义上紧密对齐。为此,我们首先建立了一个大规模的人类标注数据集——VideoXum(X代表不同感官方式)。该数据集基于活动Net进行重新标注。在我们过滤掉不符合长度要求的视频后,我们的新数据集仍然存在14,001段较长的视频。在每个视频中,我们的重新标注数据集都有人类标注的视频摘要和相应的文本摘要。然后我们设计了一种新的端到端模型——VTSUM-BILP,以解决我们提出的任务的挑战。此外,我们提出了一种新的度量指标——VT-CLIPScore,以帮助评估跨媒体摘要的语义一致性。该提议模型在 this 新的任务上取得了良好的表现,并为未来的研究树立了基准。
https://arxiv.org/abs/2303.12060
Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 seconds. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.
近年来的工作表明,语音分离引导的分音(SSGD)是一个越来越有前途的方向,这主要得益于语音分离领域的 recent 进展。它首先分离说话人,然后对每个分离的流应用语音活动检测(VAD)。在这项工作中,我们深入研究了语音分离引导的分音(SSGD)在口语电话语音(CTS)领域中的应用,主要集中在低延迟流分音应用。我们考虑了三种最先进的语音分离算法(SSep),并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现的实现方式,以及连续 SSep(CSS)窗口推理。我们比较了不同 SSGD 算法在两个广泛使用的 CTS 数据集上的表现:CALLHOME 和 Fisher Corpus(Part 1 和 2),并评估了分离和分音性能。为了改善性能,我们提出了一种新的、因果且计算高效的泄漏去除算法,这显著减少了误报。我们还首次探索了 SSep 和 VAD 模块之间的完全端到端 SSGD 集成。至关重要的是,这使得可以在没有可用的oracle 说话人来源的现实世界数据上进行微调。特别是,我们的最佳模型在CALLHOME上取得了 8.8%的der,比当前最先进的端到端神经网络分音模型还要好,尽管训练数据量要少得多,且延迟显著更低,即 0.1 秒 vs. 1秒。最后,我们还表明,分离信号可以方便地用于自动语音识别,在某些配置下达到与使用oracle 说话人来源类似的性能。
https://arxiv.org/abs/2303.12002
In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
在本文中,我们考虑了低精度(零精度和少量精度)情况下的时间行动定位问题,目标是在训练时不能看到某些未修剪视频的任意分类行动中实例的情况下,检测和分类这些行动实例。我们采用了基于Transformer的两步行动定位架构,并使用类无关的行动提议,随后采用开放词汇分类。我们做出了以下贡献。第一,为了补偿图像文本基础模型的时间运动,我们改进了类无关的行动提议,通过明确对齐光学流、RGB和文本的嵌入来提高其精度。这在现有的低精度方法中几乎被忽视了。第二,为了提高开放词汇分类的精度,我们建立了具有强大分类力的Classifier,即避免词义歧义。具体而言,我们提议使用详细的行动描述(从大规模语言模型获取)或视觉条件特定实例优先提示向量来启发预训练的CLIP文本编码器。第三,我们对THUMOS14和ActivityNet1.3进行了完整的实验和分解研究,展示了我们提出的模型的优秀性能,比现有的先进技术高出一个显著的差异。
https://arxiv.org/abs/2303.11732
There is a surge in interest in the development of open-domain chatbots, driven by the recent advancements of large language models. The "openness" of the dialogue is expected to be maximized by providing minimal information to the users about the common ground they can expect, including the presumed joint activity. However, evidence suggests that the effect is the opposite. Asking users to "just chat about anything" results in a very narrow form of dialogue, which we refer to as the "open-domain paradox". In this paper, we explain this paradox through the theory of common ground as the basis for human-like communication. Furthermore, we question the assumptions behind open-domain chatbots and identify paths forward for enabling common ground in human-computer dialogue.
对开放式对话系统的开发受到最近大型语言模型的进步的推动,引起了浓厚的兴趣。开放式对话的目标是通过向用户提供最少的信息,最大化对话的开放性,包括假定的联合活动。然而,证据表明,这样做的效果是相反的。要求用户“随便聊聊”会导致非常狭窄的对话形式,我们称之为“开放式悖论”。在本文中,我们将通过共同基点作为人类相似性沟通的基础来解释这个悖论。此外,我们质疑开放式对话系统背后的假设,并识别在人类计算机对话中实现共同基点的道路。
https://arxiv.org/abs/2303.11708
Sensor-based human activity segmentation and recognition are two important and challenging problems in many real-world applications and they have drawn increasing attention from the deep learning community in recent years. Most of the existing deep learning works were designed based on pre-segmented sensor streams and they have treated activity segmentation and recognition as two separate tasks. In practice, performing data stream segmentation is very challenging. We believe that both activity segmentation and recognition may convey unique information which can complement each other to improve the performance of the two tasks. In this paper, we firstly proposes a new multitask deep neural network to solve the two tasks simultaneously. The proposed neural network adopts selective convolution and features multiscale windows to segment activities of long or short time durations. First, multiple windows of different scales are generated to center on each unit of the feature sequence. Then, the model is trained to predict, for each window, the activity class and the offset to the true activity boundaries. Finally, overlapping windows are filtered out by non-maximum suppression, and adjacent windows of the same activity are concatenated to complete the segmentation task. Extensive experiments were conducted on eight popular benchmarking datasets, and the results show that our proposed method outperforms the state-of-the-art methods both for activity recognition and segmentation.
基于传感器的人动活动分割和识别是许多实际应用场景中两个重要且具有挑战性的的问题,近年来已经吸引了深度学习社区越来越多的关注。现存的深度学习工作大多数是基于已segmented传感器流设计的,并将其活动分割和识别视为两个独立的任务。在实践中,进行数据流分割非常困难。我们相信,活动分割和识别可能传递独特的信息,可以互相补充,以改善两个任务的表现。在本文中,我们首先提出了一种新的多任务深度神经网络,以同时解决这两个任务。该神经网络采用选择性卷积和特征多尺度窗口来分割长或短持续时间的活动。首先,多个不同尺度的窗口将被生成,以集中在每个特征单元上。然后,模型被训练以预测,对每个窗口,活动类别和真正的活动边界offset。最后,重叠窗口通过非最大值抑制被过滤出来,同时相同的活动相邻窗口被拼接起来,以完成分割任务。对八个常用的基准数据集进行了广泛的实验,结果表明,我们提出的方法在活动识别和分割方面都优于现有的方法。
https://arxiv.org/abs/2303.11100
Voice-enabled technology is quickly becoming ubiquitous, and is constituted from machine learning (ML)-enabled components such as speech recognition and voice activity detection. However, these systems don't yet work well for everyone. They exhibit bias - the systematic and unfair discrimination against individuals or cohorts of individuals in favour of others (Friedman & Nissembaum, 1996) - across axes such as age, gender and accent. ML is reliant on large datasets for training. Dataset documentation is designed to give ML Practitioners (MLPs) a better understanding of a dataset's characteristics. However, there is a lack of empirical research on voice dataset documentation specifically. Additionally, while MLPs are frequent participants in fairness research, little work focuses on those who work with voice data. Our work makes an empirical contribution to this gap. Here, we combine two methods to form an exploratory study. First, we undertake 13 semi-structured interviews, exploring multiple perspectives of voice dataset documentation practice. Using open and axial coding methods, we explore MLPs' practices through the lenses of roles and tradeoffs. Drawing from this work, we then purposively sample voice dataset documents (VDDs) for 9 voice datasets. Our findings then triangulate these two methods, using the lenses of MLP roles and trade-offs. We find that current VDD practices are inchoate, inadequate and incommensurate. The characteristics of voice datasets are codified in fragmented, disjoint ways that often do not meet the needs of MLPs. Moreover, they cannot be readily compared, presenting a barrier to practitioners' bias reduction efforts. We then discuss the implications of these findings for bias practices in voice data and speech technologies. We conclude by setting out a program of future work to address these findings -- that is, how we may "right the docs".
语音驱动的技术已经变得非常普遍,其构成成分包括语音识别和语音活动检测等机器学习(ML)驱动组件。然而,这些系统并不一定适用于每个人。它们表现出偏见——对个体或群体整体进行系统性和不公平的歧视,以某些人为优势(Friedman & Nissembaum,1996)——跨越年龄、性别和口音等轴。机器学习依赖于大型数据集进行训练。数据集文档的设计旨在使机器学习从业者(MLP)更好地理解数据集的特征。然而, specifically, there is a lack of empirical research on voice dataset documentation. Additionally, while MLPs are frequently participants in fairness research, little attention is paid to those who work with voice data. Our work fills this gap by making an empirical contribution. Here, we combine two methods to form an exploration study. First, we conduct 13 semi-structured interviews, exploring the multiple perspectives of voice dataset documentation practice. Using open and axial coding methods, we explore MLP practices through the lens of roles and tradeoffs. Drawing from this work, we then randomly sample voice dataset documents (VDDs) for 9 voice datasets. Our findings then triangulate these two methods using MLP roles and trade-offs. We find that current VDD practices are inchoate, inadequate, and incommensurate. The characteristics of voice datasets arecodified in fragmented, disjoint ways that often do not meet the needs of MLPs. Moreover, they cannot be readily compared, presenting a barrier to practitioners' bias reduction efforts. We then discuss the implications of these findings for bias practices in voice data and speech technologies. We conclude by setting out a program of future work to address these findings——that is, how we may "right theDocs".
https://arxiv.org/abs/2303.10721
A computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.
使用低分辨率图像传感器的计算机视觉系统可以提供智能服务(例如,活动识别),但从硬件级别上保留不必要的视觉隐私信息。然而,保护视觉隐私并实现准确的机器识别具有对图像分辨率的dversarial需求。通过建模隐私保护和机器识别性能之间的权衡,可以指导未来使用低分辨率图像传感器的保持隐私的计算机视觉系统。在本文中,使用家庭日常活动(ADLs)作为场景,通过用户调查获取了最重要的视觉隐私特征。然后量化和分析了图像分辨率对活动识别和隐私意识任务中人类和机器识别性能的影响。我们还研究了现代图像超分辨率技术如何影响这些影响。基于结果,我们提出了一种方法,用于建模低分辨率图像中隐私保护和活动识别的权衡。
https://arxiv.org/abs/2303.10435
Deep Neural Networks (DNNs) often fail in out-of-distribution scenarios. In this paper, we introduce a tool to visualize and understand such failures. We draw inspiration from concepts from neural electrophysiology, which are based on inspecting the internal functioning of a neural networks by analyzing the feature tuning and invariances of individual units. Deep Electrophysiology, in short Deephys, provides insights of the DNN's failures in out-of-distribution scenarios by comparative visualization of the neural activity in in-distribution and out-of-distribution datasets. Deephys provides seamless analyses of individual neurons, individual images, and a set of set of images from a category, and it is capable of revealing failures due to the presence of spurious features and novel features. We substantiate the validity of the qualitative visualizations of Deephys thorough quantitative analyses using convolutional and transformers architectures, in several datasets and distribution shifts (namely, colored MNIST, CIFAR-10 and ImageNet).
深度神经网络(DNN)在分布不均衡的情况下经常失败。在本文中,我们介绍了一种工具,以可视化和理解这些失败。我们借鉴了神经电生理学的概念,这些概念基于检查神经网络内部运作的分析个体单元的特征调整和不变性。深度电生理学,简称Deephys,通过比较分布不均衡数据和分布不均衡数据下的神经网络活动的对比,提供了DNN在分布不均衡情况下失败的见解。Deephys提供了 seamless 的分析单个神经元、单个图像和一组类别中的所有图像,并能够揭示由于伪特征和新特征的存在而导致的失败。我们通过卷积和转换架构的量化分析,支持了Deephys的定性可视化的效力,在多个数据集和分布转移中(例如彩色米尼汉、CIFAR-10和图像网)证明了它的定性可视化的定量可视化的有效性。
https://arxiv.org/abs/2303.11912
Despite their potential, markerless hand tracking technologies are not yet applied in practice to the diagnosis or monitoring of the activity in inflammatory musculoskeletal diseases. One reason is that the focus of most methods lies in the reconstruction of coarse, plausible poses for gesture recognition or AR/VR applications, whereas in the clinical context, accurate, interpretable, and reliable results are required. Therefore, we propose ShaRPy, the first RGB-D Shape Reconstruction and hand Pose tracking system, which provides uncertainty estimates of the computed pose to guide clinical decision-making. Our method requires only a light-weight setup with a single consumer-level RGB-D camera yet it is able to distinguish similar poses with only small joint angle deviations. This is achieved by combining a data-driven dense correspondence predictor with traditional energy minimization, optimizing for both, pose and hand shape parameters. We evaluate ShaRPy on a keypoint detection benchmark and show qualitative results on recordings of a patient.
尽管它们有潜力,但无标记手跟踪技术尚未在实践中应用于诊断或监测抗炎神经肌肉疾病的活动。原因之一是大多数方法的关注点在于对手势识别或增强现实应用中的粗略POS重建,而在实践中,需要准确、可解释和可靠的结果。因此,我们提出了SARPy,它是RGB-D形状重建和手POS跟踪系统的先驱,可以提供计算POS的不确定性估计,以指导临床决策。我们的方法只需要一个轻便的框架和一个消费级RGB-D相机,但它能够在仅有较小关节角度差异的情况下区分相似的POS。这是通过将数据驱动的密集对应预测与传统的能量最小化优化相结合实现的。我们评估了SARPy在一个关键点检测基准上的性能,并展示了患者记录中的质量结果。
https://arxiv.org/abs/2303.10042
Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at this https URL.
现有的文本-视频检索解决方案本质上是关注最大化条件概率的判别模型,即p(candidates|query)。尽管直观,但这种实际范式忽略了p(query) underlying数据分布,这使得找到不在分布中的数据变得困难。为了解决这一限制,我们创造性地从生成视角出发,并将文本和视频之间的相关性建模为它们的联合概率p(candidates,query)。这通过基于扩散的文本-视频检索框架(DiffusionRet)来实现,该框架将检索任务建模为从噪声中逐渐生成联合分布的过程。在训练期间,DiffusionRet从生成和区分两个方面进行优化,生成器通过生成损失优化,特征提取器通过对比损失进行训练。这样,DiffusionRet巧妙地利用了生成和区分方法的优势。我们对五个常用的文本-视频检索基准点进行了广泛的实验,包括MSRVTT、LSMDC、MSVD、ActivityNetcaptions和DiDeMo,结果显示它们表现优异,从而证明了我们方法的有效性。更鼓舞人心的是,未做任何修改,DiffusionRet在跨域检索设置中表现良好。我们相信这项工作为相关领域带来了基本见解。代码将放在这个httpsURL上。
https://arxiv.org/abs/2303.09867
Medications often impose temporal constraints on everyday patient activity. Violations of such medical temporal constraints (MTCs) lead to a lack of treatment adherence, in addition to poor health outcomes and increased healthcare expenses. These MTCs are found in drug usage guidelines (DUGs) in both patient education materials and clinical texts. Computationally representing MTCs in DUGs will advance patient-centric healthcare applications by helping to define safe patient activity patterns. We define a novel taxonomy of MTCs found in DUGs and develop a novel context-free grammar (CFG) based model to computationally represent MTCs from unstructured DUGs. Additionally, we release three new datasets with a combined total of N = 836 DUGs labeled with normalized MTCs. We develop an in-context learning (ICL) solution for automatically extracting and normalizing MTCs found in DUGs, achieving an average F1 score of 0.62 across all datasets. Finally, we rigorously investigate ICL model performance against a baseline model, across datasets and MTC types, and through in-depth error analysis.
药物通常会对日常患者的活动施加时间限制。违反这些医疗时间限制(MTCs)会导致治疗不遵守,同时导致不良健康结果和增加医疗费用。这些MTCs不仅在患者教育材料和临床文本中存在,还在药物使用指南(DUGs)中存在。通过计算在DUGs中表示MTCs的方法,可以推动以患者为中心的医疗应用程序的发展,帮助定义安全的患者活动模式。我们定义了一种新的MTC分类法,并开发了一种新的无上下文语法(CFG)模型,以计算在无结构DUGs中表示的MTCs。此外,我们发布了三个新的数据集,总大小为N = 836,其中每个数据集中都有规范化的MTCs标签。我们开发了自动从DUGs中获取和规范化MTCs的上下文学习解决方案,在所有数据集中的平均F1得分为0.62。最后,我们严格研究了ICL模型相对于基准模型的性能,包括数据集和MTC类型的基准模型,并进行深入的错误分析。
https://arxiv.org/abs/2303.09366
A comprehensive model of natural language processing in the brain must accommodate four components: representations, operations, structures and encoding. It further requires a principled account of how these components mechanistically, and causally, relate to each another. While previous models have isolated regions of interest for structure-building and lexical access, many gaps remain with respect to bridging distinct scales of neural complexity. By expanding existing accounts of how neural oscillations can index various linguistic processes, this article proposes a neurocomputational architecture for syntax, termed the ROSE model (Representation, Operation, Structure, Encoding). Under ROSE, the basic data structures of syntax are atomic features, types of mental representations (R), and are coded at the single-unit and ensemble level. Elementary computations (O) that transform these units into manipulable objects accessible to subsequent structure-building levels are coded via high frequency gamma activity. Low frequency synchronization and cross-frequency coupling code for recursive categorial inferences (S). Distinct forms of low frequency coupling and phase-amplitude coupling (delta-theta coupling via pSTS-IFG; theta-gamma coupling via IFG to conceptual hubs) then encode these structures onto distinct workspaces (E). Causally connecting R to O is spike-phase/LFP coupling; connecting O to S is phase-amplitude coupling; connecting S to E is a system of frontotemporal traveling oscillations; connecting E to lower levels is low-frequency phase resetting of spike-LFP coupling. ROSE is reliant on neurophysiologically plausible mechanisms, is supported at all four levels by a range of recent empirical research, and provides an anatomically precise and falsifiable grounding for the basic property of natural language syntax: hierarchical, recursive structure-building.
大脑自然语言处理的全面模型必须适应四个组件:表示、操作、结构和编码。它还要求一个原则性地描述这些组件如何机械性、因果性地相互关联。尽管以前的模型已将结构构建和词汇获取感兴趣的区域隔离开来,但在跨越不同神经复杂性级别的许多差距方面,仍然存在。通过扩展现有关于神经振荡如何索引各种语言过程的描述,本文提出了一个神经计算架构,称为rose模型(表示、操作、结构和编码)。在rose模型中,语法的基本数据结构是原子特征、心理表示类型(R),并使用单个单元和集体水平编码。基本计算(O)将这些单元转换为可操作的对象,使后续结构构建级别能够访问,是通过高频率高尔基体活动编码的。低频率同步和跨频率耦合编码了递归元组 inference(S)。不同形式的低频率耦合和相位-幅度耦合(通过pSTS-IFG的delta-theta耦合;通过IFG的theta-gamma耦合,以概念中心群为中介)将这些结构编码为不同的工作空间(E)。将r与o因果关系地连接起来是 spike-phase/LFP耦合;将o与s连接起来是相位-幅度耦合;将s与E连接起来是前脑-后脑旅行振荡系统;将E与较低水平连接的是低频率相位 reset的 spike-LFP耦合。rose依赖于神经生理学合理机制,在所有四个水平上都得到了各种最近的实证研究的支持,并为自然语言语法的基本特性提供了解剖学精确且可验证的基础:Hierarchical和递归的结构构建。
https://arxiv.org/abs/2303.08877
Birth asphyxia is a major newborn mortality problem in low-resource countries. International guideline provides treatment recommendations; however, the importance and effect of the different treatments are not fully explored. The available data is collected in Tanzania, during newborn resuscitation, for analysis of the resuscitation activities and the response of the newborn. An important step in the analysis is to create activity timelines of the episodes, where activities include ventilation, suction, stimulation etc. Methods: The available recordings are noisy real-world videos with large variations. We propose a two-step process in order to detect activities possibly overlapping in time. The first step is to detect and track the relevant objects, like bag-mask resuscitator, heart rate sensors etc., and the second step is to use this information to recognize the resuscitation activities. The topic of this paper is the first step, and the object detection and tracking are based on convolutional neural networks followed by post processing. Results: The performance of the object detection during activities were 96.97 % (ventilations), 100 % (attaching/removing heart rate sensor) and 75 % (suction) on a test set of 20 videos. The system also estimate the number of health care providers present with a performance of 71.16 %. Conclusion: The proposed object detection and tracking system provides promising results in noisy newborn resuscitation videos. Significance: This is the first step in a thorough analysis of newborn resuscitation episodes, which could provide important insight about the importance and effect of different newborn resuscitation activities
出生窒息是低资源国家中新生儿死亡的主要原因。国际指南提供了治疗建议,但不同的治疗方法的重要性和效果并未得到充分探讨。在坦桑尼亚的新生儿复苏期间,可用数据收集来分析复苏活动和新生儿的反应。分析的一个重要步骤是创建活动时间线图,其中包括呼吸、吸氧、刺激等活动。方法:可用的视频录制质量较差,存在大量差异。我们提出了一种两步过程,以检测可能同时发生的活动。第一步是检测和跟踪相关物体,如塑料袋口罩复苏器、心率传感器等。第二步是使用这些信息识别复苏活动。本文的主题是第一步,对象检测和跟踪基于卷积神经网络,然后进行 post 处理。结果:在活动中的对象检测性能在 96.97 %(呼吸)、100 %(插入/移除心率传感器)和 75 %(吸氧)的情况下表现良好。系统还估计了有 71.16 % 表现良好的医疗专业人员数量。结论:提出的对象检测和跟踪系统在噪音突出的新生儿复苏视频中表现出良好的结果。意义:这是深入分析新生儿复苏事件的第一步,这可能提供有关不同新生儿复苏活动重要性和效果的重要见解。
https://arxiv.org/abs/2303.07790
Objective: Birth asphyxia is one of the leading causes of neonatal deaths. A key for survival is performing immediate and continuous quality newborn resuscitation. A dataset of recorded signals during newborn resuscitation, including videos, has been collected in Haydom, Tanzania, and the aim is to analyze the treatment and its effect on the newborn outcome. An important step is to generate timelines of relevant resuscitation activities, including ventilation, stimulation, suction, etc., during the resuscitation episodes. Methods: We propose a two-step deep neural network system, ORAA-net, utilizing low-quality video recordings of resuscitation episodes to do activity recognition during newborn resuscitation. The first step is to detect and track relevant objects using Convolutional Neural Networks (CNN) and post-processing, and the second step is to analyze the proposed activity regions from step 1 to do activity recognition using 3D CNNs. Results: The system recognized the activities newborn uncovered, stimulation, ventilation and suction with a mean precision of 77.67 %, a mean recall of 77,64 %, and a mean accuracy of 92.40 %. Moreover, the accuracy of the estimated number of Health Care Providers (HCPs) present during the resuscitation episodes was 68.32 %. Conclusion: The results indicate that the proposed CNN-based two-step ORAAnet could be used for object detection and activity recognition in noisy low-quality newborn resuscitation videos. Significance: A thorough analysis of the effect the different resuscitation activities have on the newborn outcome could potentially allow us to optimize treatment guidelines, training, debriefing, and local quality improvement in newborn resuscitation.
目标:出生窒息是Neonatal 死亡的主要原因之一。生存的关键是进行即时和持续的高质量新生儿复苏。一个关键步骤是在坦桑尼亚哈杰多收集的新生儿复苏期间记录的信号数据集,包括视频,旨在分析治疗及其对新生儿结果的影响。一个重要步骤是在复苏期间生成相关的复苏活动时间线,包括通气、刺激、吸氧和吸痰等。方法:我们提议使用两个步骤的深度神经网络系统——ORAA-net,利用低质量的复苏期间视频录制进行活动识别。第一个步骤是使用卷积神经网络(CNN)和后处理检测和跟踪相关物体。第二个步骤是使用3DCNNs从第一个步骤中分析提议的活动区域。结果:系统在新生儿复苏期间识别了未暴露的活动、刺激、通气和吸痰,其平均精度为77.67%,平均召回率为77.64%,平均准确度为92.40%。此外,估计在复苏期间存在的医疗服务提供者数量的准确性为68.32%。结论:结果显示,提议的基于CNN的两步ORAA-net可以在嘈杂的低质量新生儿复苏视频中进行物体检测和活动识别。 significance:全面分析不同复苏活动对新生儿结果的影响可能使我们优化治疗指南、培训、汇报和当地质量改进新生儿复苏。
https://arxiv.org/abs/2303.07789