In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.
珊瑚礁是地球上最多样化的生态系统之一,是数百万人的依赖者。不幸的是,大多数珊瑚礁都受到全球气候变化和当地人类活动的压力的严重威胁。为了更好地理解珊瑚礁恶化的动态机制,提高空间和时间分辨率的监测是关键。然而,用于量化珊瑚覆盖和物种数量的常规监测方法由于需要大量的手动劳动而 scale 受到限制。尽管计算机视觉工具被用于协助这个过程,特别是 SfM 照相测量和图像分割深度学习网络,但数据分析造成了瓶颈,有效地限制了其 scalability。本文提出了从自我运动视频映射水下环境的新方法,将 3D 映射系统统一起来,使用机器学习适应水下挑战条件,并与现代方法之一,图像语义分割相结合。该方法在红海北部阿喀巴湾的珊瑚礁区举例说明,展示了前所未有的高精度 3D 语义映射,且所需劳动成本 significantly 减少了:使用廉价的消费级摄像机在潜水5分钟内采集的100米视频线可以在5分钟内完全自动分析。我们的方法 significantly 提高了珊瑚礁监测的规模,通过迈向完全自动分析视频线一大步。该方法通过减少劳动、设备、后勤和计算成本,实现了珊瑚礁线民主化。这种方法可以帮助更有效地传达保护政策。基于学习的结构自运动计算方法对于快速、低成本映射水下环境除珊瑚礁以外的其他生态系统也有广泛的影响。
https://arxiv.org/abs/2309.12804
We present a method based on natural language processing (NLP), for studying the influence of interest groups (lobbies) in the law-making process in the European Parliament (EP). We collect and analyze novel datasets of lobbies' position papers and speeches made by members of the EP (MEPs). By comparing these texts on the basis of semantic similarity and entailment, we are able to discover interpretable links between MEPs and lobbies. In the absence of a ground-truth dataset of such links, we perform an indirect validation by comparing the discovered links with a dataset, which we curate, of retweet links between MEPs and lobbies, and with the publicly disclosed meetings of MEPs. Our best method achieves an AUC score of 0.77 and performs significantly better than several baselines. Moreover, an aggregate analysis of the discovered links, between groups of related lobbies and political groups of MEPs, correspond to the expectations from the ideology of the groups (e.g., center-left groups are associated with social causes). We believe that this work, which encompasses the methodology, datasets, and results, is a step towards enhancing the transparency of the intricate decision-making processes within democratic institutions.
我们提出了基于自然语言处理(NLP)的方法,用于研究利益集团(lobbies)在欧洲议会(EP)立法过程中的影响。我们收集和分析由EP成员发表的利益集团的立场文件和演讲的新数据集。通过基于语义相似性和暗示性的差异比较这些文本,我们能够发现可解释的EP和利益集团之间的联系。在没有这类联系的 ground-truth 数据集的情况下,我们进行间接验证,比较发现的联系与我们编辑的 dataset 中 EP 和利益集团之间的转述链接以及公开发表的 EP 成员会议的数据集。我们的最佳方法获得AUC 得分0.77,表现显著优于多个基准值。此外,对发现的联系进行汇总分析,包括相关的利益集团和政治 EP 成员团体之间的群体和政党,符合群体意识形态的期望(例如,中左翼团体与社会问题相关)。我们相信,包括方法、数据集和结果的工作是增强民主机构内部复杂决策过程透明度的一小步。
https://arxiv.org/abs/2309.11381
Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at this https URL.
民主化获取自然语言处理(NLP)技术是至关重要的,特别是对于资源不足和缺乏代表性的语言。以前的研究主要集中在通过在线爬取和文档翻译开发这些语言的标记和非标记语料库。尽管这些方法已经证明非常有效且成本效益高,但我们在 resulting 语料库中发现了限制,包括词汇多样性和与当地社区的文化联系不足。为了填补这一差距,我们开展了印度尼西亚本地语言的案例分析。我们比较了在线爬取、人类翻译和母语人士段落写作在构建数据集时的 effectiveness。我们的发现表明,母语人士段落写作生成的数据集在词汇多样性和文化内容方面表现出更好的质量。此外,我们提出了 \datasetname{} 基准数据集,包括印度尼西亚数百万人使用的12个缺乏代表性和资源极度匮乏的语言。我们使用现有的多语言大型语言模型的实证实验结果得出结论,需要将这些模型扩展到更多的缺乏代表性的语言。我们在此 https URL 上发布了 NusaWrites 数据集。
https://arxiv.org/abs/2309.10661
Customizing machine translation models to comply with fine-grained attributes such as formality has seen tremendous progress recently. However, current approaches mostly rely on at least some supervised data with attribute annotation. Data scarcity therefore remains a bottleneck to democratizing such customization possibilities to a wider range of languages, lower-resource ones in particular. Given recent progress in pretrained massively multilingual translation models, we use them as a foundation to transfer the attribute controlling capabilities to languages without supervised data. In this work, we present a comprehensive analysis of transferring attribute controllers based on a pretrained NLLB-200 model. We investigate both training- and inference-time control techniques under various data scenarios, and uncover their relative strengths and weaknesses in zero-shot performance and domain robustness. We show that both paradigms are complementary, as shown by consistent improvements on 5 zero-shot directions. Moreover, a human evaluation on a real low-resource language, Bengali, confirms our findings on zero-shot transfer to new target languages. The code is $\href{this https URL}{\text{here}}$.
在最近,针对高精度属性(如礼貌)的自定义机器翻译模型取得了巨大的进展。然而,目前的方法大多需要至少一些带有属性标注的 supervised 数据。因此,数据稀缺仍然是将这种自定义可能性扩展到更广泛的语言、特别是较低资源语言的一个瓶颈。考虑到最近训练大规模多语言翻译模型的进展,我们使用它们作为基础将属性控制能力转移到没有 supervised 数据的语言。在本研究中,我们提出了基于预先训练的 NLLB-200 模型的属性控制器 Transfer 的全面分析。我们研究了在各种数据场景下的训练和推理时间控制技术,并发现了它们在零样本性能和行为域鲁棒性方面的相对优势和劣势。我们表明,这两种范式是互补的,根据一致性的提高,我们在五个零样本方向上取得了持续改进。此外,对真正的低资源语言孟加拉语进行了人类评估,确认了我们在零样本向新目标语言迁移方面的结论。代码如下所示 $\href{this https URL}{\text{here}}$。
https://arxiv.org/abs/2309.08565
News recommender systems play an increasingly influential role in shaping information access within democratic societies. However, tailoring recommendations to users' specific interests can result in the divergence of information streams. Fragmented access to information poses challenges to the integrity of the public sphere, thereby influencing democracy and public discourse. The Fragmentation metric quantifies the degree of fragmentation of information streams in news recommendations. Accurate measurement of this metric requires the application of Natural Language Processing (NLP) to identify distinct news events, stories, or timelines. This paper presents an extensive investigation of various approaches for quantifying Fragmentation in news recommendations. These approaches are evaluated both intrinsically, by measuring performance on news story clustering, and extrinsically, by assessing the Fragmentation scores of different simulated news recommender scenarios. Our findings demonstrate that agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations. Additionally, the analysis of simulated scenarios yields valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.
新闻推荐系统在民主社会中发挥着越来越重要的作用,以影响信息获取。然而,针对用户特定兴趣定制推荐可能会导致信息流的分化。不完整的信息获取对公共事务的完整性构成了挑战,从而影响了民主和公共讨论。碎片度度量了新闻推荐中信息流的碎片程度。准确测量这种度量需要应用自然语言处理(NLP)来识别不同的新闻事件、故事或时间线。本文介绍了对多种方法进行量化碎片度的研究。这些方法既有内在地通过衡量新闻故事聚类性能评估,也有外在地通过评估不同模拟的新闻推荐场景的碎片度得分评估。我们的研究结果表明,堆叠级Hierarchical Clustering结合句子BERT文本表示的方法比早期的实现更好地检测碎片度。此外,模拟场景的分析为 stakeholders 关于碎片度测量和解释提供了有价值的见解和建议。
https://arxiv.org/abs/2309.06192
The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.
视觉文档理解领域见证了新兴挑战和强大的多模态策略的迅速增长。然而,他们依赖大量的文档数据来学习他们的“预热-微调”范式中的 pretext objectives,因此在现实世界的在线工业环境中,出现了显著的性能下降。一个主要原因是过度依赖 OCR 引擎在文档页面内提取局部位置信息。因此,这阻碍了模型的泛化、灵活性和可靠性,因为缺乏在文档图像内捕捉全球信息的能力。我们介绍了 Transferdoc,它是一个跨模态transformer-based架构,通过使用三个新的 pretext objectives,在自监督方式下进行预训练。 Transferdoc 通过统一语言和视觉表示来学习更丰富的语义概念,从而使得能够生产更多的可移植模型。此外,为了一个“更接近现实”的工业评估场景,我们引入了两个新的后续任务,其中 Transferdoc 在这种方法中表现优于其他先进的方法。
https://arxiv.org/abs/2309.05756
In this work, we use large language models (LLMs) to augment and accelerate research on the P versus NP problem, one of the most important open problems in theoretical computer science and mathematics. Specifically, we propose Socratic reasoning, a general framework that promotes in-depth thinking with LLMs for complex problem-solving. Socratic reasoning encourages LLMs to recursively discover, solve, and integrate problems while facilitating self-evaluation and refinement. Our pilot study on the P vs. NP problem shows that GPT-4 successfully produces a proof schema and engages in rigorous reasoning throughout 97 dialogue turns, concluding "P $\neq$ NP", which is in alignment with (Xu and Zhou, 2023). The investigation uncovers novel insights within the extensive solution space of LLMs, shedding light on LLM for Science.
在本研究中,我们使用了大型语言模型(LLMs),以增加和加速对P与NP问题的研究,这是理论计算机科学和数学中最为重要的开放问题之一。具体来说,我们提出了苏格拉底推理,一个普遍框架,以促进在复杂问题解决中使用LLMs的深入思考。苏格拉底推理鼓励LLMs递归地发现、解决和整合问题,同时促进自我评估和改进。我们对P与NP问题进行的试点研究表明,GPT-4成功实现了证明 schema 并在整个对话过程中进行了严谨的推理,得出结论“P不等于NP”,这与( Xu 和 Zhou,2023)一致。研究揭示了LLMs广泛的解决方案空间中 novel insights,揭示了LLM对科学的价值。
https://arxiv.org/abs/2309.05689
We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression. Contrary to generic scene text OCR, structured scene-text spotting seeks to dynamically condition both scene text detection and recognition on user-provided regular expressions. To tackle this task, we propose the Structured TExt sPotter (STEP), a model that exploits the provided text structure to guide the OCR process. STEP is able to deal with regular expressions that contain spaces and it is not bound to detection at the word-level granularity. Our approach enables accurate zero-shot structured text spotting in a wide variety of real-world reading scenarios and is solely trained on publicly available data. To demonstrate the effectiveness of our approach, we introduce a new challenging test dataset that contains several types of out-of-vocabulary structured text, reflecting important reading applications of fields such as prices, dates, serial numbers, license plates etc. We demonstrate that STEP can provide specialised OCR performance on demand in all tested scenarios.
我们引入了结构化场景文本检测任务,该任务要求场景文本 OCR 系统根据查询 Regular Expression 检测野生文本。与通用的场景文本 OCR 不同,结构化场景文本检测旨在动态地Condition 用户提供的 Regular Expression 对场景文本的检测和识别。为了解决这个问题,我们提出了 Structured TExt sPotter (STEP),这是一个利用提供的文字结构指导 OCR 过程的模型。Step 能够处理包含空格的 Regular Expression,并且不局限于字级别的检测。我们的方法能够在各种真实阅读场景下准确检测零次文本检测,并且仅基于公开数据进行训练。为了证明我们方法的有效性,我们引入了一个新的挑战性测试数据集,其中包含几种罕见的结构化文本类型,反映了各个领域的重要阅读应用,例如价格、日期、序列号、许可证标签等。我们证明了 STEP 能够在所有测试场景中提供专门的 OCR 性能。
https://arxiv.org/abs/2309.02356
In the digital era, the integration of artificial intelligence (AI) in education has ushered in transformative changes, redefining teaching methodologies, curriculum planning, and student engagement. This review paper delves deep into the rapidly evolving landscape of digital education by contrasting the capabilities and impact of OpenAI's pioneering text generation tools like Bing Chat, Bard, Ernie with a keen focus on the novel ChatGPT. Grounded in a typology that views education through the lenses of system, process, and result, the paper navigates the multifaceted applications of AI. From decentralizing global education and personalizing curriculums to digitally documenting competence-based outcomes, AI stands at the forefront of educational modernization. Highlighting ChatGPT's meteoric rise to one million users in just five days, the study underscores its role in democratizing education, fostering autodidacticism, and magnifying student engagement. However, with such transformative power comes the potential for misuse, as text-generation tools can inadvertently challenge academic integrity. By juxtaposing the promise and pitfalls of AI in education, this paper advocates for a harmonized synergy between AI tools and the educational community, emphasizing the urgent need for ethical guidelines, pedagogical adaptations, and strategic collaborations.
在数字时代,将人工智能(AI)融入教育已经带来了变革,重新定义了教学方法、课程规划和学生参与。本综述 paper 深入探讨了数字教育迅速演变的 landscape,通过比较 OpenAI 开创性的文本生成工具如 Bing Chat、 Bard、Ernie 和独特 ChatGPT 的强大能力和影响,重点关注 ChatGPT。基于系统、过程和结果的视角看待教育,本 paper 探索了 AI 的多面应用。从全球教育去中心化和个性化课程到数字化记录基于能力的结果, AI 站在教育现代化的最前沿。重点突出了 ChatGPT 在普及教育、促进自主学习和放大学生参与方面的作用。但是,这种变革力也带来了滥用的可能性,文本生成工具无意中会挑战学术诚信。通过对比 AI 在教育中的潜力和缺点,本 paper 呼吁 AI工具和 educational community 之间的协同增效,强调迫切需要伦理准则、教育适应和战略协作。
https://arxiv.org/abs/2309.02029
Material reconstruction from a photograph is a key component of 3D content creation democratization. We propose to formulate this ill-posed problem as a controlled synthesis one, leveraging the recent progress in generative deep networks. We present ControlMat, a method which, given a single photograph with uncontrolled illumination as input, conditions a diffusion model to generate plausible, tileable, high-resolution physically-based digital materials. We carefully analyze the behavior of diffusion models for multi-channel outputs, adapt the sampling process to fuse multi-scale information and introduce rolled diffusion to enable both tileability and patched diffusion for high-resolution outputs. Our generative approach further permits exploration of a variety of materials which could correspond to the input image, mitigating the unknown lighting conditions. We show that our approach outperforms recent inference and latent-space-optimization methods, and carefully validate our diffusion process design choices. Supplemental materials and additional details are available at: this https URL.
照片材料重构是3D内容创作民主化的关键组成部分。我们提议将这个不确定的问题解决作为控制合成问题,利用生成深度网络的最新进展。我们提出了控制Mat方法,给定一个无控制照明的单个照片作为输入,该方法 conditions 一个扩散模型来生成逼真、可插拔、高分辨率的物理数字材料。我们仔细分析了扩散模型在不同通道输出中的行为,调整采样过程以融合多尺度信息,并引入了卷积扩散,以便在高分辨率输出中实现可插拔和填充扩散。我们的生成方法还允许探索与输入图像相应的各种材料,减轻未知照明条件的影响。我们表明,我们的方法比最近的推理和潜在空间优化方法表现更好,并仔细验证了我们的扩散过程设计选择。附加材料和更多信息可在:这个https URL中获取。
https://arxiv.org/abs/2309.01700
Traditionally, social choice theory has only been applicable to choices among a few predetermined alternatives but not to more complex decisions such as collectively selecting a textual statement. We introduce generative social choice, a framework that combines the mathematical rigor of social choice theory with large language models' capability to generate text and extrapolate preferences. This framework divides the design of AI-augmented democratic processes into two components: first, proving that the process satisfies rigorous representation guarantees when given access to oracle queries; second, empirically validating that these queries can be approximately implemented using a large language model. We illustrate this framework by applying it to the problem of generating a slate of statements that is representative of opinions expressed as free-form text, for instance in an online deliberative process.
传统上,社会选择理论只适用于一些事先确定的选择,而不是像集体选择一份文本 statement 这样的更复杂的决策。我们引入了生成社会选择,一个将社会选择理论的数学严谨性与大型语言模型生成文本和推断偏好的能力结合起来的框架。这个框架将AI增强民主进程的设计分成两个部分:第一个是证明,当获得查询Oracle时,该过程满足严格的代表保证;第二个是经验验证,这些查询可以用大型语言模型进行近似实现。我们利用这个框架的例子来说明它,即生成一份表格,该表格代表了以自由形式表达的意见,例如在线讨论会中。
https://arxiv.org/abs/2309.01291
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages. State-of-the-art language models came a long way, starting from the simple one-hot representation of words capable of performing tasks like natural language understanding, common-sense reasoning, or question-answering, thus capturing both the syntax and semantics of texts. At the same time, language models are expanding beyond our known language boundary, even competitively performing over very low-resource dialects of endangered languages. However, there are still problems to solve to ensure an equitable representation of texts through a unified modeling space across language and speakers. In this survey, we shed light on this iterative progression of multilingual text representation and discuss the driving factors that ultimately led to the current state-of-the-art. Subsequently, we discuss how the full potential of language democratization could be obtained, reaching beyond the known limits and what is the scope of improvement in that space.
现代自然语言处理突破包括能够跨越超过100个语言的大规模多语言模型。最先进的语言模型已经取得了长足的进步,从最初的简单的一hot表示能够执行自然语言理解、常识推理或问答等任务的词语开始,从而捕捉了文本的语法和语义。同时,语言模型正在超越我们已知的语言边界,甚至在资源非常有限的濒危语言方言中 competitive地表现。然而,仍然需要解决以确保文本平等 representation 的问题,通过跨越语言和演讲者的的统一建模空间来实现。在本调查中,我们揭示了这种多语言文本表示的迭代进展,并讨论了最终导致当前先进技术的主要驱动因素。随后,我们讨论了如何实现语言民主化的全部潜力,超越已知的极限,以及在这个空间中可以改善的范围。
https://arxiv.org/abs/2309.00949
This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.
本论文专注于利用YOLOv8模型和创新性后处理技术改进孟加拉文档布局分析(DLA)。我们使用数据增强技术来提高模型鲁棒性,以解决复杂的孟加拉文本脚本特有的挑战。经过仔细验证集评估后,我们在完整数据集上优化了我们的方法,实现了准确的元素分割预测策略。我们的联合模型与后处理一起比单个基础架构表现更好,解决了BaDLAD数据集上识别的问题。通过利用这种方法,我们的目标是推动孟加拉文档分析,为改进OCR和文档理解做出贡献,并将BaDLAD作为这项工作的基础资源,为该领域未来的研究提供帮助。此外,我们的实验提供了将新策略融入现有解决方案的关键见解。
https://arxiv.org/abs/2309.00848
Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
链接来自不同来源的信息是社会科学、商业和政府分析中至关重要的一部分。虽然大型语言模型(LLM)在改善噪声数据记录链接方面有很大的潜力,但在许多领域, popular software such as R and Stata 中的近似字符串匹配软件仍然占据着主导地位。这些软件具有干净、简单的界面,可以轻松扩展到多种语言。我们的开源软件 LinkTransformer 旨在扩展 popular string matching methods 对深度学习的熟悉度和易用性。它是一款通用的程序,用于记录链接TransformerLLM,并将其视为文本检索问题来处理。其核心是一个商用工具集,用于应用Transformer模型记录链接,只需四行代码。LinkTransformer 包含了多个语言的训练过的Transformer语义相似模型,并支持从Hugging Face或OpenAI等公司发布的Transformer语言模型的集成。它支持标准功能,如在一个多个噪声字段上的阻塞和链接。LinkTransformer API 还执行其他常见的文本数据处理任务,例如聚合、噪声去重和无翻译的跨语言链接。重要的是,LinkTransformer 还包含高效的模型调整工具,当商用模型不能满足所需精度时,可以方便地进行自定义。最后,为了促进可重复性和可扩展性,LinkTransformer 使用户能够轻松地将其训练过的模型贡献到其模型库中。通过将Transformer语言模型与许多流行的字符串匹配软件用户熟悉的API相结合,LinkTransformer 旨在使LLM的好处向那些对深度学习框架不太熟悉的人们民主化。
https://arxiv.org/abs/2309.00789
Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for automated US image slice localization within a 3D shape representation to ease how such sonographic diagnoses are carried out. Our proposed method learns a common latent embedding space between US image patches and the 3D surface of an individual's thyroid shape, or a statistical aggregation in the form of a statistical shape model (SSM), via contrastive metric learning. Using cross-modality registration and Procrustes analysis, we leverage features from our model to register US slices to a 3D mesh representation of the thyroid shape. We demonstrate that our multi-modal registration framework can localize images on the 3D surface topology of a patient-specific organ and the mean shape of an SSM. Experimental results indicate slice positions can be predicted within an average of 1.2 mm of the ground-truth slice location on the patient-specific 3D anatomy and 4.6 mm on the SSM, exemplifying its usefulness for slice localization during sonographic acquisitions. Code is publically available: \href{this https URL}{this https URL}
甲状腺异常通常是通过高分辨率超声波(US)进行诊断的。跟踪病变的结节是监测病理甲状腺形态变化的关键诊断协议。然而,由于维护心理3D重构的固有挑战,这一任务对临床医生造成了相当大的认知负担。因此我们提出了一个自动化US图像切片定位的3D形状表示框架,以简化如何进行这种 Sonographic诊断的过程。我们采用对比学度量学习的方法从我们的模型中提取特征,将US切片注册到甲状腺形状的3D网格表示中。我们证明了我们的多模态注册框架可以在病人特定器官的3D表面拓扑和SSM的平均形状上定位图像。实验结果表明,在病人特定3D解剖学上的平均切片位置和SSM的平均形状上的切片位置可以在平均1.2毫米的 ground-truth slice location 内预测,这表明在 Sonographic摄取期间切片定位时的有用性。代码已公开可用: \href{this https URL}{this https URL}
https://arxiv.org/abs/2309.00372
Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a \underline{soc}ietal \underline{r}e\underline{a}c\underline{ti}on\underline{s} benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.
现有的情感预测基准包含粗粒度的情感标签,这些标签没有考虑图像和文本由于各种原因可能引起的多样情感。学习多种方式的反应,特别是针对多模态内容的反应,因为机器在生成和向社会传递内容中扮演中心角色。为了填补这一差距,我们提出了Socratis,一个社会反应基准,其中每个图像caption(IC)对都被注释为多个情感及其感受这些情感的原因。Socratis包含了来自5 widely-read新闻和图像caption(IC)数据集的2075个图像caption对的18K个自由形式反应,涵盖了980种情感。我们基准了最先进的多模态大型语言模型生成情感感受原因的能力。基于初步人类研究,我们观察到,人类更倾向于人类撰写的原因,比机器生成的原因更愿意选择。这表明我们的任务比标准生成任务更难,因为它对比了最近发现,人类无法区分机器和人类撰写的新闻文章,例如。我们还看到,当前基于大型视觉语言模型的情感caption metrics也没有与人类偏好成正比。我们希望这些发现和我们基准将激励进一步研究训练情感认知模型。
https://arxiv.org/abs/2308.16741
Text-based Visual Question Answering (TextVQA) aims at answering questions about the text in images. Most works in this field focus on designing network structures or pre-training tasks. All these methods list the OCR texts in reading order (from left to right and top to bottom) to form a sequence, which is treated as a natural language ``sentence''. However, they ignore the fact that most OCR words in the TextVQA task do not have a semantical contextual relationship. In addition, these approaches use 1-D position embedding to construct the spatial relation between OCR tokens sequentially, which is not reasonable. The 1-D position embedding can only represent the left-right sequence relationship between words in a sentence, but not the complex spatial position relationship. To tackle these problems, we propose a novel method named Separate and Locate (SaL) that explores text contextual cues and designs spatial position embedding to construct spatial relations between OCR texts. Specifically, we propose a Text Semantic Separate (TSS) module that helps the model recognize whether words have semantic contextual relations. Then, we introduce a Spatial Circle Position (SCP) module that helps the model better construct and reason the spatial position relationships between OCR texts. Our SaL model outperforms the baseline model by 4.44% and 3.96% accuracy on TextVQA and ST-VQA datasets. Compared with the pre-training state-of-the-art method pre-trained on 64 million pre-training samples, our method, without any pre-training tasks, still achieves 2.68% and 2.52% accuracy improvement on TextVQA and ST-VQA. Our code and models will be released at this https URL.
基于文本的视觉问答(TextVQA)旨在回答图像中文本的问题。该领域的大部分工作都关注设计网络结构或前训练任务。这些方法都将OCR文本按照阅读顺序(从左到右和从上到下)列出,形成一个序列,并将其视为自然语言的“句子”。然而,他们忽略了大多数OCR单词在TextVQA任务中并没有语义上下文关系的事实。此外,这些方法使用1D位置嵌入Sequentially构建OCR代币之间的空间关系,这不是合理的。1D位置嵌入只能代表句子中单词之间的left-right序列关系,而不是复杂的空间位置关系。为了解决这些问题,我们提出了一种新的方法名为Separate and Locate(SaL),该方法探索文本上下文线索并设计空间位置嵌入,以构建OCR文本之间的空间关系。具体来说,我们提出了文本语义分开(TSS)模块,帮助模型识别单词是否有语义上下文关系。然后,我们引入了空间圆形位置(SCP)模块,帮助模型更好地构建和解释OCR文本之间的空间位置关系。我们的SaL模型在TextVQA和ST-VQA数据集上比基准模型提高了4.44%和3.96%的准确率。与在64百万前训练样本中预训练的最新前训练方法相比,我们的方法在没有前训练任务的情况下,在TextVQA和ST-VQA数据集上仍实现了2.68%和2.52%的准确率改进。我们的代码和模型将在这个httpsURL上发布。
https://arxiv.org/abs/2308.16383
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
典型的文本识别方法依赖于编码-解码结构,其中编码器从图像中提取特征,解码器从这些特征中产生识别文本。在本研究中,我们提出了一种更简单但更有效的文本识别方法,称为解码器仅TransformerOptical Character Recognition (DTrOCR)。这种方法使用解码器仅Transformer利用一个在大型语料库上预训练的生成语言模型。我们检查了是否一个成功的自然语言处理生成语言模型也可以在计算机视觉中的文本识别中有效。我们的实验结果表明,DTrOCR在英语和中文印刷、手写和场景文本识别方面比当前最先进的方法领先一大步。
https://arxiv.org/abs/2308.15996
The study investigates the potential of post-OCR models to overcome limitations in OCR models and explores the impact of incorporating glyph embedding on post-OCR correction performance. In this study, we have developed our own post-OCR correction model. The novelty of our approach lies in embedding the OCR output using CharBERT and our unique embedding technique, capturing the visual characteristics of characters. Our findings show that post-OCR correction effectively addresses deficiencies in inferior OCR models, and glyph embedding enables the model to achieve superior results, including the ability to correct individual words.
这项研究研究了OCR模型后的优化潜力,以克服OCR模型的局限性,并探索了使用字符嵌入对OCR模型后修正性能的影响。在本研究中,我们开发了我们自己的优化模型。我们的方法的创新之处在于使用CharBERT和我们的独特的嵌入技术嵌入OCR输出,以捕捉字符的视觉特性。我们的研究结果表明,OCR模型后修正能够有效弥补其劣势,包括能够纠正单个单词的错误。
https://arxiv.org/abs/2308.15262