Paper Reading AI Learner

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

2024-04-30 10:28:04
D. Panas, S. Seth, V. Belle

Abstract

Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.

Abstract (translated)

在大型语言模型时代,两个主要的兴趣领域是关于LLMs知道什么,以及它们是否能够进行推理(或者说,约等于推理)。因为迄今为止,这些领域的发展主要是并行的(当然,也有一些显著的例外),所以我们对此感兴趣的是调查这个交叉点:关于LLMs所隐含的知识的推理。我们怀疑,在这个领域,性能存在不足,因此我们使用一个非常简单的对比集来研究各种主题元素的相关基数(例如,一个鸟类有多少条腿与一个三轮车的轮子数量)。我们通过实验证明,尽管LLMs在知识获取和(伪)推理方面在每个GPT版本中都有进步,但它们的能力仅限于统计推断。很难认为,纯统计学习可以应对许多常识推理任务中固有的组合爆炸,尤其是在涉及到算术概念的情况下。此外,我们认为大并不一定就是更好的,因为过于关注统计改进会破坏正确答案与真正推理能力之间的危险混淆。

URL

https://arxiv.org/abs/2404.19432

PDF

https://arxiv.org/pdf/2404.19432.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot