Paper Reading AI Learner

Layout-aware Webpage Quality Assessment

2023-01-28 10:27:53
Anfeng Cheng, Yiding Liu, Weibin Li, Qian Dong, Shuaiqiang Wang, Zhengjie Huang, Shikun Feng, Zhicong Cheng, Dawei Yin

Abstract

Identifying high-quality webpages is fundamental for real-world search engines, which can fulfil users' information need with the less cognitive burden. Early studies of \emph{webpage quality assessment} usually design hand-crafted features that may only work on particular categories of webpages (e.g., shopping websites, medical websites). They can hardly be applied to real-world search engines that serve trillions of webpages with various types and purposes. In this paper, we propose a novel layout-aware webpage quality assessment model currently deployed in our search engine. Intuitively, layout is a universal and critical dimension for the quality assessment of different categories of webpages. Based on this, we directly employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model. The DOM tree data unifies the representation of webpages with different categories and purposes and indicates the layout of webpages. To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method that extracts rich layout-aware information that implies webpage quality in an end-to-end manner. Moreover, we improve the GNN method with an attentive readout function, external web categories and a category-aware sampling method. We conduct rigorous offline and online experiments to show that our proposed solution is effective in real search engines, improving the overall usability and user experience.

Abstract (translated)

识别高质量的网页对于真实的搜索引擎至关重要,这样就能够以较少的认知能力来满足用户的信息需求。早期研究 "网页质量评估" 通常会设计手工加工的特征,只能适用于特定的网页类别(例如,购物网站和医疗网站)。这些特征很难应用于服务于数十亿种不同类型和用途的网页的真实搜索引擎。在本文中,我们提出了一种全新的布局意识网页质量评估模型,目前在我们的搜索引擎中广泛应用。Intuitively,布局是评估不同类别网页质量的通用和关键维度。基于这一点,我们直接采用描述网页的元数据,即文档对象模型(DOM)树,作为我们的模型输入。DOM树数据将不同类别和用途的网页表示统一起来,并表明网页布局。为了从复杂的DOM树数据中评估网页质量,我们提出了一种基于Graph NN(GNN)的方法,该方法提取丰富的布局意识信息,以 end-to-end 方式暗示网页质量。此外,我们改进了GNN方法,并结合了注意力读取函数、外部网页类别和类别意识采样方法。我们进行了严格的离线在线实验,以证明我们提出的解决方案在真实搜索引擎中有效,提高了整体可用性和用户体验。

URL

https://arxiv.org/abs/2301.12152

PDF

https://arxiv.org/pdf/2301.12152.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot