Caption
Caption
2023-01-31
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Optimization
Image_Retrieval
Pose
Classification
VQA
Attention
Caption
Image_Classification
PDF
2023-01-30
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang, Yinfei Yang
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Zero-Shot
Sparse
Matching
PDF
2023-01-30
PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks
Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis
arXiv_CV
arXiv_CV
Image_Caption
Deep_Learning
Caption
PDF
2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Zero-Shot
Represenation_Learning
Pose
VQA
Text_Generation
Language_Model
QA
PDF
2023-01-28
ACL-Fig: A Dataset for Scientific Figure Classification
Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, C. Lee Giles
arXiv_AI
arXiv_AI
Pose
Classification
Caption
PDF
2023-01-27
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim, Jin-Hwa Kim, Jiyoung Lee, Minjoon Seo
arXiv_CV
arXiv_CV
Video_Caption
Pose
Action
VQA
Attention
Text_Generation
Caption
Activity
Language_Model
QA
PDF
2023-01-26
Style-Aware Contrastive Learning for Multi-Style Image Captioning
Yucheng Zhou, Guodong Long
arXiv_CV
arXiv_CV
Image_Caption
Pose
Contrastive_Learning
Relation
Caption
PDF
2023-01-26
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank
arXiv_SD
arXiv_SD
Pose
Caption
PDF
2023-01-26
Improving Statistical Fidelity for Neural Image Compression with Implicit Local Likelihood Models
Matthew J. Muckley, Alaaeldin El-Nouby, Karen Ullrich, Hervé Jégou, Jakob Verbeek
arXiv_AI
arXiv_AI
Image_Caption
Reconstruction
Adversarial
Image_Compression
PDF
2023-01-26
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data
Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon
arXiv_AI
arXiv_AI
Image_Caption
Adversarial
Pose
Relation
Caption
PDF
2023-01-26
Explaining Visual Biases as Words by Generating Captions
Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, Jinwoo Shin
arXiv_CV
arXiv_CV
Transformer
Embedding
Zero-Shot
Pose
Relation
Caption
Language_Model
PDF
2023-01-26
Paraphrase Acquisition from Image Captions
Marcel Gohsen, Matthias Hagen, Martin Potthast, Benno Stein
arXiv_CL
arXiv_CL
Image_Caption
Pose
Caption
PDF
2023-01-23
OvarNet: Towards Open-vocabulary Object Attribute Recognition
Keyan Chen, Xiaolong Jiang, Yao Hu, Xu Tang, Yan Gao, Jianqi Chen, Weidi Xie
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Weakly_Supervised
Knowledge
Pose
Classification
Detection
Object_Detection
Caption
Prediction
PDF
2023-01-23
Lexi: Self-Supervised Learning of the UI Language
Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana Riva
arXiv_AI
arXiv_AI
Recognition
Image_Retrieval
Self-Supervised
Pose
Face
Action
Caption
Language_Model
PDF
2023-01-23
Self-Supervised Image Representation Learning: Transcending Masking with Paired Image Overlay
Yinheng Li, Han Ding, Shaofei Wang
arXiv_CV
arXiv_CV
Image_Caption
Represenation_Learning
Self-Supervised
Pose
Contrastive_Learning
PDF
2023-01-22
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction
Razvan-George Pasca, Alexey Gavryushin, Yen-Ling Kuo, Otmar Hilliges, Xi Wang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Knowledge
Pose
Action
Relation
Caption
Prediction
PDF
2023-01-22
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision
Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie
arXiv_CV
arXiv_CV
Transformer
Segmentation
Embedding
Semantic_Segmentation
Zero-Shot
Pose
Attention
Caption
Prediction
PDF
2023-01-21
Raw or Cooked? Object Detection on RAW Images
William Ljungbergh, Joakim Johnander, Christoffer Petersson, Michael Felsberg
arXiv_CV
arXiv_CV
Image_Caption
Pose
Detection
Object_Detection
PDF
2023-01-20
Same Words, Different Meanings: Interpretable Predictions of Polarization Trends in Broadcast Media Language and Granger Causal Effects on Public Discourse
Xiaohan Ding, Mike Horning, Eugenia H. Rho
arXiv_CL
arXiv_CL
Salient
Relation
Caption
Prediction
PDF
2023-01-20
Visual Semantic Relatedness Dataset for Image Captioning
Ahmed Sabir, Francesc Moreno-Noguer, Lluís Padró
arXiv_CV
arXiv_CV
Image_Caption
Knowledge
Pose
Relation
Caption
PDF
2023-01-19
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Embedding
Self-Supervised
Action
Classification
Prediction
PDF
2023-01-18
Temporal Perceiving Video-Language Pre-training
Fan Ma, Xiaojie Jin, Heng Wang, Jingjia Huang, Linchao Zhu, Jiashi Feng, Yi Yang
arXiv_AI
arXiv_AI
Transformer
Video_Caption
Action_Localization
Contrastive_Learning
Action
Caption
Video_Retrieval
PDF
2023-01-18
Towards Models that Can See and Read
Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
VQA
Caption
Language_Model
QA
PDF
2023-01-18
ViT-AE++: Improving Vision Transformer Autoencoder for Self-supervised Medical Image Representations
Chinmay Prabhakar, Hongwei Bran Li, Jiancheng Yang, Suprosana Shit, Benedikt Wiestler, Bjoern Menze
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Reconstruction
3D
Self-Supervised
Pose
Contrastive_Learning
Attention
Medical
PDF
2023-01-17
Embodied Agents for Efficient Exploration and Smart Scene Description
Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
arXiv_CV
arXiv_CV
Image_Caption
Knowledge
Pose
Action
Quantitative
Relation
Caption
Autonomous
PDF
2023-01-17
GLIGEN: Open-Set Grounded Text-to-Image Generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee
arXiv_AI
arXiv_AI
Zero-Shot
Knowledge
Pose
Caption
PDF
2023-01-17
Building Scalable Video Understanding Benchmarks through Sports
Aniket Agarwal, Alex Zhang, Karthik Narasimhan, Igor Gilitschenski, Vishvak Murahari, Yash Kant
arXiv_CV
arXiv_CV
Video_Caption
Action
PDF
2023-01-15
Generating Templated Caption for Video Grounding
Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou
arXiv_CV
arXiv_CV
Video_Caption
Pose
Contrastive_Learning
Action
Relation
Attention
Caption
Activity
Matching
PDF
2023-01-14
Music Playlist Title Generation Using Artist Information
Haven Kim, SeungHeon Doh, Junwon Lee, Juhan Nam
arXiv_CL
arXiv_CL
Pose
Caption
Recommendation
PDF
2023-01-12
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, Chuang Gan
arXiv_AI
arXiv_AI
Knowledge
Pose
Few-Shot
Caption
Language_Model
PDF
2023-01-09
Explainable, Physics Aware, Trustworthy AI Paradigm Shift for Synthetic Aperture Radar
Mihai Datcu, Zhongling Huang, Andrei Anghel, Juanping Zhao, Remus Cacoveanu
arXiv_AI
arXiv_AI
Image_Caption
Recognition
Knowledge
Pose
PDF
2023-01-09
Cursive Caption Text Detection in Videos
Ali Mirza, Imran Siddiqi
arXiv_AI
arXiv_AI
Detection
Object_Detection
Summarization
Caption
CNN
PDF
2023-01-08
STPrivacy: Spatio-Temporal Tubelet Sparsification and Anonymization for Privacy-preserving Action Recognition
Ming Li, Jun Liu, Hehe Fan, Jia-Wei Liu, Jiahe Li, Mike Zheng Shou, Jussi Keppo
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Pose
Action_Recognition
Action
CNN
PDF
2023-01-06
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU
arXiv_CV
arXiv_CV
Transformer
3D
Pose
Detection
Caption
Prediction
PDF
2023-01-06
An Image captioning algorithm based on the Hybrid Deep Learning Technique
Rana Adnan Ahmad, Muhammad Azhar, Hina Sattar
arXiv_AI
arXiv_AI
Image_Caption
Reconstruction
RNN
Deep_Learning
Caption
PDF
2023-01-05
EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding
Shuhan Tan, Tushar Nagarajan, Kristen Grauman
arXiv_CV
arXiv_CV
Video_Caption
Sparse
Self-Supervised
Pose
PDF
2023-01-05
ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions
Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee
arXiv_CV
arXiv_CV
Image_Caption
Transfer_Learning
Relation
Caption
PDF
2023-01-05
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek
arXiv_AI
arXiv_AI
Video_Caption
Zero-Shot
Pose
Relation
Language_Model
PDF
2023-01-05
Adaptively Clustering Neighbor Elements for Image Captioning
Zihua Wang, Xu Yang, Haiyang Xu, Hanwang Zhang, Chenliang Li, Songfang Huang, Fei Huang, Yu Zhang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Bert
Attention
Caption
PDF
2023-01-03
An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation
Kevin Moran, Ali Yachnes, George Purnell, Junayed Mahmud, Michele Tufano, Carlos Bernal-Cárdenas, Denys Poshyvanyk, Zach H'Doubler
arXiv_AI
arXiv_AI
Image_Caption
Salient
Face
Quantitative
Caption
PDF
2023-01-02
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation
Jianzong Wu, Xiangtai Li, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Pose
Caption
PDF
2023-01-02
PCRLv2: A Unified Visual Information Preservation Framework for Self-supervised Pre-training in Medical Image Analysis
Hong-Yu Zhou, Chixiang Lu, Chaoqi Chen, Sibei Yang, Yizhou Yu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
3D
Restoration
Optimization
Self-Supervised
Pose
Detection
Attention
GAN
Medical
PDF
2022-12-31
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
arXiv_CV
arXiv_CV
Video_Caption
Zero-Shot
Knowledge
Pose
Action
Caption
Video_Retrieval
Matching
PDF
2022-12-29
Learning Multimodal Data Augmentation in Feature Space
Zichang Liu, Zhiqiang Tang, Xingjian Shi, Aston Zhang, Mu Li, Anshumali Shrivastava, Andrew Gordon Wilson
arXiv_CV
arXiv_CV
Classification
Deep_Learning
Relation
Caption
Image_Classification
PDF
2022-12-28
Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction
Yubin Kim, Huili Chen, Sharifa Alghowinem, Cynthia Breazeal, Hae Won Park
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Action
Classification
Deep_Learning
Autonomous
PDF
2022-12-28
Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning
Tarun Narayanan, Ajay Krishnan, Anirudh Koul, Siddha Ganju
arXiv_CV
arXiv_CV
Image_Caption
Self-Supervised
PDF
2022-12-27
Using Large Language Models to Generate Engaging Captions for Data Visualizations
Ashley Liew, Klaus Mueller
arXiv_AI
arXiv_AI
Pose
Deep_Learning
Caption
Language_Model
PDF
2022-12-27
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh
arXiv_AI
arXiv_AI
Image_Caption
Zero-Shot
Image_Retrieval
Knowledge
Pose
Caption
Inference
PDF
2022-12-24
On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective
Ying Wen, Ziyu Wan, Ming Zhou, Shufang Hou, Zhe Cao, Chenyang Le, Jingxiao Chen, Zheng Tian, Weinan Zhang, Jun Wang
arXiv_AI
arXiv_AI
Transformer
Reinforcement_Learning
Text_Generation
Caption
Autonomous
PDF
2022-12-23
Do DALL-E and Flamingo Understand Each Other?
Hang Li, Jindong Gu, Rajat Koner, Sahand Sharifzadeh, Volker Tresp
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Represenation_Learning
Pose
Relation
Caption
PDF
2022-12-22
When are Lemons Purple? The Concept Association Bias of CLIP
Yutaro Yamada, Yingtian Tang, Ilker Yildirim
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Zero-Shot
Classification
VQA
Image_Classification
Language_Model
Prediction
QA
PDF
2022-12-22
Confidence-Aware Paced-Curriculum Learning by Label Smoothing for Surgical Scene Understanding
Mengya Xu, Mobarakol Islam, Ben Glocker, Hongliang Ren
arXiv_CV
arXiv_CV
Segmentation
Semantic_Segmentation
Pose
Classification
Attention
Caption
Prediction
PDF
2022-12-21
Generalized Decoding for Pixel, Image, and Language
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao
arXiv_CV
arXiv_CV
Transformer
Segmentation
Zero-Shot
Action
Caption
PDF
2022-12-21
Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias
Robert Wolfe, Yiwei Yang, Bill Howe, Aylin Caliskan
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Recognition
Emotion
GAN
Caption
PDF
2022-12-21
RECAP: Retrieval Augmented Music Captioner
Zihao He, Weituo Hao, Xuchen Song
arXiv_CL
arXiv_CL
Pose
Contrastive_Learning
Attention
Caption
Recommendation
PDF
2022-12-20
METEOR Guided Divergence for Video Captioning
Daniel Lukas Rothenpieler, Shahin Amiriparian
arXiv_CV
arXiv_CV
Transformer
Reinforcement_Learning
Video_Caption
Pose
Action
Attention
Caption
Activity
PDF
2022-12-20
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Martha Lewis, Qinan Yu, Jack Merullo, Ellie Pavlick
arXiv_AI
arXiv_AI
Knowledge
Relation
Caption
Language_Model
PDF
2022-12-20
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?
Monika Wysoczańska, Tom Monnier, Tomasz Trzciński, David Picard
arXiv_AI
arXiv_AI
Image_Caption
Unsupervised
Represenation_Learning
Pose
Action
Relation
VQA
Attention
PDF
2022-12-19
MetaCLUE: Towards Comprehensive Visual Metaphors Research
Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani
arXiv_CV
arXiv_CV
Action
Classification
Relation
Caption
PDF
2022-12-19
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
arXiv_CV
arXiv_CV
Transformer
Zero-Shot
Pose
Detection
Object_Detection
Caption
Inference
PDF
2022-12-19
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Pose
Language_Model
QA
PDF
2022-12-19
Transferring General Multimodal Pretrained Models to Text Recognition
Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An Yang, Chang Zhou
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Recognition
OCR
Pose
Caption
PDF
2022-12-18
Efficient Image Captioning for Edge Devices
Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li
arXiv_CV
arXiv_CV
Image_Caption
Pose
Detection
Object_Detection
Caption
Inference
Prediction
PDF
2022-12-17
Inductive Attention for Video Action Anticipation
Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, Oswald Lanz
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Pose
Action_Recognition
Action
Attention
Prediction
PDF
2022-12-15
Objaverse: A Universe of Annotated 3D Objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, Ali Farhadi
arXiv_AI
arXiv_AI
Segmentation
3D
Caption
PDF
2022-12-15
Are Multimodal Models Robust to Image and Text Perturbations?
Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
PDF
2022-12-14
Towards Smooth Video Composition
Qihang Zhang, Ceyuan Yang, Yujun Shen, Yinghao Xu, Bolei Zhou
arXiv_CV
arXiv_CV
Video_Caption
Knowledge
Adversarial
Pose
Relation
GAN
PDF
2022-12-14
NLIP: Noise-robust Language-Image Pre-training
Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing Xu, Xiaodan Liang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Zero-Shot
Regularization
Pose
Classification
Caption
PDF
2022-12-14
A novel state connection strategy for quantum computing to represent and compress digital images
Md Ershadul Haque, Manoranjan Paul, Tanmoy Debnath
arXiv_CV
arXiv_CV
Image_Caption
Pose
Attention
PDF
2022-12-14
Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano, Koichi Takeda
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Caption
Language_Model
PDF
2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna
arXiv_CV
arXiv_CV
Transformer
Caption
PDF
2022-12-13
LidarCLIP or: How I Learned to Talk to Point Clouds
Georg Hess, Adam Tonderski, Christoffer Petersson, Lennart Svensson, Kalle Åström
arXiv_CV
arXiv_CV
Embedding
Point_Cloud
Pose
Detection
Attention
Caption
PDF
2022-12-13
Egocentric Video Task Translation
Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
arXiv_CV
arXiv_CV
Tracking
Video_Caption
Pose
Action
PDF
2022-12-12
Contextual Explainable Video Representation:Human Perception-based Understanding
Khoa Vo, Kashu Yamazaki, Phong X. Nguyen, Phat Nguyen, Khoa Luu, Ngan Le
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Action_Recognition
Action
Detection
Relation
Caption
Video_Retrieval
PDF
2022-12-11
MAViC: Multimodal Active Learning for Video Captioning
Gyanendra Das, Xavier Thomas, Anant Raj, Vikram Gupta
arXiv_CV
arXiv_CV
Video_Caption
Pose
Caption
PDF
2022-12-10
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Knowledge
Knowledge_Graph
Pose
VQA
Caption
Language_Model
PDF
2022-12-09
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab
arXiv_CV
arXiv_CV
Image_Caption
Represenation_Learning
Self-Supervised
Classification
PDF
2022-12-09
Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu
arXiv_CV
arXiv_CV
Embedding
Video_Caption
Zero-Shot
Classification
VQA
Attention
Caption
Activity
QA
Video_Retrieval
Video_Classification
PDF
2022-12-08
PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data
Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
3D
Pose
Action_Recognition
Action
Relation
PDF
2022-12-08
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
Jiezhang Cao, Qin Wang, Yongqin Xian, Yawei Li, Bingbing Ni, Zhiming Pi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool
arXiv_CV
arXiv_CV
Image_Caption
Super_Resolution
Pose
Attention
PDF
2022-12-08
Generating and Weighting Semantically Consistent Sample Pairs for Ultrasound Contrastive Learning
Yixiong Chen, Chunhui Zhang, Chris H. Q. Ding, Li Liu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Self-Supervised
Pose
Contrastive_Learning
Classification
Detection
Medical
PDF
2022-12-06
Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning
Ukyo Honda, Taro Watanabe, Yuji Matsumoto
arXiv_CV
arXiv_CV
Image_Caption
Reinforcement_Learning
Pose
Classification
Caption
PDF
2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Self-Supervised
Contrastive_Learning
Action_Recognition
Action
Detection
PDF
2022-12-06
Semantic-Conditional Diffusion Networks for Image Captioning
Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, Tao Mei
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Knowledge
Pose
Caption
PDF
2022-12-06
Adaptive Testing of Computer Vision Models
Irena Gao, Gabriel Ilharco, Scott Lundberg, Marco Tulio Ribeiro
arXiv_CV
arXiv_CV
Image_Caption
Face
Classification
Detection
Object_Detection
Caption
PDF
2022-12-06
Attend Who is Weak: Pruning-assisted Medical Image Localization under Sophisticated and Implicit Imbalances
Ajay Jaiswal, Tianlong Chen, Justin F. Rousseau, Yifan Peng, Ying Ding, Zhangyang Wang
arXiv_CV
arXiv_CV
Image_Caption
Pose
Classification
Attention
Medical
Image_Classification
PDF
2022-12-05
Towards Generating Diverse Audio Captions via Adversarial Training
Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang
arXiv_AI
arXiv_AI
Adversarial
Pose
Attention
GAN
Caption
PDF
2022-12-04
Controllable Image Captioning via Prompting
Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Pose
Emotion
Caption
Inference
PDF
2022-12-03
Named Entity and Relation Extraction with Multi-Modal Retrieval
Xinyu Wang, Jiong Cai, Yong Jiang, Pengjun Xie, Kewei Tu, Wei Lu
arXiv_CL
arXiv_CL
Recognition
Knowledge
Pose
Action
Relation
Relation_Extraction
Caption
Prediction
PDF
2022-12-02
Generative Reasoning Integrated Label Noise Robust Deep Image Representation Learning in Remote Sensing
Gencer Sumbul, Begüm Demir
arXiv_CV
arXiv_CV
Image_Caption
Represenation_Learning
Deep_Learning
Attention
PDF
2022-12-02
3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
Zutao Jiang, Guangsong Lu, Xiaodan Liang, Jihua Zhu, Wei Zhang, Xiaojun Chang, Hang Xu
arXiv_AI
arXiv_AI
3D
Optimization
Pose
Contrastive_Learning
Caption
PDF
2022-12-02
SimpleMind adds thinking to deep neural networks
Youngwon Choi, M. Wasil Wahi-Anwar, Matthew S. Brown
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Optimization
Knowledge
Pose
Relation
Medical
Prediction
PDF
2022-12-02
QFF: Quantized Fourier Features for Neural Field Representations
Jae Yong Lee, Yuqun Wu, Chuhang Zou, Shenlong Wang, Derek Hoiem
arXiv_CV
arXiv_CV
Image_Caption
Pose
PDF
2022-12-01
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset
Sidra Hanif, Longin Jan Latecki
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Weakly_Supervised
Pose
Caption
PDF
2022-12-01
Focus! Relevant and Sufficient Context Selection for News Image Captioning
Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu
arXiv_CV
arXiv_CV
Image_Caption
Pose
Action
Relation
Relation_Extraction
Caption
PDF
2022-12-01
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang
arXiv_CV
arXiv_CV
Transformer
3D
Knowledge
Pose
Relation
Caption
PDF
2022-12-01
GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
arXiv_CV
arXiv_CV
Transformer
Action
Detection
Object_Detection
Caption
PDF
2022-11-30
Uncertainty-Aware Image Captioning
Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, Xiaolin Wei
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
Inference
PDF
2022-11-30
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim, Namyup Kim, Suha Kwak
arXiv_CV
arXiv_CV
Embedding
Pose
Attention
Caption
Inference
Prediction
PDF
2022-11-30
Iterative Scene Graph Generation with Generative Transformers
Sanjoy Kundu, Sathyanarayanan N. Aakur
arXiv_CV
arXiv_CV
Transformer
Pose
Classification
Detection
Relation
Object_Detection
Caption
Inference
Prediction
PDF
2022-11-29
Procedural Image Programs for Representation Learning
Manel Baradad, Chun-Fu Chen, Jonas Wulff, Tongzhou Wang, Rogerio Feris, Antonio Torralba, Phillip Isola
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Unsupervised
Represenation_Learning
Knowledge
Pose
PDF
2022-11-29
Language-driven Open-Vocabulary 3D Scene Understanding
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi
arXiv_CV
arXiv_CV
Transformer
Segmentation
Embedding
3D
Zero-Shot
Represenation_Learning
Knowledge
Pose
Contrastive_Learning
Caption
PDF
2022-11-28
Satlas: A Large-Scale, Multi-Task Dataset for Remote Sensing Image Understanding
Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, Aniruddha Kembhavi
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Tracking
Pose
PDF
2022-11-28
E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model
W. Ronny Huang, Shuo-Yiin Chang, Tara N. Sainath, Yanzhang He, David Rybach, Robert David, Rohit Prabhavalkar, Cyril Allauzen, Cal Peyser, Trevor D. Strohman
arXiv_CL
arXiv_CL
Segmentation
Pose
Caption
Inference
PDF
2022-11-28
Task-Aware Asynchronous Multi-Task Model with Class Incremental Contrastive Learning for Surgical Scene Understanding
Lalithkumar Seenivasan, Mobarakol Islam, Mengya Xu, Chwee Ming Lim, Hongliang Ren
arXiv_AI
arXiv_AI
Transformer
Recognition
Optimization
Pose
Contrastive_Learning
Action_Recognition
Action
Detection
Attention
Caption
Prediction
PDF
2022-11-28
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo Ren, Ming-Ming Cheng
arXiv_CV
arXiv_CV
Image_Caption
Zero-Shot
Image_Retrieval
Pose
Detection
Object_Detection
PDF
2022-11-28
VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le
arXiv_CV
arXiv_CV
Transformer
Embedding
Pose
Action
Relation
Caption
Activity
PDF
2022-11-28
Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning
Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, Mang Ye
arXiv_AI
arXiv_AI
Enhancement
Video_Caption
Pose
Caption
PDF
2022-11-28
Renmin University of China at TRECVID 2022: Improving Video Search by Feature Fusion and Negation Understanding
Xirong Li, Aozhu Chen, Ziyue Wang, Fan Hu, Kaibin Tian, Xinru Chen, Chengbo Dong
arXiv_CV
arXiv_CV
Transformer
Action
Attention
Caption
Language_Model
Video_Retrieval
PDF
2022-11-27
MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding
Zilong Wang, Jiuxiang Gu, Chris Tensmeyer, Nikolaos Barmpalios, Ani Nenkova, Tong Sun, Jingbo Shang, Vlad I. Morariu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Pose
Relation
Attention
GAN
Language_Model
PDF
2022-11-27
CLID: Controlled-Length Image Descriptions with Limited Data
Elad Hirsch, Ayellet Tal
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
PDF
2022-11-26
Predictive linguistic cues for fake news: a societal artificial intelligence problem
Sandhya Aneja, Nagender Aneja, Ponnurangam Kumaraguru
arXiv_CL
arXiv_CL
Image_Caption
Pose
Relation
Caption
PDF
2022-11-25
Aesthetically Relevant Image Captioning
Zhipeng Zhong, Fei Zhou, Guoping Qiu
arXiv_CV
arXiv_CV
Image_Caption
Action
Caption
QA
PDF
2022-11-25
Aggregated Text Transformer for Scene Text Detection
Zhao Zhou, Xiangcheng Du, Yingbin Zheng, Cheng Jin
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Scene_Text
Action
Detection
Attention
PDF
2022-11-24
Self-supervised vision-language pretraining for Medical visual question answering
Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, Shenjun Zhong
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Self-Supervised
Pose
Contrastive_Learning
VQA
Medical
Caption
Language_Model
QA
Matching
PDF
2022-11-24
Shifted Diffusion for Text-to-image Generation
Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou Chen, Jinhui Xu
arXiv_AI
arXiv_AI
Embedding
Zero-Shot
Knowledge
Pose
Quantitative
Caption
PDF
2022-11-23
InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
Konstantin Kobs, Michael Steininger, Andreas Hotho
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Zero-Shot
Image_Retrieval
Pose
PDF
2022-11-23
Dynamic Appearance: A Video Representation for Action Recognition with Joint Training
Guoxi Huang, Adrian G. Bors
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Optimization
Pose
Action_Recognition
Action
PDF
2022-11-22
Retrieval-Augmented Multimodal Language Modeling
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Knowledge
Pose
Text_Generation
Caption
Language_Model
PDF
2022-11-22
Progressive Learning with Cross-Window Consistency for Semi-Supervised Semantic Segmentation
Bo Dang, Yansheng Li
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Semantic_Segmentation
Pose
Medical
PDF
2022-11-22
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning
Fenglin Liu, Xian Wu, Chenyu You, Shen Ge, Yuexian Zou, Xu Sun
arXiv_CV
arXiv_CV
Video_Caption
Pose
Caption
Inference
PDF
2022-11-22
Explaining Image Classifiers with Multiscale Directional Image Representation
Stefan Kolek, Robert Windesheim, Hector Andrade Loarca, Gitta Kutyniok, Ron Levie
arXiv_CV
arXiv_CV
Image_Caption
Regularization
Pose
PDF
2022-11-22
Knowledge Prompting for Few-shot Action Recognition
Yuheng Shi, Xinxiao Wu, Hanxi Lin
arXiv_CV
arXiv_CV
Transformer
Recognition
Knowledge
Pose
Action_Recognition
Action
Classification
Few-Shot
Caption
Language_Model
Matching
PDF
2022-11-21
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Attention
Caption
Inference
Prediction
PDF
2022-11-21
Place Recognition under Occlusion and Changing Appearance via Disentangled Representations
Yue Chen, Xingyu Chen
arXiv_CV
arXiv_CV
Image_Caption
Unsupervised
Recognition
Pose
Autonomous
PDF
2022-11-21
Self adaptive global-local feature enhancement for radiology report generation
Yuhao Wang, Kai Wang, Xiaohong Liu, Tianrun Gao, Jingyue Zhang, Guangyu Wang
arXiv_AI
arXiv_AI
Enhancement
Pose
Attention
Medical
Caption
PDF
2022-11-21
VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
Ajay Jain, Amber Xie, Pieter Abbeel
arXiv_AI
arXiv_AI
3D
Sketch
Knowledge
Caption
PDF
2022-11-21
Contrastive Masked Autoencoders for Self-Supervised Video Hashing
Yuting Wang, Jinpeng Wang, Bin Chen, Ziyun Zeng, Shutao Xia
arXiv_AI
arXiv_AI
Video_Caption
Self-Supervised
Pose
Relation
Attention
Activity
Video_Retrieval
PDF
2022-11-21
Unifying Vision-Language Representation Space with Single-tower Transformer
Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak
arXiv_CV
arXiv_CV
Transformer
Zero-Shot
Represenation_Learning
Pose
Contrastive_Learning
Caption
PDF
2022-11-20
How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation
Jie Ruan, Yue Wu, Xiaojun Wan, Yuesheng Zhu
arXiv_CV
arXiv_CV
Pose
Action
Relation
Sentiment
Text_Generation
Caption
PDF
2022-11-20
Real-time Local Feature with Global Visual Information Enhancement
Jinyu Miao, Haosong Yue, Zhong Liu, Xingming Wu, Zaojun Fang, Guilin Yang
arXiv_CV
arXiv_CV
Image_Caption
Enhancement
Reinforcement_Learning
Pose
Deep_Learning
CNN
Matching
PDF
2022-11-20
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, Wenhu Chen
arXiv_CV
arXiv_CV
Knowledge
Pose
Quantitative
Caption
PDF
2022-11-20
ESTAS: Effective and Stable Trojan Attacks in Self-supervised Encoders with One Target Unlabelled Sample
Jiaqi Xue, Qian Lou
arXiv_CV
arXiv_CV
Image_Caption
Optimization
Self-Supervised
Pose
Action
PDF
2022-11-19
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
Youssef Mohamed, Mohamed Abdelfattah, Shyma Alhuwaider, Feifan Li, Xiangliang Zhang, Kenneth Ward Church, Mohamed Elhoseiny
arXiv_AI
arXiv_AI
Emotion
Caption
PDF
2022-11-19
A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset
Jiaxin Deng, Dong Shen, Haojie Pan, Xiangyu Wu, Ximan Liu, Gaofeng Meng, Fan Yang, Size Li, Ruiji Fu, Zhongyuan Wang
arXiv_CV
arXiv_CV
Embedding
Video_Caption
Knowledge
Knowledge_Graph
Pose
Classification
Relation
Inference
Recommendation
Video_Retrieval
PDF
2022-11-19
Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach
Xixi Wang, Bo Jiang, Xiao Wang, Bin Luo
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Sparse
Pose
Relation
Attention
PDF
2022-11-18
Impact of visual assistance for automated audio captioning
Wim Boes, Hugo Van hamme
arXiv_SD
arXiv_SD
Transformer
Embedding
Transfer_Learning
Detection
Caption
PDF
2022-11-18
Masked Autoencoders for Egocentric Video Understanding @ Ego4D Challenge 2022
Jiachen Lei, Shuang Ma, Zhongjie Ba, Sai Vemprala, Ashish Kapoor, Kui Ren
arXiv_CV
arXiv_CV
Video_Caption
Classification
PDF
2022-11-17
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data
Sophia Gu, Christopher Clark, Aniruddha Kembhavi
arXiv_CV
arXiv_CV
Image_Caption
Embedding
VQA
Caption
Language_Model
PDF
2022-11-17
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Knowledge
Pose
Relation
Attention
Activity
PDF
2022-11-17
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei Huang, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, Limin Wang, Yu Qiao
arXiv_CV
arXiv_CV
Video_Caption
Action
Detection
Object_Detection
Prediction
PDF
2022-11-17
Visual Commonsense-aware Representation Network for Video Captioning
Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng Tao Shen
arXiv_CV
arXiv_CV
Video_Caption
Knowledge
Pose
Relation
Caption
Inference
PDF
2022-11-17
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Pose
Relation
Caption
Inference
Prediction
PDF
2022-11-17
Feedback is Needed for Retakes: An Explainable Poor Image Notification Framework for the Visually Impaired
Kazuya Ohata, Shunsuke Kitada, Hitoshi Iyatomi
arXiv_CV
arXiv_CV
Image_Caption
Pose
Detection
Caption
PDF
2022-11-17
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge
Linli Yao, Weijing Chen, Qin Jin
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Knowledge
Pose
Caption
Recommendation
PDF
2022-11-17
Learning Domain and Pose Invariance for Thermal-to-Visible Face Recognition
Cedric Nimpa Fondje, Shuowen Hu, Benjamin S. Riggan
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Pose
Face
Face_Recognition
Matching
PDF
2022-11-16
A Creative Industry Image Generation Dataset Based on Captions
Xiang Yuejia, Lv Chuanhao, Liu Qingdazhu, Yang Xiaocui, Liu Bo, Ju Meizhi
arXiv_CV
arXiv_CV
Sketch
Caption
PDF
2022-11-16
Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022
Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, Limin Wang
arXiv_CV
arXiv_CV
Transformer
Video_Caption
3D
Action
Classification
PDF
2022-11-15
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Zero-Shot
Pose
Scene_Text
VQA
Caption
Language_Model
QA
PDF
2022-11-15
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi
arXiv_CV
arXiv_CV
Image_Caption
PDF
2022-11-14
Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates
Etienne Labbé (IRIT-SAMoVA, UT3), Thomas Pellegrini (IRIT-SAMoVA, UT3), Julien Pinquier (IRIT-SAMoVA, UT3)
arXiv_SD
arXiv_SD
Pose
Caption
PDF
2022-11-14
Learning to Model Multimodal Semantic Alignment for Story Visualization
Bowen Li, Thomas Lukasiewicz
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Face
Action
GAN
Caption
PDF
2022-11-14
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
Junyang Wang, Yi Zhang, Ming Yan, Ji Zhang, Jitao Sang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Unsupervised
Zero-Shot
Image_Retrieval
Pose
Classification
Relation
Attention
Caption
Language_Model
PDF
2022-11-14
ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations
Chanda Grover, Indra Deep Mastan, Debayan Gupta
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Zero-Shot
Image_Retrieval
Pose
Contrastive_Learning
Quantitative
Classification
Caption
PDF
2022-11-13
Large-Scale Bidirectional Training for Zero-Shot Image Captioning
Taehoon Kim, Mark Marsden, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Alessandra Sala, Seung Hwan Kim
arXiv_CV
arXiv_CV
Image_Caption
Zero-Shot
Pose
Action
Caption
Inference
PDF
2022-11-12
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov
arXiv_SD
arXiv_SD
Zero-Shot
Represenation_Learning
Pose
Contrastive_Learning
Classification
Caption
PDF
2022-11-12
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu
arXiv_CL
arXiv_CL
Image_Caption
Contrastive_Learning
PDF
2022-11-12
DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis
Xian Wu, Shuxin Yang, Zhaopeng Qiu, Shen Ge, Yangtian Yan, Xingwang Wu, Yefeng Zheng, S. Kevin Zhou, Li Xiao
arXiv_CV
arXiv_CV
Image_Caption
Pose
Medical
Caption
PDF
2022-11-12
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics
Sandeep Kothinti, Dimitra Emmanouilidou
arXiv_SD
arXiv_SD
Image_Caption
Pose
Action
Caption
PDF
2022-11-10
VieCap4H - VLSP 2021: ObjectAoA -- Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning
Nghia Hieu Nguyen, Duong T.D. Vo, Minh-Quan Ha
arXiv_CL
arXiv_CL
Image_Caption
Transformer
Pose
Relation
Attention
Caption
PDF
2022-11-09
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting
arXiv_AI
arXiv_AI
Image_Caption
PDF
2022-11-09
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna, Albert Gatt, Kees van Deemter
arXiv_CL
arXiv_CL
Image_Caption
Action
Caption
Language_Model
PDF
2022-11-09
Visual Named Entity Linking: A New Dataset and A Baseline
Wenxiang Sun, Yixing Fan, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
arXiv_AI
arXiv_AI
Image_Caption
Image_Retrieval
Knowledge
Pose
VQA
Caption
PDF
2022-11-06
Distilling Representations from GAN Generator via Squeeze and Span
Yu Yang, Xiaotian Cheng, Chang Liu, Hakan Bilen, Xiangyang Ji
arXiv_CV
arXiv_CV
Image_Caption
Represenation_Learning
Knowledge
Adversarial
Self-Supervised
Pose
GAN
PDF
2022-11-05
Semantic Metadata Extraction from Dense Video Captioning
Johannes Scherer, Ansgar Scherp, Deepayan Bhowmik
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Pose
Action
Relation
Caption
Activity
PDF
2022-11-04
OSIC: A New One-Stage Image Captioner Coined
Bo Wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Pose
Action
Detection
Object_Detection
Text_Generation
Caption
Language_Model
PDF
2022-11-03
Book Cover Synthesis from the Summary
Emdadul Haque, Md. Faraz Kabir Khan, Mohammad Imrul Jubair, Jarin Anjum, Abrar Zahir Niloy
arXiv_CV
arXiv_CV
Pose
Face
Action
Relation
GAN
Caption
PDF
2022-11-02
MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid
Fengjun Wang, Sarai Mizrachi, Moran Beladev, Guy Nadav, Gil Amsalem, Karen Lastmann Assaraf, Hadas Harush Boker
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Transfer_Learning
Zero-Shot
Optimization
Represenation_Learning
Knowledge
Pose
Classification
Image_Classification
Prediction
PDF
2022-11-01
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald
arXiv_CV
arXiv_CV
Video_Caption
Image_Retrieval
Pose
Caption
PDF
2022-11-01
Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai, Ron Mokady, Amir Globerson
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Style_Transfer
Zero-Shot
Pose
Caption
PDF
2022-11-01
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal, Ben Bogin, Jonathan Berant
arXiv_CV
arXiv_CV
Transformer
VQA
Caption
Language_Model
QA
PDF
2022-10-31
Generative Negative Text Replay for Continual Vision-Language Pretraining
Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, Xuming He
arXiv_CV
arXiv_CV
Transformer
Zero-Shot
Knowledge
Pose
Classification
Attention
Caption
Image_Classification
Prediction
PDF
2022-10-31
Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation
Eungbeom Kim, Jinhee Kim, Yoori Oh, Kyungsu Kim, Minju Park, Jaeheon Sim, Jinwoo Lee, Kyogu Lee
arXiv_CL
arXiv_CL
Transformer
Pose
Deep_Learning
Caption
PDF
2022-10-28
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu, Xian Wu, Shen Ge, Xuancheng Ren, Wei Fan, Xu Sun, Yuexian Zou
arXiv_CL
arXiv_CL
Image_Caption
Transformer
Bert
Pose
Classification
Relation
Attention
Caption
Language_Model
PDF
2022-10-28
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang
arXiv_SD
arXiv_SD
Recognition
Pose
Attention
Caption
PDF
2022-10-28
UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance
Wei Li, Xue Xu, Xinyan Xiao, Jiachen Liu, Hu Yang, Guohao Li, Zhanpeng Wang, Zhifan Feng, Qiaoqiao She, Yajuan Lyu, Hua Wu
arXiv_CV
arXiv_CV
Transformer
Pose
Caption
Language_Model
Matching
PDF
2022-10-27
Towards Reliable Zero Shot Classification in Self-Supervised Models with Conformal Prediction
Bhawesh Kumar, Anil Palepu, Rudraksh Tuwani, Andrew Beam
arXiv_CV
arXiv_CV
Zero-Shot
Self-Supervised
Pose
Classification
Detection
Medical
Caption
Prediction
PDF
2022-10-27
Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation
Fernando López, Jordi Luque
arXiv_CL
arXiv_CL
Recognition
Speech
Self-Supervised
Pose
Classification
Speech_Recognition
Caption
PDF
2022-10-26
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen Jiang, Tao Xiang, Ning Zhang
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Image_Retrieval
Pose
Caption
PDF
2022-10-26
Visual Semantic Parsing: From Images to Abstract Meaning Representation
Mohamed Ashraf Abdelsalam, Zhan Shi, Federico Fancellu, Kalliopi Basioti, Dhaivat J. Bhatt, vladimir pavlovic, Afsaneh Fazly
arXiv_CV
arXiv_CV
Image_Caption
Pose
Relation
Attention
PDF
2022-10-26
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, Daniel Whitenack
arXiv_AI
arXiv_AI
Image_Caption
Recognition
Speech
Face
Caption
Language_Model
PDF
2022-10-24
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
Tzu-Jui Julius Wang, Jorma Laaksonen, Tomas Langer, Heikki Arponen, Tom E. Bishop
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Bert
Image_Retrieval
Pose
Action
VQA
Caption
PDF
2022-10-24
Language-free Training for Zero-shot Video Grounding
Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn
arXiv_CV
arXiv_CV
Video_Caption
Zero-Shot
Pose
Caption
PDF
2022-10-23
Extending Phrase Grounding with Pronouns in Visual Dialogues
Panzhong Lu, Xin Zhang, Meishan Zhang, Min Zhang
arXiv_CV
arXiv_CV
Caption
CNN
PDF
2022-10-23
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
Yang Zhan, Zhitong Xiong, Yuan Yuan
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Salient
Pose
Deep_Learning
VQA
Caption
PDF
2022-10-21
Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards
Jean-Benoit Delbrouck, Pierre Chambon, Christian Bluethgen, Emily Tsai, Omar Almusa, Curtis P. Langlotz
arXiv_AI
arXiv_AI
Image_Caption
Recognition
Pose
Face
Relation
Medical
PDF
2022-10-21
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Yuechen Wang, Wengang Zhou, Houqiang Li
arXiv_CV
arXiv_CV
Weakly_Supervised
Pose
Action
Caption
Activity
PDF
2022-10-21
Boosting vision transformers for image retrieval
Chull Hwan Song, Jooyoung Yoon, Shunghyun Choi, Yannis Avrithis
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Image_Retrieval
Pose
Action
Classification
Detection
CNN
Image_Classification
PDF
2022-10-21
Collaborative Image Understanding
Koby Bibas, Oren Sar Shalom, Dietmar Jannach
arXiv_CV
arXiv_CV
Image_Caption
Pose
Classification
GAN
Image_Classification
Recommendation
PDF
2022-10-21
PoseScript: 3D Human Poses from Natural Language
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Grégory Rogez
arXiv_CV
arXiv_CV
Image_Caption
3D
Pose
Relation
VQA
Caption
PDF
2022-10-21
Generative Range Imaging for Learning Scene Priors of 3D LiDAR Data
Kazuto Nakashima, Yumi Iwashita, Ryo Kurazume
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Semantic_Segmentation
3D
Restoration
Adversarial
Pose
GAN
Autonomous
PDF
2022-10-20
Communication breakdown: On the low mutual intelligibility between human and neural captioning
Roberto Dessì, Eleonora Gualdoni, Francesca Franzon, Gemma Boleda, Marco Baroni
arXiv_CL
arXiv_CL
Caption
PDF
2022-10-20
Image-Text Retrieval with Binary and Continuous Label Supervision
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, Ying Jin, Yufeng Zhang
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Pose
Relation
Caption
PDF
2022-10-20
Context-driven Visual Object Recognition based on Knowledge Graphs
Sebastian Monka, Lavdim Halilaj, Achim Rettinger
arXiv_AI
arXiv_AI
Image_Caption
Transfer_Learning
Recognition
Knowledge
Knowledge_Graph
Pose
Deep_Learning
Prediction
PDF
2022-10-20
VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection
Yi Liu, Xuan Zhang, Ying Li, Guixin Liang, Yabing Jiang, Lixia Qiu, Haiping Tang, Fei Xie, Wei Yao, Yi Dai, Yu Qiao, Yali Wang
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Pose
Face
Action_Recognition
Action
Classification
PDF
2022-10-20
General Image Descriptors for Open World Image Retrieval using ViT CLIP
Marcos V. Conde, Ivan Aerlic, Simon Jégou
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Embedding
Zero-Shot
Image_Retrieval
PDF
2022-10-20
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Yu Zhao, Jianguo Wei, Zhichao Lin, Yueheng Sun, Meishan Zhang, Min Zhang
arXiv_CV
arXiv_CV
Image_Caption
Pose
Classification
Relation
Attention
Text_Generation
Caption
PDF
2022-10-20
SSiT: Saliency-guided Self-supervised Image Transformer for Diabetic Retinopathy Grading
Yijin Huang, Junyan Lyu, Pujin Cheng, Roger Tam, Xiaoying Tang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Salient
Knowledge
Self-Supervised
Pose
Contrastive_Learning
Medical
PDF
2022-10-19
Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning
Fenglin Liu, Xuewei Ma, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu Sun
arXiv_CV
arXiv_CV
Image_Caption
Pose
Attention
Caption
PDF
2022-10-19
Grounded Video Situation Recognition
Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi
arXiv_CV
arXiv_CV
Transformer
Embedding
Recognition
Video_Caption
Weakly_Supervised
Pose
Face
Action
Relation
Caption
Prediction
PDF
2022-10-19
Image Semantic Relation Generation
Mingzhe Du
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Image_Retrieval
Pose
Detection
Relation
VQA
Text_Generation
Visual_Relation
Autonomous
PDF
2022-10-19
Temporal Action Segmentation: An Analysis of Modern Technique
Guodong Ding, Fadime Sener, Angela Yao
arXiv_CV
arXiv_CV
Segmentation
Video_Caption
Review
Pose
Survey
Action
PDF
2022-10-18
Aligning MAGMA by Few-Shot Learning and Finetuning
Jean-Charles Layoun, Alexis Roger, Irina Rish
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Face
VQA
Few-Shot
Caption
Language_Model
PDF
2022-10-18
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun
arXiv_CV
arXiv_CV
Transformer
Embedding
Zero-Shot
Knowledge
Pose
Contrastive_Learning
Classification
Medical
Caption
Prediction
Matching
PDF
2022-10-18
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Mantas Mazeika, Eric Tang, Andy Zou, Steven Basart, Jun Shern Chan, Dawn Song, David Forsyth, Jacob Steinhardt, Dan Hendrycks
arXiv_CV
arXiv_CV
Video_Caption
Pose
Emotion
Action
Contour
PDF
2022-10-18
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, Shujian Huang, Xinyu Dai, Jiajun Chen
arXiv_CL
arXiv_CL
Image_Caption
Transformer
Bert
Pose
Attention
Caption
PDF
2022-10-17
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Video_Caption
Knowledge
Review
Face
Survey
Classification
Detection
VQA
Few-Shot
Object_Detection
Caption
Image_Classification
PDF
2022-10-17
Weakly Supervised Face Naming with Symmetry-Enhanced Contrastive Loss
Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens
arXiv_CV
arXiv_CV
Image_Caption
Weakly_Supervised
Pose
Contrastive_Learning
Face
Caption
PDF
2022-10-17
Social Biases in Automatic Evaluation Metrics for NLG
Mingqi Gao, Xiaojun Wan
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Pose
Relation
Summarization
Text_Generation
Caption
Language_Model
PDF
2022-10-17
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven C.H. Hoi
arXiv_CV
arXiv_CV
Image_Caption
Zero-Shot
Pose
VQA
Caption
Language_Model
QA
PDF
2022-10-17
Runner-Up Solution to Google Universal Image Embedding Competition 2022
Xiaolong Huang, QianKun Li
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Recognition
Classification
Image_Classification
PDF
2022-10-14
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang, Wangchunshu Zhou, Yan Zeng, Xinsong Zhang
arXiv_CV
arXiv_CV
Transformer
Knowledge
Pose
VQA
Caption
Inference
Language_Model
QA
PDF
2022-10-14
Learning image representations for anomaly detection: application to discovery of histological alterations in drug development
Igor Zingman, Birgit Stierstorfer, Charlotte Lempp, Fabian Heinemann
arXiv_AI
arXiv_AI
Image_Caption
Pose
Detection
GAN
CNN
PDF
2022-10-13
Caption supervision enables robust learners
Benjamin Feuer, Ameya Joshi, Chinmay Hegde
arXiv_CV
arXiv_CV
Caption
Language_Model
PDF
2022-10-13
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas, Pau Rodriguez, Saba Ahmadi, Aida Nematzadeh, Yash Goyal, Aishwarya Agrawal
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Pose
VQA
Few-Shot
Caption
PDF
2022-10-13
Learning with Style: Continual Semantic Segmentation Across Tasks and Domains
Marco Toldo, Umberto Michieli, Pietro Zanuttigh
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Semantic_Segmentation
Style_Transfer
Knowledge
Pose
Face
Deep_Learning
Autonomous
PDF
2022-10-13
DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models
Zeyang Sha, Zheng Li, Ning Yu, Yang Zhang
arXiv_CV
arXiv_CV
Pose
Detection
Attention
Caption
PDF
2022-10-12
Self-supervised video pretraining yields strong image representations
Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff
arXiv_AI
arXiv_AI
Image_Caption
Segmentation
Transfer_Learning
Semantic_Segmentation
Knowledge
Self-Supervised
Pose
Contrastive_Learning
Detection
Object_Detection
PDF
2022-10-11
Visual Language Maps for Robot Navigation
Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
arXiv_AI
arXiv_AI
Image_Caption
Reconstruction
3D
Pose
Caption
Autonomous
Language_Model
Matching
PDF
2022-10-11
Like a bilingual baby: The advantage of visually grounding a bilingual language model
Khai-Nguyen Nguyen, Zixin Tang, Ankur Mali, Alex Kelly
arXiv_CL
arXiv_CL
RNN
Caption
Language_Model
PDF
2022-10-10
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Pedro Rodriguez, Mahmoud Azab, Becka Silvert, Renato Sanchez, Linzy Labson, Hardik Shah, Seungwhan Moon
arXiv_CL
arXiv_CL
Video_Caption
Pose
Caption
Recommendation
Video_Retrieval
PDF
2022-10-10
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features
Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang
arXiv_SD
arXiv_SD
Transformer
Pose
Caption
PDF
2022-10-10
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon, Lei Li, William Yang Wang
arXiv_CL
arXiv_CL
Image_Caption
Unsupervised
Relation
Text_Generation
Caption
PDF
2022-10-10
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Raphael Tang, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Jimmy Lin, Ferhan Ture
arXiv_CV
arXiv_CV
Segmentation
Unsupervised
Knowledge
Speech
Pose
Denoising
Attention
Caption
PDF
2022-10-10
Generating image captions with external encyclopedic knowledge
Sofia Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter
arXiv_CV
arXiv_CV
Image_Caption
Knowledge
Caption
PDF
2022-10-10
Using Whole Slide Image Representations from Self-Supervised Contrastive Learning for Melanoma Concordance Regression
Sean Grullon, Vaughn Spurrier, Jiayi Zhao, Corey Chivers, Yang Jiang, Kiran Motaparthi, Michael Bonham, Julianna Ianni
arXiv_AI
arXiv_AI
Image_Caption
Salient
Self-Supervised
Contrastive_Learning
Action
Deep_Learning
PDF
2022-10-10
LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong, Georgina Cosma
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Pose
PDF
2022-10-10
YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding
Kayode Olaleye, Dan Oneata, Herman Kamper
arXiv_CL
arXiv_CL
Speech
Attention
Caption
PDF
2022-10-10
CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning
Shitong Xu
arXiv_CV
arXiv_CV
Image_Caption
Denoising
Text_Generation
Caption
Inference
PDF
2022-10-09
ConTra: text nsformer for Cross-Modal Video Retrieval
Adriano Fragomeni, Michael Wray, Dima Damen
arXiv_CV
arXiv_CV
Transformer
Embedding
Knowledge
Pose
Action
Caption
Activity
Video_Retrieval
PDF
2022-10-09
Students taught by multimodal teachers are superior action recognizers
Gorjan Radevski, Dusan Grujicic, Matthew Blaschko, Marie-Francine Moens, Tinne Tuytelaars
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Knowledge
Action_Recognition
Action
Detection
Object_Detection
Optical_Flow
Inference
PDF
2022-10-09
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Semantic_Segmentation
Pose
Caption
Language_Model
PDF
2022-10-09
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann LeCun, Rama Chellappa
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Pose
Detection
Object_Detection
Caption
Matching
PDF
2022-10-09
Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains
Pierre Chambon, Christian Bluethgen, Curtis P. Langlotz, Akshay Chaudhari
arXiv_AI
arXiv_AI
Transformer
Quantitative
Medical
Caption
PDF
2022-10-08
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
Baoxiong Jia, Ting Lei, Song-Chun Zhu, Siyuan Huang
arXiv_AI
arXiv_AI
Video_Caption
Action_Localization
Action
Prediction
QA
PDF
2022-10-08
Contextual Modeling for 3D Dense Captioning on Point Clouds
Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma
arXiv_CV
arXiv_CV
Transformer
Point_Cloud
3D
Pose
Relation
Caption
PDF
2022-10-07
Learning to embed semantic similarity for joint image-text retrieval
Noam Malali, Yosi Keller
arXiv_CV
arXiv_CV
Embedding
Quantization
Pose
Deep_Learning
Caption
PDF
2022-10-07
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass
arXiv_CV
arXiv_CV
Knowledge
Pose
Caption
Prediction
Video_Retrieval
PDF
2022-10-07
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement
Hui Liu, Wenya Wang, Haoliang Li
arXiv_CL
arXiv_CL
Image_Caption
Enhancement
Knowledge
Pose
Detection
Attention
Caption
PDF
2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova
arXiv_CV
arXiv_CV
Image_Caption
OCR
Face
Caption
Language_Model
PDF
2022-10-07
Unsupervised Neural Stylistic Text Generation using Transfer learning and Adapters
Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah, Dan Roth
arXiv_CL
arXiv_CL
Unsupervised
Transfer_Learning
Pose
Text_Generation
Caption
PDF
2022-10-06
Compressed Vision for Efficient Video Understanding
Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski
arXiv_CV
arXiv_CV
Video_Caption
Pose
PDF
2022-10-06
Text-driven Video Prediction
Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang
arXiv_CV
arXiv_CV
Embedding
Video_Prediction
Pose
Caption
Inference
Prediction
PDF
2022-10-06
What Should the System Do Next?: Operative Action Captioning for Estimating System Actions
Taiki Nakamura, Seiya Kawano, Akishige Yuguchi, Yasutomo Kawanishi, Koichiro Yoshino
arXiv_RO
arXiv_RO
Pose
Action
Caption
Prediction
PDF
2022-10-05
Transferring dense object detection models to event-based data
Vincenz Mechler, Pavel Rojtberg
arXiv_CV
arXiv_CV
Image_Caption
Sparse
Pose
Detection
Object_Detection
PDF
2022-10-05
Active Image Indexing
Pierre Fernandez, Matthijs Douze, Hervé Jégou, Teddy Furon
arXiv_AI
arXiv_AI
Image_Caption
Quantization
Detection
PDF
2022-10-05
SoccerNet 2022 Challenges Results
Silvio Giancola, Anthony Cioppa, Adrien Deliège, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdulrahman Darwish, Adrien Maglo, Albert Clapés, Andreas Luyts, Andrei Boiarov, Artur Xarles, Astrid Orcesi, Avijit Shah, Baoyu Fan, Bharath Comandur, Chen Chen, Chen Zhang, Chen Zhao, Chengzhi Lin, Cheuk-Yiu Chan, Chun Chuen Hui, Dengjie Li, Fan Yang, Fan Liang, Fang Da, Feng Yan, Fufu Yu, Guanshuo Wang, H. Anthony Chan, He Zhu, Hongwei Kan, Jiaming Chu, Jianming Hu, Jianyang Gu, Jin Chen, João V. B. Soares, Jonas Theiner, Jorge De Corte, José Henrique Brito, Jun Zhang, Junjie Li, Junwei Liang, Leqi Shen, Lin Ma, Lingchi Chen, Miguel Santos Marques, Mike Azatov, Nikita Kasatkin, et al. (39 additional authors not shown)
arXiv_CV
arXiv_CV
Tracking
Video_Caption
Object_Tracking
Pose
Action
GAN
Re-identification
PDF
2022-10-04
When and why vision-language models behave like bag-of-words models, and what to do about it?
Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou
arXiv_AI
arXiv_AI
Transformer
Contrastive_Learning
Face
Relation
Caption
Language_Model
PDF
2022-10-04
VICRegL: Self-Supervised Learning of Local Visual Features
Adrien Bardes, Jean Ponce, Yann LeCun
arXiv_AI
arXiv_AI
Image_Caption
Segmentation
Self-Supervised
Pose
Classification
Detection
CNN
PDF
2022-10-04
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai
arXiv_CV
arXiv_CV
Image_Caption
Speech
Pose
VQA
Attention
Caption
QA
PDF
2022-10-03
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity
Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu
arXiv_AI
arXiv_AI
Image_Caption
Pose
Relation
Text_Generation
Caption
PDF
2022-10-03
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
arXiv_CL
arXiv_CL
Bert
Zero-Shot
Speech
Pose
Caption
Language_Model
PDF
2022-09-30
Contrastive Corpus Attribution for Explaining Representations
Chris Lin, Hugh Chen, Chanwoo Kim, Su-In Lee
arXiv_AI
arXiv_AI
Image_Caption
Unsupervised
Zero-Shot
Pose
Contrastive_Learning
Quantitative
PDF
2022-09-30
Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study
Ziyuan Qin, Huahui Yi, Qicheng Lao, Kang Li
arXiv_CV
arXiv_CV
Image_Caption
Zero-Shot
Knowledge
Medical
Language_Model
PDF
2022-09-30
AudioGen: Textually Guided Audio Generation
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi
arXiv_CL
arXiv_CL
Pose
Caption
Inference
PDF
2022-09-30
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
Rita Ramos, Bruno Martins, Desmond Elliott, Yova Kementchedjhieva
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Attention
Caption
PDF
2022-09-30
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge
arXiv_AI
arXiv_AI
Transformer
Video_Caption
Represenation_Learning
Knowledge
Speech
GAN
Caption
PDF
2022-09-30
Linearly Mapping from Image to Text Space
Jack Merullo, Louis Castricato, Carsten Eickhoff, Ellie Pavlick
arXiv_CL
arXiv_CL
Image_Caption
Salient
VQA
Caption
Language_Model
PDF
2022-09-29
REST: REtrieve & Self-Train for generative action recognition
Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios Tzimiropoulos
arXiv_AI
arXiv_AI
Unsupervised
Recognition
Zero-Shot
Knowledge
Pose
Contrastive_Learning
Action_Recognition
Action
Caption
PDF
2022-09-29
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
Ali Abdari, Pouria Amirjan, Azadeh Mansouri
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Pose
Action_Recognition
Action
Classification
Attention
PDF
2022-09-29
Prompt-guided Scene Generation for 3D Zero-Shot Learning
Majid Nasiri, Ali Cheraghian, Townim Faisal Chowdhury, Sahar Ahmadi, Morteza Saberi, Shafin Rahman
arXiv_CV
arXiv_CV
Point_Cloud
3D
Bert
Zero-Shot
Scene_Generation
Pose
Contrastive_Learning
Action
Caption
Language_Model
PDF
2022-09-28
Audio Retrieval with WavText5K and CLAP Training
Soham Deshmukh, Benjamin Elizalde, Huaming Wang
arXiv_AI
arXiv_AI
Pose
Contrastive_Learning
Caption
PDF
2022-09-28
Weighted Contrastive Hashing
Jiaguo Yu, Huming Qiu, Dubing Chen, Haofeng Zhang
arXiv_CV
arXiv_CV
Image_Caption
Unsupervised
Pose
Contrastive_Learning
Relation
Attention
PDF
2022-09-28
Medical Image Captioning via Generative Pretrained Transformers
Alexander Selivanov, Oleg Y. Rogov, Daniil Chesakov, Artem Shelmanov, Irina Fedulova, Dmitry V. Dylov
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Pose
Medical
Caption
Language_Model
PDF
2022-09-28
Thinking Hallucination for Video Captioning
Nasib Ullah, Partha Pratim Mohanta
arXiv_CV
arXiv_CV
Video_Caption
Pose
Action
Caption
Language_Model
PDF
2022-09-28
Streaming Video Temporal Action Segmentation In Real Time
Wujun Wen, Yunheng Li, Zhuben Dong, Lin Feng, Wanxiao Yang, Shenlan Liu
arXiv_CV
arXiv_CV
Segmentation
Video_Caption
Knowledge
Pose
Action
Language_Model
PDF
2022-09-26
Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned
Ahmed Sabir
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
PDF
2022-09-26
Improving Document Image Understanding with Reinforcement Finetuning
Bao-Sinh Nguyen, Dung Tien Le, Hieu M. Vu, Tuan Anh D. Nguyen, Minh-Tien Nguyen, Hung Le
arXiv_CV
arXiv_CV
Image_Caption
Reinforcement_Learning
Action
PDF
2022-09-25
Paraphrasing Is All You Need for Novel Object Captioning
Cheng-Fu Yang, Yao-Hung Hubert Tsai, Wan-Cyuan Fan, Ruslan Salakhutdinov, Louis-Philippe Morency, Yu-Chiang Frank Wang
arXiv_CV
arXiv_CV
Optimization
Caption
Language_Model
PDF
2022-09-23
Semantic scene descriptions as an objective of human vision
Adrien Doerig, Tim C Kietzmann, Emily Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Ian Charest
arXiv_CV
arXiv_CV
Reconstruction
Embedding
Deep_Learning
Relation
Caption
Activity
CNN
PDF
2022-09-22
DRAMA: Joint Risk Localization and Captioning in Driving
Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, Jiachen Li
arXiv_AI
arXiv_AI
Pose
Face
Caption
Autonomous
Prediction
PDF
2022-09-21
Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Knowledge
Pose
Caption
PDF
2022-09-21
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering
Hao Li, Jinfa Huang, Peng Jin, Guoli Song, Qi Wu, Jie Chen
arXiv_CV
arXiv_CV
Image_Caption
Transformer
OCR
3D
Knowledge
Pose
Scene_Text
Face
Relation
VQA
Attention
Caption
Prediction
QA
PDF
2022-09-21
Recipe Generation from Unsegmented Cooking Videos
Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Shinsuke Mori
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Pose
Caption
PDF
2022-09-20
Language-based Audio Retrieval Task in DCASE 2022 Challenge
Huang Xie, Samuel Lipping, Tuomas Virtanen
arXiv_SD
arXiv_SD
Relation
Caption
PDF
2022-09-20
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, Hang Xu
arXiv_CV
arXiv_CV
Transformer
Zero-Shot
Knowledge
Pose
Action
Detection
Relation
Object_Detection
Caption
PDF
2022-09-19
Panoramic Vision Transformer for Saliency Detection in 360° Videos
Heeseung Yun, Sehun Lee, Gunhee Kim
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Salient
Detection
Relation
VQA
Prediction
QA
PDF
2022-09-19
Attentive Symmetric Autoencoder for Brain MRI Segmentation
Junjia Huang, Haofeng Li, Guanbin Li, Xiang Wan
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Reconstruction
Segmentation
3D
Self-Supervised
Pose
Relation
Attention
Medical
PDF
2022-09-19
On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks
Hubert Leterme (UGA, LJK), Kévin Polisano (UGA, LJK), Valérie Perrier (Grenoble INP, LJK), Karteek Alahari (LJK)
arXiv_AI
arXiv_AI
Image_Caption
Classification
Relation
CNN
Image_Classification
PDF
2022-09-17
Learning Distinct and Representative Modes for Image Captioning
Qi Chen, Chaorui Deng, Qi Wu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Pose
Caption
PDF
2022-09-16
Belief Revision based Caption Re-ranker with Visual Semantic Information
Ahmed Sabir, Francesc Moreno-Noguer, Pranava Madhyastha, Lluís Padró
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
PDF
2022-09-15
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C.H. Hoi
arXiv_CL
arXiv_CL
Transformer
Face
Classification
Deep_Learning
VQA
Caption
Language_Model
PDF
2022-09-15
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan
arXiv_CV
arXiv_CV
Transformer
Recognition
Pose
Action_Recognition
Action
Classification
Caption
Image_Classification
Language_Model
PDF
2022-09-15
Distribution Aware Metrics for Conditional Natural Language Generation
David M Chan, Yiming Ni, Austin Myers, Sudheendra Vijayanarasimhan, David A Ross, John Canny
arXiv_CV
arXiv_CV
Recognition
Speech
Pose
Summarization
Speech_Recognition
Caption
Matching
PDF
2022-09-15
Exploring Visual Interpretability for Contrastive Language-Image Pre-training
Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, Xiaomeng Li
arXiv_CV
arXiv_CV
Transformer
Segmentation
Recognition
Zero-Shot
Knowledge
Self-Supervised
Pose
Attention
Caption
Prediction
PDF
2022-09-15
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
Yi Wang, Zhiwen Fan, Tianlong Chen, Hehe Fan, Zhangyang Wang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Embedding
Point_Cloud
3D
Classification
Detection
PDF
2022-09-15
VIPHY: Probing 'Visible' Physical Commonsense Knowledge
Shikhar Singh, Ehsan Qasemi, Muhao Chen
arXiv_CL
arXiv_CL
Transformer
Bert
Knowledge
Caption
Language_Model
PDF
2022-09-14
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
arXiv_CV
arXiv_CV
Transformer
Face
VQA
Caption
Language_Model
PDF
2022-09-14
WildQA: In-the-Wild Video Question Answering
Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, Rada Mihalcea
arXiv_CV
arXiv_CV
Video_Caption
Pose
Action
Attention
QA
PDF
2022-09-14
Learning to Evaluate Performance of Multi-modal Semantic Localization
Zhiqiang Yuan, Wenkai Zhang, Chongyang Li, Zhaoying Pan, Yongqiang Mao, Jialiang Chen, Shouke Li, Hongqi Wang, Xian Sun
arXiv_CV
arXiv_CV
Pose
Action
Detection
Attention
Caption
PDF
2022-09-14
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Activity
PDF
2022-09-13
Do Androids Laugh at Electric Sheep? Humor 'Understanding' Benchmarks from The New Yorker Caption Contest
Jack Hessel, Ana Marasović, Jena D. Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi
arXiv_CV
arXiv_CV
Face
Relation
Caption
Language_Model
PDF
2022-09-13
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Adyasha Maharana, Darryl Hannan, Mohit Bansal
arXiv_AI
arXiv_AI
Transformer
Adversarial
Pose
GAN
Caption
PDF
2022-09-12
Towards Multi-Lingual Visual Question Answering
Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V. Thapliyal, Julien Amelot, Xi Chen, Radu Soricut
arXiv_CV
arXiv_CV
Pose
VQA
Caption
QA
PDF
2022-09-12
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language
Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, Ji-Rong Wen
arXiv_AI
arXiv_AI
Knowledge
Pose
Contrastive_Learning
Caption
Prediction
PDF
2022-09-11
MAiVAR: Multimodal Audio-Image and Video Action Recognizer
Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, Naveed Akhtar
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Pose
Action_Recognition
Action
PDF
2022-09-09
Pre-training image-language transformers for open-vocabulary tasks
AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
arXiv_CV
arXiv_CV
Transformer
VQA
Caption
PDF
2022-09-09
EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography
Rand Muhtaseb, Mohammad Yaqub
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Bert
Pose
Face
Action
Medical
CNN
PDF
2022-09-08
FETA: Towards Specializing Foundation Models for Expert Task Applications
Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Zero-Shot
Pose
Action
PDF
2022-09-07
Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations
Vadim Tschernezki, Iro Laina, Diane Larlus, Andrea Vedaldi
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
3D
Self-Supervised
PDF
2022-09-07
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
arXiv_AI
arXiv_AI
Video_Caption
Pose
Autonomous
PDF
2022-09-07
Multi-Grained Angle Representation for Remote Sensing Object Detection
Hao Wang, Zhanchao Huang, Zhengchao Chen, Ying Song, Wei Li
arXiv_CV
arXiv_CV
Image_Caption
Pose
Face
Classification
Detection
Object_Detection
Prediction
PDF
2022-09-06
Improving the Accuracy and Robustness of CNNs Using a Deep CCA Neural Data Regularizer
Cassidy Pirlot, Richard C. Gerum, Cory Efird, Joel Zylberberg, Alona Fyshe
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Regularization
Adversarial
Classification
Relation
CNN
PDF
2022-09-05
Design of the topology for contrastive visual-textual alignment
Zhun Sun
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
Zero-Shot
Optimization
Pose
PDF
2022-09-05
Bridging Music and Text with Crowdsourced Music Comments: A Sequence-to-Sequence Framework for Thematic Music Comments Generation
Peining Zhang, Junliang Guo, Linli Xu, Mu You, Junming Yin
arXiv_CL
arXiv_CL
Image_Caption
Pose
Text_Generation
Caption
CNN
PDF
2022-09-04
Every picture tells a story: Image-grounded controllable stylistic story generation
Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, Pascale Fung
arXiv_CL
arXiv_CL
Image_Caption
Pose
Action
Text_Generation
Caption
PDF
2022-09-04
Pseudo-LiDAR for Visual Odometry
Huiying Deng, Guangming Wang, Zhiheng Feng, Chaokang Jiang, Xinrui Wu, Yanzi Miao, Hesheng Wang
arXiv_RO
arXiv_RO
Image_Caption
Point_Cloud
3D
Sparse
Knowledge
Pose
Matching
PDF
2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Caption
Optical_Flow
Video_Retrieval
PDF
2022-09-03
Label Structure Preserving Contrastive Embedding for Multi-Label Learning with Missing Labels
Zhongchen Ma, Lisha Li, Qirong Mao, Songcan Chen
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Unsupervised
Represenation_Learning
Knowledge
Pose
Contrastive_Learning
Classification
GAN
Image_Classification
PDF
2022-09-03
vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM
Thanh Tin Nguyen, Long H. Nguyen, Nhat Truong Pham, Liu Tai Nguyen, Van Huong Do, Hai Nguyen, Ngoc Duy Nguyen
arXiv_CV
arXiv_CV
Image_Caption
Transformer
RNN
Speech
Pose
Attention
Caption
CNN
PDF
2022-08-26
Partially Relevant Video Retrieval
Jianfeng Dong, Xianke Chen, Minsong Zhang, Xun Yang, Shujie Chen, Xirong Li, Xun Wang
arXiv_CV
arXiv_CV
Pose
Caption
Activity
Video_Retrieval
PDF
2022-08-26
Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification
Xixi Wang, Xiao Wang, Bo Jiang, Bin Luo
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Classification
Few-Shot
Attention
PDF
2022-08-25
Riesz-Quincunx-UNet Variational Auto-Encoder for Satellite Image Denoising
Duy H. Thai, Xiqi Fei, Minh Tri Le, Andreas Züfle, Konrad Wessels
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Pose
Quantitative
Deep_Learning
Denoising
Medical
CNN
PDF
2022-08-25
Multiresolution Neural Networks for Imaging
Hallison Paz, Tiago Novello, Vinicius Silva, Luiz Schirmer, Guilherme Schardong, Luiz Velho
arXiv_CV
arXiv_CV
Image_Caption
Pose
PDF
2022-08-24
Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations
Paul Primus, Gerhard Widmer
arXiv_SD
arXiv_SD
Transformer
Embedding
Transfer_Learning
Optimization
Pose
Deep_Learning
Caption
PDF
2022-08-24
Visual Subtitle Feature Enhanced Video Outline Generation
Qi Lv, Ziqiang Cao, Wenrui Xie, Derui Wang, Jingwen Wang, Zhiyong Hu, Tangkun Zhang, Yuan Ba, Yuanhang Li, Min Cao, Wenjie Li, Sujian Li, Guohong Fu
arXiv_CL
arXiv_CL
Segmentation
Video_Caption
OCR
Pose
Attention
Summarization
PDF
2022-08-23
IMPaSh: A Novel Domain-shift Resistant Representation for Colorectal Cancer Tissue Classification
Trinh Thi Le Vuong, Quoc Dang Vu, Mostafa Jahanifar, Simon Graham, Jin Tae Kwak, Nasir Rajpoot
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Self-Supervised
Pose
Contrastive_Learning
Classification
Deep_Learning
PDF
2022-08-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Semantic_Segmentation
Pose
Classification
Detection
VQA
Object_Detection
Caption
Image_Classification
QA
PDF
2022-08-22
A Medical Semantic-Assisted Transformer for Radiographic Report Generation
Zhanyu Wang, Mingkang Tang, Lei Wang, Xiu Li, Luping Zhou
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Sparse
Pose
Action
Attention
Medical
Caption
PDF
2022-08-22
Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan, Chunhui Ai, Ziqiang Cao, Min Cao, Sujian Li, Wenjie Chen, Guohong Fu
arXiv_AI
arXiv_AI
Pose
Caption
Matching
PDF
2022-08-22
Identifying Auxiliary or Adversarial Tasks Using Necessary Condition Analysis for Adversarial Multi-task Video Understanding
Stephen Su, Samuel Kwong, Qingyu Zhao, De-An Huang, Juan Carlos Niebles, Ehsan Adeli
arXiv_AI
arXiv_AI
Recognition
Video_Caption
Adversarial
Pose
Action_Recognition
Action
Relation
PDF
2022-08-19
Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph
Yufeng Huang, Zhuo Chen, Wen Zhang, Jiaoyan Chen, Jeff Z. Pan, Zhen Yao, Yujie Xie, Huajun Chen
arXiv_CV
arXiv_CV
Image_Caption
Sentiment_Classification
Pose
Classification
Relation
Sentiment
Caption
PDF
2022-08-19
Diverse Video Captioning by Adaptive Spatio-temporal Attention
Zohreh Ghaderi, Leonard Salewski, Hendrik P. A. Lensch
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Relation
Attention
Text_Generation
Caption
Inference
PDF
2022-08-18
VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations
Georgios Chochlakis, Tejas Srinivasan, Jesse Thomason, Shrikanth Narayanan (University of Southern California)
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Bert
Pose
Relation
Caption
Inference
Language_Model
PDF
2022-08-18
GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, Alexander Hauptmann
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Recognition
Salient
Pose
Detection
Relation
Object_Detection
Attention
Caption
Activity
PDF
2022-08-18
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning
Olivia Wiles, Isabela Albuquerque, Sven Gowal
arXiv_CV
arXiv_CV
Image_Caption
Adversarial
Relation
Caption
PDF
2022-08-18
Towards Label-efficient Automatic Diagnosis and Analysis: A Comprehensive Survey of Advanced Deep Learning-based Weakly-supervised, Semi-supervised and Self-supervised Techniques in Histopathological Image Analysis
Linhao Qu, Siyu Liu, Xiaoyu Liu, Manning Wang, Zhijian Song
arXiv_CV
arXiv_CV
Image_Caption
Weakly_Supervised
Represenation_Learning
Review
Self-Supervised
Survey
Deep_Learning
CNN
Prediction
PDF
2022-08-17
ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber
Manuel Brac, Patrick Schramowski, Björn Deiseroth, Kristian Kersting
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
VQA
Caption
Language_Model
PDF
2022-08-17
DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations
Gabriel Van Zandycke, Vladimir Somers, Maxime Istasse, Carlo Del Don, Davide Zambrano
arXiv_CV
arXiv_CV
Segmentation
Video_Caption
3D
Pose
Deep_Learning
Attention
GAN
Re-identification
PDF
2022-08-17
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li
arXiv_CV
arXiv_CV
Pose
Relation
Visual_Relation
Caption
Inference
QA
PDF
2022-08-17
Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions
Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Pose
Action
CNN
PDF
2022-08-17
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides
Dong Won Lee, Chaitanya Ahuja, Paul Pu Liang, Sanika Natu, Louis-Philippe Morency
arXiv_CV
arXiv_CV
Transformer
Knowledge
Speech
Caption
PDF
2022-08-16
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Shuo Liu, Weize Quan, Ming Zhou, Sihong Chen, Jian Kang, Zhe Zhao, Chen Chen, Dong-Ming Yan
arXiv_CV
arXiv_CV
Pose
Action
Relation
Caption
Activity
Video_Retrieval
PDF
2022-08-15
C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification
Chunlei Zhang, Dong Yu
arXiv_SD
arXiv_SD
Image_Caption
Embedding
Speech
Self-Supervised
Pose
Contrastive_Learning
Attention
PDF
2022-08-13
Self-Contained Entity Discovery from Captioned Videos
Melika Ayoughi, Pascal Mettes, Paul Groth
arXiv_CV
arXiv_CV
Knowledge
Pose
Face
Caption
PDF
2022-08-13
Medical image analysis based on transformer: A Review
Zhaoshan Liu, Lei Shen
arXiv_CV
arXiv_CV
Transformer
Segmentation
Review
Classification
Detection
Denoising
Attention
GAN
Medical
Caption
PDF
2022-08-13
ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning
Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
arXiv_CV
arXiv_CV
Image_Caption
Deep_Learning
Caption
PDF
2022-08-12
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Reconstruction
Segmentation
Semantic_Segmentation
Represenation_Learning
Knowledge
Self-Supervised
Pose
Classification
Image_Classification
Prediction
PDF
2022-08-12
An investigation on selecting audio pre-trained models for audio captioning
Peiran Yan, Shengchen Li
arXiv_SD
arXiv_SD
Pose
Relation
Caption
PDF
2022-08-12
Facial Expression Recognition and Image Description Generation in Vietnamese
Khang Nhut Lam, Kim-Ngoc Thi Nguyen, Loc Huu Nguy, Jugal Kalita
arXiv_CV
arXiv_CV
Image_Caption
Recognition
RNN
Pose
Emotion
PDF
2022-08-12
Motion Sensitive Contrastive Learning for Self-supervised Video Representation
Jingcheng Ni, Nan Zhou, Jie Qin, Qian Wu, Junqi Liu, Boxun Li, Di Huang
arXiv_CV
arXiv_CV
Video_Caption
3D
Represenation_Learning
Self-Supervised
Pose
Contrastive_Learning
Classification
Optical_Flow
Video_Retrieval
Video_Classification
PDF
2022-08-11
MILAN: Masked Image Pretraining on Language Assisted Representation
Zejiang Hou, Fei Sun, Yen-Kuang Chen, Yuan Xie, Sun-Yuan Kung
arXiv_CL
arXiv_CL
Transformer
Reconstruction
Segmentation
Semantic_Segmentation
Weakly_Supervised
Pose
Attention
Caption
PDF
2022-08-11
Figure Descriptive Text Extraction using Ontological Representation
Gilchan Park, Julia Rayz, Line Pouchard
arXiv_CL
arXiv_CL
Recognition
Knowledge
Action
Classification
Caption
PDF
2022-08-11
PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding
Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Si Liu
arXiv_CV
arXiv_CV
Segmentation
Sparse
Pose
Caption
Matching
PDF
2022-08-10
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt
arXiv_CV
arXiv_CV
Transformer
Action
Caption
PDF
2022-08-10
Exploring Anchor-based Detection for Ego4D Natural Language Query
Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
arXiv_AI
arXiv_AI
Video_Caption
Pose
Face
Detection
PDF
2022-08-10
Aesthetic Visual Question Answering of Photographs
Xin Jin, Wu Zhou, Xinghui Zhou, Shuai Cui, Le Zhang, Jianwen Lv, Shu Zhao
arXiv_CV
arXiv_CV
Pose
Sentiment
VQA
Caption
QA
PDF
2022-08-10
Alternating Cross-attention Vision-Language Model for Efficient Learning with Medical Image and Report without Curation
Sangjoon Park, Eun Sun Lee, Jeong Eun Lee, Jong Chul Ye
arXiv_CV
arXiv_CV
Transformer
Detection
Attention
Medical
Caption
Language_Model
PDF
2022-08-09
Automatic Ultrasound Image Segmentation of Supraclavicular Nerve Using Dilated U-Net Deep Learning Architecture
Mizuki Miyatake, Subhash Nerella, David Simpson, Natalia Pawlowicz, Sarah Stern, Patrick Tighe, Parisa Rashidi
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Recognition
Deep_Learning
Detection
Medical
PDF
2022-08-09
Sports Video Analysis on Large-Scale Data
Dekun Wu, He Zhao, Xingce Bao, Richard P. Wildes
arXiv_CV
arXiv_CV
Transformer
Segmentation
Recognition
Salient
Pose
Action_Recognition
Action
Caption
PDF
2022-08-09
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2
Xinghui Zhou, Xin Jin, Jianwen Lv, Heng Huang, Ming Mao, Shuai Cui
arXiv_CV
arXiv_CV
Image_Caption
Transformer
RNN
Knowledge
Pose
Caption
PDF
2022-08-08
Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks
Yunqing Bao, Hang Dai, Abdulmotaleb Elsaddik
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Salient
Pose
Detection
Object_Detection
Attention
Caption
PDF
2022-08-08
Distincive Image Captioning via CLIP Guided Group Optimization
Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Optimization
Pose
Caption
PDF
2022-08-08
Boosting Video-Text Retrieval with Explicit High-Level Semantics
Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, Errui Ding
arXiv_CV
arXiv_CV
Video_Caption
Pose
Action
Caption
PDF
2022-08-08
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning
Ting Chen, Ruixiang Zhang, Geoffrey Hinton
arXiv_AI
arXiv_AI
Image_Caption
Pose
Caption
PDF
2022-08-07
Adaptive Local Implicit Image Function for Arbitrary-scale Super-resolution
Hongwei Li, Tao Dai, Yiming Li, Xueyi Zou, Shu-Tao Xia
arXiv_AI
arXiv_AI
Image_Caption
Super_Resolution
Pose
PDF
2022-08-06
Deep Uncalibrated Photometric Stereo via Inter-Intra Image Feature Fusion
Fangzhou Gao, Meng Wang, Lianghao Zhang, Li Wang, Jiawan Zhang
arXiv_CV
arXiv_CV
Image_Caption
Optimization
Pose
Face
Action
Deep_Learning
PDF
2022-08-05
RadTex: Learning Efficient Radiograph Representations from Text Reports
Keegan Quigley, Miriam Cha, Ruizhi Liao, Geeticka Chauhan, Steven Horng, Seth Berkowitz, Polina Golland
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Classification
Deep_Learning
Medical
Caption
CNN
Image_Classification
PDF
2022-08-05
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang, Feiyang Lv, Ting Yao, Yiming Yuan, Jin Ma, Yu Luo, Haijin Liang
arXiv_CV
arXiv_CV
Image_Caption
VQA
Caption
Language_Model
QA
PDF
2022-08-04
MVSFormer: Learning Robust Image Representations via Transformers and Temperature-based Depth for Multi-View Stereo
Chenjie Cao, Xinlin Ren, Yanwei Fu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Represenation_Learning
Pose
Classification
CNN
PDF
2022-08-04
SA-NET.v2: Real-time vehicle detection from oblique UAV images with use of uncertainty estimation in deep meta-learning
Mehdi Khoshboresh-Masouleh, Reza Shah-Hosseini
arXiv_CV
arXiv_CV
Segmentation
Semantic_Segmentation
Video_Caption
Pose
Detection
Attention
PDF
2022-08-03
Word-Level Fine-Grained Story Visualization
Bowen Li, Thomas Lukasiewicz
arXiv_CV
arXiv_CV
Segmentation
Pose
Attention
Caption
PDF
2022-08-03
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval
Alex Falcon, Giuseppe Serra, Oswald Lanz
arXiv_CV
arXiv_CV
Pose
Attention
Caption
Video_Retrieval
PDF
2022-08-03
Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation
Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, Jun Xiao
arXiv_CV
arXiv_CV
Weakly_Supervised
Knowledge
Pose
Action
Relation
Visual_Relation
Caption
PDF
2022-08-02
Two-Stream Transformer Architecture for Long Video Understanding
Edward Fish, Jon Weinbren, Andrew Gilbert
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Pose
Action_Recognition
Action
Classification
Attention
Video_Classification
PDF
2022-08-01
BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation
Ye Yu, Jialin Yuan, Gaurav Mittal, Li Fuxin, Mei Chen
arXiv_CV
arXiv_CV
Transformer
Segmentation
Video_Caption
Pose
Face
Attention
Optical_Flow
PDF
2022-08-01
MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild
Yuanyuan Liu, Wei Dai, Chuanxu Feng, Wenbin Wang, Guanghao Yin, Jiabei Zeng, Shiguang Shan
arXiv_CV
arXiv_CV
Transformer
Recognition
Knowledge
Pose
Emotion
Relation
Caption
PDF
2022-07-31
Neuro-Symbolic Learning: Principles and Applications in Ophthalmology
Muhammad Hassan, Haifei Guan, Aikaterini Melliou, Yuqi Wang, Qianhui Sun, Sen Zeng, Wen Liang, Yiwei Zhang, Ziheng Zhang, Qiuyue Hu, Yang Liu, Shunkai Shi, Lin An, Shuyue Ma, Ijaz Gul, Muhammad Akmal Rahee, Zhou You, Canyang Zhang, Vijay Kumar Pandey, Yuxing Han, Yongbing Zhang, Ming Xu, Qiming Huang, Jiefu Tan, Qi Xing, Peiwu Qin, Dongmei Yu
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Knowledge
Review
Survey
Deep_Learning
Caption
PDF
2022-07-30
Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding
Hao Wen, Yunze Liu, Jingwei Huang, Bo Duan, Li Yi
arXiv_CV
arXiv_CV
Transformer
Point_Cloud
Video_Caption
Pose
PDF
2022-07-29
Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding
Jiachang Hao, Haifeng Sun, Pengfei Ren, Jingyu Wang, Qi Qi, Jianxin Liao
arXiv_CV
arXiv_CV
Pose
Caption
Activity
Matching
PDF
2022-07-29
High Dynamic Range and Super-Resolution from Raw Image Bursts
Bruno Lecouat, Thomas Eboli, Jean Ponce, Julien Mairal
arXiv_CV
arXiv_CV
Image_Caption
Reconstruction
Super_Resolution
Restoration
Optimization
Knowledge
Pose
PDF
2022-07-29
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
arXiv_CV
arXiv_CV
Image_Caption
Pose
GAN
Caption
PDF
2022-07-28
Separable Quaternion Matrix Factorization for Polarization Images
Junjun Pan, Michael K. Ng
arXiv_CV
arXiv_CV
Image_Caption
Pose
PDF
2022-07-28
Self-supervised learning with rotation-invariant kernels
Léon Zheng (DANTE), Gilles Puy, Elisa Riccietti (DANTE), Patrick Pérez, Rémi Gribonval (DANTE)
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Regularization
Self-Supervised
Pose
PDF
2022-07-27
D3C2-Net: Dual-Domain Deep Convolutional Coding Network for Compressive Sensing
Weiqi Li, Bin Chen, Jian Zhang
arXiv_CV
arXiv_CV
Image_Caption
Optimization
Pose
Compressive_Sensing
CNN
PDF
2022-07-27
Reducing the Vision and Language Bias for Temporal Sentence Grounding
Daizong Liu, Xiaoye Qu, Wei Hu
arXiv_CV
arXiv_CV
Pose
Caption
Activity
PDF
2022-07-27
Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base
Jinyeong Chae, Jihie Kim
arXiv_AI
arXiv_AI
Knowledge
Pose
VQA
Caption
QA
PDF
2022-07-26
Retrieval-Augmented Transformer for Image Captioning
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Knowledge
Action
Attention
Caption
PDF
2022-07-26
Unsupervised Contrastive Learning of Image Representations from Ultrasound Videos with Hard Negative Mining
Soumen Basu, Somanshu Singla, Mayank Gupta, Pratyaksha Rana, Pankaj Gupta, Chetan Arora
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Unsupervised
Pose
Contrastive_Learning
Detection
GAN
PDF
2022-07-26
NewsStories: Illustrating articles with visual summaries
Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud, Thomas Leung
arXiv_AI
arXiv_AI
Zero-Shot
Self-Supervised
Relation
Caption
PDF
2022-07-26
Static and Dynamic Concepts for Self-supervised Video Representation Learning
Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin
arXiv_CV
arXiv_CV
Video_Caption
Represenation_Learning
Regularization
Self-Supervised
Pose
Attention
PDF
2022-07-25
Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?
Pietro Bongini, Federico Becattini, Alberto Del Bimbo
arXiv_CL
arXiv_CL
Pose
Deep_Learning
VQA
Caption
PDF
2022-07-25
ConceptBeam: Concept Driven Target Speech Extraction
Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino
arXiv_SD
arXiv_SD
Embedding
Recognition
Speech
Pose
Action
Caption
PDF
2022-07-24
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions
Ansh Mittal, Shuvam Ghosal, Rishibha Bansal, Dat Ngyuyen
arXiv_AI
arXiv_AI
Surveillance
Video_Caption
Pose
Action
Caption
PDF
2022-07-23
Robots Enact Malignant Stereotypes
Andrew Hundt, William Agnew, Vicky Zeng, Severin Kacianka, Matthew Gombolay
arXiv_AI
arXiv_AI
Face
Caption
Autonomous
PDF
2022-07-23
Arbitrary Style Transfer with Structure Enhancement by Combining the Global and Local Loss
Lizhen Long, Chi-Man Pun
arXiv_CV
arXiv_CV
Image_Caption
Style_Transfer
Enhancement
Classification
PDF
2022-07-22
Egocentric scene context for human-centric environment understanding from video
Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman
arXiv_CV
arXiv_CV
Video_Caption
Scene_Classification
3D
Pose
Classification
PDF
2022-07-22
Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022
Maria Escobar, Laura Daza, Cristina González, Jordi Pont-Tuset, Pablo Arbeláez
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Classification
PDF
2022-07-22
Rethinking the Reference-based Distinctive Image Captioning
Yangjun Mao, Long Chen, Zhihong Jiang, Dong Zhang, Zhimeng Zhang, Jian Shao, Jun Xiao
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Attention
Caption
Matching
PDF
2022-07-22
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Yoad Tewel, Yoav Shalev, Roy Nadler, Idan Schwartz, Lior Wolf
arXiv_CV
arXiv_CV
Image_Caption
Video_Caption
Zero-Shot
Knowledge
Caption
Language_Model
Matching
PDF
2022-07-22
Efficient Modeling of Future Context for Image Captioning
Zhengcong Fei, Junshi Huang, Xiaoming Wei, Xiaolin Wei
arXiv_CV
arXiv_CV
Image_Caption
Pose
Relation
Caption
Inference
PDF
2022-07-21
An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang
arXiv_CV
arXiv_CV
Transformer
Video_Caption
3D
Pose
Action
Detection
Attention
PDF
2022-07-20
Spotting Temporally Precise, Fine-Grained Events in Video
James Hong, Haotian Zhang, Michaël Gharbi, Matthew Fisher, Kayvon Fatahalian
arXiv_CV
arXiv_CV
Segmentation
Video_Caption
Pose
Action
Detection
PDF
2022-07-20
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Pose
Detection
Object_Detection
Caption
Inference
PDF
2022-07-20
Explicit Image Caption Editing
Zhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, Jun Xiao
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
GAN
Caption
PDF
2022-07-19
Shrinking the Semantic Gap: Spatial Pooling of Local Moment Invariants for Copy-Move Forgery Detection
Chao Wang, Zhiqiu Huang, Shuren Qi, Yaoshen Yu, Guohua Shen
arXiv_CV
arXiv_CV
Image_Caption
Salient
Pose
Detection
Matching
PDF
2022-07-19
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks
Motonari Kambara, Komei Sugiura
arXiv_CV
arXiv_CV
Transformer
Pose
Action
Relation
Attention
Caption
PDF
2022-07-18
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Qi Zhang, Yuqing Song, Qin Jin
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Pose
Action
Detection
Caption
Activity
PDF
2022-07-18
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Embedding
Semantic_Segmentation
Zero-Shot
Pose
Relation
Attention
Caption
PDF
2022-07-16
SVGraph: Learning Semantic Graphs from Instructional Videos
Madeline C. Schiappa, Yogesh S. Rawat
arXiv_CV
arXiv_CV
Video_Caption
Self-Supervised
Pose
Attention
PDF
2022-07-16
Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation
Chaofan Zheng, Lianli Gao, Xinyu Lyu, Pengpeng Zeng, Abdulmotaleb El Saddik, Heng Tao Shen
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
Inference
QA
PDF
2022-07-16
Clover: Towards A Unified Video-Language Alignment and Fusion Model
Jingjia Huang, Yinan Li, Jiashi Feng, Xiaoshuai Sun, Rongrong Ji
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Zero-Shot
Pose
Language_Model
Video_Retrieval
PDF
2022-07-15
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, Tatsunori Hashimoto
arXiv_CV
arXiv_CV
Transformer
Represenation_Learning
Classification
Caption
PDF
2022-07-15
LineCap: Line Charts for Data Visualization Captioning Models
Anita Mahinpei, Zona Kostic, Chris Tanner
arXiv_CV
arXiv_CV
Image_Caption
Pose
Deep_Learning
Caption
PDF
2022-07-13
Is Appearance Free Action Recognition Possible?
Filip Ilic, Thomas Pock, Richard P. Wildes
arXiv_CV
arXiv_CV
Recognition
Video_Caption
Action_Recognition
Action
Optical_Flow
PDF
2022-07-12
Camera Pose Auto-Encoders for Improving Pose Regression
Yoli Shavit, Yosi Keller
arXiv_AI
arXiv_AI
Image_Caption
Optimization
Pose
PDF
2022-07-12
Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional Network
Hao Xing, Darius Burschka
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Pose
Action_Recognition
Action
Relation
Attention
CNN
PDF
2022-07-12
A Baseline for Detecting Out-of-Distribution Examples in Image Captioning
Gabi Shalev, Gal-Lev Shalev, Joseph Keshet
arXiv_CV
arXiv_CV
Image_Caption
Detection
Caption
PDF
2022-07-12
Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection
Xubin Zhong, Changxing Ding, Zijian Li, Shaoli Huang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Action
Detection
Object_Detection
Attention
Prediction
PDF
2022-07-11
Adaptive Fine-Grained Predicates Learning for Scene Graph Generation
Xinyu Lyu, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, Jingkuan Song
arXiv_AI
arXiv_AI
Image_Caption
Pose
Classification
Relation
Caption
Image_Classification
Prediction
QA
PDF
2022-07-09
A Study on Self-Supervised Object Detection Pretraining
Trung Dang, Simon Kornblith, Huy Thong Nguyen, Peter Chin, Maryam Khademi
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Represenation_Learning
Self-Supervised
Action
Detection
Object_Detection
PDF
2022-07-09
Towards Multimodal Vision-Language Models Generating Non-Generic Text
Wes Robbins, Zanyar Zohourianshahzadi, Jugal Kalita
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Recognition
Optical_Character
Caption
Language_Model
PDF
2022-07-08
Automated Audio Captioning and Language-Based Audio Retrieval
Clive Gomes, Hyejin Park, Patrick Kollman, Yi Song
arXiv_CL
arXiv_CL
Caption
PDF
2022-07-08
The Power of Transfer Learning in Agricultural Applications: AgriNet
Zahraa Al Sahili, Mariette Awad
arXiv_CV
arXiv_CV
Transfer_Learning
Recognition
Pose
Face
Classification
Deep_Learning
Detection
Face_Recognition
Medical
Caption
PDF
2022-07-07
Predicting Word Learning in Children from the Performance of Computer Vision Systems
Sunayana Rane, Mira L. Nencheva, Zeyu Wang, Casey Lew-Williams, Olga Russakovsky, Thomas L. Griffiths
arXiv_AI
arXiv_AI
Classification
Relation
Caption
PDF
2022-07-07
ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning
Jia Cheng Hu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Caption
CNN
PDF
2022-07-07
Deformer: Towards Displacement Field Learning for Unsupervised Medical Image Registration
Jiashun Chen, Donghuan Lu, Yu Zhang, Dong Wei, Munan Ning, Xinyu Shi, Zhe Xu, Yefeng Zheng
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Unsupervised
Pose
Relation
Attention
Medical
CNN
Prediction
PDF
2022-07-07
Improving Few-Shot Image Classification Using Machine- and User-Generated Natural Language Descriptions
Kosuke Nishida, Kyosuke Nishida, Shuichi Nishioka
arXiv_CV
arXiv_CV
Image_Caption
Knowledge
Pose
Classification
Few-Shot
Image_Classification
Prediction
PDF
2022-07-07
Dual-Stream Transformer for Generic Event Boundary Captioning
Xin Gu, Hanhua Ye, Guang Chen, Yufei Wang, Libo Zhang, Longyin Wen
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Pose
Caption
PDF
2022-07-06
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning
Yifan Lu, Ziqi Zhang, Yuxin Chen, Chunfeng Yuan, Bing Li, Weiming Hu
arXiv_CV
arXiv_CV
Video_Caption
Classification
Detection
Object_Detection
Caption
PDF
2022-07-06
Unsupervised Learning for Human Sensing Using Radio Signals
Tianhong Li, Lijie Fan, Yuan Yuan, Dina Katabi
arXiv_CV
arXiv_CV
Unsupervised
Recognition
Represenation_Learning
Pose_Estimation
Pose
Contrastive_Learning
Action_Recognition
Action
Re-identification
Caption
PDF
2022-07-05
Zero-shot Cross-Linguistic Learning of Event Semantics
Malihe Alikhani, Thomas Kober, Bashar Alhafni, Yue Chen, Mert Inan, Elizabeth Nielsen, Shahab Raji, Mark Steedman, Matthew Stone
arXiv_CL
arXiv_CL
Zero-Shot
Salient
Face
Caption
PDF
2022-07-05
TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers
Fan Zhang, Tengfei Xue, Weidong Cai, Yogesh Rathi, Carl-Fredrik Westin, Lauren J O'Donnell
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Embedding
3D
Pose
Quantitative
Classification
Relation
Attention
PDF
2022-07-05
Detecting and Recovering Sequential DeepFake Manipulation
Rui Shao, Tianxing Wu, Ziwei Liu
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Face
Detection
Caption
Prediction
PDF
2022-07-05
Federated Self-supervised Learning for Video Understanding
Yasar Abbas Ur Rehman, Yan Gao, Jiajun Shen, Pedro Porto Buarque de Gusmao, Nicholas Lane
arXiv_CV
arXiv_CV
Video_Caption
Self-Supervised
Pose
PDF
2022-07-05
Entity Linking in Tabular Data Needs the Right Attention
Miltiadis Marios Katsakioris, Yiwei Zhou, Daniele Masato
arXiv_CL
arXiv_CL
Sparse
Knowledge
Attention
Caption
PDF
2022-07-05
Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation
Bin Li, Yixuan Weng, Ziyu Ma, Bin Sun, Shutao Li
arXiv_CV
arXiv_CV
Pose
Caption
PDF
2022-07-04
Are metrics measuring what they should? An evaluation of image captioning task metrics
Othón González-Chávez, Guillermo Ruiz, Daniela Moctezuma, Tania A. Ramirez-delReal
arXiv_CV
arXiv_CV
Image_Caption
Relation
Caption
PDF
2022-07-04
GraphVid: It Only Takes a Few Nodes to Understand a Video
Eitan Kosman, Dotan Di Castro
arXiv_AI
arXiv_AI
Video_Caption
Pose
CNN
Inference
PDF
2022-07-03
Exploiting Context Information for Generic Event Boundary Captioning
Jinrui Zhang, Teng Wang, Feng Zheng, Ran Cheng, Ping Luo
arXiv_CV
arXiv_CV
Pose
Action
Caption
PDF
2022-07-02
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
Keyu Wen, Zhenshan Tan, Qingrong Cheng, Cheng Chen, Xiaodong Gu
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Represenation_Learning
Image_Retrieval
Knowledge
Contrastive_Learning
Action
Caption
Matching
PDF
2022-07-02
Syntax Controlled Knowledge Graph-to-Text Generation with Order and Semantic Consistency
Jin Liu, Chongfeng Fan, Fengyu Zhou, Huijuan Xu
arXiv_AI
arXiv_AI
Optimization
Regularization
Knowledge
Knowledge_Graph
Speech
Text_Generation
Caption
Prediction
PDF
2022-07-01
American == White in Multimodal Language-and-Image AI
Robert Wolfe, Aylin Caliskan
arXiv_AI
arXiv_AI
Image_Caption
Embedding
Face
VQA
GAN
Caption
PDF
2022-07-01
likelihood Training for Interpretable Embedding
Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Zhijian Hou
arXiv_CV
arXiv_CV
Embedding
Video_Caption
Represenation_Learning
Regularization
Knowledge
Pose
PDF
2022-06-30
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches
Mengya Xu, Mobarakol Islam, Hongliang Ren
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Video_Caption
3D
Pose
Detection
Object_Detection
Attention
Caption
Inference
Prediction
PDF
2022-06-30
Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach
Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, Limin Wang
arXiv_CV
arXiv_CV
Video_Caption
Pose
Boundary_Detection
Detection
PDF
2022-06-29
Technical Report for CVPR 2022 LOVEU AQTC Challenge
Hyeonyu Kim, Jongeun Kim, Jeonghun Kang, Sanguk Park, Dongchan Park, Taehwan Kim
arXiv_CV
arXiv_CV
Video_Caption
Face
Attention
PDF
2022-06-28
ZoDIAC: Zoneout Dropout Injection Attention Calculation
Zanyar Zohourianshahzadi, Jugal Kalita
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Face
Action
Classification
Relation
Attention
Caption
Image_Classification
PDF
2022-06-27
Parameter-Efficient Image-to-Video Transfer Learning
Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li
arXiv_CV
arXiv_CV
Image_Caption
Transfer_Learning
Recognition
Video_Caption
Knowledge
Pose
Action_Recognition
Action
PDF
2022-06-27
Lesion-Aware Contrastive Representation Learning for Histopathology Whole Slide Images Analysis
Jun Li, Yushan Zheng, Kun Wu, Jun Shi, Fengying Xie, Zhiguo Jiang
arXiv_CV
arXiv_CV
Image_Caption
Represenation_Learning
Self-Supervised
Pose
Contrastive_Learning
Classification
Attention
PDF
2022-06-26
VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
Kashu Yamazaki, Sang Truong, Khoa Vo, Michael Kidd, Chase Rainwater, Khoa Luu, Ngan Le
arXiv_CV
arXiv_CV
Transformer
Pose
Contrastive_Learning
Action
Relation
Caption
Activity
PDF
2022-06-24
Using BERT Embeddings to Model Word Importance in Conversational Transcripts for Deaf and Hard of Hearing Users
Akhter Al Amin, Saad Hassan, Cecilia O. Alm, Matt Huenerfauth
arXiv_CL
arXiv_CL
Embedding
Bert
Classification
Relation
Caption
Language_Model
PDF
2022-06-24
Competence-based Multimodal Curriculum Learning for Medical Report Generation
Fenglin Liu, Shen Ge, Xian Wu
arXiv_CV
arXiv_CV
Image_Caption
Pose
Medical
Caption
PDF
2022-06-24
Deep embedded clustering algorithm for clustering PACS repositories
Teo Manojlović, Matija Milanič, Ivan Štajduhar
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Unsupervised
Action
Medical
CNN
PDF
2022-06-22
Prototypical Contrastive Language Image Pretraining
Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou
arXiv_CV
arXiv_CV
Zero-Shot
Knowledge
Pose
Classification
Attention
Caption
PDF
2022-06-21
Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching
Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli, Fabrizio Falchi
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Pose
Caption
Inference
Matching
PDF
2022-06-21
SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders
Gang Li, Heliang Zheng, Daqing Liu, Bing Su, Changwen Zheng
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Semantic_Segmentation
Recognition
Self-Supervised
Relation
Attention
Language_Model
PDF
2022-06-21
KTN: Knowledge Transfer Network for Learning Multi-person 2D-3D Correspondences
Xuanhan Wang, Lianli Gao, Yixuan Zhou, Jingkuan Song, Meng Wang
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
3D
Pose_Estimation
Knowledge
Knowledge_Graph
Pose
Detection
CNN
PDF
2022-06-21
Bypass Network for Semantics Driven Image Paragraph Captioning
Qi Zheng, Chaoyue Wang, Dadong Wang
arXiv_CV
arXiv_CV
Pose
Attention
Caption
PDF
2022-06-20
DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection
Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Laurent Itti, Vibhav Vineet
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Zero-Shot
Pose
Detection
Object_Detection
Caption
PDF
2022-06-19
A Self-Guided Framework for Radiology Report Generation
Jun Li, Shibo Li, Ying Hu, Huiren Tao
arXiv_CV
arXiv_CV
Image_Caption
Unsupervised
Knowledge
Pose
Deep_Learning
Medical
Caption
PDF
2022-06-19
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Tal Shaharabany, Yoad Tewel, Lior Wolf
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Weakly_Supervised
Caption
Inference
Matching
PDF
2022-06-18
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner
Jaehyuk Heo, YongGi Jeong, Sunwoo Kim, Jaehee Kim, Pilsung Kang
arXiv_CV
arXiv_CV
Segmentation
Embedding
Semantic_Segmentation
Video_Caption
Attention
Caption
PDF
2022-06-17
Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding
Yubei Chen, Adrien Bardes, Zengyi Li, Yann LeCun
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Knowledge
Self-Supervised
Pose
PDF
2022-06-17
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi
arXiv_CV
arXiv_CV
Transformer
Pose_Estimation
Pose
Detection
VQA
Object_Detection
Caption
QA
PDF
2022-06-16
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
Matthew Gwilliam, Abhinav Shrivastava
arXiv_CV
arXiv_CV
Image_Caption
Embedding
Unsupervised
Represenation_Learning
Pose
Contrastive_Learning
Classification
Prediction
PDF
2022-06-16
Channel Importance Matters in Few-Shot Image Classification
Xu Luo, Jing Xu, Zenglin Xu
arXiv_CV
arXiv_CV
Image_Caption
Pose
Classification
Few-Shot
Attention
CNN
Image_Classification
PDF
2022-06-16
Image Captioning based on Feature Refinement and Reflective Decoding
Ghadah Alabduljabbar, Hafida Benhidour, Said Kerrache
arXiv_CV
arXiv_CV
Image_Caption
Salient
Pose
Action
Deep_Learning
Attention
Caption
PDF
2022-06-16
Multimodal Dialogue State Tracking
Hung Le, Nancy F. Chen, Steven C.H. Hoi
arXiv_AI
arXiv_AI
Transformer
Tracking
Video_Caption
Knowledge
Self-Supervised
Pose
Prediction
PDF
2022-06-15
Prefix Language Models are Unified Modal Learners
Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
arXiv_CV
arXiv_CV
Transformer
Zero-Shot
Pose
Classification
VQA
Text_Generation
Caption
Language_Model
QA
PDF
2022-06-15
A Unified Sequence Interface for Vision Tasks
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton
arXiv_CV
arXiv_CV
Image_Caption
Segmentation
Face
Detection
Object_Detection
Caption
PDF
2022-06-15
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan Wang
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Detection
VQA
Object_Detection
Attention
Caption
QA
PDF
2022-06-15
Recent Advances in Scene Image Representation and Classification
Chiranjibi Sitaula, Tej Bahadur Shahi, Faezeh Marzbanrad
arXiv_CV
arXiv_CV
Image_Caption
Review
Pose
Survey
Quantitative
Classification
Deep_Learning
Image_Classification
PDF
2022-06-14
Measuring Representational Harms in Image Captioning
Angelina Wang, Solon Barocas, Kristen Laird, Hanna Wallach
arXiv_CV
arXiv_CV
Image_Caption
Pose
Face
Caption
PDF
2022-06-14
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang
arXiv_CV
arXiv_CV
Transformer
Video_Caption
Zero-Shot
Face
Few-Shot
Caption
Language_Model
Video_Retrieval
PDF
2022-06-14
Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut Features
Anil Palepu, Andrew L Beam
arXiv_CV
arXiv_CV
Image_Caption
Self-Supervised
Deep_Learning
Medical
PDF
2022-06-14
ReCo: Retrieve and Co-segment for Zero-shot Transfer
Gyungin Shin, Weidi Xie, Samuel Albanie
arXiv_CV
arXiv_CV
Image_Caption
Transformer
Segmentation
Unsupervised
Semantic_Segmentation
Zero-Shot
Knowledge
Classification
Prediction
PDF
2022-06-14
Stand-Alone Inter-Frame Attention in Video Models
Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei
arXiv_CV
arXiv_CV
Transformer
Video_Caption
3D
Deep_Learning
Attention
Prediction
PDF
2022-06-14
Comprehending and Ordering Semantics for Image Captioning
Yehao Li, Yingwei Pan, Ting Yao, Tao Mei
arXiv_CL
arXiv_CL
Image_Caption
Transformer
Pose
Detection
Object_Detection
Caption
PDF
2022-06-13
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
arXiv_CV
arXiv_CV
Transformer
Recognition
Video_Caption
Pose
Action_Recognition
Action
Relation
PDF
2022-06-12
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao
arXiv_AI
arXiv_AI
Image_Caption
Transformer
Segmentation
Zero-Shot
Contrastive_Learning
Detection
VQA
Few-Shot
Object_Detection
GAN
Caption
Language_Model
QA
PDF
2022-06-10
Zero-Shot Audio Classification using Image Embeddings
Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen
arXiv_SD
arXiv_SD
Image_Caption
Embedding
Zero-Shot
Classification
Relation
PDF
2022-06-09
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, Jifeng Dai
arXiv_CV
arXiv_CV
Video_Caption
Zero-Shot
Sparse
Pose
Caption
Inference
PDF
2022-06-09
SAR Despeckling using a Denoising Diffusion Probabilistic Model
Malsha V. Perera, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M. Patel
arXiv_CV
arXiv_CV
Image_Caption
Recognition
Pose
Quantitative
Detection
Denoising
Inference
PDF
2022-06-08
Words are all you need? Capturing human sensory similarity with textual descriptors
Raja Marjieh, Pol van Rijn, Ilia Sucholutsky, Theodore R. Sumers, Harin Lee, Thomas L. Griffiths, Nori Jacoby
arXiv_CL
arXiv_CL
Relation
Caption
Prediction
PDF
2022-06-07
Improving Image Captioning with Control Signal of Sentence Quality
Zhangzi Zhu, Hong Qu
arXiv_CV
arXiv_CV
Image_Caption
Pose
Caption
PDF
2022-06-07
Intra-agent speech permits zero-shot task acquisition
Chen Yan, Federico Carnevale, Petko Georgiev, Adam Santoro, Aurelia Guy, Alistair Muldal, Chia-Chun Hung, Josh Abramson, Timothy Lillicrap, Gregory Wayne
arXiv_AI
arXiv_AI
Image_Caption
3D
Zero-Shot
Speech
Pose