Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

التفاصيل البيبلوغرافية
العنوان: Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
المؤلفون: Zhao, Yu, Wei, Jianguo, Lin, Zhichao, Sun, Yueheng, Zhang, Meishan, Zhang, Min
سنة النشر: 2022
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
الوصف: Image-to-text tasks, such as open-ended image captioning and controllable image description, have received extensive attention for decades. Here, we further advance this line of work by presenting Visual Spatial Description (VSD), a new perspective for image-to-text toward spatial semantics. Given an image and two objects inside it, VSD aims to produce one description focusing on the spatial perspective between the two objects. Accordingly, we manually annotate a dataset to facilitate the investigation of the newly-introduced task and build several benchmark encoder-decoder models by using VL-BART and VL-T5 as backbones. In addition, we investigate pipeline and joint end-to-end architectures for incorporating visual spatial relationship classification (VSRC) information into our model. Finally, we conduct experiments on our benchmark dataset to evaluate all our models. Results show that our models are impressive, providing accurate and human-like spatial-oriented text descriptions. Meanwhile, VSRC has great potential for VSD, and the joint end-to-end architecture is the better choice for their integration. We make the dataset and codes public for research purposes.
نوع الوثيقة: Working Paper
الوصول الحر: http://arxiv.org/abs/2210.11109Test
رقم الانضمام: edsarx.2210.11109
قاعدة البيانات: arXiv