CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

التفاصيل البيبلوغرافية
العنوان: CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
المؤلفون: Han, Seungju, Hessel, Jack, Dziri, Nouha, Choi, Yejin, Yu, Youngjae
بيانات النشر: arXiv, 2023.
سنة النشر: 2023
مصطلحات موضوعية: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL)
الوصف: Visual information is central to conversation: body gestures and facial expressions, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code at https://seungjuhan.me/champagneTest.
DOI: 10.48550/arxiv.2303.09713
الوصول الحر: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d18567acaae77b0f059168ab5bd73f5aTest
حقوق: OPEN
رقم الانضمام: edsair.doi.dedup.....d18567acaae77b0f059168ab5bd73f5a
قاعدة البيانات: OpenAIRE