Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

التفاصيل البيبلوغرافية
العنوان: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
المؤلفون: Weinzaepfel, Philippe, Lucas, Thomas, Leroy, Vincent, Cabon, Yohann, Arora, Vaibhav, Brégier, Romain, Csurka, Gabriela, Antsfeld, Leonid, Chidlovskii, Boris, Revaud, Jérôme
بيانات النشر: arXiv, 2022.
سنة النشر: 2022
مصطلحات موضوعية: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
الوصف: Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of selfsupervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent crossview completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting realworld image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement: first, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that stateof-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.
DOI: 10.48550/arxiv.2211.10408
الوصول الحر: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::38f84fefddf0ec6f91e2c9408b17afafTest
حقوق: OPEN
رقم الانضمام: edsair.doi.dedup.....38f84fefddf0ec6f91e2c9408b17afaf
قاعدة البيانات: OpenAIRE