Treffer: Unified Approaches for Multi-Task Vision-Language Interactions

Title:
Unified Approaches for Multi-Task Vision-Language Interactions
Authors:
Publication Year:
2024
Collection:
Columbia University: Academic Commons
Document Type:
Dissertation thesis
Language:
English
DOI:
10.7916/n044-ra77
Accession Number:
edsbas.2B83F9CB
Database:
BASE

Weitere Informationen

Vision and Language are two major modalities that humans rely on to perceive the environment and understand the world. Recent advances in Artificial Intelligence (AI) facilitate the development of a variety of vision-language tasks derived from diverse multimodal interactions in daily life, such as image captioning, image-text matching, visual question answering (VQA), text-to-image generation, etc. Despite the remarkable performance, most previous state-of-the-art models are merely specialized for a single vision-language task, which lack generalizability across multiple tasks. Additionally, those specialized models sophisticate the algorithm designs and bring redundancy to model deployment when dealing with complex scenes. In this study, we investigate developing unified approaches capable of solving various vision-language interactions in a multi-task manner. We argue that unified multi-task methods could enjoy several potential advantages: (1) A unified framework for multiple tasks can reduce human efforts in designing different models for different tasks; (2) Reusing and sharing parameters across tasks can improve efficiency; (3) Some tasks may be complementary to other tasks so that multi-tasking can boost the performance; (4) They can deal with the complex tasks that need a joint collaborating of multiple basic tasks and enable new applications. In the first part of this thesis, we explore unified multi-task models with the goal of sharing and reusing as many parameters as possible between different tasks. We started with unifying many vision-language question-answering tasks, such as visual entailment, outside-knowledge VQA, and visual commonsense reasoning, in a simple iterative divide-and-conquer framework. Specifically, it iteratively decomposes the original text question into sub-question, solves each sub-question, and derives the answer to the original question, which can uniformly handle reasoning of various types and semantics levels within one framework. In the next work, we take one step further ...