Treffer: Aiding Complex Multimodal Reasoning with Contextual and Structural Information

Title:
Aiding Complex Multimodal Reasoning with Contextual and Structural Information
Publication Year:
2025
Collection:
Columbia University: Academic Commons
Document Type:
Dissertation thesis
Language:
English
DOI:
10.7916/2fqk-3t63
Accession Number:
edsbas.E074FA8F
Database:
BASE

Weitere Informationen

Multimodal reasoning involves integrating information from multiple data modalities —such as text, images, and videos — to perform a wide range of tasks, including Visual Question Answering, Visual Entailment, Captioning, Retrieval, and more. While current multimodal models have shown strong performance on these standard tasks, they often struggle with more complex scenarios that demand external knowledge, richer context, or the ability to process large and intricate data inputs. This thesis explores these challenging aspects of multimodal reasoning and proposes targeted solutions to help models overcome these limitations. We focus on two broad categories of complex tasks: those that benefit from external context and those that benefit from structured representations of data. External context refers to additional information necessary for solving a task, beyond what is explicitly available in the input — for instance, knowledge from external sources or temporal context derived from surrounding data. Structured representations, on the other hand, involve organizing raw inputs into meaningful forms, such as graphs or hierarchies, which help simplify reasoning over complex or large-scale data. To address these challenges, we propose a set of complementary solutions. For tasks requiringexternal knowledge, we introduce methods that integrate information from sources such as Google Search and Large Language Models. To address temporal context gaps, we develop techniques to extract relevant contextual cues from the source video surrounding the multimodal task instance. Additionally, to handle large or densely structured data, we propose methods that convert raw inputs into compact, structured representations — such as graphs or hierarchies — which make the data more accessible for models to interpret and reason over. Collectively, the solutions proposed in this thesis aim to enhance the reasoning capabilities of multimodal models, equipping them to handle a broader spectrum of real-world scenarios. These advancements ...