1 option
Bridging Perception and Reasoning in Multimodal Models Xingyu Fu
- Format:
- Book
- Thesis/Dissertation
- Author/Creator:
- Fu, Xingyu, author.
- Language:
- English
- Subjects (All):
- 0464.
- 0800.
- 0984.
- Local Subjects:
- 0464.
- 0800.
- 0984.
- Physical Description:
- 1 electronic resource (208 pages)
- Contained In:
- Dissertations Abstracts International 87-07B
- Place of Publication:
- Ann Arbor : ProQuest Dissertations and Theses, 2025
- Language Note:
- English
- Summary:
- Most human knowledge is acquired through rich, multimodal cognitive experiences, where both vision and language play essential roles. For a long time, computer vision (CV) has aimed to reconstruct the physical world from sensory data, while natural language processing (NLP) has seeked to achieve human-level understanding and reasoning through communication. According to Plato, the highest knowledge is attained through dialectical reasoning about the world of Forms, facilitated by language. In contrast, Thomas Aquinas stated that "There is nothing in the mind that was not first in the senses". While these philosophical arguments have persisted through the ages, it is evident that humans are inherently multimodal beings, unable to rely solely on language nor on vision for thought. In this thesis, we take the view that artificial intelligence (AI) is being developed in order to free people from repetitive real-world tasks that typically involve both language and vision. To accomplish that, we need to craft reliable multimodal intelligent systems that emulate human-level intelligence to bridge visual perceptual understanding and language-based reasoning together, answer questions about them on demand, and develop a common sense understanding of the world. We first describe our efforts to understand what multimodal abilities current models fall short of. We design high-quality benchmarks to uncover emergent limitations in multimodal models, focusing on visual challenges beyond language in multimodal Large Language Models (LLMs), and reasoning challenges beyond vision in text-to-image (T2I) generative models. Then, we show how to address these two challenges by combining language and vision. Specifically, we develop data-centric post-training methods that bridge perception and reasoning to address the identified challenges. For multimodal LLMs, we explore thinking with images through coding, which also enables the generation of high-quality synthetic data for post-training. For T2I models, we apply adversarial synthetic data collection guided by LLMs to improve alignment in the post-training stage. Moreover, we provide an extended analyses on demystifying data influence of multimodal supervision data across various tasks. Finally, we demonstrate real-world application on AI-generated video fakeness detection where combined perception and reasoning is needed
- Notes:
- Advisors: Roth, Dan Committee members: Callison-Burch, Chris; Yatskar, Mark; Daniilidis, Kostas; Vondrick, Carl; Zettlemoyer, Luke
- Source: Dissertations Abstracts International, Volume: 87-07, Section: B.
- Ph.D. University of Pennsylvania 2025
- Vendor supplied data
- Local Notes:
- School code: 0175
- ISBN:
- 9798276006000
- Access Restriction:
- Restricted for use by site license
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.