Author ORCID Identifier
0009-0009-1714-6405
Date of Award
Summer 8-31-2025
Document Type
Open Access Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Advisor
Marc Pomplun
Second Advisor
Xiaohui Liang
Abstract
Multimodal learning has emerged as a critical paradigm for developing intelligent systems that can understand and reason across diverse inputs such as images, text, and audio data. Despite significant advances, effective deployment of multimodal models in practice remains a challenging task. This dissertation explores how multimodal learning can be effectively applied to high-stakes, real-world scenarios, with a focus on enhancing feature representation and training efficiency. Specifically, this research investigates multimodal learning strategies in two key domains: healthcare and surveillance.
In the healthcare domain, we explored the data fusion and alignment approaches for cognitive decline diagnoses. First, we propose the LOVEMA multimodal model for diagnosing mild cognitive impairment (MCI). We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Second, we explore Image-text alignment in dementia detection. We propose dementia detection models that take both the picture and the description texts as input and incorporate knowledge from large pre-trained image-text alignment models. We develop three advanced models that pre-process the samples based on their relevance to the picture, sub-images, and focused areas. The evaluation results show that our advanced models, with knowledge of multimodal features, achieve state-of-the-art performance with a detection accuracy of 83.44%, surpassing the text-only baseline accuracy of 79.91%.
In the surveillance domain, we present a novel auto-distill pipeline that leverages open-vocabulary vision-language models to automatically generate annotations, followed by knowledge distillation to train lightweight detectors for edge deployment. We propose the OpenYOLO, a two-stage framework that uses vision-language foundation models such as Grounding DINO and GPT-4 Vision for open-vocabulary annotation and distills their knowledge into lightweight YOLO detectors for real-time deployment. Finally, we evaluate the scene understanding capabilities of multimodal foundation models like GPT-4 Vision, Gemini Vision Pro, and Claude 3 using a standardized prompt-based protocol. Results show the potential of top-performing models for automated situational interpretation in surveillance contexts.
Together, the findings of this dissertation demonstrate that multimodal learning can significantly enhance task performance, reduce reliance on manual annotations, and enable zero-shot or few-shot generalization in real-world applications. By addressing the challenges of feature fusion, open-vocabulary adaptation, and model distillation, this work contributes practical solutions and novel insights for the development of scalable, adaptive, and intelligent systems in complex environments.
Recommended Citation
Lin, Nana, "Multimodal Learning in Real-world Application: Enhancing Feature Representation and Training Strategies" (2025). Graduate Doctoral Dissertations. 1094.
https://scholarworks.umb.edu/doctoral_dissertations/1094
Included in
Computer and Systems Architecture Commons, Digital Communications and Networking Commons, Other Computer Engineering Commons, Robotics Commons