Graduate Doctoral Dissertations

Multimodal Learning in Real-world Application: Enhancing Feature Representation and Training Strategies

Nana LinFollow

Author ORCID Identifier

0009-0009-1714-6405

Date of Award

Summer 8-31-2025

Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Marc Pomplun

Second Advisor

Xiaohui Liang

Abstract

Multimodal learning has emerged as a critical paradigm for developing intelligent systems that can understand and reason across diverse inputs such as images, text, and audio data. Despite significant advances, effective deployment of multimodal models in practice remains a challenging task. This dissertation explores how multimodal learning can be effectively applied to high-stakes, real-world scenarios, with a focus on enhancing feature representation and training efficiency. Specifically, this research investigates multimodal learning strategies in two key domains: healthcare and surveillance.

In the healthcare domain, we explored the data fusion and alignment approaches for cognitive decline diagnoses. First, we propose the LOVEMA multimodal model for diagnosing mild cognitive impairment (MCI). We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Second, we explore Image-text alignment in dementia detection. We propose dementia detection models that take both the picture and the description texts as input and incorporate knowledge from large pre-trained image-text alignment models. We develop three advanced models that pre-process the samples based on their relevance to the picture, sub-images, and focused areas. The evaluation results show that our advanced models, with knowledge of multimodal features, achieve state-of-the-art performance with a detection accuracy of 83.44%, surpassing the text-only baseline accuracy of 79.91%.

In the surveillance domain, we present a novel auto-distill pipeline that leverages open-vocabulary vision-language models to automatically generate annotations, followed by knowledge distillation to train lightweight detectors for edge deployment. We propose the OpenYOLO, a two-stage framework that uses vision-language foundation models such as Grounding DINO and GPT-4 Vision for open-vocabulary annotation and distills their knowledge into lightweight YOLO detectors for real-time deployment. Finally, we evaluate the scene understanding capabilities of multimodal foundation models like GPT-4 Vision, Gemini Vision Pro, and Claude 3 using a standardized prompt-based protocol. Results show the potential of top-performing models for automated situational interpretation in surveillance contexts.

Together, the findings of this dissertation demonstrate that multimodal learning can significantly enhance task performance, reduce reliance on manual annotations, and enable zero-shot or few-shot generalization in real-world applications. By addressing the challenges of feature fusion, open-vocabulary adaptation, and model distillation, this work contributes practical solutions and novel insights for the development of scalable, adaptive, and intelligent systems in complex environments.

Recommended Citation

Lin, Nana, "Multimodal Learning in Real-world Application: Enhancing Feature Representation and Training Strategies" (2025). Graduate Doctoral Dissertations. 1094.
https://scholarworks.umb.edu/doctoral_dissertations/1094

Download

Included in

Computer and Systems Architecture Commons, Digital Communications and Networking Commons, Other Computer Engineering Commons, Robotics Commons

COinS

Graduate Doctoral Dissertations

Multimodal Learning in Real-world Application: Enhancing Feature Representation and Training Strategies

Author ORCID Identifier

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Contact Us

Graduate Doctoral Dissertations

Multimodal Learning in Real-world Application: Enhancing Feature Representation and Training Strategies

Author

Author ORCID Identifier

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links

Contact Us