Date of Award
5-2021
Document Type
Campus Access Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Advisor
Wei Ding
Second Advisor
Ping Chen
Third Advisor
Scott E. Crouter
Abstract
Advances in artificial intelligence and machine learning have begun a revolution in the understanding and analysis of data across nearly every industry. AI and ML methods (particularly deep neural models) have been successfully scaled to fit the massive datasets available today, especially in image- and text-based tasks. However, in many settings, the application of these advanced methods is held back by underlying data issues that hamstring the models’ generalization performance.
In this work, two such challenges have been considered. The first is data-dependent uncertainty in ground-truth labels. This uncertainty can arise from ambiguity in the labeling process - e.g., whether a certain song should be labeled ‘folk’ or ‘country’ may be answered differently by different annotators - or low data quality that induces annotation mistakes. In this work, a new neighborhood-based scoring system is introduced to identify the data examples that may have suspect labels. A means for translating those uncertainty scores to sample weights is then provided so that the influence of label mistakes on a model’s decision boundary can be reduced.
The second challenge is uneven generalization performance across individuals leading to unfair deployed models. When a deep neural network is deployed to predict data from unseen users, some of those new users could experience poor performance if their data is not typical of that in the training set. To improve model fairness over individuals in a deep learning setting, we use mode connectivity, a technique from the study of neural network loss landscapes, to explore the region around a trained network in parameter space to identify a feasible set of weight configurations with similar overall performance but different distributions of performance over individuals. Multi-objective optimization over that feasible set can then be used to select the best model by observed fairness, a process we call Fairness Maximization via Mode Connectivity (FMMC).
These methods have been validated in real-world settings on time-distributed data, including two human activity recognition datasets and a music genre classification task. Our fairness approach is further validated on a Tamil handwriting classification dataset. Each is shown to surpass the performance of current baseline approaches.
Recommended Citation
Almeida, Matthew, "Effects of Real-World Data Challenges on Generalization in Applied Machine Learning and Time Series Modeling" (2021). Graduate Doctoral Dissertations. 660.
https://scholarworks.umb.edu/doctoral_dissertations/660
Comments
Free and open access to this Campus Access Dissertation is made available to the UMass Boston community by ScholarWorks at UMass Boston. Those not on campus and those without a UMass Boston campus username and password may gain access to this dissertation through resources like Proquest Dissertations & Theses Global or through Interlibrary Loan. If you have a UMass Boston campus username and password and would like to download this work from off-campus, click on the "Off-Campus UMass Boston Users" link above.