Date of Award
Campus Access Dissertation
Doctor of Philosophy (PhD)
Scott E. Crouter
Advances in artificial intelligence and machine learning have begun a revolution in the understanding and analysis of data across nearly every industry. AI and ML methods (particularly deep neural models) have been successfully scaled to fit the massive datasets available today, especially in image- and text-based tasks. However, in many settings, the application of these advanced methods is held back by underlying data issues that hamstring the models’ generalization performance.
In this work, two such challenges have been considered. The first is data-dependent uncertainty in ground-truth labels. This uncertainty can arise from ambiguity in the labeling process - e.g., whether a certain song should be labeled ‘folk’ or ‘country’ may be answered differently by different annotators - or low data quality that induces annotation mistakes. In this work, a new neighborhood-based scoring system is introduced to identify the data examples that may have suspect labels. A means for translating those uncertainty scores to sample weights is then provided so that the influence of label mistakes on a model’s decision boundary can be reduced.
The second challenge is uneven generalization performance across individuals leading to unfair deployed models. When a deep neural network is deployed to predict data from unseen users, some of those new users could experience poor performance if their data is not typical of that in the training set. To improve model fairness over individuals in a deep learning setting, we use mode connectivity, a technique from the study of neural network loss landscapes, to explore the region around a trained network in parameter space to identify a feasible set of weight configurations with similar overall performance but different distributions of performance over individuals. Multi-objective optimization over that feasible set can then be used to select the best model by observed fairness, a process we call Fairness Maximization via Mode Connectivity (FMMC).
These methods have been validated in real-world settings on time-distributed data, including two human activity recognition datasets and a music genre classification task. Our fairness approach is further validated on a Tamil handwriting classification dataset. Each is shown to surpass the performance of current baseline approaches.
Almeida, Matthew, "Effects of Real-World Data Challenges on Generalization in Applied Machine Learning and Time Series Modeling" (2021). Graduate Doctoral Dissertations. 660.