Date of Award

12-2019

Document Type

Campus Access Thesis

Degree Name

Master of Science (MS)

Department

Physics, Applied

First Advisor

Rahul Kulkarni

Second Advisor

Niraj Kumar

Third Advisor

Jonathan Celli

Abstract

With the advent of high-throughput biological data in the past twenty years there has been significant amount of effort in the scientific community to devise new techniques to analyze and make sense of these data. The effort can be categorized into two categories. One category deals with coming up more accurate and efficient techniques for acquisition, storage and organization of the data. The second category deals with advanced methods to dig into the collected data and make valuable predictions. In this work we focus on methods suited for analysis of one type of such biological data, namely gene expression data. Due to the nature of the gene expression data obtained in typical biological experiments we have to deal with expression values of thousands of genes across a much smaller size of the samples. This feature instantly poses great difficulty in statistical analysis of gene expression data. Main difficulty here is overfitting of models due to large number of predictors (genes) compared to number of samples. As such, we look at methods to overcome this difficulty and they all include some type of regularization of the model. We look in detail how different regularization techniques work and introduce a class of regularization methods that will shrink the parameter space to arbitrarily smaller sets. The ability to make sparse statistical models can potentially alleviate the overfitting problem as well as making the resulting model more interpretable and able to make better predictions on future input data. We start with a basic regression setting and add the nonsparse ridge penalty as our starting point to show how regularization can overcome overfitting. Sparse penalties starting with the lasso are introduced following elastic net penalty, group lasso and sparse group lasso. Finally, we apply these techniques to a real gene expression dataset showing how shrinkage of parameter space can help prediction accuracy.

Comments

Free and open access to this Campus Access Thesis is made available to the UMass Boston community by ScholarWorks at UMass Boston. Those not on campus and those without a UMass Boston campus username and password may gain access to this thesis through resources like Proquest Dissertations & Theses Global or through Interlibrary Loan. If you have a UMass Boston campus username and password and would like to download this work from off-campus, click on the "Off-Campus UMass Boston Users" link above.

Share

COinS