Date of Award
Campus Access Thesis
Master of Science (MS)
With the advent of high-throughput biological data in the past twenty years there has been significant amount of effort in the scientific community to devise new techniques to analyze and make sense of these data. The effort can be categorized into two categories. One category deals with coming up more accurate and efficient techniques for acquisition, storage and organization of the data. The second category deals with advanced methods to dig into the collected data and make valuable predictions. In this work we focus on methods suited for analysis of one type of such biological data, namely gene expression data. Due to the nature of the gene expression data obtained in typical biological experiments we have to deal with expression values of thousands of genes across a much smaller size of the samples. This feature instantly poses great difficulty in statistical analysis of gene expression data. Main difficulty here is overfitting of models due to large number of predictors (genes) compared to number of samples. As such, we look at methods to overcome this difficulty and they all include some type of regularization of the model. We look in detail how different regularization techniques work and introduce a class of regularization methods that will shrink the parameter space to arbitrarily smaller sets. The ability to make sparse statistical models can potentially alleviate the overfitting problem as well as making the resulting model more interpretable and able to make better predictions on future input data. We start with a basic regression setting and add the nonsparse ridge penalty as our starting point to show how regularization can overcome overfitting. Sparse penalties starting with the lasso are introduced following elastic net penalty, group lasso and sparse group lasso. Finally, we apply these techniques to a real gene expression dataset showing how shrinkage of parameter space can help prediction accuracy.
Borji, Mehdi, "Sparse Statistical Learning Techniques for Analysis of High-Dimentional Gene Expression Data" (2019). Graduate Masters Theses. 599.