Date of Award


Document Type

Campus Access Dissertation

Degree Name

Doctor of Philosophy (PhD)


Business Administration

First Advisor

Jeffrey Keisler

Second Advisor

Ehsan Elahi

Third Advisor

Josephine Namayanja


Accurately discovering knowledge from a huge volume of short messages generated daily on social media platforms is a critical challenge. Conventional topic models like Latent Dirichlet Allocation (LDA) and its variants that are widely used to automatically extract thematic information from regular-sized documents fail to discover essential information from short texts. Short text documents such as tweets, compared to regular-sized documents such as news articles, lack word co-occurrence information, which leads to very sparse and high dimensional vector representations. This extreme sparsity brings challenges to applying the conventional topic models on social media short texts. In this study, a novel heuristic topic model denoted as the Hashtag-Cluster-based Aggregation model (HCA) is developed to address this sparseness problem. This heuristic topic model treats tweets as semi-structured texts and uses hashtag relations to aggregate related tweets and create larger text documents for topic modeling. At an application level, the HCA model is used to study the topic of public conversations on Twitter about prescription opioids and joint discussions about prescription opioids and marijuana. Monitoring the topic of these discussions at a population scale, such as among the population of Twitter users, can help policymakers and public health officials to better understand the public perception about prescription opioids, study the association between prescription opioids and marijuana, detect abuse behaviors, and surveil the trend of related incidents. The proposed HCA model is evaluated using TF-IDF, GloVe, and FastText word embeddings in comparison to the Hashtag-based Aggregation model (HA), the most common heuristic hashtag-based Twitter aggregation model in the literature. Findings of the model evaluation proved the impact of including the hashtag information in topic modeling by generating more coherent topics than the HA model, which is based on aggregating tweets with the same hashtags.


Free and open access to this Campus Access Dissertation is made available to the UMass Boston community by ScholarWorks at UMass Boston. Those not on campus and those without a UMass Boston campus username and password may gain access to this dissertation through resources like Proquest Dissertations & Theses Global or through Interlibrary Loan. If you have a UMass Boston campus username and password and would like to download this work from off-campus, click on the "Off-Campus UMass Boston Users" link above.