Date of Award

8-2024

Document Type

Campus Access Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Ping Chen

Second Advisor

Marc Pomplun

Third Advisor

Wei Ding, Dan Simovici

Abstract

Recent developments in machine learning, especially deep learning, have significantly advanced the progress in Natural Language Processing (NLP). In order to train machines to understand natural languages, learning informative semantic representations from them is a fundamental and crucial step. Although there has been impressive development in semantic representation learning, key challenges still remain. In this dissertation, we address some of these challenges by proposing novel approaches to learning representations at word and sentence levels and developing an automatic short answer grading system that applies the techniques in NLP in the education field. The first section of this dissertation explores semantic representation learning at the word level, which is treated as an undividable atomic information unit. Word embeddings which are dense vectors have been popularly applied in current machine learning in NLP. We propose a novel modular neuro-symbolic approach to learn richer semantic information such as denotation information. It designs a small neural network for each word, treats such representation of a word as a module, utilizes the symbolic information of the dependency parsing tree, and connects word modules to construct a neural network for a sentence which will be trained on some sentence level NLP tasks. Experiments on a Linguistic Acceptability task are run to test this approach’s potential in learning informative semantic representations with much less training data and a much smaller model size. The second section shows our work on representation learning at the sentence level. We develop a framework that applies contrastive learning and utilizes publicly labeled Natural Language Inference corpora to improve the learning of sentence representations. This framework is model agnostic which can be applied on top of any existing encoders. By using BERT as the encoder, experiments on the series of Text Similarity tasks prove this simple yet effective approach. In the last section, we present SteLLA, a structured automatic grading system using Large Language Models (LLMs) with Retrieval-Augmented Generation. This system also applies other NLP techniques such as question generation and answering to provide structured grades and feedback. We experiment with it on a real-world dataset from college-level Biology course exams and show that our grading system can achieve substantial agreement with human graders. A systematic analysis of the outputs from LLMs provides practical insights into their application to the grading task.

Comments

Free and open access to this Campus Access Thesis is made available to the UMass Boston community by ScholarWorks at UMass Boston. Those not on campus and those without a UMass Boston campus username and password may gain access to this thesis through resources like Proquest Dissertations & Theses Global (https://www.proquest.com/) or through Interlibrary Loan. If you have a UMass Boston campus username and password and would like to download this work from off-campus, click on the "Off-Campus UMass Boston Users

Share

COinS