Date of Award

5-31-2026

Document Type

Campus Access Dissertation

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Beatrice Perez

Abstract

Over the past few years, large language models (LLMs) have made it possible to generate dialogue that is remarkably fluent and often convincingly human in short interactions (Brown et al., 2020; Devlin et al., 2019). Yet it remains difficult to determine when such dialogue actually feels realistic to human users, especially in multi-turn settings where identity, context, and emotional tone must be sustained over time. This thesis introduces a multi-dimensional evaluation framework that breaks conversational realism into five dimensions: coherence, empathy, naturalness, consistency, and contextual appropriateness. To instantiate this framework, the thesis constructs a dataset of multi-turn dialogue segments drawn from movies, television shows, and video games, and obtains fine-grained human ratings along each of these dimensions. These annotations support both supervised regression experiments and unsupervised clustering analyses that probe how different types of conversational behavior relate to perceived realism. In parallel, the work explores a set of computational indicators such as perplexity and repetition patterns for naturalness, sentiment-oriented measures for empathy, semantic similarity for coherence, contradiction signals for consistency, and context-response similarity for contextual appropriateness, as interpretable proxies for aspects of the realism dimensions. The work is motivated by a longer-term goal of building AI patient simulation agents for medical training, where realistic conversational behavior is essential for educational value rather than merely cosmetic. In representative experiments on a held-out test set, models trained on the annotated dataset achieve Pearson correlations of approximately 0.77 with human judgments, Spearman correlations of approximately 0.57, and mean absolute errors around 0.46 across dimensions, indicating that the framework captures learnable structure in realism assessments. Overall, the proposed framework provides a more structured basis for evaluating dialogue systems and lays groundwork for future simulation agents that can sustain more believable, contextually appropriate interactions.

Comments

Free and open access to this work is made available to the UMass Boston community by ScholarWorks at UMass Boston. Those not on campus and those without a UMass Boston campus username and password may gain access to this work through Interlibrary Loan. If you have a UMass Boston campus username and password and would like to download this work from off-campus, click on the “Off-Campus Users” button.

Share

COinS