Guess what I’m gonna say!
Problem: We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues—such as gestures, gaze, and emotional tone—from the context.
Method: To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitive-inspired design to emulate the predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results verify the feasibility of next-utterance prediction with LLMs from multimodal cues, and emphasize the indispensable role of non-verbal cues as the foundation of natural human interaction.
Contribution: We present the SAYNEXT and position it as an initial step toward bridging the gap between fluent dialogue generation and genuine cognitive understanding in human–AI interaction. We believe this exploration not only opens a new direction toward more human-like, context-sensitive AI interaction but also offers a pathway to uncovering cognitive concepts from dialogue data for human-centered AI.