SAYNEXT: A Benchmark and Cognitively Inspired Framework for Next-Utterance Prediction with Multimodal LLMs

Guess what I’m gonna say!


SayNext-Bench performance overview.
Figure 1: The Illustration of Next-Utterance Prediction. Given a video and the interviewer’s question turn, the SayNext-Chat predicts the subsequent response using a dual-route framework. In evaluation, key factors predicted by our model (green) align with those in the ground truth (blue), whereas other MLLMs often yield irrelevant factors, incorrect ones, or entirely fail to predict (red).

Abstract

Problem: We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues—such as gestures, gaze, and emotional tone—from the context.

Method: To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitive-inspired design to emulate the predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results verify the feasibility of next-utterance prediction with LLMs from multimodal cues, and emphasize the indispensable role of non-verbal cues as the foundation of natural human interaction.

Contribution: We present the SAYNEXT and position it as an initial step toward bridging the gap between fluent dialogue generation and genuine cognitive understanding in human–AI interaction. We believe this exploration not only opens a new direction toward more human-like, context-sensitive AI interaction but also offers a pathway to uncovering cognitive concepts from dialogue data for human-centered AI.

Dataset Viewer




Leaderboard



SayNext-Chat Performance


SayNext-Chat Performance
SayNext-Chat Performance
SayNext-Chat Performance

Experimental Findings

  • Clear improvements with vision modality: Incorporating visual cues consistently improves next-utterance prediction performance.
  • SayNext-Chat outperforms baseline MLLMs: Across all three evaluation dimensions, SayNext-Chat consistently surpasses zero-shot baselines, including frontier large-scale MLLMs, open-source models of comparable scale, and emotion-specific MLLMs.
  • Priming vectors significantly boost emotional alignment: While fine-tuning on domain-specific corpora increases both lexical overlap and semantic similarity, priming tokens further improve emotion accuracy of future utterances by 3%.
  • Cross-scenario generalization and scalability: SayNext-Chat maintains superior performance over compared baselines when evaluated on larger-scale datasets and across different scenarios in the zero-shot setting.
  • Efficacy in human & LLM evaluations: SayNext-Chat achieves higher scores in subjective human assessments, slightly surpassing GPT-4o and showing a clear margin over open-source MLLMs with comparable parameter scales.

SayNext-Chat Framework


SayNext-Bench performance overview.

Inspired by a cognitive neuroscience perspective, we propose a dual-route prediction framework, SayNext-Chat, to anticipate forthcoming utterances, which incorporates learnable priming tokens representing the high-level belief priors from visual inputs and low-level cues directly perceived from multimodal inputs.


We design SayNext-Chat with two complementary predictive routes: a fast route that directly maps low-level visual and textual cues to a response, and a deep route that infers high-level priors (priming factors) to guide generation.

Case Study


SayNext-Bench performance overview.