SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction?

The First Benchmark for Predicting Human Next Utterance with MLLMs


SayNext-Bench performance overview.
Figure 1: The Illustration of Next-Utterance Prediction in SayNext-Bench. Given a question utterance text and the corresponding human reaction video, the task requires MLLMs to predict the human’s subsequent response. Predicted responses from SayNext-Chat (green) are compared with ground-truth utterances (blue) and other MLLMs (red); key factors are extracted for interpretability.

Abstract

Problem: We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues—such as gestures, gaze, and emotional tone—from the context.

Method: To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate the predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency.

Contribution: Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction but missing in the current MLLMs. We hope that this exploration offers a new research entry toward a more human-like, context-sensitive AI interaction for human-centered AI.

Dataset Viewer




Leaderboard



SayNext-Chat Performance


SayNext-Chat Performance
SayNext-Chat Performance
SayNext-Chat Performance

Experimental Findings

  • Clear improvements with vision modality: Incorporating visual cues consistently improves next-utterance prediction performance.
  • SayNext-Chat outperforms baseline MLLMs: Across all three evaluation dimensions, SayNext-Chat consistently surpasses zero-shot baselines, including frontier large-scale MLLMs, open-source models of comparable scale, and emotion-specific MLLMs.
  • Priming vectors significantly boost emotional alignment: While fine-tuning on domain-specific corpora increases both lexical overlap and semantic similarity, priming tokens further improve emotion accuracy of future utterances by 3%.
  • Cross-scenario generalization and scalability: SayNext-Chat maintains superior performance over compared baselines when evaluated on larger-scale datasets and across different scenarios in the zero-shot setting.
  • Efficacy in human & LLM evaluations: SayNext-Chat achieves higher scores in subjective human assessments, slightly surpassing GPT-4o and showing a clear margin over open-source MLLMs with comparable parameter scales.

SayNext-Chat Framework


SayNext-Bench performance overview.

Inspired by a cognitive neuroscience perspective, we propose a dual-route prediction framework, SayNext-Chat, to anticipate forthcoming utterances, which incorporates learnable priming tokens representing the high-level belief priors from visual inputs and low-level cues directly perceived from multimodal inputs.


We design SayNext-Chat with two complementary predictive routes: a fast route that directly maps low-level visual and textual cues to a response, and a deep route that infers high-level priors (priming factors) to guide generation.

Case Study


SayNext-Bench performance overview.