SAYNEXT: A Benchmark and Cognitively Inspired Framework for Next-Utterance Prediction with Multimodal LLMs

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?

The First Benchmark for Anticipating Human Next Utterance with MLLMs

SayNext-Bench performance overview. — **Figure 1:** **Illustration of Next-Utterance Anticipation in SayNext-Bench.** Given a question utterance text and the corresponding human reaction video, the task requires MLLMs to anticipate the human’s subsequent response. Anticipated responses from SayNext-Chat (green) are compared with ground-truth utterances (blue) and other MLLMs (red); key factors are extracted for interpretability.

Abstract

Problem: We explore the use of large language models (LLMs) for next-utterance anticipation in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to anticipate a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues—such as gestures, gaze, and emotional tone—from the context.

Method: To systematically examine this gap, we propose SayNext-Bench, a benchmark evaluating MLLMs on anticipating context-conditioned responses across diverse real-world scenarios. To support it, we build SayNext-PC, a large-scale multimodal dialogue dataset, and carefully design a multi-level evaluation framework spanning lexical similarity, emotion-intention consistency, and LLM-based overall alignment. Building on this, we develop SayNext-Chat, a cognitively inspired dual-route MLLM that incorporates learnable priming tokens to fuse perceptual cues with anticipatory priors. Extensive experiments demonstrate that SayNext-Chat consistently outperforms state-of-the-art MLLMs across all evaluation levels, corroborated by user studies and LLM-as-Judge evaluations.

Contribution: Our results emphasize the (i) indispensable role of multimodal cues and (ii) active anticipatory processing as foundations of natural human interaction currently missing in MLLMs. We hope that this exploration offers a new research entry toward a more human-like, context-sensitive AI interaction for human-centered AI.

Leaderboard

Experimental Findings

Clear improvements with vision modality: Incorporating visual cues consistently improves next-utterance anticipation performance.
SayNext-Chat outperforms baseline MLLMs: Across all three evaluation levels (lexical similarity, emotion-intention consistency, and overall alignment), SayNext-Chat consistently surpasses zero-shot baselines, including frontier large-scale MLLMs, open-source models of comparable scale, and emotion-specific MLLMs.
Priming vectors significantly boost emotional alignment: While fine-tuning on domain-specific corpora increases both lexical overlap and semantic similarity, priming tokens further improve emotion-intention consistency of future utterances.
Cross-scenario generalization and scalability: SayNext-Chat maintains superior performance over compared baselines when evaluated on larger-scale datasets and across different scenarios in the zero-shot setting.
Efficacy in human & LLM evaluations: SayNext-Chat achieves higher scores in subjective human assessments, slightly surpassing GPT-4o and showing a clear margin over open-source MLLMs with comparable parameter scales.

SayNext-Chat Framework

Inspired by a cognitive neuroscience perspective, we propose a dual-route prediction framework, SayNext-Chat, to anticipate forthcoming utterances, which incorporates learnable priming tokens representing the high-level belief priors from visual inputs and low-level cues directly perceived from multimodal inputs.

We design SayNext-Chat with two complementary predictive routes: a fast route that directly maps low-level visual and textual cues to a response, and a deep route that infers high-level priors (priming factors) to guide generation.

Visualization of Attention Maps

Input Video

SayNext-Chat (Ours)

InternVL

VideoLLaMA3

Emotion-LLaMA

Compared to other open-sourced 7–8B MLLMs, SayNext-Chat attends more broadly to facial expressions and gesture regions.

Scroll to view all 12 samples

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?

Abstract

Dataset Viewer

Leaderboard

SayNext-Chat Performance

Experimental Findings

SayNext-Chat Framework

Visualization of Attention Maps

Case Study