The First Benchmark for Anticipating Human Next Utterance with MLLMs
Problem: We explore the use of large language models (LLMs) for next-utterance anticipation in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to anticipate a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues—such as gestures, gaze, and emotional tone—from the context.
Method: To systematically examine this gap, we propose SayNext-Bench, a benchmark evaluating MLLMs on anticipating context-conditioned responses across diverse real-world scenarios. To support it, we build SayNext-PC, a large-scale multimodal dialogue dataset, and carefully design a multi-level evaluation framework spanning lexical similarity, emotion-intention consistency, and LLM-based overall alignment. Building on this, we develop SayNext-Chat, a cognitively inspired dual-route MLLM that incorporates learnable priming tokens to fuse perceptual cues with anticipatory priors. Extensive experiments demonstrate that SayNext-Chat consistently outperforms state-of-the-art MLLMs across all evaluation levels, corroborated by user studies and LLM-as-Judge evaluations.
Contribution: Our results emphasize the (i) indispensable role of multimodal cues and (ii) active anticipatory processing as foundations of natural human interaction currently missing in MLLMs. We hope that this exploration offers a new research entry toward a more human-like, context-sensitive AI interaction for human-centered AI.