摘要
Abstract
[Purpose/Significance]Multi-turn dialogue has become a dominant interaction paradigm for large language models(LLMs)in complex cognitive tasks,yet mainstream evaluations rely on information-complete single-turn tasks that fail to capture the distributional risks introduced by multi-turn interactions.Although the"Lost in Conversation"pheno-menon has been identified in English settings,whether it generalizes to the Chinese context,which task types are most vul-nerable,and what variables drive the instability remain systematically unexplored.This study aimed to provide controlled,verifiable quantitative evidence on the existence,structural characteristics,and key drivers of this phenomenon in Chinese multi-turn dialogue.[Method/Process]This study constructed a controlled comparative framework grounded in an"infor-mation equivalence"principle,under which the single-turn and multi-turn conditions shared identical semantic content and constraints,with only the presentation mode varied.A total of 250 single-turn samples spanning five task types—read-ing comprehension,logical reasoning,text classification,long-text processing,and summarization—were curated from established Chinese benchmarks(CMRC 2018,CLUE,DuReader,Math23K,LCSTS),and each sample was converted into a paired multi-turn counterpart through an"instruction slicing—alignment verification—human final review"pipe-line.Four LLMs with distinct architectures and parameter scales(Gemini-2.5-Pro,Claude-3.7-Sonnet,Qwen1.5-72B-Chat,Yi-1.5-34B-Chat)each ran 10 repeated trials per instance under low-entropy decoding,yielding 20,000 scored observations evaluated by an LLM-as-a-Judge protocol supplemented with human spot-checks.The study assessed output distributions through mean performance,percentile bounds,and fluctuation ranges,and verified robustness via Cohen's d,95%confidence intervals,and two-way ANOVA.Controlled-effect experiments on 52 underperforming instances further isolated the independent contributions of temporal order,interaction frequency,and language context.[Result/Conclusion]The results confirm that multi-turn interaction in the Chinese context consistently reduces mean performance and amplifies uncertainty,exhibiting a structural pattern of"stable upper bounds,precipitous lower-bound drops,and expanded fluctua-tion."Mean scores decrease by 8.05 to 18.03 points across models,with all Cohen's d values exceeding 0.93.Task structure significantly moderates degradation severity:long-text processing and summarization suffer the most severe lower-bound breakdowns(P10 dropping to 43 and 52),while text classification remains comparatively resilient.Two-way ANOVA confirms that both main effects and their interaction reach high significance(P<0.0001).Controlled experiments reveal that interaction frequency is the dominant driver—compressing dialogue turns substantially recovers performance and narrows fluctuation across all task types.These findings point to a cascading mechanism of"cognitive load accumula-tion—attention dilution—state drift,"and suggest that designing fewer turns with denser instructions and embedding inter-mediate checkpoint mechanisms are essential practical strategies for reliable human-AI collaboration in high-stakes environments.关键词
大语言模型/人机交互/多轮对话/迷失现象/受控实验/提示工程Key words
large language models/human-computer interaction/multi-turn dialogue/lost in conversation pheno-menon/controlled experiment/prompt engineering分类
社会科学