智能系统学报2025,Vol.20Issue(5):1281-1293,13.DOI:10.11992/tis.202411001
中文多技能对话评估
Evaluation of Chinese multiskill dialogues
摘要
Abstract
The accurate evaluation of the capabilities of a multiskilled dialogue system is important to satisfy the differ-ent demands of users,including social banter,profound knowledge-based discussions,role-playing conversations,and dialogue recommendations.Current benchmarks concentrate on assessing specific dialogue skills and cannot efficiently evaluate multiple dialogue skills concurrently.To facilitate the evaluation of multiskill dialogues,this study establishes a Chinese multiskill evaluation benchmark,which is the Multi-Skill Dialogue Evaluation Benchmark(MSDE).MSDE contains 1,781 dialogues and 21,218 utterances,which cover four common dialogue tasks:chit-chat,knowledge dialog,persona-based dialog,and dialog recommendations.We performed extensive experiments on MSDE and examined the correlation between automatic and human evaluation metrics.Results indicate that(1)among the four dialogue tasks,chit-chat is the most difficult to analyze,while knowledge dialogue is the easiest;(2)significant differences exist in the performance of various metrics on MSDE;(3)for human evaluation,the analysis complexity of each metric differs across varying dialogue tasks.Certain data will be made available on https://github.com/IRIP-LLM/MSDE,and all data will be released after sorting.关键词
多技能对话/对话评估/闲聊/开放域对话/对话推荐/画像聊天/知识对话/大语言模型Key words
multiskill dialogue/dialogue evaluation/chit-chat/open domain dialogue/conversational recommendation/persona-chat/knowledge-grounded dialogue/large language model分类
信息技术与安全科学引用本文复制引用
柳泽明,程子豪,刘晶晶,杨晓,郭园方,王蕴红..中文多技能对话评估[J].智能系统学报,2025,20(5):1281-1293,13.基金项目
国家重点研发计划项目(2023YFF0725600) (2023YFF0725600)
国家自然科学基金项目(62406015). (62406015)