解放军医学院学报2025,Vol.46Issue(10):988-993,6.DOI:10.12435/j.issn.2095-5227.25042102
国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较
Performance evaluation and comparison of domestic large language models in multi-hop reasoning tasks for burn injury diagnosis and treatment assistance
摘要
Abstract
Background Burn care demands rapid integration of multidimensional clinical information to support accurate decision-making,and multi-hop reasoning plays a key role in this process.Objective To evaluate the performance differences of four domestic large language models——DeepSeek R1,DeepSeek V3,DouBao,and KiMi——in multi-hop reasoning tasks for burn-assisted diagnosis and treatment,and provide theoretical reference for model selection and optimization in clinical and field emergency environments.Methods A total of 30 burn cases were randomly selected from those discharged from Chinese PLA General Hospital from January 2023 and February 2025.Three burn-care experts performed a blind evaluation using a 5-point Likert scale to assess the accuracy of the diagnostic results.Overall comparisons were analyzed using randomized block ANOVA,subgroup analyses(question word count,burn site,area,severity)employed the Mann-Whitney U test,and mixed-effects models were used to assess the interaction between major language models and subgroup factors.Results The experts'consensus score Cronbach's Alpha reached 0.809.DeepSeek R1 achieved a mean score of(4.2±0.62),significantly outperforming DeepSeek V3(2.4±1.06),Doubao(3.2±1.31)and KiMi(1.6±0.86)(P<0.001).Subgroup analysis revealed DeepSeek-R1 consistently demonstrated superior performance metrics across all defined subpopulations:cases with word counts≤2 000 versus≥2 000,single-site versus multi-site burn injuries,total body surface area(TBSA)involvement<10%versus≥10%,and burn severity below deep partial-thickness versus deep partial-thickness or greater.Mixed-effects modeling revealed significant interactions between model score and prompt length(P=0.006),number of burn sites(P=0.007),and burn area(P=0.001).Conclusion Significant performance differences exist among domestic large language models on multi-hop reasoning tasks for burn-care diagnostic support,with DeepSeek R1 demonstrating superior capability.These findings underscore the promise of multi-hop reasoning techniques for integrating complex clinical data and facilitating rapid decision-making,and they offer important guidance for optimizing large models in emergency burn-care settings.关键词
大语言模型/多跳推理/烧伤诊疗/人工智能/临床决策支持Key words
large language models/multi-hop reasoning/burn care diagnostics/artificial intelligence/clinical decision support分类
医药卫生引用本文复制引用
张文一,郭九宫,郑金光,王庆梅,李林..国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较[J].解放军医学院学报,2025,46(10):988-993,6.基金项目
省部级课题 ()