首页|期刊导航|解放军医学院学报|国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

张文一郭九宫郑金光王庆梅李林

解放军医学院学报2025，Vol.46Issue(10)：988-993,6.

解放军医学院学报2025，Vol.46Issue(10)：988-993,6.DOI:10.12435/j.issn.2095-5227.25042102

国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

Performance evaluation and comparison of domestic large language models in multi-hop reasoning tasks for burn injury diagnosis and treatment assistance

张文一 ¹郭九宫 ²郑金光 ³王庆梅 ⁴李林¹

作者信息

1. 解放军总医院医学创新研究部,北京 100853
2. 国防大学联合勤务学院卫勤教研室,北京 100091
3. 解放军总医院第四医学中心烧伤整形医学部,北京 100048
4. 解放军总医院第一医学中心军队伤病员管理科,北京 100853
折叠

摘要

Abstract

Background Burn care demands rapid integration of multidimensional clinical information to support accurate decision-making,and multi-hop reasoning plays a key role in this process.Objective To evaluate the performance differences of four domestic large language models——DeepSeek R1,DeepSeek V3,DouBao,and KiMi——in multi-hop reasoning tasks for burn-assisted diagnosis and treatment,and provide theoretical reference for model selection and optimization in clinical and field emergency environments.Methods A total of 30 burn cases were randomly selected from those discharged from Chinese PLA General Hospital from January 2023 and February 2025.Three burn-care experts performed a blind evaluation using a 5-point Likert scale to assess the accuracy of the diagnostic results.Overall comparisons were analyzed using randomized block ANOVA,subgroup analyses(question word count,burn site,area,severity)employed the Mann-Whitney U test,and mixed-effects models were used to assess the interaction between major language models and subgroup factors.Results The experts'consensus score Cronbach's Alpha reached 0.809.DeepSeek R1 achieved a mean score of(4.2±0.62),significantly outperforming DeepSeek V3(2.4±1.06),Doubao(3.2±1.31)and KiMi(1.6±0.86)(P＜0.001).Subgroup analysis revealed DeepSeek-R1 consistently demonstrated superior performance metrics across all defined subpopulations:cases with word counts≤2 000 versus≥2 000,single-site versus multi-site burn injuries,total body surface area(TBSA)involvement＜10%versus≥10%,and burn severity below deep partial-thickness versus deep partial-thickness or greater.Mixed-effects modeling revealed significant interactions between model score and prompt length(P=0.006),number of burn sites(P=0.007),and burn area(P=0.001).Conclusion Significant performance differences exist among domestic large language models on multi-hop reasoning tasks for burn-care diagnostic support,with DeepSeek R1 demonstrating superior capability.These findings underscore the promise of multi-hop reasoning techniques for integrating complex clinical data and facilitating rapid decision-making,and they offer important guidance for optimizing large models in emergency burn-care settings.

关键词

大语言模型/多跳推理/烧伤诊疗/人工智能/临床决策支持

Key words

large language models/multi-hop reasoning/burn care diagnostics/artificial intelligence/clinical decision support

分类

医药卫生

引用本文复制引用

张文一,郭九宫,郑金光,王庆梅,李林..国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较[J].解放军医学院学报,2025,46(10):988-993,6.

基金项目

省部级课题（）

解放军医学院学报

ISSN：2095-5227

访问量1

下载量0

段落导航