| 注册
首页|期刊导航|解放军医学院学报|国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

张文一 郭九宫 郑金光 王庆梅 李林

解放军医学院学报2025,Vol.46Issue(10):988-993,6.
解放军医学院学报2025,Vol.46Issue(10):988-993,6.DOI:10.12435/j.issn.2095-5227.25042102

国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

Performance evaluation and comparison of domestic large language models in multi-hop reasoning tasks for burn injury diagnosis and treatment assistance

张文一 1郭九宫 2郑金光 3王庆梅 4李林1

作者信息

  • 1. 解放军总医院医学创新研究部,北京 100853
  • 2. 国防大学联合勤务学院卫勤教研室,北京 100091
  • 3. 解放军总医院第四医学中心烧伤整形医学部,北京 100048
  • 4. 解放军总医院第一医学中心军队伤病员管理科,北京 100853
  • 折叠

摘要

Abstract

Background Burn care demands rapid integration of multidimensional clinical information to support accurate decision-making,and multi-hop reasoning plays a key role in this process.Objective To evaluate the performance differences of four domestic large language models——DeepSeek R1,DeepSeek V3,DouBao,and KiMi——in multi-hop reasoning tasks for burn-assisted diagnosis and treatment,and provide theoretical reference for model selection and optimization in clinical and field emergency environments.Methods A total of 30 burn cases were randomly selected from those discharged from Chinese PLA General Hospital from January 2023 and February 2025.Three burn-care experts performed a blind evaluation using a 5-point Likert scale to assess the accuracy of the diagnostic results.Overall comparisons were analyzed using randomized block ANOVA,subgroup analyses(question word count,burn site,area,severity)employed the Mann-Whitney U test,and mixed-effects models were used to assess the interaction between major language models and subgroup factors.Results The experts'consensus score Cronbach's Alpha reached 0.809.DeepSeek R1 achieved a mean score of(4.2±0.62),significantly outperforming DeepSeek V3(2.4±1.06),Doubao(3.2±1.31)and KiMi(1.6±0.86)(P<0.001).Subgroup analysis revealed DeepSeek-R1 consistently demonstrated superior performance metrics across all defined subpopulations:cases with word counts≤2 000 versus≥2 000,single-site versus multi-site burn injuries,total body surface area(TBSA)involvement<10%versus≥10%,and burn severity below deep partial-thickness versus deep partial-thickness or greater.Mixed-effects modeling revealed significant interactions between model score and prompt length(P=0.006),number of burn sites(P=0.007),and burn area(P=0.001).Conclusion Significant performance differences exist among domestic large language models on multi-hop reasoning tasks for burn-care diagnostic support,with DeepSeek R1 demonstrating superior capability.These findings underscore the promise of multi-hop reasoning techniques for integrating complex clinical data and facilitating rapid decision-making,and they offer important guidance for optimizing large models in emergency burn-care settings.

关键词

大语言模型/多跳推理/烧伤诊疗/人工智能/临床决策支持

Key words

large language models/multi-hop reasoning/burn care diagnostics/artificial intelligence/clinical decision support

分类

医药卫生

引用本文复制引用

张文一,郭九宫,郑金光,王庆梅,李林..国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较[J].解放军医学院学报,2025,46(10):988-993,6.

基金项目

省部级课题 ()

解放军医学院学报

2095-5227

访问量0
|
下载量0
段落导航相关论文