数据采集与处理2025,Vol.40Issue(3):603-615,13.DOI:10.16337/j.1004-9037.2025.03.004
基于大语言模型的航空发动机领域高质量数据集构建
Construction of High-Quality Dataset in Aero-engine Domain Based on Large Lan-guage Model
邹冠沄 1王存俊 2孔寅豪 2马小庆 2李丕绩1
作者信息
- 1. 南京航空航天大学人工智能学院,南京 211106||模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学),南京 211106
- 2. 中国商用飞机有限责任公司上海飞机设计研究院,上海 201210
- 折叠
摘要
Abstract
With the rapid advancement of artificial intelligence technology,large language models(LLMs)are increasingly being applied across various domains.However,the lack of high-quality,manually curated question-answering datasets in the field of aero-engine has hindered the practical application of expert-level question-answering model.To address this issue,this paper proposes an automated method for constructing question-answering datasets based on LLMs,which generates high-quality open-domain question-answering data without human intervention.During the data generation phase,the method employs in-context learning and input-priority generation strategies to enhance the stability of the generated data.In the data filtering phase,a dual evaluation mechanism is established,combining faithfulness assessment based on source text similarity and semantic quality evaluation using large language models,to automatically filter out hallucinated or anomalous data and ensure factual reliability.Experimental results demonstrate that the proposed method significantly improves the quality of the generated dataset.Models fine-tuned on this dataset exhibit notable performance improvements in aero-engine domain knowledge question-answering tasks.The findings of this study not only provide a solid foundation for the application of large language model in the aero-engine domain but also offer valuable insights for automated dataset construction in other complex engineering fields.关键词
大语言模型/垂直领域大模型/问答数据生成/问答数据质量评估Key words
large language model/vertical domain large language model/question-answering data generation/quality assessment of question-answering data分类
计算机与自动化引用本文复制引用
邹冠沄,王存俊,孔寅豪,马小庆,李丕绩..基于大语言模型的航空发动机领域高质量数据集构建[J].数据采集与处理,2025,40(3):603-615,13.