计算机工程与应用2026,Vol.62Issue(9):122-132,11.DOI:10.3778/j.issn.1002-8331.2506-0066
无监督数据驱动的垂直领域大模型微调数据集生成框架
Unsupervised Data-Driven Framework for Fine-Tuning Dataset Generation in Vertical Domain Large Models
摘要
Abstract
Large language models have achieved remarkable progress in natural language processing,but general-purpose models face challenges in vertical domains due to the lack of domain-specific knowledge and limited data availability.This paper proposes an unsupervised data-driven fine-tuning dataset generation framework(GEO).By integrating genera-tion,evaluation,and optimization modules,the framework automatically constructs and refines question-answer pairs to produce high-quality,domain-adaptive training data.The method is applied to the low-carbon energy domain to build a specialized dataset based on the GEO framework.LoRA-based fine-tuning experiments are conducted on Chinese models including Qwen2.5-7B,Baichuan2-7B,and ChatGLM3-6B.Results demonstrate that the GEO-generated dataset signifi-cantly enhances text generation quality,semantic consistency,and domain adaptability,while preserving the general capa-bilities of the base models,validating the effectiveness of the approach.关键词
大语言模型/垂直领域/数据生成/LoRA微调Key words
large language model/vertical domain/data generation/LoRA fine-tuning分类
信息技术与安全科学引用本文复制引用
师瑞峰,轩顺德,叶禹江,时文刚,宋寅,滕婧..无监督数据驱动的垂直领域大模型微调数据集生成框架[J].计算机工程与应用,2026,62(9):122-132,11.基金项目
国家重点研发计划(2021YFB2601300) (2021YFB2601300)
国家自然科学基金(62373148) (62373148)
中央高校基本科研业务费专项(2025JC005). (2025JC005)