大数据2025,Vol.11Issue(6):57-71,15.DOI:10.11959/j.issn.2096-0271.2025085
人工智能大语言模型数据集现状和充实对策研究
Research on the status of large language model data set of artificial intelligence and enriching countermeasure
摘要
Abstract
Artificial intelligence large language model training data usually has the characteristics of large data scale,high data quality and rich data types.At present,although domestic data resources are rich,high-quality chinese large language model training data is still scarce,and there is still a certain gap between the quantity and quality of chinese large language model training data and the world's leading countries.Based on the situation of public data sets and typical general large language model data sets at home and abroad,this paper deeply studied and compared the relevant situation of large language model data sets at home and abroad,analyzed the challenges and problems faced by the development of large language model data sets in China,and puted forward countermeasures and suggestions to enrich the supply of large language model data sets in artificial intelligence.关键词
人工智能/大语言模型/数据集Key words
artificial intelligence/large language model/data set分类
计算机与自动化引用本文复制引用
胡晓女,李涛,李姗姗..人工智能大语言模型数据集现状和充实对策研究[J].大数据,2025,11(6):57-71,15.基金项目
工业和信息化部2024年度指导性软课题(No.GXZK2024-49) (No.GXZK2024-49)
2024年度全国学会服务国家战略专项-面向AI的算力网关键技术路线图(No.(2024)837) Ministry of Industry and Information Technology of the People's Republic of China 2024 Guidance Soft Topic(No.GXZK2024-49),The 2024 National Academic Society Service National Strategy Special Project-Key Technology Roadmap for AI Computing Power Network(No.(2024)837) (No.(2024)