华东师范大学学报(自然科学版)Issue(5):14-24,11.DOI:10.3969/j.issn.1000-5641.2025.05.002
大语言模型在开源项目主题标注中的应用与评估研究
Application and evaluation of large language models in open source project topic annotation
何德鑫 1韩凡宇 1王伟1
作者信息
- 1. 华东师范大学 数据科学与工程学院,上海 200062
- 折叠
摘要
Abstract
With the rapid development of open source communities,the number of GitHub projects has increased exponentially.However,a considerable portion of these projects lack explicit topic labels,creating challenges for developers in technology selection and project retrieval processes.Existing topic generation methods rely primarily on supervised learning paradigms that suffer from strong dependencies on high-quality annotated data and other limitations.This study addresses the accuracy and efficiency issues in open source community project topic annotation by conducting the first comprehensive study on the application effectiveness of large language models in GitHub project topic prediction tasks.We constructed a dataset containing 3 000 popular GitHub projects that were selected based on a quantitative metric specifically designed to evaluate the activity and influence of open source projects,encompassing multidimensional features including repository names,README documents,and description information.Comparative experiments were conducted using several mainstream large language models from domestic and international sources including Claude 3.7 Sonnet,DeepSeek-V3,Gemini 2.0 Flash,GPT-4o,and Qwen-Plus.The results demonstrated that Claude 3.7 Sonnet achieved optimal performance across most evaluation metrics,and as the dataset scale expanded,the performances of all models tended to stabilize.The experiments proved that large language models exhibited excellent applicability in project topic annotation tasks,although significant performance differences existed among different models.These findings provide an important reference foundation for open source community project management and intelligent annotation system design.关键词
大语言模型/仓库挖掘/主题标注/开源数据集Key words
large language model/repository mining/topic annotation/open source dataset分类
信息技术与安全科学引用本文复制引用
何德鑫,韩凡宇,王伟..大语言模型在开源项目主题标注中的应用与评估研究[J].华东师范大学学报(自然科学版),2025,(5):14-24,11.