计算机工程与应用2023,Vol.59Issue(24):277-288,12.DOI:10.3778/j.issn.1002-8331.2209-0402
以CodeBERT为基础的代码分类研究
CodeBERT Based Code Classification Method
摘要
Abstract
With the continuous development of code big data,the amount of source code in the code base is gradually growing,which makes software code management more complex.How to quickly and effectively classify and manage the code in the code base is of great importance to the development of software engineering.The article introduces pre-trained models to code classification research for the first time and proposes an optimized code classification method,CBBCC,which firstly uses wordpiece to pre-process the source code.Secondly,a CodeBERT pre-training model is used to charac-terise the source code.Finally,the classification task is fine-tuned on the basis of the pre-trained model.To verify the effectiveness of the proposed model,experimental analysis is conducted on the POJ104 dataset.The experimental results show that the CBBCC model achieves more than 98%in all classification metrics compared to the seven benchmark mod-els.The accuracy is improved by 1.1 percentage points over the current optimal model,reaching the SOTA value for the classification task on the POJ104 code classification dataset.CBBCC can effectively annotate code,improve the manage-ment of open source community source code and promote the development of the software engineering field.关键词
代码分类/代码表征/CodeBERT/迁移训练/代码片段Key words
code classification/code representation/CodeBERT/migration training/code fragmentation分类
信息技术与安全科学引用本文复制引用
成思强,刘建勋,彭珍连,曹奔..以CodeBERT为基础的代码分类研究[J].计算机工程与应用,2023,59(24):277-288,12.基金项目
国家自然科学基金(61872139) (61872139)
湖南省教育厅重点项目(20A175). (20A175)