| 注册
首页|期刊导航|计算机工程与应用|以CodeBERT为基础的代码分类研究

以CodeBERT为基础的代码分类研究

成思强 刘建勋 彭珍连 曹奔

计算机工程与应用2023,Vol.59Issue(24):277-288,12.
计算机工程与应用2023,Vol.59Issue(24):277-288,12.DOI:10.3778/j.issn.1002-8331.2209-0402

以CodeBERT为基础的代码分类研究

CodeBERT Based Code Classification Method

成思强 1刘建勋 1彭珍连 1曹奔1

作者信息

  • 1. 湖南科技大学 计算机科学与工程学院,湖南 湘潭 411201||湖南科技大学 服务计算与软件服务新技术湖南重点实验室,湖南 湘潭 411201
  • 折叠

摘要

Abstract

With the continuous development of code big data,the amount of source code in the code base is gradually growing,which makes software code management more complex.How to quickly and effectively classify and manage the code in the code base is of great importance to the development of software engineering.The article introduces pre-trained models to code classification research for the first time and proposes an optimized code classification method,CBBCC,which firstly uses wordpiece to pre-process the source code.Secondly,a CodeBERT pre-training model is used to charac-terise the source code.Finally,the classification task is fine-tuned on the basis of the pre-trained model.To verify the effectiveness of the proposed model,experimental analysis is conducted on the POJ104 dataset.The experimental results show that the CBBCC model achieves more than 98%in all classification metrics compared to the seven benchmark mod-els.The accuracy is improved by 1.1 percentage points over the current optimal model,reaching the SOTA value for the classification task on the POJ104 code classification dataset.CBBCC can effectively annotate code,improve the manage-ment of open source community source code and promote the development of the software engineering field.

关键词

代码分类/代码表征/CodeBERT/迁移训练/代码片段

Key words

code classification/code representation/CodeBERT/migration training/code fragmentation

分类

信息技术与安全科学

引用本文复制引用

成思强,刘建勋,彭珍连,曹奔..以CodeBERT为基础的代码分类研究[J].计算机工程与应用,2023,59(24):277-288,12.

基金项目

国家自然科学基金(61872139) (61872139)

湖南省教育厅重点项目(20A175). (20A175)

计算机工程与应用

OA北大核心CSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文