首页|期刊导航|吉林大学学报（信息科学版）|基于预训练Transformer语言模型的源代码剽窃检测研究

基于预训练Transformer语言模型的源代码剽窃检测研究

钱亮宏王福德孙晓海

吉林大学学报（信息科学版）2024，Vol.42Issue(4)：747-753,7.

基于预训练Transformer语言模型的源代码剽窃检测研究

Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model

钱亮宏 ¹王福德 ²孙晓海³

作者信息

1. 益数软件科技(上海)有限公司数据科学部,上海 200233
2. 吉林海诚科技有限公司技术部,长春 130119||吉林农业大学智慧农业研究院,长春 130118
3. 吉林海诚科技有限公司技术部,长春 130119
折叠

摘要

Abstract

To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages,we propose a source code plagiarism detection method based on pre-trained Transformer language models,in combination with word embedding,similarity and classification models.The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance.Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets.In addition,for scenarios where only a few labeled plagiarism training samples can be obtained,this paper also proposes a method that combines supervised learning classification models to further improve detection performance.The method can be widely used in source code plagiarism detection scenarios where training data is scarce,computational resources are limited,and the programming languages are diverse.

关键词

源代码剽窃检测/Transformer模型/预训练模型/机器学习/深度学习

Key words

source code plagiarism detection/Transformer model/pre-trained model/machine learning/deep learning

分类

信息技术与安全科学

引用本文复制引用

钱亮宏,王福德,孙晓海..基于预训练Transformer语言模型的源代码剽窃检测研究[J].吉林大学学报（信息科学版）,2024,42(4):747-753,7.

基金项目

吉林省教育厅产业化培育基金资助项目(JJKH20240274CY) （JJKH20240274CY）

吉林大学学报（信息科学版）

OACSTPCD

ISSN：1671-5896

访问量0

下载量0

段落导航