吉林大学学报(信息科学版)2024,Vol.42Issue(4):747-753,7.
基于预训练Transformer语言模型的源代码剽窃检测研究
Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model
摘要
Abstract
To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages,we propose a source code plagiarism detection method based on pre-trained Transformer language models,in combination with word embedding,similarity and classification models.The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance.Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets.In addition,for scenarios where only a few labeled plagiarism training samples can be obtained,this paper also proposes a method that combines supervised learning classification models to further improve detection performance.The method can be widely used in source code plagiarism detection scenarios where training data is scarce,computational resources are limited,and the programming languages are diverse.关键词
源代码剽窃检测/Transformer模型/预训练模型/机器学习/深度学习Key words
source code plagiarism detection/Transformer model/pre-trained model/machine learning/deep learning分类
计算机与自动化引用本文复制引用
钱亮宏,王福德,孙晓海..基于预训练Transformer语言模型的源代码剽窃检测研究[J].吉林大学学报(信息科学版),2024,42(4):747-753,7.基金项目
吉林省教育厅产业化培育基金资助项目(JJKH20240274CY) (JJKH20240274CY)