工程科学与技术2025,Vol.57Issue(6):119-130,12.DOI:10.12454/j.jsuese.202400082
基于知识蒸馏的民间文学文本修复
Text Restoration of Folk Literature Based on Knowledge Distillation
摘要
Abstract
Objective Folk literature serves as an important carrier for depicting the social life and cultural perspectives of the general public.Due to natural,historical,or human factors,the words in folk literature texts are often ambiguous,difficult to identify,or even completely missing.For effective research and dissemination,it is necessary to repair incomplete folk literature texts.A significant difference exists between folk literary text data and the pre-training data used during the pre-training phase of pre-trained language models.For example,differences occur in the form of special-ized vocabularies and structural features.These differences lead to catastrophic forgetting when directly fine-tuning pre-trained language models,as the model must perform extensive parameter adjustments and can forget previously learned universal language knowledge.Avoiding cata-strophic forgetting in pre-trained language models for this repair task and ensuring that the restored sentences align with the linguistic characteris-tics of folk literature are the two main challenges.A knowledge-distillation-based method for folk literature text restoration is proposed to address these issues. Methods Considering the characteristics of limited annotated data,the presence of specialized vocabularies,and the structural nature of folk liter-ary texts,this study adopted a pre-trained language model to expand knowledge distillation and train the student network,enabling the automatic restoration of incomplete folk literary sentences.First,the pre-trained language models and student networks were utilized to extract the basic fea-ture vectors of characters from folk literary texts.These basic feature vectors were then utilized to construct semantic feature matrices,which un-derwent intermediate feature knowledge distillation.This process involved computing the SmoothL1 loss between the semantic feature matrices of each layer in the pre-trained language model and the student network,ensuring minimal distribution differences between the output features of the student network and the teacher network.The student network's comprehension of the overall semantic meaning of sentences was enhanced by leveraging the teacher network's understanding of character-level general knowledge.Then,the structural relationships among the basic fea-ture vectors in the semantic feature matrix were treated as the structural knowledge of folk literary text sentences.A structural feature matrix was constructed and subjected to structural feature knowledge distillation to reinforce the constraints of structural knowledge during the parameter up-date process of the student network,enhancing the structural regularity of the repaired sentences. Results and Discussions For the three typical genres of folk literature,the corresponding datasets were constructed,and experimental studies were conducted.In the comparative experiments,BERT applied to the constructed folk literary text datasets showed improvements in average bi-lingual evaluation understudy(BLEU)values by 0.12%,0.80%,and 0.29%,and reductions in PPL(perplexity)values by 146.07,168.80,and 72.52,respectively.GPT applied to the constructed folk literary text datasets showed improvements in average BLEU values by 1.00%,1.28%,and 0.66%,and reductions in PPL values by 233.25,303.39,and 144.96,respectively.BART applied to the constructed folk literary text datasets demonstrated improvements in average BLEU values by 6.19%,6.41%,and 11.67%,and reductions in PPL values by 38.75%,7.48%,and 14.82%,proving the effectiveness of the proposed method.In the ablation experiments,the average BLEU of the w/S model was,on average,0.3%higher than that of the w/F model,indicating that structural feature knowledge distillation has a better effect on improving the accuracy of folk literary text sentences compared to intermediate feature knowledge distillation.The PPL index was,on average,22 higher,indicating that in-termediate feature knowledge distillation has a better effect on improving the fluency of folk literary text sentences.The results of the ablation ex-periments also indicated that combining these two distillation methods further improved the average BLEU index and reduced the PPL index com-pared to the w/S model and w/F model.In the mask rate experiment,the combined knowledge distillation method showed improvements in aver-age BLEU indices and reductions in PPL indices relative to traditional fine-tuning methods,demonstrating the robustness of the combined knowl-edge distillation method.In addition,when the mask ratio was set to 15%or 20%,the average BLEU and the PPL metrics typically demonstrated the most optimal improvement,indicating that the combined knowledge distillation method was more effective in capturing crucial linguistic in-formation from incomplete folk literary text sentences when the number of missing characters was moderate,providing the student network with rich and accurate semantic and structural knowledge.The case study intuitively demonstrated the execution results of the combined knowledge distillation method,indicating that the method generated coherent,complete,and well-formatted sentences. Conclusions Considering the specific vocabulary and structural features of folk literature,the catastrophic forgetting phenomenon faced by exist-ing controllable text generation methods,and the insufficient generalization when handling data from the vertical domain of folk literature,a com-bined knowledge distillation method is proposed.This method constructs semantic and structural feature matrices and conducts knowledge distil-lation on both.Experimental results demonstrate that the method effectively prevents catastrophic forgetting in pre-trained language models and generates sentences with more accurate semantics,comprehensive content,and improved alignment with the formatting requirements of folk lit-erature texts.关键词
民间文学/文本修复/知识蒸馏/灾难性遗忘/结构知识Key words
folk literature/text restoration/knowledge distillation/catastrophic forgetting/structural knowledge分类
计算机与自动化引用本文复制引用
曹熊能,王笳辉,岳昆,段亮,张多..基于知识蒸馏的民间文学文本修复[J].工程科学与技术,2025,57(6):119-130,12.基金项目
云南省基础研究计划面上项目(202201AT070394) (202201AT070394)
云南省智能系统与计算重点实验室项目(202405AV340009) (202405AV340009)
国家社会科学基金项目(20CZW059) (20CZW059)
云南大学研究生科研创新项目(KC‒22221717) (KC‒22221717)