计算机工程2025,Vol.51Issue(5):93-102,10.DOI:10.19678/j.issn.1000-3428.0069492
面向多源文本的越南语文本检错方法
Vietnamese Text Error Detection Method for Multi-source Text
摘要
Abstract
Text error detection is a topic of significant research interest within natural language processing,focusing on automatically identifying the location and type of erroneous words in input text.This task has broad applications across various downstream aspects of text processing and directly impacts daily life.While models for text error detection in English and Chinese have achieved high accuracy,models for Vietnamese face challenges due to a scarcity of corpus resources and manually labeled data.The low quality of training samples hampers the performance of error detection models for Vietnamese.Additionally,the task of error detection for multi-source text introduces complexities,including varied error types across sources and an uneven distribution of error types.Consequently,generalized text error detection models struggle to learn specific error type detection methods,leading to suboptimal performance.To address these challenges,this study proposes a Vietnamese text error detection corpus construction method for multi-source text.The approach leverages datasets from Vietnamese Optical Character Recognition(OCR),Vietnamese speech recognition,and Vietnamese-English translation to create an initial corpus.Using the multi-source Vietnamese error detection corpus generation method,an error corpus is constructed.An error detection corpus automatic labeling algorithm is then employed to generate labeled training data.Furthermore,a Vietnamese text error detection sequence annotation model is introduced,incorporating multi-source information features.By integrating scene features into the multilingual Bidirectional Encoder Representations from Transformers(BERT)encoding layer,the model adapts to specific error types based on the context of the input text.Experimental results demonstrate that the proposed method enhances the F0.5 and F1 values by 1.91 and 1.80 percentage points,respectively,compared to the baseline model.These results validate the necessity of each component of the model as well as the effectiveness of the dataset construction approach.关键词
自然语言处理/机器学习/深度学习/文本检错/越南语Key words
natural language processing/machine learning/deep learning/text error detection/Vietnamese分类
计算机与自动化引用本文复制引用
庄紫薇,朱俊国..面向多源文本的越南语文本检错方法[J].计算机工程,2025,51(5):93-102,10.基金项目
云南省科技厅面上项目(202101AT070077). (202101AT070077)