首页|期刊导航|计算机工程|面向多源文本的越南语文本检错方法

面向多源文本的越南语文本检错方法

庄紫薇朱俊国

计算机工程2025，Vol.51Issue(5)：93-102,10.

计算机工程2025，Vol.51Issue(5)：93-102,10.DOI:10.19678/j.issn.1000-3428.0069492

面向多源文本的越南语文本检错方法

Vietnamese Text Error Detection Method for Multi-source Text

庄紫薇 ¹朱俊国¹

作者信息

1. 昆明理工大学信息工程与自动化学院,云南昆明 650500||昆明理工大学云南省人工智能重点实验室,云南昆明 650500
折叠

摘要

Abstract

Text error detection is a topic of significant research interest within natural language processing,focusing on automatically identifying the location and type of erroneous words in input text.This task has broad applications across various downstream aspects of text processing and directly impacts daily life.While models for text error detection in English and Chinese have achieved high accuracy,models for Vietnamese face challenges due to a scarcity of corpus resources and manually labeled data.The low quality of training samples hampers the performance of error detection models for Vietnamese.Additionally,the task of error detection for multi-source text introduces complexities,including varied error types across sources and an uneven distribution of error types.Consequently,generalized text error detection models struggle to learn specific error type detection methods,leading to suboptimal performance.To address these challenges,this study proposes a Vietnamese text error detection corpus construction method for multi-source text.The approach leverages datasets from Vietnamese Optical Character Recognition(OCR),Vietnamese speech recognition,and Vietnamese-English translation to create an initial corpus.Using the multi-source Vietnamese error detection corpus generation method,an error corpus is constructed.An error detection corpus automatic labeling algorithm is then employed to generate labeled training data.Furthermore,a Vietnamese text error detection sequence annotation model is introduced,incorporating multi-source information features.By integrating scene features into the multilingual Bidirectional Encoder Representations from Transformers(BERT)encoding layer,the model adapts to specific error types based on the context of the input text.Experimental results demonstrate that the proposed method enhances the F0.5 and F1 values by 1.91 and 1.80 percentage points,respectively,compared to the baseline model.These results validate the necessity of each component of the model as well as the effectiveness of the dataset construction approach.

关键词

自然语言处理/机器学习/深度学习/文本检错/越南语

Key words

natural language processing/machine learning/deep learning/text error detection/Vietnamese

分类

信息技术与安全科学

引用本文复制引用

庄紫薇,朱俊国..面向多源文本的越南语文本检错方法[J].计算机工程,2025,51(5):93-102,10.

基金项目

云南省科技厅面上项目(202101AT070077). （202101AT070077）

计算机工程

OA北大核心

ISSN：1000-3428

访问量3

下载量0

段落导航