计算机工程与应用Issue(13):153-157,186,6.DOI:10.3778/j.issn.1002-8331.1310-0106
构建和剖析中英三元组可比语料库
Building and profiling Chinese-English 3-tuple comparable corpora
摘要
Abstract
There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.关键词
三元组可比语料库/语言迁移/自动语言剖析/n-元词串Key words
3-tuple comparable corpora/language transfer/automatic language profiling/n-grams分类
信息技术与安全科学引用本文复制引用
胡小鹏,袁琦,耿鑫辉,朱姝..构建和剖析中英三元组可比语料库[J].计算机工程与应用,2014,(13):153-157,186,6.基金项目
国家自然科学基金(No.61172101,No.61172102)。 ()