首页|期刊导航|计算机工程与应用|构建和剖析中英三元组可比语料库

构建和剖析中英三元组可比语料库

胡小鹏袁琦耿鑫辉朱姝

计算机工程与应用Issue(13)：153-157,186,6.

计算机工程与应用Issue(13)：153-157,186,6.DOI:10.3778/j.issn.1002-8331.1310-0106

构建和剖析中英三元组可比语料库

Building and profiling Chinese-English 3-tuple comparable corpora

胡小鹏 ¹袁琦 ¹耿鑫辉 ¹朱姝¹

作者信息

1. 中国电子信息产业发展研究院，北京 100044
折叠

摘要

Abstract

There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.

关键词

三元组可比语料库/语言迁移/自动语言剖析/n-元词串

Key words

3-tuple comparable corpora/language transfer/automatic language profiling/n-grams

分类

信息技术与安全科学

引用本文复制引用

胡小鹏,袁琦,耿鑫辉,朱姝..构建和剖析中英三元组可比语料库[J].计算机工程与应用,2014,(13):153-157,186,6.

基金项目

国家自然科学基金（No.61172101，No.61172102）。（）

计算机工程与应用

OA北大核心CSCDCSTPCD

ISSN：1002-8331

访问量0

下载量0

段落导航