| 注册
首页|期刊导航|计算机工程与应用|构建和剖析中英三元组可比语料库

构建和剖析中英三元组可比语料库

胡小鹏 袁琦 耿鑫辉 朱姝

计算机工程与应用Issue(13):153-157,186,6.
计算机工程与应用Issue(13):153-157,186,6.DOI:10.3778/j.issn.1002-8331.1310-0106

构建和剖析中英三元组可比语料库

Building and profiling Chinese-English 3-tuple comparable corpora

胡小鹏 1袁琦 1耿鑫辉 1朱姝1

作者信息

  • 1. 中国电子信息产业发展研究院,北京 100044
  • 折叠

摘要

Abstract

There exists inherent skewed language model in Chinese-English parallel corpus due to the influence of transla-tionese. Obviously, natural language processing systems trained with these corpora, including machine translation and cross-language information retrieval, will inherit the skewed language model, thus seriously degrading the performance of applications. To fix the inherent defaults in parallel corpus, this paper proposes a technical research on building and profiling Chinese-English 3-tuple comparable corpora. The study adopts comparable corpora and automatic language profiling technologies and applies a combined method of statistics and rules for statistical analysis on native English and Chinglish in 3-tuple comparable corpora that consists of native English, Chinglish and standard Chinese. Based on this, automatic extraction technologies, such as n-grams and key clusters, are used in the mining of native-language-based bilingual resources to improve and develop natural language processing applications such as machine translation.

关键词

三元组可比语料库/语言迁移/自动语言剖析/n-元词串

Key words

3-tuple comparable corpora/language transfer/automatic language profiling/n-grams

分类

信息技术与安全科学

引用本文复制引用

胡小鹏,袁琦,耿鑫辉,朱姝..构建和剖析中英三元组可比语料库[J].计算机工程与应用,2014,(13):153-157,186,6.

基金项目

国家自然科学基金(No.61172101,No.61172102)。 ()

计算机工程与应用

OA北大核心CSCDCSTPCD

1002-8331

访问量0
|
下载量0
段落导航相关论文