| 注册
首页|期刊导航|哈尔滨工业大学学报(英文版)|Analysis on n-gram statistics and linguistic features of whole genome protein sequences

Analysis on n-gram statistics and linguistic features of whole genome protein sequences

DONG Qi-wen WANG Xiao-long LIN Lei

哈尔滨工业大学学报(英文版)2008,Vol.15Issue(5):694-698,5.
哈尔滨工业大学学报(英文版)2008,Vol.15Issue(5):694-698,5.

Analysis on n-gram statistics and linguistic features of whole genome protein sequences

Analysis on n-gram statistics and linguistic features of whole genome protein sequences

DONG Qi-wen 1WANG Xiao-long 1LIN Lei1

作者信息

  • 1. School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China
  • 折叠

摘要

Abstract

To obtain the statistical sequence analysis on a large number of genomic and proteomie sequences available for different organisms,the n-grams of whole genome protein sequences from 20 organisms were extracted.Their linguistic features were analyzed by two tests:Zipf power law and Shannon entropy,developed for analysis of natural languages and symbolic sequences.The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered.The results show that:the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4;the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins;a simple unigram model can distinguish different organisms;there exist organism-specific usages of "phrases" in protein sequences.It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence,structure and function.

关键词

n-gram statistics/protein sequence/Zipf law

Key words

n-gram statistics/protein sequence/Zipf law

分类

生物科学

引用本文复制引用

DONG Qi-wen,WANG Xiao-long,LIN Lei..Analysis on n-gram statistics and linguistic features of whole genome protein sequences[J].哈尔滨工业大学学报(英文版),2008,15(5):694-698,5.

基金项目

Sponsored by the National Natural Science Foundation of China(Grant No.60435020). (Grant No.60435020)

哈尔滨工业大学学报(英文版)

1005-9113

访问量3
|
下载量0
段落导航相关论文