计算机工程与应用Issue(21):111-115,5.DOI:10.3778/j.issn.1002-8331.1212-0115
Twitter中重复消息的分析和处理
Twitter repeat messages analysis and processing
摘要
Abstract
Twitter has become the representative applications of the micro-blog. By analysis on twitter a lot of messages (tweets)are the same or similar. Those messages bring up a trouble on the analysis and message storage, so it is needed to remove those messages which are the same or similar. According to the characteristics of short text on tweets, this paper proposes the following approach. It processes the same tweets based on the specific format, then uses the simhash to process the similar tweets. It uses 240 million tweets crawled on the Internet to experiment. In the experiment it only processes Chinese and English tweets. The repetition messages(tweets)is 10 percent of all the Chinese and English tweets.关键词
推特/微博/Simhash/短文本去重Key words
twitter/microblog/Simhash/short text duplicate removal分类
信息技术与安全科学引用本文复制引用
徐凯,沙瀛,李阳,单既喜,王晓岩..Twitter中重复消息的分析和处理[J].计算机工程与应用,2014,(21):111-115,5.基金项目
国家自然科学基金(No.61070184);中国科学院战略性科技先导专项(No.XDA06030200);国家科技支撑计划(No.2012BAH46B03)。 ()