计算机工程Issue(5):12-16,20,6.DOI:10.3969/j.issn.1000-3428.2014.05.003
微博数据通用抓取算法
Universal Crawling Algorithm for Microblogging Data
摘要
Abstract
Currently, Web crawler and microblog API which are used to grab data from the microblog are difficult to satisfy the public opinion system demands for microblog data. To settle the problem, this paper presents a feasible solution which is the similar as the browser login microblog to capture data from Web pages. It can easily get all data from any microblog users. On this basis, it constructs a microblogging network through interconnections among users, and discovers new users through it. In order to get high quality data, it builds mathematical models to calculate the user’s influence index by using posting number, posting frequency, fans number, forwarding number and comments number. Moreover, it builds priority queue according to the calculated influence factor, which let those that have bigger influence index have high acquisition frequency. Finally, it calculates time interval to balance the lower frequency of non-active microblog user. The experimental results show that this method not only processes easily and has higher speed but also can obtain high quality information and have huge versatility.关键词
微博数据/模拟登录/用户网络/用户影响力/网络舆情/优先队列Key words
microblogging data/analog login/user network/user influence/Internet public opinion/priority queue分类
信息技术与安全科学引用本文复制引用
卢体广,刘新,刘任任..微博数据通用抓取算法[J].计算机工程,2014,(5):12-16,20,6.基金项目
湖南省自然科学基金资助项目(12JJ3066);湖南省高校科技成果产业化培育基金资助项目(11CY018);湖南省重点学科基金资助项目。 (12JJ3066)