工程科学与技术2017,Vol.49Issue(2):100-106,7.DOI:10.15961/j.jsuese.201601032
基于ICE-LDA模型的中英文跨语言话题发现研究
Analysis and Research on Cross Language Topic Discovery in Chinese and English
摘要
Abstract
With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis.Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion.Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens.Firstly,Internet news in Chinese and English network were collected.Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset.Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model.Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data.During model building,the TF-IDF algorithm was used to remove noise words of the text.Finally,two kinds of topic vectors were used to detect the co-occurrence topics.The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.关键词
话题发现/跨英汉文本/ICE-LDA模型/TF-IDF特征提取/共现话题Key words
topic model/cross language/ICE-LDA model/TF-IDF feature word extraction/co-occurrence topic分类
信息技术与安全科学引用本文复制引用
陈兴蜀,罗梁,王海舟,王文贤,高悦..基于ICE-LDA模型的中英文跨语言话题发现研究[J].工程科学与技术,2017,49(2):100-106,7.基金项目
国家科技支撑计划资助项目(2012BAH18B05) (2012BAH18B05)
国家自然科学基金资助项目(61272447) (61272447)
四川大学青年教师启动基金(2015SCU11079) (2015SCU11079)