

Semi-supervised website topic classification based on hetero-geneous graph neural network


互联网网站数量快速增长使现有方法难以准确分类特定网站主题,如基于 URL 的方法无法处理未反映在URL中的主题信息,基于网页内容的方法受到数据稀疏性和语义关系捕捉的限制.为此,提出一种基于异构图神经网络的半监督网站主题分类方法 HGNN-SWT.该方法不仅利用网站文本特征来弥补仅使用URL特征的不足,还利用异构图对网站文本和词语的稀疏关系进行建模,通过处理图中的节点和边关系来提高分类性能.同时引入基于随机游走的邻居节点采样方法,考虑节点的局部特征和全局图结构,并提出特征融合策略,捕捉网站文本数据的上下文关系和特征交互.通过在自制的 Chinaz Website数据集上的实验,证明了 HGNN-SWT方法在网站主题分类任务中相较于现有方法具有更高的准确率.

The rapid growth of the number of Internet websites has made existing methods challeng-ing to accurately classify specific website topics.URL-based methods,for example,struggle to handle topic information not reflected in the URL,while content-based methods face limitations due to data sparsity and challenges in capturing semantic relationships.To address this,a semi-supervised website topic classification method,HGNN-SWT,based on a heterogeneous graph neural network,is proposed.This method not only utilizes website text features to complement the limitations of using only URL fea-tures but also models sparse relationships between website text and words using a heterogeneous graph,improving classification performance by handling node and edge relationships within the graph.The ap-proach introduces a neighbor node sampling method based on random walks,considering both local fea-tures and the global graph structure of nodes.Additionally,a feature fusion strategy is proposed to cap-ture contextual relationships and feature interactions within website text data.Experimental results on a self-created Chinaz Website dataset demonstrate that HGNN-SWT achieves higher accuracy in website topic classification compared to existing methods.


北方民族大学计算机科学与工程学院,宁夏 银川 750000北方民族大学电气信息工程学院,宁夏 银川 750000



website topicheterogeneous graph neural networksemi-supervisedfeature fusion

《计算机工程与科学》 2024 (004)

635-646 / 12


