| 注册
首页|期刊导航|南京大学学报(自然科学版)|多视角网页分类数据集构建及性能评估

多视角网页分类数据集构建及性能评估

孙辰星 刘伟 卢彬 梁诗宇 诸云强 甘小莺

南京大学学报(自然科学版)2024,Vol.60Issue(3):406-415,10.
南京大学学报(自然科学版)2024,Vol.60Issue(3):406-415,10.DOI:10.13232/j.cnki.jnju.2024.03.005

多视角网页分类数据集构建及性能评估

Multi-view webpage classification dataset construction and evaluation

孙辰星 1刘伟 1卢彬 1梁诗宇 1诸云强 2甘小莺1

作者信息

  • 1. 上海交通大学电子信息与电气工程学院,上海,200240
  • 2. 中国科学院地理科学与资源研究所,北京,100101
  • 折叠

摘要

Abstract

Webpage classification is an important task in Internet data mining,playing a crucial role in information retrieval,recommendation systems,and knowledge discovery,etc.However,existing public webpage datasets suffer from limitations such as scarcity,single sources and insuffcient information,which hinder the development of webpage classification techniques.To address these issues,we propose a publicly available dataset for webpage classification called Web-Minds,incorporating multi-view features by designing a three-step process of"collection-processing-annotation".Specifically,the relevant webpage data are collected and integrated from the open Internet.Then,a webpage parsing tool is employed to extract and clean multi-view information from the collected data,including text,structure,keywords,etc.We design a large language model and a"human-in-the-loop"annotation strategy to assign two types of labels,namely webpage type and webpage topic.Furthermore,we establish an algorithmic evaluation benchmark based on the Web-Minds dataset,containing such methods as machine learning,text classification,and webpage classification.The results demonstrate that compared to using single-view features alone,the comprehensive utilization of multi-view features significantly improves algorithm accuracy,with an increase of 5.49%and 5.61%in webpage type and topic classification tasks,respectively.

关键词

网页数据集/网页分类/文本分类/数据挖掘/深度学习

Key words

webpage dataset/webpage classification/text classification/data mining/deep learning

分类

信息技术与安全科学

引用本文复制引用

孙辰星,刘伟,卢彬,梁诗宇,诸云强,甘小莺..多视角网页分类数据集构建及性能评估[J].南京大学学报(自然科学版),2024,60(3):406-415,10.

基金项目

国家重点研发计划(2022YFB3904204),国家自然科学基金(62272301,42050105,62020106005,62061146002,61960206002) (2022YFB3904204)

南京大学学报(自然科学版)

OA北大核心CSTPCD

0469-5097

访问量0
|
下载量0
段落导航相关论文