|国家科技期刊平台
首页|期刊导航|统计与决策|编制价格指数的爬虫数据抽样方法研究

编制价格指数的爬虫数据抽样方法研究OA北大核心CHSSCDCSSCICSTPCD

Research on Crawler Data Sampling Method for Price Index Compilation

中文摘要英文摘要

文章针对全量爬虫数据编制价格指数成本高的问题,提出了一种抽样方法.该方法采用"大数据—小数据"思想,在基期通过网络爬虫技术全量抓取电商平台的商品交易数据,形成抽样框;在连续性调查中采用抽样技术,根据分层抽样思想,运用聚类算法及其轮廓系数实现总体数据分层,并通过不等概率随机抽样获取各层代表性样本;考虑到连续性调查中入选样本存在无回答现象,提出正式和备选样本思路,针对每个正式样本,采用最近邻匹配法挑选若干个备选样本,当正式样本无回答时,以备选样本作为替补来完成价格指数编制.以天猫商城粮油品类为例进行验证,结果表明:在抓取的数据中,基期全量爬虫数据有18351条,第2-8期连续性调查的平均抽样比为10.18%,抽样的平均相对误差为0.59%,说明该方法是可行的.

Aiming at the problem of high cost of compiling price index with full crawler data,this paper proposes a sampling method.This method adopts the idea of"big data-small data",and fully captures the commodity transaction data of the e-com-merce platform through web crawler technology in the base period to form a sampling frame.Sampling techniques are used in con-tinuous surveys;according to the idea of stratified sampling,clustering algorithms and silhouette coefficients are used to achieve overall data stratification;representative samples of each stratum are obtained through random sampling with unequal probability.Considering the non-response phenomenon of the selected samples in the continuous survey,the idea of formal and alternative samples is proposed.For each formal sample,the nearest neighbor matching algorithm is used to select several alternative sam-ples.When the formal sample has no answer,the alternative sample is used as a substitute to complete the price index compila-tion.Finally,the grain and oil category in Tmall mall is used as an example for experimental validation,and the results show that in the captured data,the full-amount crawler data in the base period is 18351,the average sampling ratio of the continuous survey from 2 to 8 periods is 10.18%,and the average relative error of sampling is 0.59%,which indicates that the method is feasible.

雷兵;梁凯凯;刘维

河南工业大学 管理学院,郑州 450000

统计学

价格指数爬虫数据分层抽样聚类算法样本匹配

price indexcrawler datastratified samplingclustering algorithmsample matching

《统计与决策》 2024 (012)

24-28 / 5

国家社会科学基金一般项目(18BGL268);河南省高校哲学社会科学创新团队资助项目(2019-CXTD-04)

10.13546/j.cnki.tjyjc.2024.12.004

评论