| 注册
首页|期刊导航|科技与出版|出版行业构建高质量数据集的优势分析与方法研究

出版行业构建高质量数据集的优势分析与方法研究

王钧 王飚 李苏航

科技与出版Issue(6):64-72,9.
科技与出版Issue(6):64-72,9.

出版行业构建高质量数据集的优势分析与方法研究

Advantage Analysis and Methodological Approaches for Constructing High Quality Datasets in Publishing Industry

王钧 1王飚 2李苏航3

作者信息

  • 1. 上海交通大学文创学院,200240,上海
  • 2. 中国新闻出版研究院,100073,北京
  • 3. 伦敦大学学院,WC1E6BT,伦敦
  • 折叠

摘要

Abstract

Nations worldwide are actively developing and leveraging data resources both domestically and internationally.These resources exhibit economic characteristics including externalities,non-rivalry,and non-excludability,alongside sociological attributes such as shareability,spatiotemporal relevance,and public accessibility.Data quality serves as a critical determinant of model performance in generative artificial intelligence(GenAI)systems,and the lack of high-quality training datasets remains a significant challenge across sectors.While previous research on data elements has focused on implementation aspects,this study examines the underlying rationale and methodologies.This paper establishes the connotation and extension of high-quality datasets,identifying four quality dimensions within a three-dimensional six-tier analytical framework;1.Structural Dimension;2.Spatiotemporal Dimension;3.Security Dimension:The core requirements for constructing high-quality datasets are categorized into four dimensions;1.Data Unit Level;2.Dataset Level;3.Social Benefit Perspective;4.Economic Benefit Perspective:This framework integrates technical specifications with governance principles,addressing both operational efficiency and societal value creation.The analysis examines industry-specific characteristics and resource endowments to demonstrate why the publishing sector holds unique social responsibility in constructing high-quality datasets.Publishing data exhibits inherent advantages:1.Quantity:Rich diversity of types and abundant reserves;2.Quality:Rigorous supply mechanisms and strict review processes;3.Externality:Traceable ownership and privacy clearance;4.Standardization:Technical support and cross-referencing capabilities.At the data unit level,publishing data undergoes comprehensive peer review and expert verification,ensuring superior accuracy and reliability compared to alternative data sources.Publishing data achieves substantial completeness and richness through comprehensive industry coverage.At the dataset level,professional editorial teams facilitate secondary knowledge production during data aggregation.They integrate technology with publishing workflows in processes such as packaging,delivery,error correction,and iterative updates,establishing sustainable version control mechanisms.Regarding benefits,publishing data inherently features desensitization and alignment with mainstream ideological values,addressing the balance between data protection and public accessibility.Moreover,the publishing industry's established ownership tracing and benefit distribution mechanisms provide a foundation for business evolution,facilitating trust networks and incentive-compatible business models between data providers and users.From a meso-theoretical perspective,this study employs a best-practice approach,examining mature image databases in the digital copyright trading industry as case studies.It analyzes principles and methodologies for constructing high-quality datasets,proposes operational and training recommendations,and achieves alignment between theory and practice.The marginal contributions of this paper are threefold:first,clarifying the scope and definition of high-quality datasets;second,analyzing the publishing industry's characteristics and advantages to identify key stakeholders;and third,recommending standards,operational principles,and construction methods for high-quality datasets.

关键词

高质量数据集/出版/构建/生成式人工智能

Key words

high quality dataset/publishing/construction/generative artificial intelligence(GenAI)

引用本文复制引用

王钧,王飚,李苏航..出版行业构建高质量数据集的优势分析与方法研究[J].科技与出版,2025,(6):64-72,9.

科技与出版

OA北大核心

1005-0590

访问量1
|
下载量0
段落导航相关论文