中国科学数据(中英文网络版)2024,Vol.9Issue(1):356-365,10.DOI:10.11922/11-6035.csd.2022.0030.zh
面向机器阅读理解的医学域数据集MedicalQA
MedicalQA:A dataset of medical domain for machine reading comprehension
马宁 1吕文蓉 1郭泽晨1
作者信息
- 1. 西北民族大学,中国民族语言文字信息技术重点实验室,兰州 730030||西北民族大学,甘肃省民族语言智能处理重点实验室,兰州 730030
- 折叠
摘要
Abstract
Machine reading comprehension aims to make the computer understand the paragraph semantics and answer the questions raised by users using algorithms.The quality of the dataset used in this task can directly affect the experimental results of the model.In order to enrich the medical domain dataset of machine reading comprehension,this paper constructs MedicalQA,a medical domain dataset for machine reading comprehension,employing a combination of web crawlers and manual annotation techniques.The dataset takes two medical platforms(i.e.Xunyiwenyao Network and 39 Health Network)as main data sources,and includes 19,502 paragraphs and Q&A pairs,covering 9 medical departments,such as internal medicine,surgery,obstetrics and gynecology.The dataset is formatted as an Excel file,organized with 5e columns.The first column denotes the paragraph ID;the second column indicates the department to which the paragraph belongs;the third column contains the paragraph content;the fourth column lists the questions,and the fifth column provides corresponding answers to the questions.The construction of this dataset is conducive to the establishment of machine reading comprehension models in the medical domain,and can also promote the sharing of medical datasets in the field of machine reading comprehension.关键词
机器阅读理解/医学域/数据集Key words
machine reading comprehension/medical domain/dataset引用本文复制引用
马宁,吕文蓉,郭泽晨..面向机器阅读理解的医学域数据集MedicalQA[J].中国科学数据(中英文网络版),2024,9(1):356-365,10.