华东理工大学学报(自然科学版)2024,Vol.50Issue(1):114-120,7.DOI:10.14135/j.cnki.1006-3080.20221122001
基于子句抽取的文本摘要自动提取算法
An Automatic Text Summarization Algorithm Based on Clause Extraction
摘要
Abstract
In today's exponential growth of information data,it is undoubtedly a better choice for people to obtain effective data in a short period of time via automatic summary technology.Among them,how to extract key information from redundant and unstructured long text and make the extracted information concise and smooth is a key issue.The TextRank algorithm and improved algorithms such as SWTextRank have been widely used in the generation of extracted abstracts,but they have not effectively solved the redundancy problem that exists in extracted abstracts.Therefore,this paper proposes an automatic text summarization extraction algorithm based on Clause extraction(PTextRank).Firstly,the text is preprocessed and divided into sentences,after which Sinica Treebank(STB)is used to mark each sentence,and then set extraction units based on clause.Next,BERT is used to construct the title and feature vector for each clause,and then the similarity between the feature vectors of the clause is calculated and stored in the similarity matrix.Finally,the clause similarity matrix is adjusted according to the clause position and the similarity between the clause and the title,the calculation is iteratively made until convergence,and then,the clause with the highest score is selected as the final summary.Experiments and analysis show that PTextRank algorithm effectively avoids redundant information in multiple sentences,and compared to traditional TextRank and the improved SWTextRank,the accuracy of PTextRank in generating abstracts is improved by at least 6%,while the quality of the generated abstract is better.In PTextRank algorithm,clauses are used as extraction units,starting from finer-grained extraction units to avoid redundant information in multiple sentences.关键词
TextRank/摘要提取/冗余处理/Sinica Treebank/篇章结构Key words
TextRank/abstract extraction/redundant processing/Sinica Treebank/textual structure分类
信息技术与安全科学引用本文复制引用
朱兵兵,罗飞,罗勇军,丁炜超,黄浩..基于子句抽取的文本摘要自动提取算法[J].华东理工大学学报(自然科学版),2024,50(1):114-120,7.基金项目
上海市自然科学基金(22ZR1416500) (22ZR1416500)
上海市青年科技英才杨帆计划(20YF1410900) (20YF1410900)
上海市2021年度"科技创新行动计划"长三角科技创新共同体领域项目(21002411000) (21002411000)