首页|期刊导航|计算机工程|基于大语言模型的教育文本幂等摘要方法

基于大语言模型的教育文本幂等摘要方法OA北大核心CSTPCD

Large Language Model-based Idempotent Summarization Method for Educational Text

中文摘要

英文摘要

大语言模型在自然语言处理领域蓬勃发展,但在教育数字化领域应用过程中仍面临一系列重要挑战.针对教育数字化领域垂域数据稀缺、摘要长度不稳定导致信息缺失或冗余的问题,提出一种用于教育领域文本摘要的轻量化幂等模型框架IGLM.该模型首先采用多源训练进行自适应扩增以提升数据多样性,然后对下游的文本摘要任务进行多种微调.同时,为降低文本长度的影响,设计幂等摘要生成策略拉近初次摘要与幂等摘要来约束模型,减少语料分布不均导致的偏见,结合量化技术在低资源条件下生成更为精确和流畅的摘要文本.实验以ROUGE分数为评估指标,在公开中文文本摘要数据集LCSTS、EDUCATION、NLPCC上进行验证.实验结果表明,该框架在生成摘要的准确率和流畅性上有明显提升,其中ROUGE-1/2/L相较基线模型在LCSTS数据集上分别提升7.9、7.4、8.7个百分点,在EDUCATION数据集上分别提升12.9、15.4、15.7个百分点,在NLPCC数据集上分别提升12.2、11.7、12.7个百分点,验证了模型有效性.

Large Language Models(LLMs)are currently undergoing vigorous development in the field of Natural Language Processing(NLP).However,significant challenges remain in their applications to educational digitization.To address the problem posed by the scarcity of domain-specific data and the instability of summarization leading to information loss or redundancy,this study introduces a lightweight idempotent model framework,Idempotent Generative Language Model(IGLM),for educational text summarization.The model first employs multisource training for adaptive augmentation to enhance data diversity.Subsequently,various fine-tuning procedures are applied to the downstream text summarization task.Concurrently,an idempotent summarization generation strategy is designed to mitigate the impact of text length.This strategy brings the summaries closer to idempotent form,constrains the model,mitigates biases resulting from uneven language corpora,and combines quantization techniques to generate more precise and fluent summaries under low-resource conditions.The experiments used Recall-Oriented Understudy for Gisting Evaluation(ROUGE)scores as the evaluation metric and validated the model on publicly available Chinese text summarization datasets Large-scale Chinese Short Text Summarization(LCSTS),EDUCATION,and Natural Language Processing and Chinese Computing(NLPCC).The results revealed significant enhancements in precision and coherence within this framework.Specifically,compared to the baseline model,the ROUGE-1/2/L scores were improved by 7.9,7.4,and 8.7 percentage points on the LCSTS dataset.Moreover,on the EDUCATION dataset,the scores exhibited enhancements of 12.9,15.4,and 15.7 percentage points for ROUGE-1/2/L.Similarly,on the NLPCC dataset,improvements of 12.2,11.7,and 12.7 percentage points were observed for ROUGE-1/2/L.This validation confirms the model's effectiveness.

作者：杨兴睿;马斌;李森垚;钟忺

作者单位：武汉理工大学计算机与人工智能学院,湖北武汉 430070武汉理工大学信息化办公室,湖北武汉 430070武汉理工大学计算机与人工智能学院,湖北武汉 430070||武汉理工大学信息化办公室,湖北武汉 430070

分类：计算机与自动化

中文关键词：教育数字化文本摘要大语言模型低资源场景幂等扩增

英文关键词：educational digitalizationtext summarizationLarge Language Model(LLM)low-resource scenariosidempotentaugmentation

刊名：《计算机工程》 2024 (007)

页码/页数：32-41 / 10

基金： 国家自然科学基金(62271361).

DOI：10.19678/j.issn.1000-3428.0068625

基于大语言模型的教育文本幂等摘要方法OA北大核心CSTPCD

Large Language Model-based Idempotent Summarization Method for Educational Text

评论