中南民族大学学报(自然科学版)2026,Vol.45Issue(4):548-558,11.DOI:10.20056/j.cnki.ZNMDZK.20260708
基于增强FPN的Vision Transformer在文档布局分析任务中的应用研究
Application research of Vision Transformer with enhanced FPN in document layout analysis tasks
摘要
Abstract
Compared with traditional methods based on Convolutional Neural Networks(CNN),the document layout analysis model based on Vision Transformer can provide robust semantic and visual representations for downstream tasks through multi-modal pre-training mechanisms.However,the current multi-scale feature generation module and cross-resolution feature fusion process are prone to causing the loss of category attributes and boundary details,which in turn leads to issues such as category confusion and blurred boundaries.To address this bottleneck,Local Feature Enhancement Generation(LFEG)and Global-to-Local Feature Enhancement Fusion(GLEF)techniques are proposed to construct an enhanced Feature Pyramid Network(FPN)structure for achieving novel feature optimization.Specifically,the LFEG module optimizes four resolution modification modules to generate the multi-scale feature,while the GLEF module optimizes the traditional top-down fusion approach.Experimental results demonstrate that the proposed enhanced FPN structure can effectively improve the category consistency and boundary clarity of multi-scale feature maps,providing key technical support for optimizing the accuracy of document layout analysis based on Vision Transformer.关键词
视觉变换器/特征金字塔网络/特征融合/文档布局分析Key words
Vision Transformer/FPN/feature fusion/document layout analysis分类
信息技术与安全科学引用本文复制引用
张法,李艳红,吴龙雨,龙焓..基于增强FPN的Vision Transformer在文档布局分析任务中的应用研究[J].中南民族大学学报(自然科学版),2026,45(4):548-558,11.基金项目
湖北省自然科学基金资助项目(2017CFB135) (2017CFB135)
中央高校基本科研业务费专项资金资助项目(CZY23019) (CZY23019)
网络创新及应用型人才课程实践教学研究项目(2019年第一批) (2019年第一批)