计算机应用与软件2024,Vol.41Issue(7):145-149,5.DOI:10.3969/j.issn.1000-386x.2024.07.022
基于全局自适应宽度注意力改进的Transformer
IMPROVED TRANSFORMER BASED ON GLOBAL ADAPTIVE WIDTH ATTENTION
摘要
Abstract
Transformer is widely-used in natural language processing,but there is a problem that the input information is cut and the video memory is too large caused by the long text.The existing solution is to let the model dynamically determine the attention width of each layer,and it can associate the optimal sequence length under the premise of controlling calculation amount and memory footprint overhead.However,there is the disadvantage that the optimal attention width of each layer cannot reach the optimal attention width of the model.For this reason,we propose a global adaptive width attention(GAA).We let the attention range of each layer be associated with the global,so as to achieve the optimal global attention range of the model,and modified the feedforward layer of the model to the feedforward layer of the gated unit.Validations on the data sets enwiki8 and text-8 show that this method only uses 25%of the training calculation cost to achieve better performance than the baseline.关键词
Transformer/全局自适应宽度注意力/FFNGLUKey words
Transformer/Global adaptive width attention/FFNGLU分类
信息技术与安全科学引用本文复制引用
曾庆威,张建,张鸿昌,谭雨阳,沈文枫..基于全局自适应宽度注意力改进的Transformer[J].计算机应用与软件,2024,41(7):145-149,5.基金项目
上海智能计算系统工程技术研究中心项目(19DZ2252600) (19DZ2252600)
国家重点研发计划项目(2017YFB0701600) (2017YFB0701600)
上海市科学技术委员会项目(19511121002). (19511121002)