自然语言处理领域中的词嵌入方法综述OACSTPCD

Word Embedding Methods in Natural Language Processing:a Review

中文摘要

英文摘要

词嵌入作为自然语言处理任务的第一步,其目的是将输入的自然语言文本转换为模型可以处理的数值向量,即词向量,也称词的分布式表示.词向量作为自然语言处理任务的根基,是完成一切自然语言处理任务的前提.然而,国内外针对词嵌入方法的综述文献大多只关注于不同词嵌入方法本身的技术路线,而未能将词嵌入的前置分词方法以及词嵌入方法完整的演变趋势进行分析与概述.以word2vec模型和Transformer模型作为划分点,从生成的词向量是否能够动态地改变其内隐的语义信息来适配输入句子的整体语义这一角度,将词嵌入方法划分为静态词嵌入方法和动态词嵌入方法,并对此展开讨论.同时,针对词嵌入中的分词方法,包括整词切分和子词切分,进行了对比和分析;针对训练词向量所使用的语言模型,从概率语言模型到神经概率语言模型再到如今的深度上下文语言模型的演化,进行了详细列举和阐述;针对预训练语言模型时使用的训练策略进行了总结和探讨.最后,总结词向量质量的评估方法,分析词嵌入方法的当前现状并对其未来发展方向进行展望.

Word embedding,as the first step in natural language processing(NLP)tasks,aims to transform input natural language text into numerical vectors,known as word vectors or distributed representations,which artificial intelligence models can process.Word vectors,the foundation of NLP tasks,are a prerequisite for accomplishing various NLP downstream tasks.However,most existing review literature on word embedding methods focuses on the technical routes of different word embedding methods,neglecting comprehensive analysis of the tokenization methods and the complete evolutionary trends of word embedding.This paper takes the introduction of the word2vec model and the Transformer model as pivotal points.From the perspective of whether generated word vectors can dynamically change their implicit semantic information to adapt to the overall semantics of input sentences,this paper categorizes word embedding methods into static and dynamic approaches and extensively discusses this classification.Simultaneously,it compares and analyzes tokenization methods in word embedding,including whole and sub-word segmentation.This paper also provides a detailed enumeration of the evolution of language models used to train word vectors,progressing from probability language models to neural probability language models and the current deep contextual language models.Additionally,this paper summarizes and explores the training strategies employed in pre-training language models.Finally,this paper concludes with a summary of methods for evaluating word vector quality,an analysis of the current state of word embedding methods,and a prospective outlook on their development.

作者：曾骏;王子威;于扬;文俊浩;高旻

作者单位：重庆大学大数据与软件学院,重庆 401331||信息物理社会可信服务计算教育部重点实验室(重庆大学),重庆 400044重庆大学大数据与软件学院,重庆 401331

分类：计算机与自动化

中文关键词：词向量词嵌入方法自然语言处理语言模型分词词向量评估

英文关键词：word vectorword embeddingnatural language processinglanguage modeltokenizationword vector evaluation

刊名：《计算机科学与探索》 2024 (001)

页码/页数：24-43 / 20

基金：国家重点研发计划(2019YFB1706104);重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0900);留学人员回国创业创新支持计划(cx2021125);中央高校基本科研业务费专项资金(2020CDJ-LHZZ-040).This work was supported by the National Key Research and Development Program of China(2019YFB1706104),the General Project of Natural Science Foundation of Chongqing(cstc2020jcyj-msxmX0900),the Entrepreneurship and Innovation Support Program for Re-turned Overseas Students(cx2021125),and the Fundamental Research Funds for the Central Universities of China(2020CDJ-LHZZ-040).

DOI：10.3778/j.issn.1673-9418.2303056

自然语言处理领域中的词嵌入方法综述OACSTPCD

Word Embedding Methods in Natural Language Processing:a Review

评论