| 注册
首页|期刊导航|计算机科学与探索|自然语言处理领域中的词嵌入方法综述

自然语言处理领域中的词嵌入方法综述

曾骏 王子威 于扬 文俊浩 高旻

计算机科学与探索2024,Vol.18Issue(1):24-43,20.
计算机科学与探索2024,Vol.18Issue(1):24-43,20.DOI:10.3778/j.issn.1673-9418.2303056

自然语言处理领域中的词嵌入方法综述

Word Embedding Methods in Natural Language Processing:a Review

曾骏 1王子威 2于扬 2文俊浩 1高旻1

作者信息

  • 1. 重庆大学 大数据与软件学院,重庆 401331||信息物理社会可信服务计算教育部重点实验室(重庆大学),重庆 400044
  • 2. 重庆大学 大数据与软件学院,重庆 401331
  • 折叠

摘要

Abstract

Word embedding,as the first step in natural language processing(NLP)tasks,aims to transform input natural language text into numerical vectors,known as word vectors or distributed representations,which artificial intelligence models can process.Word vectors,the foundation of NLP tasks,are a prerequisite for accomplishing various NLP downstream tasks.However,most existing review literature on word embedding methods focuses on the technical routes of different word embedding methods,neglecting comprehensive analysis of the tokenization methods and the complete evolutionary trends of word embedding.This paper takes the introduction of the word2vec model and the Transformer model as pivotal points.From the perspective of whether generated word vectors can dynamically change their implicit semantic information to adapt to the overall semantics of input sentences,this paper categorizes word embedding methods into static and dynamic approaches and extensively discusses this classification.Simultaneously,it compares and analyzes tokenization methods in word embedding,including whole and sub-word segmentation.This paper also provides a detailed enumeration of the evolution of language models used to train word vectors,progressing from probability language models to neural probability language models and the current deep contextual language models.Additionally,this paper summarizes and explores the training strategies employed in pre-training language models.Finally,this paper concludes with a summary of methods for evaluating word vector quality,an analysis of the current state of word embedding methods,and a prospective outlook on their development.

关键词

词向量/词嵌入方法/自然语言处理/语言模型/分词/词向量评估

Key words

word vector/word embedding/natural language processing/language model/tokenization/word vector evaluation

分类

信息技术与安全科学

引用本文复制引用

曾骏,王子威,于扬,文俊浩,高旻..自然语言处理领域中的词嵌入方法综述[J].计算机科学与探索,2024,18(1):24-43,20.

基金项目

国家重点研发计划(2019YFB1706104) (2019YFB1706104)

重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0900) (cstc2020jcyj-msxmX0900)

留学人员回国创业创新支持计划(cx2021125) (cx2021125)

中央高校基本科研业务费专项资金(2020CDJ-LHZZ-040).This work was supported by the National Key Research and Development Program of China(2019YFB1706104),the General Project of Natural Science Foundation of Chongqing(cstc2020jcyj-msxmX0900),the Entrepreneurship and Innovation Support Program for Re-turned Overseas Students(cx2021125),and the Fundamental Research Funds for the Central Universities of China(2020CDJ-LHZZ-040). (2020CDJ-LHZZ-040)

计算机科学与探索

OA北大核心CSTPCD

1673-9418

访问量0
|
下载量0
段落导航相关论文