| 注册
首页|期刊导航|浙江电力|电力非结构化大文本特征提取研究

电力非结构化大文本特征提取研究

王家凯 黄佩卓 李勇乐 盛爽 刘洋 郑玲 魏振华

浙江电力2024,Vol.43Issue(6):117-124,8.
浙江电力2024,Vol.43Issue(6):117-124,8.DOI:10.19585/j.zjdl.202406013

电力非结构化大文本特征提取研究

Research on feature extraction of unstructured large power texts

王家凯 1黄佩卓 1李勇乐 1盛爽 1刘洋 1郑玲 2魏振华2

作者信息

  • 1. 国家电网有限公司大数据中心,北京 100052
  • 2. 华北电力大学,北京 100026
  • 折叠

摘要

Abstract

Large power texts contain numerous abbreviations of technical terms,alternative names,and irregular ex-pressions.Existing word segmentation tools often fail to identify specialized vocabulary in the electrical engineering field,significantly hindering the analysis and utilization of unstructured texts.To address this challenge,this paper proposes a set of indexing rules tailored to the characteristics of unstructured texts in electrical engineering.Segmen-tation based on these rules can significantly enhance segmentation accuracy,laying a solid foundation for feature ex-traction of power texts.Furthermore,by employing effective long-text segmentation algorithms to preserve the seman-tic information of the original text,the paper integrates and embeds text feature information extracted by the BERT model with vocabulary feature information extracted by Word2Vec.This combined approach enables the extraction of precise features from large unstructured power texts.Finally,experimental results have demonstrated the effec-tiveness of the proposed method for extracting features from large unstructured power texts.

关键词

电力大文本/特征提取/BERT/文本分割/联合嵌入

Key words

large power text/feature extraction/BERT/text segmentation/integrate and embed

引用本文复制引用

王家凯,黄佩卓,李勇乐,盛爽,刘洋,郑玲,魏振华..电力非结构化大文本特征提取研究[J].浙江电力,2024,43(6):117-124,8.

基金项目

国家自然科学基金(62373150) (62373150)

国家电网公司大数据中心科技专项资助项目(SGSJ0000YYJS2310054) (SGSJ0000YYJS2310054)

浙江电力

OACSTPCD

1007-1881

访问量0
|
下载量0
段落导航相关论文