面向法院电子卷宗的文本分类方法研究OA北大核心CSTPCD

TEXT CLASSIFICATION METHOD FOR COURT ELECTRONIC FILE

中文摘要

英文摘要

针对法院电子卷宗文本分类的主要问题,给出相应解决方案.提出卷宗文件的多维度语义表示方法,得到更准确全面的文本特征信息;使用基于高斯核的KELM(Kernel Extreme Learning Machine)学习文本分类器,获取全局最优解的同时大幅提高训练效率;使用基于RLS(Recursive Least Squares)的序列优化模型KOS-ELM,通过新样本对模型参数迭代更新,使分类模型具备在线自学习的能力,减少了对初始样本的依赖.对比实验证明,基于高斯核的KELM分类模型在正确率上比BP网络模型和LSSVM分别提高了 2.66百分点和4.43百分点,但训练时间只有两者的1/6和1/10;采用多维度语义表示方法为模型提供输入,在正确率上比文本向量和词向量表示方法分别提高了 8.84百分点和2.33百分点;采用基于RLS的序列优化模型KOS-ELM对弱分类器进行迭代优化,以4种不同步长迭代20次后,分类正确率均得到显著提升.

This paper provides corresponding solutions to the main problems in the text classification of court electronic files.We propose a multi-dimensional semantic representation method for court case file to obtain more accurate and comprehensive text feature information.The Gaussian kernel-based kernel extreme learning machine(KELM)learning text classifier was used to get the global optimal solution while greatly improving the training efficiency.The sequence optimization model KOS-ELM based on recursive least squares(RLS)was used to iteratively update the model parameters through new samples.The solutions enabled the classification model to learn online by itself and reduce the dependence on the initial samples.Through comparative experiments,it was proved that the accuracy of the Gaussian kernel-based KELM classification model was 2.66 percentage points and 4.43 percentage points higher than that of the BP network model and LSSVM,but the training time was only 1/6 and 1/10 of the two.The multi-dimensional semantic representation method was used to provide input for the model,and the accuracy rate was 8.84 percentage points and 2.33 percentage points higher than the text vector and word vector representation methods respectively.The RLS-based sequence optimization model KOS-ELM was used to iteratively optimize the weak classifier.After 20 iterations with 4 different types of step-size,the classification accuracy was significantly improved.

作者：王霄;万玉晴

作者单位：太极计算机股份有限公司北京 100102

分类：计算机与自动化

中文关键词：法院电子卷宗文本分类语义表示核极限学习机递归最小二乘

英文关键词：Court electronic fileText classificationSemantic representationKernel extreme learning machineRecursive least squares

刊名：《计算机应用与软件》 2024 (006)

页码/页数：101-107,133 / 8

基金： 国家重点研发计划项目(2018YFC0807700).

DOI：10.3969/j.issn.1000-386x.2024.06.015

面向法院电子卷宗的文本分类方法研究OA北大核心CSTPCD

TEXT CLASSIFICATION METHOD FOR COURT ELECTRONIC FILE

评论