首页|期刊导航|工程科学学报|基于渐进机器学习的中文问句匹配方法

基于渐进机器学习的中文问句匹配方法

贺学剑陈安琪郭志强王致茹陈群

工程科学学报2025，Vol.47Issue(1)：79-90,12.

工程科学学报2025，Vol.47Issue(1)：79-90,12.DOI:10.13374/j.issn2095-9389.2023.11.05.002

基于渐进机器学习的中文问句匹配方法

Question-matching approach based on gradual machine learning

贺学剑 ¹陈安琪 ²郭志强 ¹王致茹 ³陈群⁴

作者信息

1. 河南林业职业学院,洛阳 471002
2. 西北工业大学软件学院,西安 710072
3. 西北工业大学计算机学院,西安 710072
4. 西北工业大学软件学院,西安 710072||西北工业大学计算机学院,西安 710072
折叠

摘要

Abstract

Question matching attempts to determine whether the intentions of two different questions are similar.Recently,with the development of large-scale pretrained DNN(Deep neural network)language models,state-of-the-art question-matching performance has been achieved.However,due to the independent and identically distributed assumption,the performance of these DNN models in real-world scenarios is limited by the adequacy of the training data and the distribution drift between the target and training data.In this study,we propose a novel gradual machine learning(GML)-based approach for Chinese question matching.Beginning with initially labeled instances,this approach gradually labels target instances in order of increasing hardness via iterative factor inference on a factor graph.The proposed solution first extracts diverse semantic features from different perspectives and then constructs a factor graph by fusing the extracted features to facilitate gradual learning from easy to hard.In feature modeling,we extract and model two complementary types of features:1)TF-IDF-based keyword features,which can capture the shallow semantic similarity between two questions;2)DNN-based deep semantic features,which can capture the latent semantic similarity between two questions.We model keyword features as unary factors in a factor graph,which define their influence on the matching status of the two questions.The DNN-based features contain global and local features,where the global features correspond to a question pair's matching probability as estimated by a DNN model,and the local features correspond to the semantic similarity between two neighboring question pairs estimated by their vector representations in a DNN's embedding space.To facilitate gradual inference,we model the DNN-based global and local features as unary and binary factors,respectively,in a factor graph.Finally,we implement a GML solution for question matching based on an open-sourced GML inference engine.We validated the efficacy of the proposed approach through a comparative study on two open-sourced Chinese benchmark datasets,LCQMC and the BQ corpus.Extensive experiments demonstrate that compared with pure deep learning models,the proposed solution effectively improves the accuracy of question matching,and its performance advantage generally increases with a decrease in labeled training data.Our experiments also demonstrate that the performance of the proposed solution is very robust w.r.t key algorithmic parameters,indicating its applicability in real-world scenarios.In addition,our work on the GML solution is orthogonal to existing deep learning-based question-matching algorithms because our solution can easily accommodates and leverages other deep language models.

关键词

自然语言理解/中文问句匹配/渐进机器学习/自然语言预训练模型/因子图推理

Key words

natural language understanding/chinese question matching/gradual machine learning/natural language pretraining model/factor graph inference

分类

矿业与冶金

引用本文复制引用

贺学剑,陈安琪,郭志强,王致茹,陈群..基于渐进机器学习的中文问句匹配方法[J].工程科学学报,2025,47(1):79-90,12.

基金项目

国家自然科学基金面上资助项目(62172335) （62172335）

工程科学学报

OA北大核心

ISSN：2095-9389

访问量2

下载量0

段落导航