农业图书情报学报2025,Vol.37Issue(8):61-77,17.DOI:10.13998/j.cnki.issn1002-1248.25-0365
多维特征文本复杂度框架与知识库增强模型
A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model
摘要
Abstract
[Purpose/Significance]In cross-domain natural language processing(NLP)tasks,deep learning models often exhibit performance variations due to texts with distinct domain characteristics,leading to a decline in model generalization capabilities.Text complexity stands out as one of the most explanatory factors influencing model generalization.[Method/Process]This paper presents two innovative contributions.First,a multi-dimensional text complexity calculation framework grounded in systemic functional linguistics theory was constructed.This framework employs a hierarchical quantification approach:at the lexical level,it dynamically identified four types of non-standard expressions-abbreviations,emoticons,internet buzzwords,and alphanumeric mixed words-and calculated a normative score using a non-linear formula.At the sentence level,an innovative inverse fusion enhancement method(IFEM)was proposed,integrating punctuation anomaly density(weight 0.1),colloquial word ratio(weight 0.4),semantic ambiguity(weight 0.2),and sentence length features(weight 0.3),and generating a structural score through modeling of feature synergy and suppression effects along with an adaptive weighting mechanism.Finally,at the corpus level,a weighted fusion output the global corpus complexity assessment.Experimental results demonstrated that this framework successfully quantifies intrinsic differences between domain texts.For instance,the measured complexity of the waimai_10k dataset reached 0.703,significantly higher than the 0.552 of the ChnSentiCorp_htl_all dataset,and it accurately captured complexity changes even after internal text reduction and substitution operations.Second,a knowledge base-enhanced dynamic adaptive CNN-BiLSTM model was designed.This model implemented the following innovative mechanisms:1)The knowledge base adopts a dual mapping architecture of text-label and vector-label,supporting historical experience knowledge loading and real-time error recording;2)Feature weights were adjusted based on the knowledge base content,such as strengthening positive semantic representations or weakening negative expressions.The model architecture integrated multi-scale CNN convolutional kernels for local feature extraction,bidirectional long short-term memory networks for capturing long-distance dependencies,and an attention mechanism to focus on key information.To validate the effectiveness of the proposed methods,experiments were conducted on four Chinese datasets.[Results/Conclusions]The results indicate that the complexity calculation framework exhibits strong robustness,with complexity fluctuations below 3.3%after a 20%sample reduction,and a maximum complexity increase of 13.8%upon short text data injection.Moreover,the framework effectively quantifies and differentiates text complexities,as evidenced by the 0.703 complexity of the waimai_10k dataset compared to the 0.552 of the ChnSentiCorp_htl_all dataset.Additionally,the proposed model demonstrated optimal performance across both the most standardized ChnSentiCorp_htl_all dataset and the most challenging waimai_10k dataset(achieving accuracies of 0.923 8 and 0.943 4,respectively),significantly outperforming Transformer and various large language models such as deepseek-v3 and qwen-plus.关键词
跨领域情感分析/文本复杂度量化/知识库增强/深度学习/动态自适应Key words
cross-domain sentiment analysis/text complexity quantification/knowledge base enhancement/deep learning/dynamic self-adaptation分类
信息技术与安全科学引用本文复制引用
常郝,徐涛涛,李峰..多维特征文本复杂度框架与知识库增强模型[J].农业图书情报学报,2025,37(8):61-77,17.基金项目
国家自然科学基金"基于延迟特征的三维集成电路硅通孔测试关键技术研究"(61704001) (61704001)
安徽省自然科学基金"面向TSV延迟测试的3D芯片可靠性和良率提升方法研究"(1808085QF196) (1808085QF196)
安徽省高校自然科学研究项目"多模态数据融合的语音情感识别研究"(2024AH050018) (2024AH050018)