首页|期刊导航|厦门大学学报（自然科学版）|基于Transformer的预训练语言模型在生物医学领域的应用

基于Transformer的预训练语言模型在生物医学领域的应用

游至宇阳倩傅姿晴陈庆超李奇渊

厦门大学学报（自然科学版）2024，Vol.63Issue(5)：883-893,11.

厦门大学学报（自然科学版）2024，Vol.63Issue(5)：883-893,11.DOI:10.6043/j.issn.0438-0479.202404005

基于Transformer的预训练语言模型在生物医学领域的应用

Application of Transformer-based pretrained language models in the biomedical domain

游至宇 ¹阳倩 ¹傅姿晴 ¹陈庆超 ²李奇渊¹

作者信息

1. 厦门大学医学院,健康医疗大数据国家研究院,福建厦门 361102
2. 北京大学健康医疗大数据国家研究院,跨媒体通用人工智能全国重点实验室,北京 100191
折叠

摘要

Abstract

[Background]The rapid development of artificial intelligence(AI)has had a profound impact across various scientific disciplines,with natural language processing(NLP)emerging as a cornerstone technology in biomedical research.NLP's ability to analyze vast amounts of sequenced biomedical data,including not only medical texts but also complex sequences such as proteins and DNA,has become indispensable for tasks like clinical decision support and genomics interpretation.The introduction of Transformer-based pretrained language models(T-PLMs)represents a major breakthrough in this field,fundamentally transforming the way biomedical sequences are processed and understood.These models,particularly BERT and its variants,have significantly surpassed traditional rule-based and feature engineering approaches.By leveraging deep learning,T-PLMs can capture the intricate patterns and relationships within biomedical data that were previously difficult to discern.This advancement has enabled the extraction of complex,meaningful medical knowledge from large-scale sequence data,fueling innovations in personalized medicine and improving overall healthcare outcomes.The shift from conventional methods to T-PLMs marks a pivotal milestone in the field,equipping researchers with powerful tools to uncover new insights,drive forward biomedical research,and ultimately transform patient care on a global scale.[Progress]The development of T-PLMs has significantly evolved,transforming how biomedical data is processed and analyzed.The training paradigms for T-PLMs in biomedicine typically involve two major phases:pre-training and fine-tuning.In the pre-training phase,models are trained on vast amounts of general language data,and then further specialized through domain-specific pre-training on biomedical corpora.This process enables T-PLMs to develop a strong foundational understanding of language,which can be fine-tuned for specific biomedical tasks.Fine-tuning is often tailored to particular applications,such as clinical decision support,where models are adjusted to improve their accuracy and relevance to medical contexts.Advanced techniques like continuous pre-training and prompt engineering have further enhanced the adaptability and performance of T-PLMs in specialized tasks.T-PLMs have found diverse applications in biomedicine,ranging from text representation and knowledge extraction to complex tasks like protein structure prediction and molecular representation.These models have significantly improved the accuracy of tasks such as named entity recognition,relation extraction,and even drug discovery.Moreover,their ability to integrate and process multimodal data,such as combining text with medical images,has opened new frontiers in medical diagnostics and research.The advancements in T-PLMs have not only improved data analysis but have also paved the way for personalized medicine and more informed clinical practices,demonstrating their critical role in modern biomedical research.[Perspective]While T-PLMs have achieved considerable success in biomedicine,several ongoing challenges must be addressed to fully capitalize on their potential.One of the most pressing issues is the interpretability of these models,particularly in clinical settings where understanding the rationale behind AI-generated decisions is crucial for ensuring patient safety and fostering trust in AI-driven healthcare.Enhancing the transparency and explainability of T-PLMs is essential,with approaches like explainable AI and advanced model visualization playing a critical role.Another key challenge is the integration and fusion of multimodal data,which involves combining diverse data types such as text,medical images,and genetic sequences.This is particularly complex due to the heterogeneity of these data sources,which can lead to difficulties in data alignment and fusion.Addressing these challenges is vital for advancing the field.For example,research is increasingly focusing on multimodal learning frameworks that can seamlessly integrate different types of biomedical data,potentially enabling breakthroughs in diagnostics and personalized medicine.Additionally,data privacy concerns remain paramount,especially when using sensitive patient information to train and deploy T-PLMs.Implementing robust privacy-preserving techniques,such as federated learning and blockchain,is essential to maintaining data security while still leveraging AI's potential in healthcare.Furthermore,the scalability of T-PLMs poses another significant challenge.Training these models on large-scale,complex biomedical datasets demands substantial computational resources.Future research must focus on developing more efficient training paradigms and optimizing resource utilization to make T-PLMs more accessible for a broader range of applications in both research and clinical settings.Addressing these challenges will be key to unlocking the full potential of T-PLMs and driving continued innovation in the biomedical field.

关键词

自然语言处理/生物医学应用/预训练语言模型/多模态学习/医疗文本挖掘

Key words

natural language processing/biomedical application/pretrained language model/multimodal learning/medical text mining

分类

信息技术与安全科学

引用本文复制引用

游至宇,阳倩,傅姿晴,陈庆超,李奇渊..基于Transformer的预训练语言模型在生物医学领域的应用[J].厦门大学学报（自然科学版）,2024,63(5):883-893,11.

基金项目

国家自然科学基金面上项目(82272944) （82272944）

厦门大学学报（自然科学版）

OA北大核心CSTPCD

ISSN：0438-0479

访问量0

下载量0

段落导航