计算机科学与探索2025,Vol.19Issue(10):2815-2830,16.DOI:10.3778/j.issn.1673-9418.2409063
面向扩散大模型的多模态人脸生成方法
Multimodal Face Generation Method for Diffusion Large Models
摘要
Abstract
Face generation represents a cutting-edge topic within the field of computer vision,boasting broad application prospects in areas such as criminal investigation and virtual reality.Recently,diffusion models have exhibited outstanding generative capabilities,capable of producing images with high semantic consistency under specific conditions.Their appli-cation in facial generation has become a new trend.However,existing methods based on conventional diffusion inade-quately handle the details of conditional information,failing to fully exploit such information for the precise generation of faces.Methods based on large diffusion models typically require significant computational resources to fine-tune the model or necessitate the addition of extra complex networks without achieving a balanced integration of multimodal conditional information.To address these challenges,this study proposes a multimodal generative facial method for diffusion-based large models called MA-adapter.By incorporating a compact auxiliary network to extract visual structural information and integrate semantic guidance,the method aims to harness the generative capabilities of diffusion large models while avoiding the substantial computational resources required for fine-tuning.This model first enhances image modality prompts using a multi-head attention module,focusing more on key information.Subsequently,it extracts multi-scale feature in-formation through a multi-scale feature module,providing guarantees for precise generation guidance.Finally,an adaptive adjustment mechanism is designed to adaptively adjust the generation guidance coefficients of different feature layers to achieve better performance.Experimental results on the MM-CelebA-HQ(multi-modal-CelebA-HQ)dataset show that com-pared with the current state-of-the-art method T2I-adapter,the perceptual similarity metric LPIPS of the MA-adapter decreases by approximately 18.4%,the image-text matching metric CLIP-Score increases by about 13.6%,and the feature similarity metric CLIP-I grows by approximately 14.8%.Extensive experimental results fully validate the effectiveness and superiority of the MA-adapter.关键词
人脸生成/多模态/扩散模型/智能生成/注意力机制Key words
face generation/multimodal/diffusion model/intelligent generation/attention mechanism分类
信息技术与安全科学引用本文复制引用
黄万鑫,任英杰,芦天亮,杨刚,袁梦娇,曾高俊..面向扩散大模型的多模态人脸生成方法[J].计算机科学与探索,2025,19(10):2815-2830,16.基金项目
中国人民公安大学网络空间安全执法技术双一流创新研究专项(2023SYL07) (2023SYL07)
中央高校基本科研业务费项目(2022JKF02022).This work was supported by the Double First-Class Innovation Research Project for People·s Public Security University of China(2023SYL07),and the Fundamental Research Funds for the Central Universities of Ministry of Education of China(2022JKF02022). (2022JKF02022)