计算机应用研究2025,Vol.42Issue(11):3421-3429,9.DOI:10.19734/j.issn.1001-3695.2025.02.0063
面向函数图像数据的多模态大模型训练策略
Multimodal large model training strategy for functional image data
摘要
Abstract
In recent years,multimodal large language models have undergone rapid development and demonstrate excellent performance in various multimodal downstream tasks.However,the current mainstream multimodal large language models still perform unsatisfactorily in function image reasoning tasks,which requires the model not only possesses strong visual perception capabilities but also performs chained thinking reasoning to accurately understand and answer questions involving mathematical functions.To address these issues,it firstly constructed a specially designed instruction fine-tuning dataset,FunctionQA,tai-lored for function image reasoning tasks.Each data point,in addition to standard question-answer pairs,also included a de-tailed chained reasoning process,ensuring that the model could learn complex reasoning steps during training.Secondly,it designed a four-stage fine-tuning strategy for function image reasoning tasks,gradually optimizing the visual encoder,multimo-dal adapter,and large language model,and incorporating LoRA technology to reduce training costs.Experimental results show that the mFunction-4B model,built on the LLaVA framework,achieves an accuracy of 43.55%on the MathVista testmini FunctionQA subset with 4B parameters after optimization using the FunctionQA dataset and the four-stage fine-tuning strategy,representing a 14.52%improvement over the baseline model LLaVA-1.5-7B,validating the feasibility and effectiveness of the proposed method.关键词
多模态大语言模型/链式思维推理/指令微调/LoRAKey words
multimodal large language model(MLLM)/chain thinking reasoning/instruction fine-tuning/low-rank adapta-tion(LoRA)分类
信息技术与安全科学引用本文复制引用
明一博,陈彦敏,赵嘉璐..面向函数图像数据的多模态大模型训练策略[J].计算机应用研究,2025,42(11):3421-3429,9.基金项目
新疆维吾尔自治区自然科学基金资助项目(2022D01A227) (2022D01A227)
新疆维吾尔自治区重点研发专项(2022B01007-1) (2022B01007-1)