智能科学与技术学报2025,Vol.7Issue(3):290-303,14.DOI:10.11959/j.issn.2096-6652.202536
平行智能范式视角下的视觉-语言-动作模型发展现状与展望
Vision-language-action models under parallel intelligence paradigm:the state of the art and future perspectives
摘要
Abstract
Vision-language-action(VLA)models are a comprehensive modeling approach for embodied intelligence that integrates visual perception,natural language understanding,and action execution within a unified framework,aiming to establish a continuous loop from environmental perception to task planning and action control.Their operational logic cor-responds closely to the paradigm of parallel intelligence articulated in the early 21st century.That paradigm comprises ar-tificial systems,computational experiments,and parallel execution,emphasizing virtual modeling,reproducible inference,and closed-loop interaction between the virtual and the real.The initial stage of VLA development,driven by multimodal deep learning,can be regarded as prototypical work within artificial systems,the subsequent stage characterized by large-scale models and cross-domain training expanded the scope of computational experiments,and the more recent focus on hierarchical control and virtual-real closed loops reflects the feedback correction and normative guidance emphasized in parallel execution.VLA models exhibit deep coupling between semantics and action,iterative cycles linking simulation with reality,and steadily improving verifiability.Nonetheless,challenges remain in generalization,semantic alignment,safety and interpretability,and deployment efficiency.Addressing these issues calls for contract-based task semantics,re-pairable long-horizon hierarchical planning,engineering-oriented use of world models,multi-level feedback and safety governance,and cross-platform transfer with human-machine collaboration.Examining VLA through the lens of parallel intelligence clarifies its developmental logic and provides methodological support for advancing toward trustworthy real-world applications.关键词
平行智能/视觉-语言-动作模型/具身智能/多模态融合/虚实交互Key words
parallel intelligence/vision-language-action model/embodied intelligence/multimodal fusion/virtual-real in-teraction分类
信息技术与安全科学引用本文复制引用
李柏,郝金第,孙跃硕,孟雨晴,黄峻,田永林,贺正冰..平行智能范式视角下的视觉-语言-动作模型发展现状与展望[J].智能科学与技术学报,2025,7(3):290-303,14.基金项目
澳门特别行政区科学技术发展基金项目(No.0157/2024/RIA2,No.0093/2023/RIA2,No.0145/2023/RIA3) (No.0157/2024/RIA2,No.0093/2023/RIA2,No.0145/2023/RIA3)
国家自然科学基金项目(No.62103139) (No.62103139)
湖南省芙蓉计划湖湘青年英才项目(No.2023RC3115)Science and Technology Development Fund,Macao Special Administrative Region(No.0157/2024/RIA2,No.0093/2023/RIA2,No.0145/2023/RIA3),National Natural Science Foundation of China(No.62103139),Hibiscus Mutabilis Youth Talent Program of Hunan Province(No.2023RC3115) (No.2023RC3115)