| 注册

视觉智能发展综述

魏云超 任中伟 方岩

北京交通大学学报2025,Vol.49Issue(5):66-81,16.
北京交通大学学报2025,Vol.49Issue(5):66-81,16.DOI:10.11860/j.issn.1673-0291.20250133

视觉智能发展综述

A survey of visual intelligence development

魏云超 1任中伟 1方岩1

作者信息

  • 1. 北京交通大学 计算机科学与技术学院,北京 100044
  • 折叠

摘要

Abstract

Visual intelligence,as a core branch of AI,seeks to endow machines with human-like capa-bilities for visual understanding and interaction.Since the breakthrough of deep learning in computer vi-sion in 2012,the field has undergone four progressive stages of evolution.The first stage,represented by AlexNet,VGGNet and ResNet,leveraged large annotated datasets such as ImageNet to achieve remarkable success in closed-domain tasks(e.g.,image classification and object detection),but its de-pendence on labeled data highlighted inherent limitations.The second stage witnessed the rise of self-supervised learning,with models such as MoCo,DION and MAE learning powerful visual representa-tions from massive unlabeled data through contrastive,distillation,and masked reconstruction meth-ods.The third stage marked a shift toward multimodal intelligence,where models like CLIP and GPT-4V integrated vision and language,enabling open-vocabulary understanding and advancing to-ward fine-grained,intent-driven reasoning.The current frontier is world models,exemplified by Sora,which aim not only to perceive and describe but also to simulate and predict the physical world,paving the way for embodied intelligence capable of interacting with reality.The current frontier is world models,exemplified by Sora,which aim not only to perceive and describe but also to simulate and predict the physical world,paving the way for embodied intelligence capable of interacting with re-ality.This fundamental transformation from discriminative understanding to generative simulation of the world marks a new consensus:generative modeling is the new deep learning.This survey follows this developmental trajectory,analyzing the core ideas,representative models,and methodological paradigms at each stage,while highlighting ongoing challenges in robustness,reasoning,and general-ization.This survey follows this developmental trajectory,analyzing the core ideas,representative models,and methodological paradigms at each stage,while highlighting ongoing challenges in robust-ness,reasoning,and generalization.

关键词

视觉智能/深度学习/自监督学习/多模态AI/世界模型

Key words

visual intelligence/deep learning/self-supervised learning/multimodal ai/world model

分类

信息技术与安全科学

引用本文复制引用

魏云超,任中伟,方岩..视觉智能发展综述[J].北京交通大学学报,2025,49(5):66-81,16.

基金项目

国家自然科学基金(92470203)National Natural Science Foundation of China(92470203) (92470203)

北京交通大学学报

OA北大核心

1673-0291

访问量0
|
下载量0
段落导航相关论文