Depth-Guided Vision Transformer With Normal-izing Flows for Monocular 3D Object DetectionOA北大核心CSTPCD
Depth-Guided Vision Transformer With Normal-izing Flows for Monocular 3D Object Detection
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with con-volutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mecha-nism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone archi-tecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior per-formance over previous counterparts.
Cong Pan;Junran Peng;Zhaoxiang Zhang
Center for Research on Intelligent Perception and Computing(CRIPAC),National Laboratory of Pattern Recognition(NLPR),Institute of Automation,Chinese Academy of Sciences(CASIA),Beijing 100190||School of Future Technology,University of Chinese Academy of Sciences(UCAS),Beijing 100049,ChinaHuawei Inc.,Beijing 100085,ChinaInstitute of Automation,Chinese Academy of Sciences(CASIA),Beijing 100190||University of Chinese Academy of Sciences(UCAS),Beijing 100049||Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science & Innovation,Chinese Academy of Sciences(HKISI CAS),Hong Kong 999077,China
Monocular 3D object detectionnormalizing flowsSwin Transformer
《自动化学报(英文版)》 2024 (003)
673-689 / 17
This work was supported in part by the Major Project for New Generation of AI(2018AAA0100400),the National Natural Science Founda-tion of China(61836014,U21B2042,62072457,62006231),and the InnoHK Program.
评论