工程科学与技术2026,Vol.58Issue(2):46-56,11.DOI:10.12454/j.jsuese.202400467
基于改进YOLOv5和CombineSORT的车联网路侧视觉感知
Roadside Visual Perception in Internet of Vehicles Based on Improved YOLOv5 and CombineSORT
摘要
Abstract
Objective Visual inspection is a critical technology for roadside perception in vehicle-road cooperative systems.However,in practical applica-tions,achieving both high detection accuracy and computational efficiency simultaneously remains challenging due to limited computing re-sources.This study proposes a novel method based on an improved YOLOv5 combined with CombineSORT for image recognition and target tracking,which achieves strong detection performance while maintaining low computational time cost,as demonstrated through experimental results. Methods Firstly,Multi-scale Feature Enhancement(MFE)was applied to the FPN of YOLOv5 to extract shallow target details.This module was mainly composed of Scale Fusion,CombineFPN,and Pixel‒Region Attention.A super-efficient IOU(SEIOU)loss function and network pruning were applied to improve convergence and reduce model complexity.In this process,the loss was calculated based on differences in length,width,and diagonal between the detection boxes and the ground-truth boxes,while batch normalization(BN)layer sparsification was applied for convo-lutional channel filtering.Secondly,by combining DeepSORT,StrongSORT,and Bot-SORT,a new multi-target tracking method named Combine-SORT was presented.In this approach,the basic framework of DeepSORT was adopted,and the BotNet with ResNet50 as the backbone was uti-lized to extract appearance features.Kalman filtering was replaced by polynomial fitting to improve trajectory smoothness,while the joint similar-ity matrix from StrongSORT was utilized to match targets with trajectories.Based on the operational procedure of the proposed algorithm,a se-ries of experiments was designed to validate its effectiveness.Using images from real intersections,ablation tests verified the effectiveness and data volume contribution of each improved module.The algorithm was then compared to classical methods using intersection video streams with varying traffic volumes,all of which were executed on a mobile edge computer(MEC)with limited computing resources. Results and Discussions Through ablation tests,the original YOLOv5 achieved an mAP@90 of 0.894 with a parameter quantity of 21.2 M.Scale Fusion,CombineFPN,and Pixel‒Region Attention increased the mAP@90 of the original model to 0.91,0.923,and 0.916,respectively,while the parameter quantity increased to 24.4,25.3,and 24.1 M,respectively.The YOLOv5 model integrating all three modules achieved an mAP@90 of 0.939 with a parameter quantity of 31.0 M,after which network pruning reduced the parameter quantity to 6.6 M while maintaining an mAP@90 of 0.937.Through three groups of real intersection experiments,the average recall rates for Group 1 to 3 were 97.68%,95.83%,and 96.76%,while the multiple object tracking accuracy(MOTA)values were 0.944,0.890,and 0.910,respectively.Among all target categories,pedestrians and non-motorized vehicles exhibited relatively poor detection performance.Especially in Group 2,the recall rate and MOTA for pedestrians were 89.98%and 0.75,respectively,while those for non‒motorized vehicles were as low as 84.5%and 0.675.This behavior occurred because these two target types had relatively small sizes and did not strictly follow traffic rules,which caused frequent occlusion and increased trajectory prediction difficulty.In addition,the recall rates of buses and trucks were nearly 3 percentage points lower than those of cars,especially in the group,where the recall rates were only 94.81%and 94.92%,respectively.This issue occurred because box trucks and buses exhibited similar ap-pearance features,which increased the likelihood of misidentification from rear perspectives.When comparing the overall processing perfor-mance of different algorithms at low-volume intersections,the worst test result achieved a recall rate of 96.54%with a MOTA value of 0.938,while the best result achieved a recall rate of 97.69%with a MOTA value of 0.946.These results indicated that most algorithms achieved good de-tection performance under sparse target conditions,and lightweight models demonstrated advantages when considering computational resource constraints.However,for high-volume intersections,although the lightweight algorithm based on EfficientNet and ByteTrack exhibited the short-est computation delay,its recall rate and MOTA value were only 91.75%and 0.817,respectively.In contrast,algorithms based on YOLOv5,YOLOX,YOLOv7,and the improved YOLOv5 proposed in this study achieved recall rates ranging from 95.26%to 96.28%,while algorithms combined with DeepSORT,StrongSORT,Bot-SORT,and CombineSORT achieved MOTA values ranging from 0.887 to 0.901.However,most of these methods exhibited computation times exceeding 80 ms,which prevented real-time operation.Among algorithms with computation times be-low 80 ms,the proposed method based on improved YOLOv5 and CombineSORT achieved the best overall performance,with a recall rate of 96.27%and an MOTA value of 0.900,which confirmed its ability to balance detection accuracy and computational efficiency. Conclusions This study focuses on traffic target perception from a fixed roadside perspective,and the results demonstrate the effectiveness and accuracy of the proposed algorithm.Compared to other commonly used algorithms,the proposed approach simultaneously achieves higher detec-tion accuracy and lower time cost at high-volume intersections,indicating strong application potential in vehicle road collaboration scenarios.For improved engineering practices,further research can be conducted to enhance recognition and tracking performance based on continuous image sequences under adverse weather conditions.关键词
车路协同/路侧感知/图像识别/YOLOv5/CombineSORTKey words
vehicle-road cooperative/roadside perception/image recognition/YOLOv5/CombineSORT分类
交通工程引用本文复制引用
李晓晖,杨杰,夏芹..基于改进YOLOv5和CombineSORT的车联网路侧视觉感知[J].工程科学与技术,2026,58(2):46-56,11.基金项目
江西省重点研发计划项目(20243BBG71033) (20243BBG71033)