Data augmentation plays an important role in boosting the performance of 3D models,while very few studies handle the 3D point cloud data with this technique.Global augmentation and cut-paste are commonly used augmenta...Data augmentation plays an important role in boosting the performance of 3D models,while very few studies handle the 3D point cloud data with this technique.Global augmentation and cut-paste are commonly used augmentation techniques for point clouds,where global augmentation is applied to the entire point cloud of the scene,and cut-paste samples objects from other frames into the current frame.Both types of data augmentation can improve performance,but the cut-paste technique cannot effectively deal with the occlusion relationship between the foreground object and the background scene and the rationality of object sampling,which may be counterproductive and may hurt the overall performance.In addition,LiDAR is susceptible to signal loss,external occlusion,extreme weather and other factors,which can easily cause object shape changes,while global augmentation and cut-paste cannot effectively enhance the robustness of the model.To this end,we propose Syn-Aug,a synchronous data augmentation framework for LiDAR-based 3D object detection.Specifically,we first propose a novel rendering-based object augmentation technique(Ren-Aug)to enrich training data while enhancing scene realism.Second,we propose a local augmentation technique(Local-Aug)to generate local noise by rotating and scaling objects in the scene while avoiding collisions,which can improve generalisation performance.Finally,we make full use of the structural information of 3D labels to make the model more robust by randomly changing the geometry of objects in the training frames.We verify the proposed framework with four different types of 3D object detectors.Experimental results show that our proposed Syn-Aug significantly improves the performance of various 3D object detectors in the KITTI and nuScenes datasets,proving the effectiveness and generality of Syn-Aug.On KITTI,four different types of baseline models using Syn-Aug improved mAP by 0.89%,1.35%,1.61%and 1.14%respectively.On nuScenes,four different types of baseline models using Syn-Aug improved mAP by 14.93%,10.42%,8.47%and 6.81%respectively.The code is available at https://github.com/liuhuaijjin/Syn-Aug.展开更多
Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)a...Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)and three-dimensional(3D)object detection methods havemade significant progress,they often fall short under severe occlusion due to depth ambiguities in 2D imagery and the high cost and deployment limitations of 3D sensors such as Light Detection and Ranging(LiDAR).This paper presents a comparative review of recent 2D and 3D detection models,focusing on their occlusion-handling capabilities and the impact of sensor modalities such as stereo vision,Time-of-Flight(ToF)cameras,and LiDAR.In this context,we introduce FuDensityNet,our multimodal occlusion-aware detection framework that combines Red-Green-Blue(RGB)images and LiDAR data to enhance detection performance.As a forward-looking direction,we propose a monocular depth-estimation extension to FuDensityNet,aimed at replacing expensive 3D sensors with a more scalable CNN-based pipeline.Although this enhancement is not experimentally evaluated in this manuscript,we describe its conceptual design and potential for future implementation.展开更多
The inherent limitations of 2D object detection,such as inadequate spatial reasoning and susceptibility to environmental occlusions,pose significant risks to the safety and reliability of autonomous driving systems.To...The inherent limitations of 2D object detection,such as inadequate spatial reasoning and susceptibility to environmental occlusions,pose significant risks to the safety and reliability of autonomous driving systems.To address these challenges,this paper proposes an enhanced 3D object detection framework(FastSECOND)based on an optimized SECOND architecture,designed to achieve rapid and accurate perception in autonomous driving scenarios.Key innovations include:(1)Replacing the Rectified Linear Unit(ReLU)activation functions with the Gaussian Error Linear Unit(GELU)during voxel feature encoding and region proposal network stages,leveraging partial convolution to balance computational efficiency and detection accuracy;(2)Integrating a Swin-Transformer V2 module into the voxel backbone network to enhance feature extraction capabilities in sparse data;and(3)Introducing an optimized position regression loss combined with a geometry-aware Focal-EIoU loss function,which incorporates bounding box geometric correlations to accelerate network convergence.While this study currently focuses exclusively on the detection of the Car category,with experiments conducted on the Car class of the KITTI dataset,future work will extend to other categories such as Pedestrian and Cyclist to more comprehensively evaluate the generalization capability of the proposed framework.Extensive experimental results demonstrate that our framework achieves a more effective trade-off between detection accuracy and speed.Compared to the baseline SECOND model,it achieves a 21.9%relative improvement in 3D bounding box detection accuracy on the hard subset,while reducing inference time by 14 ms.These advancements underscore the framework’s potential for enabling real-time,high-precision perception in autonomous driving applications.展开更多
As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advan...As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.展开更多
Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,w...Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,we propose Rail-PillarNet,a three-dimensional(3D)LIDAR(Light Detection and Ranging)railway foreign object detection method based on the improvement of PointPillars.Firstly,the parallel attention pillar encoder(PAPE)is designed to fully extract the features of the pillars and alleviate the problem of local fine-grained information loss in PointPillars pillars encoder.Secondly,a fine backbone network is designed to improve the feature extraction capability of the network by combining the coding characteristics of LIDAR point cloud feature and residual structure.Finally,the initial weight parameters of the model were optimised by the transfer learning training method to further improve accuracy.The experimental results on the OSDaR23 dataset show that the average accuracy of Rail-PillarNet reaches 58.51%,which is higher than most mainstream models,and the number of parameters is 5.49 M.Compared with PointPillars,the accuracy of each target is improved by 10.94%,3.53%,16.96%and 19.90%,respectively,and the number of parameters only increases by 0.64M,which achieves a balance between the number of parameters and accuracy.展开更多
Multi-modal 3D object detection has achieved remarkable progress,but it is often limited in practical industrial production because of its high cost and low efficiency.The multi-view camera-based method provides a fea...Multi-modal 3D object detection has achieved remarkable progress,but it is often limited in practical industrial production because of its high cost and low efficiency.The multi-view camera-based method provides a feasible solution due to its low cost.However,camera data lacks geometric depth,and only using camera data to obtain high accuracy is challenging.This paper proposes a multi-modal Bird-Eye-View(BEV)distillation framework(MMDistill)to make a trade-off between them.MMDistill is a carefully crafted two-stage distillation framework based on teacher and student models for learning cross-modal knowledge and generating multi-modal features.It can improve the performance of unimodal detectors without introducing additional costs during inference.Specifically,our method can effectively solve the cross-gap caused by the heterogeneity between data.Furthermore,we further propose a Light Detection and Ranging(LiDAR)-guided geometric compensation module,which can assist the student model in obtaining effective geometric features and reduce the gap between different modalities.Our proposed method generally requires fewer computational resources and faster inference speed than traditional multi-modal models.This advancement enables multi-modal technology to be applied more widely in practical scenarios.Through experiments,we validate the effectiveness and superiority of MMDistill on the nuScenes dataset,achieving an improvement of 4.1%mean Average Precision(mAP)and 4.6%NuScenes Detection Score(NDS)over the baseline detector.In addition,we also present detailed ablation studies to validate our method.展开更多
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t...Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.展开更多
3D object detection is one of the most challenging research tasks in computer vision. In order to solve the problem of template information dependency of 3D object proposal in the method of 3D object detection based o...3D object detection is one of the most challenging research tasks in computer vision. In order to solve the problem of template information dependency of 3D object proposal in the method of 3D object detection based on 2.5D information, we proposed a 3D object detector based on fusion of vanishing point and prior orientation, which estimates an accurate 3D proposal from 2.5D data, and provides an excellent start point for 3D object classification and localization. The algorithm first calculates three mutually orthogonal vanishing points by the Euler angle principle and projects them into the pixel coordinate system. Then, the top edge of the 2D proposal is sampled by the preset sampling pitch, and the first one vertex is taken. Finally, the remaining seven vertices of the 3D proposal are calculated according to the linear relationship between the three vanishing points and the vertices, and the complete information of the 3D proposal is obtained. The experimental results show that this proposed method improves the Mean Average Precision score by 2.7% based on the Amodal3Det method.展开更多
The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.I...The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.Inspired by the great progress of Transformer,we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer.We first investigate the permutation invariance of sequence data of the self-attention and apply it to point cloud processing.Then we construct a voxel feature layer based on the self-attention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel.Lastly,we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection.The voxel feature with Transformer(VFT)can be plugged into any other voxel-based 3D object detection framework easily,and serves as the backbone for voxel feature extractor.Experiments results on the KITTI dataset demonstrate that our method achieves the state-of-the-art performance on 3D object detection.展开更多
Nowadays, 3D object detection, which uses the color and depth information to find object localization in the 3D world and estimate their physical size and pose, is one of the most important 3D perception tasks in the ...Nowadays, 3D object detection, which uses the color and depth information to find object localization in the 3D world and estimate their physical size and pose, is one of the most important 3D perception tasks in the field of computer vision. In order to solve the problem of mixed segmentation results when multiple instances appear in one frustum in the F-PointNet method and in the occlusion that leads to the loss of depth information, a 3D object detection approach based on instance segmentation and image restoration is proposed in this paper. Firstly, instance segmentation with Mask R-CNN on an RGB image is used to avoid mixed segmentation results. Secondly, for the detected occluded objects, we remove the occluding object first in the depth map and then restore the empty pixel region by utilizing the Criminisi Algorithm to recover the missing depth information of the object. The experimental results show that the proposed method improves the average precision score compared with the F-PointNet method.展开更多
In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection ...In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.展开更多
The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpow...The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.展开更多
LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previou...LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.展开更多
Currently,worldwide industries and communities are concerned with building,expanding,and exploring the assets and resources found in the oceans and seas.More precisely,to analyze a stock,archaeology,and surveillance,s...Currently,worldwide industries and communities are concerned with building,expanding,and exploring the assets and resources found in the oceans and seas.More precisely,to analyze a stock,archaeology,and surveillance,sev-eral cameras are installed underseas to collect videos.However,on the other hand,these large size videos require a lot of time and memory for their processing to extract relevant information.Hence,to automate this manual procedure of video assessment,an accurate and efficient automated system is a greater necessity.From this perspective,we intend to present a complete framework solution for the task of video summarization and object detection in underwater videos.We employed a perceived motion energy(PME)method tofirst extract the keyframes followed by an object detection model approach namely YoloV3 to perform object detection in underwater videos.The issues of blurriness and low contrast in underwater images are also taken into account in the presented approach by applying the image enhancement method.Furthermore,the suggested framework of underwater video summarization and object detection has been evaluated on a publicly available brackish dataset.It is observed that the proposed framework shows good performance and hence ultimately assists several marine researchers or scientists related to thefield of underwater archaeology,stock assessment,and surveillance.展开更多
The exploration of building detection plays an important role in urban planning,smart city and military.Aiming at the problem of high overlapping ratio of detection frames for dense building detection in high resoluti...The exploration of building detection plays an important role in urban planning,smart city and military.Aiming at the problem of high overlapping ratio of detection frames for dense building detection in high resolution remote sensing images,we present an effective YOLOv3 framework,corner regression-based YOLOv3(Correg-YOLOv3),to localize dense building accurately.This improved YOLOv3 algorithm establishes a vertex regression mechanism and an additional loss item about building vertex offsets relative to the center point of bounding box.By extending output dimensions,the trained model is able to output the rectangular bounding boxes and the building vertices meanwhile.Finally,we evaluate the performance of the Correg-YOLOv3 on our self-produced data set and provide a comparative analysis qualitatively and quantitatively.The experimental results achieve high performance in precision(96.45%),recall rate(95.75%),F1 score(96.10%)and average precision(98.05%),which were 2.73%,5.4%,4.1%and 4.73%higher than that of YOLOv3.Therefore,our proposed algorithm effectively tackles the problem of dense building detection in high resolution images.展开更多
In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is pro...In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.展开更多
Artificial Intelligence(AI)and Computer Vision(CV)advancements have led to many useful methodologies in recent years,particularly to help visually-challenged people.Object detection includes a variety of challenges,fo...Artificial Intelligence(AI)and Computer Vision(CV)advancements have led to many useful methodologies in recent years,particularly to help visually-challenged people.Object detection includes a variety of challenges,for example,handlingmultiple class images,images that get augmented when captured by a camera and so on.The test images include all these variants as well.These detection models alert them about their surroundings when they want to walk independently.This study compares four CNN-based pre-trainedmodels:ResidualNetwork(ResNet-50),Inception v3,DenseConvolutional Network(DenseNet-121),and SqueezeNet,predominantly used in image recognition applications.Based on the analysis performed on these test images,the study infers that Inception V3 outperformed other pre-trained models in terms of accuracy and speed.To further improve the performance of the Inception v3 model,the thermal exchange optimization(TEO)algorithm is applied to tune the hyperparameters(number of epochs,batch size,and learning rate)showing the novelty of the work.Better accuracy was achieved owing to the inclusion of an auxiliary classifier as a regularizer,hyperparameter optimizer,and factorization approach.Additionally,Inception V3 can handle images of different sizes.This makes Inception V3 the optimum model for assisting visually challenged people in real-world communication when integrated with Internet of Things(IoT)-based devices.展开更多
Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).Howe...Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).However,the sparse and disordered nature of the 3D point cloud poses significant challenges to feature extraction.Overcoming limitations is critical for 3D point cloud processing.3D point cloud object detection is a very challenging and crucial task,in which point cloud processing and feature extraction methods play a crucial role and have a significant impact on subsequent object detection performance.In this overview of outstanding work in object detection from the 3D point cloud,we specifically focus on summarizing methods employed in 3D point cloud processing.We introduce the way point clouds are processed in classical 3D object detection algorithms,and their improvements to solve the problems existing in point cloud processing.Different voxelization methods and point cloud sampling strategies will influence the extracted features,thereby impacting the final detection performance.展开更多
The perception of Bird's Eye View(BEV)has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency.However,the increasing complexity of neural network architectures ha...The perception of Bird's Eye View(BEV)has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency.However,the increasing complexity of neural network architectures has resulted in higher training memory,thereby limiting the scalability of model training.To address these challenges,we propose a novel model,RevFB-BEV,which is based on the Reversible Swin Transformer(RevSwin)with Forward-Backward View Transformation(FBVT)and LiDAR Guided Back Projection(LGBP).This approach includes the RevSwin backbone network,which employs a reversible architecture to minimise training memory by recomputing intermediate parameters.Moreover,we introduce the FBVT module that refines BEV features extracted from forward projection,yielding denser and more precise camera BEV representations.The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features.Extensive experiments on the nuScenes dataset demonstrate notable performance improvements,with our model achieving over a 4 x reduction in training memory and a more than 12x decrease in single-backbone training memory.These efficiency gains become even more pronounced with deeper network architectures.Additionally,RevFB-BEV achieves 68.1 mAP(mean Average Precision)on the validation set and 68.9 mAP on the test set,which is nearly on par with the baseline BEVFusion,underscoring its effectiveness in resource-constrained scenarios.展开更多
针对小尺度目标在检测时精确率低且易出现漏检和误检等问题,提出一种改进的YOLOv3(You Only Look Once version 3)小目标检测算法。在网络结构方面,为提高基础网络的特征提取能力,使用DenseNet-121密集连接网络替换原Darknet-53网络作...针对小尺度目标在检测时精确率低且易出现漏检和误检等问题,提出一种改进的YOLOv3(You Only Look Once version 3)小目标检测算法。在网络结构方面,为提高基础网络的特征提取能力,使用DenseNet-121密集连接网络替换原Darknet-53网络作为其基础网络,同时修改卷积核尺寸,进一步降低特征图信息的损耗,并且为增强检测模型对小尺度目标的鲁棒性,额外增加第4个尺寸为104×104像素的特征检测层;在对特征图融合操作方面,使用双线性插值法进行上采样操作代替原最近邻插值法上采样操作,解决大部分检测算法中存在的特征严重损失问题;在损失函数方面,使用广义交并比(GIoU)代替交并比(IoU)来计算边界框的损失值,同时引入Focal Loss焦点损失函数作为边界框的置信度损失函数。实验结果表明,改进算法在VisDrone2019数据集上的均值平均精度(mAP)为63.3%,较原始YOLOv3检测模型提高了13.2百分点,并且在GTX 1080 Ti设备上可实现52帧/s的检测速度,对小目标有着较好的检测性能。展开更多
基金supported by National Natural Science Foundation of China(61673186 and 61871196)Beijing Normal University Education Reform Project(jx2024040)Guangdong Undergraduate Universities Teaching Quality and Reform Project(jx2024309).
文摘Data augmentation plays an important role in boosting the performance of 3D models,while very few studies handle the 3D point cloud data with this technique.Global augmentation and cut-paste are commonly used augmentation techniques for point clouds,where global augmentation is applied to the entire point cloud of the scene,and cut-paste samples objects from other frames into the current frame.Both types of data augmentation can improve performance,but the cut-paste technique cannot effectively deal with the occlusion relationship between the foreground object and the background scene and the rationality of object sampling,which may be counterproductive and may hurt the overall performance.In addition,LiDAR is susceptible to signal loss,external occlusion,extreme weather and other factors,which can easily cause object shape changes,while global augmentation and cut-paste cannot effectively enhance the robustness of the model.To this end,we propose Syn-Aug,a synchronous data augmentation framework for LiDAR-based 3D object detection.Specifically,we first propose a novel rendering-based object augmentation technique(Ren-Aug)to enrich training data while enhancing scene realism.Second,we propose a local augmentation technique(Local-Aug)to generate local noise by rotating and scaling objects in the scene while avoiding collisions,which can improve generalisation performance.Finally,we make full use of the structural information of 3D labels to make the model more robust by randomly changing the geometry of objects in the training frames.We verify the proposed framework with four different types of 3D object detectors.Experimental results show that our proposed Syn-Aug significantly improves the performance of various 3D object detectors in the KITTI and nuScenes datasets,proving the effectiveness and generality of Syn-Aug.On KITTI,four different types of baseline models using Syn-Aug improved mAP by 0.89%,1.35%,1.61%and 1.14%respectively.On nuScenes,four different types of baseline models using Syn-Aug improved mAP by 14.93%,10.42%,8.47%and 6.81%respectively.The code is available at https://github.com/liuhuaijjin/Syn-Aug.
文摘Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)and three-dimensional(3D)object detection methods havemade significant progress,they often fall short under severe occlusion due to depth ambiguities in 2D imagery and the high cost and deployment limitations of 3D sensors such as Light Detection and Ranging(LiDAR).This paper presents a comparative review of recent 2D and 3D detection models,focusing on their occlusion-handling capabilities and the impact of sensor modalities such as stereo vision,Time-of-Flight(ToF)cameras,and LiDAR.In this context,we introduce FuDensityNet,our multimodal occlusion-aware detection framework that combines Red-Green-Blue(RGB)images and LiDAR data to enhance detection performance.As a forward-looking direction,we propose a monocular depth-estimation extension to FuDensityNet,aimed at replacing expensive 3D sensors with a more scalable CNN-based pipeline.Although this enhancement is not experimentally evaluated in this manuscript,we describe its conceptual design and potential for future implementation.
基金funded by the National Key R&D Program of China(Grant No.2022YFB4703400)the China Three Gorges Corporation(Grant No.2324020012)+2 种基金the National Natural Science Foundation of China(Grant No.62476080)the Jiangsu Province Natural Science Foundation(Grant No.BK20231186)Key Laboratory about Maritime Intelligent Network Information Technology of the Ministry of Education(EKLMIC202405).
文摘The inherent limitations of 2D object detection,such as inadequate spatial reasoning and susceptibility to environmental occlusions,pose significant risks to the safety and reliability of autonomous driving systems.To address these challenges,this paper proposes an enhanced 3D object detection framework(FastSECOND)based on an optimized SECOND architecture,designed to achieve rapid and accurate perception in autonomous driving scenarios.Key innovations include:(1)Replacing the Rectified Linear Unit(ReLU)activation functions with the Gaussian Error Linear Unit(GELU)during voxel feature encoding and region proposal network stages,leveraging partial convolution to balance computational efficiency and detection accuracy;(2)Integrating a Swin-Transformer V2 module into the voxel backbone network to enhance feature extraction capabilities in sparse data;and(3)Introducing an optimized position regression loss combined with a geometry-aware Focal-EIoU loss function,which incorporates bounding box geometric correlations to accelerate network convergence.While this study currently focuses exclusively on the detection of the Car category,with experiments conducted on the Car class of the KITTI dataset,future work will extend to other categories such as Pedestrian and Cyclist to more comprehensively evaluate the generalization capability of the proposed framework.Extensive experimental results demonstrate that our framework achieves a more effective trade-off between detection accuracy and speed.Compared to the baseline SECOND model,it achieves a 21.9%relative improvement in 3D bounding box detection accuracy on the hard subset,while reducing inference time by 14 ms.These advancements underscore the framework’s potential for enabling real-time,high-precision perception in autonomous driving applications.
基金funded by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project(2023CSJGG1600)the Natural Science Foundation of Anhui Province(2208085MF173)Wuhu“ChiZhu Light”Major Science and Technology Project(2023ZD01,2023ZD03).
文摘As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.
基金supported by a grant from the National Key Research and Development Project(2023YFB4302100)Key Research and Development Project of Jiangxi Province(No.20232ACE01011)Independent Deployment Project of Ganjiang Innovation Research Institute,Chinese Academy of Sciences(E255J001).
文摘Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,we propose Rail-PillarNet,a three-dimensional(3D)LIDAR(Light Detection and Ranging)railway foreign object detection method based on the improvement of PointPillars.Firstly,the parallel attention pillar encoder(PAPE)is designed to fully extract the features of the pillars and alleviate the problem of local fine-grained information loss in PointPillars pillars encoder.Secondly,a fine backbone network is designed to improve the feature extraction capability of the network by combining the coding characteristics of LIDAR point cloud feature and residual structure.Finally,the initial weight parameters of the model were optimised by the transfer learning training method to further improve accuracy.The experimental results on the OSDaR23 dataset show that the average accuracy of Rail-PillarNet reaches 58.51%,which is higher than most mainstream models,and the number of parameters is 5.49 M.Compared with PointPillars,the accuracy of each target is improved by 10.94%,3.53%,16.96%and 19.90%,respectively,and the number of parameters only increases by 0.64M,which achieves a balance between the number of parameters and accuracy.
基金supported by the National Natural Science Foundation of China(GrantNo.62302086)the Natural Science Foundation of Liaoning Province(Grant No.2023-MSBA-070)the Fundamental Research Funds for the Central Universities(Grant No.N2317005).
文摘Multi-modal 3D object detection has achieved remarkable progress,but it is often limited in practical industrial production because of its high cost and low efficiency.The multi-view camera-based method provides a feasible solution due to its low cost.However,camera data lacks geometric depth,and only using camera data to obtain high accuracy is challenging.This paper proposes a multi-modal Bird-Eye-View(BEV)distillation framework(MMDistill)to make a trade-off between them.MMDistill is a carefully crafted two-stage distillation framework based on teacher and student models for learning cross-modal knowledge and generating multi-modal features.It can improve the performance of unimodal detectors without introducing additional costs during inference.Specifically,our method can effectively solve the cross-gap caused by the heterogeneity between data.Furthermore,we further propose a Light Detection and Ranging(LiDAR)-guided geometric compensation module,which can assist the student model in obtaining effective geometric features and reduce the gap between different modalities.Our proposed method generally requires fewer computational resources and faster inference speed than traditional multi-modal models.This advancement enables multi-modal technology to be applied more widely in practical scenarios.Through experiments,we validate the effectiveness and superiority of MMDistill on the nuScenes dataset,achieving an improvement of 4.1%mean Average Precision(mAP)and 4.6%NuScenes Detection Score(NDS)over the baseline detector.In addition,we also present detailed ablation studies to validate our method.
基金supported in part by the Major Project for New Generation of AI (2018AAA0100400)the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231)the InnoHK Program。
文摘Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
基金Supported by the National Natural Science Foundation of China(61772328,61802253,61831018)
文摘3D object detection is one of the most challenging research tasks in computer vision. In order to solve the problem of template information dependency of 3D object proposal in the method of 3D object detection based on 2.5D information, we proposed a 3D object detector based on fusion of vanishing point and prior orientation, which estimates an accurate 3D proposal from 2.5D data, and provides an excellent start point for 3D object classification and localization. The algorithm first calculates three mutually orthogonal vanishing points by the Euler angle principle and projects them into the pixel coordinate system. Then, the top edge of the 2D proposal is sampled by the preset sampling pitch, and the first one vertex is taken. Finally, the remaining seven vertices of the 3D proposal are calculated according to the linear relationship between the three vanishing points and the vertices, and the complete information of the 3D proposal is obtained. The experimental results show that this proposed method improves the Mean Average Precision score by 2.7% based on the Amodal3Det method.
基金National Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)University Superior Discipline Construction Project of Jiangsu Province。
文摘The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.Inspired by the great progress of Transformer,we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer.We first investigate the permutation invariance of sequence data of the self-attention and apply it to point cloud processing.Then we construct a voxel feature layer based on the self-attention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel.Lastly,we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection.The voxel feature with Transformer(VFT)can be plugged into any other voxel-based 3D object detection framework easily,and serves as the backbone for voxel feature extractor.Experiments results on the KITTI dataset demonstrate that our method achieves the state-of-the-art performance on 3D object detection.
基金Supported by the National Natural Science Foundation of China(61603242)Collaborative Innovation Center for Economics Crime Investigation and Prevention Technology of Jiangxi Province(JXJZXTCX-030)
文摘Nowadays, 3D object detection, which uses the color and depth information to find object localization in the 3D world and estimate their physical size and pose, is one of the most important 3D perception tasks in the field of computer vision. In order to solve the problem of mixed segmentation results when multiple instances appear in one frustum in the F-PointNet method and in the occlusion that leads to the loss of depth information, a 3D object detection approach based on instance segmentation and image restoration is proposed in this paper. Firstly, instance segmentation with Mask R-CNN on an RGB image is used to avoid mixed segmentation results. Secondly, for the detected occluded objects, we remove the occluding object first in the depth map and then restore the empty pixel region by utilizing the Criminisi Algorithm to recover the missing depth information of the object. The experimental results show that the proposed method improves the average precision score compared with the F-PointNet method.
基金The authors would like to thank the financial support of Natural Science Foundation of Anhui Province(No.2208085MF173)the key research and development projects of Anhui(202104a05020003)+2 种基金the anhui development and reform commission supports R&D and innovation project([2020]479)the national natural science foundation of China(51575001)Anhui university scientific research platform innovation team building project(2016-2018).
文摘In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.
基金supported by the National Key Research and Development Program of China(2020YFB1807500)the National Natural Science Foundation of China(62072360,62001357,62172438,61901367)+4 种基金the key research and development plan of Shaanxi province(2021ZDLGY02-09,2023-GHZD-44,2023-ZDLGY-54)the Natural Science Foundation of Guangdong Province of China(2022A1515010988)Key Project on Artificial Intelligence of Xi'an Science and Technology Plan(2022JH-RGZN-0003,2022JH-RGZN-0103,2022JH-CLCJ-0053)Xi'an Science and Technology Plan(20RGZN0005)the Proof-ofconcept fund from Hangzhou Research Institute of Xidian University(GNYZ2023QC0201).
文摘The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.
基金This work was supported,in part,by the National Nature Science Foundation of China under grant numbers 62272236in part,by the Natural Science Foundation of Jiangsu Province under grant numbers BK20201136,BK20191401in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund.
文摘LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2020R1G1A1099559).
文摘Currently,worldwide industries and communities are concerned with building,expanding,and exploring the assets and resources found in the oceans and seas.More precisely,to analyze a stock,archaeology,and surveillance,sev-eral cameras are installed underseas to collect videos.However,on the other hand,these large size videos require a lot of time and memory for their processing to extract relevant information.Hence,to automate this manual procedure of video assessment,an accurate and efficient automated system is a greater necessity.From this perspective,we intend to present a complete framework solution for the task of video summarization and object detection in underwater videos.We employed a perceived motion energy(PME)method tofirst extract the keyframes followed by an object detection model approach namely YoloV3 to perform object detection in underwater videos.The issues of blurriness and low contrast in underwater images are also taken into account in the presented approach by applying the image enhancement method.Furthermore,the suggested framework of underwater video summarization and object detection has been evaluated on a publicly available brackish dataset.It is observed that the proposed framework shows good performance and hence ultimately assists several marine researchers or scientists related to thefield of underwater archaeology,stock assessment,and surveillance.
基金National Natural Science Foundation of China(No.41871305)National Key Research and Development Program of China(No.2017YFC0602204)+2 种基金Fundamental Research Funds for the Central Universities,China University of Geosciences(Wuhan)(No.CUGQY1945)Open Fund of Key Laboratory of Geological Survey and Evaluation of Ministry of Education and the Fundamental Research Funds for the Central Universities(No.GLAB2019ZR02)Open Fund of Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Natural Resources,China(No.KF-2020-05-068)。
文摘The exploration of building detection plays an important role in urban planning,smart city and military.Aiming at the problem of high overlapping ratio of detection frames for dense building detection in high resolution remote sensing images,we present an effective YOLOv3 framework,corner regression-based YOLOv3(Correg-YOLOv3),to localize dense building accurately.This improved YOLOv3 algorithm establishes a vertex regression mechanism and an additional loss item about building vertex offsets relative to the center point of bounding box.By extending output dimensions,the trained model is able to output the rectangular bounding boxes and the building vertices meanwhile.Finally,we evaluate the performance of the Correg-YOLOv3 on our self-produced data set and provide a comparative analysis qualitatively and quantitatively.The experimental results achieve high performance in precision(96.45%),recall rate(95.75%),F1 score(96.10%)and average precision(98.05%),which were 2.73%,5.4%,4.1%and 4.73%higher than that of YOLOv3.Therefore,our proposed algorithm effectively tackles the problem of dense building detection in high resolution images.
基金National Youth Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)Jiangsu University Superior Discipline Construction Project。
文摘In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.
基金Princess Nourah bint Abdulrahman University Researchers Supporting Project number(PNURSP2023R191)Princess Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.The authors would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code:(22UQU4310373DSR61)This study is supported via funding from Prince Sattam bin Abdulaziz University project number(PSAU/2023/R/1444).
文摘Artificial Intelligence(AI)and Computer Vision(CV)advancements have led to many useful methodologies in recent years,particularly to help visually-challenged people.Object detection includes a variety of challenges,for example,handlingmultiple class images,images that get augmented when captured by a camera and so on.The test images include all these variants as well.These detection models alert them about their surroundings when they want to walk independently.This study compares four CNN-based pre-trainedmodels:ResidualNetwork(ResNet-50),Inception v3,DenseConvolutional Network(DenseNet-121),and SqueezeNet,predominantly used in image recognition applications.Based on the analysis performed on these test images,the study infers that Inception V3 outperformed other pre-trained models in terms of accuracy and speed.To further improve the performance of the Inception v3 model,the thermal exchange optimization(TEO)algorithm is applied to tune the hyperparameters(number of epochs,batch size,and learning rate)showing the novelty of the work.Better accuracy was achieved owing to the inclusion of an auxiliary classifier as a regularizer,hyperparameter optimizer,and factorization approach.Additionally,Inception V3 can handle images of different sizes.This makes Inception V3 the optimum model for assisting visually challenged people in real-world communication when integrated with Internet of Things(IoT)-based devices.
文摘Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).However,the sparse and disordered nature of the 3D point cloud poses significant challenges to feature extraction.Overcoming limitations is critical for 3D point cloud processing.3D point cloud object detection is a very challenging and crucial task,in which point cloud processing and feature extraction methods play a crucial role and have a significant impact on subsequent object detection performance.In this overview of outstanding work in object detection from the 3D point cloud,we specifically focus on summarizing methods employed in 3D point cloud processing.We introduce the way point clouds are processed in classical 3D object detection algorithms,and their improvements to solve the problems existing in point cloud processing.Different voxelization methods and point cloud sampling strategies will influence the extracted features,thereby impacting the final detection performance.
基金supported by the Baima Lake Laboratory Joint Funds of the Zhejiang Provincial Natural Science Foundation of China under Grant LBMHD25F030001in part by NSFC No.62088101+1 种基金The authors certify that there are no competing financial interests or personal relationships influencing this work.Financial support originated exclusively from public research funds:(1)Baima Lake Laboratory Joint Funds(Zhejiang Provincial NSF,China)Grant LBMHD25F030001National Natural Science Foundation of China Grant 62088101 for the'Autonomous Intelligent Unmanned Systems'project.
文摘The perception of Bird's Eye View(BEV)has become a widely adopted approach in 3D object detection due to its spatial and dimensional consistency.However,the increasing complexity of neural network architectures has resulted in higher training memory,thereby limiting the scalability of model training.To address these challenges,we propose a novel model,RevFB-BEV,which is based on the Reversible Swin Transformer(RevSwin)with Forward-Backward View Transformation(FBVT)and LiDAR Guided Back Projection(LGBP).This approach includes the RevSwin backbone network,which employs a reversible architecture to minimise training memory by recomputing intermediate parameters.Moreover,we introduce the FBVT module that refines BEV features extracted from forward projection,yielding denser and more precise camera BEV representations.The LGBP module further utilises LiDAR BEV guidance for back projection to achieve more accurate camera BEV features.Extensive experiments on the nuScenes dataset demonstrate notable performance improvements,with our model achieving over a 4 x reduction in training memory and a more than 12x decrease in single-backbone training memory.These efficiency gains become even more pronounced with deeper network architectures.Additionally,RevFB-BEV achieves 68.1 mAP(mean Average Precision)on the validation set and 68.9 mAP on the test set,which is nearly on par with the baseline BEVFusion,underscoring its effectiveness in resource-constrained scenarios.
文摘针对小尺度目标在检测时精确率低且易出现漏检和误检等问题,提出一种改进的YOLOv3(You Only Look Once version 3)小目标检测算法。在网络结构方面,为提高基础网络的特征提取能力,使用DenseNet-121密集连接网络替换原Darknet-53网络作为其基础网络,同时修改卷积核尺寸,进一步降低特征图信息的损耗,并且为增强检测模型对小尺度目标的鲁棒性,额外增加第4个尺寸为104×104像素的特征检测层;在对特征图融合操作方面,使用双线性插值法进行上采样操作代替原最近邻插值法上采样操作,解决大部分检测算法中存在的特征严重损失问题;在损失函数方面,使用广义交并比(GIoU)代替交并比(IoU)来计算边界框的损失值,同时引入Focal Loss焦点损失函数作为边界框的置信度损失函数。实验结果表明,改进算法在VisDrone2019数据集上的均值平均精度(mAP)为63.3%,较原始YOLOv3检测模型提高了13.2百分点,并且在GTX 1080 Ti设备上可实现52帧/s的检测速度,对小目标有着较好的检测性能。