Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)a...Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)and three-dimensional(3D)object detection methods havemade significant progress,they often fall short under severe occlusion due to depth ambiguities in 2D imagery and the high cost and deployment limitations of 3D sensors such as Light Detection and Ranging(LiDAR).This paper presents a comparative review of recent 2D and 3D detection models,focusing on their occlusion-handling capabilities and the impact of sensor modalities such as stereo vision,Time-of-Flight(ToF)cameras,and LiDAR.In this context,we introduce FuDensityNet,our multimodal occlusion-aware detection framework that combines Red-Green-Blue(RGB)images and LiDAR data to enhance detection performance.As a forward-looking direction,we propose a monocular depth-estimation extension to FuDensityNet,aimed at replacing expensive 3D sensors with a more scalable CNN-based pipeline.Although this enhancement is not experimentally evaluated in this manuscript,we describe its conceptual design and potential for future implementation.展开更多
Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain su...Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain suffer from inherent limitations:existing pose network branches infer camera ego-motion exclusively under static-scene and Lambertian-surface assumptions.These assumptions are often violated in real-world scenarios due to dynamic objects,non-Lambertian reflectance,and unstructured background elements,leading to pervasive artifacts such as depth discontinuities(“holes”),structural collapse,and ambiguous reconstruction.To address these challenges,we propose a novel framework that integrates scene dynamic pose estimation into the conventional self-supervised depth network,enhancing its ability to model complex scene dynamics.Our contributions are threefold:(1)a pixel-wise dynamic pose estimation module that jointly resolves the pose transformations of moving objects and localized scene perturbations;(2)a physically-informed loss function that couples dynamic pose and depth predictions,designed to mitigate depth errors arising from high-speed distant objects and geometrically inconsistent motion profiles;(3)an efficient SE(3)transformation parameterization that streamlines network complexity and temporal pre-processing.Extensive experiments on the KITTI and NYU-V2 benchmarks show that our framework achieves state-of-the-art performance in both quantitative metrics and qualitative visual fidelity,significantly improving the robustness and generalization of monocular depth estimation under dynamic conditions.展开更多
Precise and robust three-dimensional object detection(3DOD)presents a promising opportunity in the field of mobile robot(MR)navigation.Monocular 3DOD techniques typically involve extending existing twodimensional obje...Precise and robust three-dimensional object detection(3DOD)presents a promising opportunity in the field of mobile robot(MR)navigation.Monocular 3DOD techniques typically involve extending existing twodimensional object detection(2DOD)frameworks to predict the three-dimensional bounding box(3DBB)of objects captured in 2D RGB images.However,these methods often require multiple images,making them less feasible for various real-time scenarios.To address these challenges,the emergence of agile convolutional neural networks(CNNs)capable of inferring depth froma single image opens a new avenue for investigation.The paper proposes a novel ELDENet network designed to produce cost-effective 3DBounding Box Estimation(3D-BBE)froma single image.This novel framework comprises the PP-LCNet as the encoder and a fast convolutional decoder.Additionally,this integration includes a Squeeze-Exploit(SE)module utilizing the Math Kernel Library for Deep Neural Networks(MKLDNN)optimizer to enhance convolutional efficiency and streamline model size during effective training.Meanwhile,the proposed multi-scale sub-pixel decoder generates high-quality depth maps while maintaining a compact structure.Furthermore,the generated depthmaps provide a clear perspective with distance details of objects in the environment.These depth insights are combined with 2DOD for precise evaluation of 3D Bounding Boxes(3DBB),facilitating scene understanding and optimal route planning for mobile robots.Based on the estimated object center of the 3DBB,the Deep Reinforcement Learning(DRL)-based obstacle avoidance strategy for MRs is developed.Experimental results demonstrate that our model achieves state-of-the-art performance across three datasets:NYU-V2,KITTI,and Cityscapes.Overall,this framework shows significant potential for adaptation in intelligent mechatronic systems,particularly in developing knowledge-driven systems for mobile robot navigation.展开更多
Depth maps play a crucial role in various practical applications such as computer vision,augmented reality,and autonomous driving.How to obtain clear and accurate depth information in video depth estimation is a signi...Depth maps play a crucial role in various practical applications such as computer vision,augmented reality,and autonomous driving.How to obtain clear and accurate depth information in video depth estimation is a significant challenge faced in the field of computer vision.However,existing monocular video depth estimation models tend to produce blurred or inaccurate depth information in regions with object edges and low texture.To address this issue,we propose a monocular depth estimation model architecture guided by semantic segmentation masks,which introduces semantic information into the model to correct the ambiguous depth regions.We have evaluated the proposed method,and experimental results show that our method improves the accuracy of edge depth,demonstrating the effectiveness of our approach.展开更多
A method of source depth estimation based on the multi-path time delay difference is proposed. When the minimum time arrivals in all receiver depths are snapped to a certain time on time delay-depth plane, time delay ...A method of source depth estimation based on the multi-path time delay difference is proposed. When the minimum time arrivals in all receiver depths are snapped to a certain time on time delay-depth plane, time delay arrivals of surface-bottom reflection and bottom-surface reflection intersect at the source depth. Two hydrophones deployed vertically with a certain interval are required at least. If the receiver depths are known, the pair of time delays can be used to estimate the source depth. With the proposed method the source depth can be estimated successfully in a moderate range in the deep ocean without complicated matched-field calculations in the simulations and experiments.展开更多
Depth estimation of subsurface faults is one of the problems in gravity interpretation. We tried using the support vector classifier (SVC) method in the estimation. Using forward and nonlinear inverse techniques, de...Depth estimation of subsurface faults is one of the problems in gravity interpretation. We tried using the support vector classifier (SVC) method in the estimation. Using forward and nonlinear inverse techniques, detecting the depth of subsurface faults with related error is possible but it is necessary to have an initial guess for the depth and this initial guess usually comes from non-gravity data. We introduce SVC in this paper as one of the tools for estimating the depth of subsurface faults using gravity data. We can suppose that each subsurface fault depth is a class and that SVC is a classification algorithm. To better use the SVC algorithm, we select proper depth estimation features using a proper features selection (FS) algorithm. In this research, we produce a training set consisting of synthetic gravity profiles created by subsurface faults at different depths to train the SVC code to estimate the depth of real subsurface faults. Then we test our trained SVC code by a testing set consisting of other synthetic gravity profiles created by subsurface faults at different depths. We also tested our trained SVC code using real data.展开更多
Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious pro...Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.展开更多
Depth estimation is an active research area with the developing of stereo vision in recent years. It is one of the key technologies to resolve the large data of stereo vision communication. Now depth estimation still ...Depth estimation is an active research area with the developing of stereo vision in recent years. It is one of the key technologies to resolve the large data of stereo vision communication. Now depth estimation still has some problems, such as occlusion, fuzzy edge, real-time processing, etc. Many algorithms have been proposed base on software, however the performance of the computer configurations limits the software processing speed. The other resolution is hardware design and the great developments of the digital signal processor (DSP), and application specific integrated circuit (ASIC) and field programmable gate array (FPGA) provide the opportunity of flexible applications. In this work, by analyzing the procedures of depth estimation, the proper algorithms which can be used in hardware design to execute real-time depth estimation are proposed. The different methods of calibration, matching and post-processing are analyzed based on the hardware design requirements. At last some tests for the algorithm have been analyzed. The results show that the algorithms proposed for hardware design can provide credited depth map for further view synthesis and are suitable for hardware design.展开更多
Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupe...Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupervised monocular depth estimation using semantic segmentation as an auxiliary task.To address the lack of cross-domain datasets and catastrophic forgetting problems encountered in multi-task training,we utilize existing methodology to obtain redundant segmentation maps to build our cross-domain dataset,which not only provides a new way to conduct multi-task training,but also helps us to evaluate results compared with those of other algorithms.In addition,in order to comprehensively use the extracted features of the two tasks in the early perception stage,we use a strategy of sharing weights in the network to fuse cross-domain features,and introduce a novel multi-task loss function to further smooth the depth values.Extensive experiments on KITTI and Cityscapes datasets show that our method has achieved state-of-the-art performance in the depth estimation task,as well improved semantic segmentation.展开更多
For traffic object detection in foggy environment based on convolutional neural network(CNN),data sets in fog-free environment are generally used to train the network directly.As a result,the network cannot learn the ...For traffic object detection in foggy environment based on convolutional neural network(CNN),data sets in fog-free environment are generally used to train the network directly.As a result,the network cannot learn the object characteristics in the foggy environment in the training set,and the detection effect is not good.To improve the traffic object detection in foggy environment,we propose a method of generating foggy images on fog-free images from the perspective of data set construction.First,taking the KITTI objection detection data set as an original fog-free image,we generate the depth image of the original image by using improved Monodepth unsupervised depth estimation method.Then,a geometric prior depth template is constructed to fuse the image entropy taken as weight with the depth image.After that,a foggy image is acquired from the depth image based on the atmospheric scattering model.Finally,we take two typical object-detection frameworks,that is,the two-stage object-detection Fster region-based convolutional neural network(Faster-RCNN)and the one-stage object-detection network YOLOv4,to train the original data set,the foggy data set and the mixed data set,respectively.According to the test results on RESIDE-RTTS data set in the outdoor natural foggy environment,the model under the training on the mixed data set shows the best effect.The mean average precision(mAP)values are increased by 5.6%and by 5.0%under the YOLOv4 model and the Faster-RCNN network,respectively.It is proved that the proposed method can effectively improve object identification ability foggy environment.展开更多
Most approaches to estimate a scene’s 3D depth from a single image often model the point spread function (PSF) as a 2D Gaussian function. However, those method<span>s</span><span> are suffered ...Most approaches to estimate a scene’s 3D depth from a single image often model the point spread function (PSF) as a 2D Gaussian function. However, those method<span>s</span><span> are suffered from some noises, and difficult to get a high quality of depth recovery. We presented a simple yet effective approach to estimate exactly the amount of spatially varying defocus blur at edges, based on </span><span>a</span><span> Cauchy distribution model for the PSF. The raw image was re-blurred twice using two known Cauchy distribution kernels, and the defocus blur amount at edges could be derived from the gradient ratio between the two re-blurred images. By propagating the blur amount at edge locations to the entire image using the matting interpolation, a full depth map was then recovered. Experimental results on several real images demonstrated both feasibility and effectiveness of our method, being a non-Gaussian model for DSF, in providing a better estimation of the defocus map from a single un-calibrated defocused image. These results also showed that our method </span><span>was</span><span> robust to image noises, inaccurate edge location and interferences of neighboring edges. It could generate more accurate scene depth maps than the most of existing methods using a Gaussian based DSF model.</span>展开更多
Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods dire...Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods directly feed the original RGB image into the model to extract depth features without avoiding the interference of depth-irrelevant information on depth-estimation accuracy,which leads to inferior performance.Methods To remove the influence of depth-irrelevant information and improve the depth-prediction accuracy,we propose RADepthNet,a novel reflectance-guided network that fuses boundary features.Specifically,our method predicts depth maps using the following three steps:(1)Intrinsic Image Decomposition.We propose a reflectance extraction module consisting of an encoder-decoder structure to extract the depth-related reflectance.Through an ablation study,we demonstrate that the module can reduce the influence of illumination on depth estimation.(2)Boundary Detection.A boundary extraction module,consisting of an encoder,refinement block,and upsample block,was proposed to better predict the depth at object boundaries utilizing gradient constraints.(3)Depth Prediction Module.We use an encoder different from(2)to obtain depth features from the reflectance map and fuse boundary features to predict depth.In addition,we proposed FIFADataset,a depth-estimation dataset applied in soccer scenarios.Results Extensive experiments on a public dataset and our proposed FIFADataset show that our method achieves state-of-the-art performance.展开更多
Imaging of surface-enhanced Raman scattering(SERS) nanoparticles(NPs) has been intensively studied for cancer detection due to its high sensitivity, unconstrained low signal-to-noise ratios, and multiplexing detection...Imaging of surface-enhanced Raman scattering(SERS) nanoparticles(NPs) has been intensively studied for cancer detection due to its high sensitivity, unconstrained low signal-to-noise ratios, and multiplexing detection capability. Furthermore, conjugating SERS NPs with various biomarkers is straightforward, resulting in numerous successful studies on cancer detection and diagnosis. However, Raman spectroscopy only provides spectral data from an imaging area without co-registered anatomic context.展开更多
Light Field(LF)depth estimation is an important research direction in the area of computer vision and computational photography,which aims to infer the depth information of different objects in threedimensional scenes...Light Field(LF)depth estimation is an important research direction in the area of computer vision and computational photography,which aims to infer the depth information of different objects in threedimensional scenes by capturing LF data.Given this new era of significance,this article introduces a survey of the key concepts,methods,novel applications,and future trends in this area.We summarize the LF depth estimation methods,which are usually based on the interaction of radiance from rays in all directions of the LF data,such as epipolar-plane,multi-view geometry,focal stack,and deep learning.We analyze the many challenges facing each of these approaches,including complex algorithms,large amounts of computation,and speed requirements.In addition,this survey summarizes most of the currently available methods,conducts some comparative experiments,discusses the results,and investigates the novel directions in LF depth estimation.展开更多
Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulner...Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulnerable to illumination variance, occlusions, texture-less regions, as well as moving objects, making them not robust enough to deal with various scenes. To address this challenge, we study two kinds of robust cross-view consistency in this paper. Firstly, the spatial offset field between adjacent frames is obtained by reconstructing the reference frame from its neighbors via deformable alignment, which is used to align the temporal depth features via a depth feature alignment (DFA) loss. Secondly, the 3D point clouds of each reference frame and its nearby frames are calculated and transformed into voxel space, where the point density in each voxel is calculated and aligned via a voxel density alignment (VDA) loss. In this way, we exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the “point-to-point” alignment paradigm to the “region-to-region” one. Compared with the photometric consistency loss as well as the rigid point cloud alignment loss, the proposed DFA and VDA losses are more robust owing to the strong representation power of deep features as well as the high tolerance of voxel density to the aforementioned challenges. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques. Extensive ablation study and analysis validate the effectiveness of the proposed losses, especially in challenging scenes. The code and models are available at https://github.com/sunnyHelen/RCVC-depth.展开更多
Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on fe...Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image(monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.展开更多
This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover...This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover,the Transformer and convolution are good at long-range and close-range depth estimation,respectively.Therefore,we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch.The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents.However,independent branches lead to a shortage of connections between features.To bridge this gap,we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner.Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps,we adopt the deformable scheme to reduce the complexity.Extensive experiments on the KITTI,NYU,and SUN RGB-D datasets demonstrate that our proposed model,termed DepthFormer,surpasses state-of-the-art monocular depth estimation methods with prominent margins.The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.展开更多
We present an estimation of depth of anomalous bodies using normalized full gradient (NFG) of gravity anomaly. Maxima in the NFG map can locate the bodies and indicate their depth. Model calculation using a sphere a...We present an estimation of depth of anomalous bodies using normalized full gradient (NFG) of gravity anomaly. Maxima in the NFG map can locate the bodies and indicate their depth. Model calculation using a sphere and application of the NFG method to gravity anomalies over salt domes in the USA and Denmark shows effectiveness of the method. However, the accuracy of depth estimation strongly depends on the number of term N in the Fourier series used to calculate the NFG. An optimum N for the calculation can be given from a test.展开更多
Estimating depth from images captured by camera sensors is crucial for the advancement of autonomous driving technologies and has gained significant attention in recent years.However,most previous methods rely on stac...Estimating depth from images captured by camera sensors is crucial for the advancement of autonomous driving technologies and has gained significant attention in recent years.However,most previous methods rely on stacked pooling or stride convolution to extract high-level features,which can limit network performance and lead to information redundancy.This paper proposes an improved bidirectional feature pyramid module(BiFPN)and a channel attention module(Seblock:squeeze and excitation)to address these issues in existing methods based on monocular camera sensor.The Seblock redistributes channel feature weights to enhance useful information,while the improved BiFPN facilitates efficient fusion of multi-scale features.The proposed method is in an end-to-end solution without any additional post-processing,resulting in efficient depth estimation.Experiment results show that the proposed method is competitive with state-of-the-art algorithms and preserves fine-grained texture of scene depth.展开更多
Depth estimation is a fundamental computer vision problem that infers three-dimensional(3D)structures from a given scene.As it is an ill-posed problem,to fit the projection function from the given scene to the 3D stru...Depth estimation is a fundamental computer vision problem that infers three-dimensional(3D)structures from a given scene.As it is an ill-posed problem,to fit the projection function from the given scene to the 3D structure,traditional methods generally require mass amounts of annotated data.Such pixel-level annotation is quite labor consuming,especially when addressing reflective surfaces such as mirrors or water.The widespread application of deep learning further intensifies the demand for large amounts of annotated data.Therefore,it is urgent and necessary to propose a framework that is able to reduce the requirement on the amount of data.In this paper,we propose a novel semisupervised learning framework to infer the 3D structure from the given scene.First,semantic information is employed to make the depth inference more accurate.Second,we make both the depth estimation and semantic segmentation coarse-to-fine frameworks;thus,the depth estimation can be gradually guided by semantic segmentation.We compare our model with state-of-the-art methods.The experimental results demonstrate that our method is better than many supervised learning-based methods,which proves the effectiveness of the proposed method.展开更多
文摘Object detection in occluded environments remains a core challenge in computer vision(CV),especially in domains such as autonomous driving and robotics.While Convolutional Neural Network(CNN)-based twodimensional(2D)and three-dimensional(3D)object detection methods havemade significant progress,they often fall short under severe occlusion due to depth ambiguities in 2D imagery and the high cost and deployment limitations of 3D sensors such as Light Detection and Ranging(LiDAR).This paper presents a comparative review of recent 2D and 3D detection models,focusing on their occlusion-handling capabilities and the impact of sensor modalities such as stereo vision,Time-of-Flight(ToF)cameras,and LiDAR.In this context,we introduce FuDensityNet,our multimodal occlusion-aware detection framework that combines Red-Green-Blue(RGB)images and LiDAR data to enhance detection performance.As a forward-looking direction,we propose a monocular depth-estimation extension to FuDensityNet,aimed at replacing expensive 3D sensors with a more scalable CNN-based pipeline.Although this enhancement is not experimentally evaluated in this manuscript,we describe its conceptual design and potential for future implementation.
基金supported in part by the National Natural Science Foundation of China under Grants 62071345。
文摘Self-supervised monocular depth estimation has emerged as a major research focus in recent years,primarily due to the elimination of ground-truth depth dependence.However,the prevailing architectures in this domain suffer from inherent limitations:existing pose network branches infer camera ego-motion exclusively under static-scene and Lambertian-surface assumptions.These assumptions are often violated in real-world scenarios due to dynamic objects,non-Lambertian reflectance,and unstructured background elements,leading to pervasive artifacts such as depth discontinuities(“holes”),structural collapse,and ambiguous reconstruction.To address these challenges,we propose a novel framework that integrates scene dynamic pose estimation into the conventional self-supervised depth network,enhancing its ability to model complex scene dynamics.Our contributions are threefold:(1)a pixel-wise dynamic pose estimation module that jointly resolves the pose transformations of moving objects and localized scene perturbations;(2)a physically-informed loss function that couples dynamic pose and depth predictions,designed to mitigate depth errors arising from high-speed distant objects and geometrically inconsistent motion profiles;(3)an efficient SE(3)transformation parameterization that streamlines network complexity and temporal pre-processing.Extensive experiments on the KITTI and NYU-V2 benchmarks show that our framework achieves state-of-the-art performance in both quantitative metrics and qualitative visual fidelity,significantly improving the robustness and generalization of monocular depth estimation under dynamic conditions.
文摘Precise and robust three-dimensional object detection(3DOD)presents a promising opportunity in the field of mobile robot(MR)navigation.Monocular 3DOD techniques typically involve extending existing twodimensional object detection(2DOD)frameworks to predict the three-dimensional bounding box(3DBB)of objects captured in 2D RGB images.However,these methods often require multiple images,making them less feasible for various real-time scenarios.To address these challenges,the emergence of agile convolutional neural networks(CNNs)capable of inferring depth froma single image opens a new avenue for investigation.The paper proposes a novel ELDENet network designed to produce cost-effective 3DBounding Box Estimation(3D-BBE)froma single image.This novel framework comprises the PP-LCNet as the encoder and a fast convolutional decoder.Additionally,this integration includes a Squeeze-Exploit(SE)module utilizing the Math Kernel Library for Deep Neural Networks(MKLDNN)optimizer to enhance convolutional efficiency and streamline model size during effective training.Meanwhile,the proposed multi-scale sub-pixel decoder generates high-quality depth maps while maintaining a compact structure.Furthermore,the generated depthmaps provide a clear perspective with distance details of objects in the environment.These depth insights are combined with 2DOD for precise evaluation of 3D Bounding Boxes(3DBB),facilitating scene understanding and optimal route planning for mobile robots.Based on the estimated object center of the 3DBB,the Deep Reinforcement Learning(DRL)-based obstacle avoidance strategy for MRs is developed.Experimental results demonstrate that our model achieves state-of-the-art performance across three datasets:NYU-V2,KITTI,and Cityscapes.Overall,this framework shows significant potential for adaptation in intelligent mechatronic systems,particularly in developing knowledge-driven systems for mobile robot navigation.
文摘Depth maps play a crucial role in various practical applications such as computer vision,augmented reality,and autonomous driving.How to obtain clear and accurate depth information in video depth estimation is a significant challenge faced in the field of computer vision.However,existing monocular video depth estimation models tend to produce blurred or inaccurate depth information in regions with object edges and low texture.To address this issue,we propose a monocular depth estimation model architecture guided by semantic segmentation masks,which introduces semantic information into the model to correct the ambiguous depth regions.We have evaluated the proposed method,and experimental results show that our method improves the accuracy of edge depth,demonstrating the effectiveness of our approach.
基金Supported by the National Natural Science Foundation of China under Grant No 11174235
文摘A method of source depth estimation based on the multi-path time delay difference is proposed. When the minimum time arrivals in all receiver depths are snapped to a certain time on time delay-depth plane, time delay arrivals of surface-bottom reflection and bottom-surface reflection intersect at the source depth. Two hydrophones deployed vertically with a certain interval are required at least. If the receiver depths are known, the pair of time delays can be used to estimate the source depth. With the proposed method the source depth can be estimated successfully in a moderate range in the deep ocean without complicated matched-field calculations in the simulations and experiments.
文摘Depth estimation of subsurface faults is one of the problems in gravity interpretation. We tried using the support vector classifier (SVC) method in the estimation. Using forward and nonlinear inverse techniques, detecting the depth of subsurface faults with related error is possible but it is necessary to have an initial guess for the depth and this initial guess usually comes from non-gravity data. We introduce SVC in this paper as one of the tools for estimating the depth of subsurface faults using gravity data. We can suppose that each subsurface fault depth is a class and that SVC is a classification algorithm. To better use the SVC algorithm, we select proper depth estimation features using a proper features selection (FS) algorithm. In this research, we produce a training set consisting of synthetic gravity profiles created by subsurface faults at different depths to train the SVC code to estimate the depth of real subsurface faults. Then we test our trained SVC code by a testing set consisting of other synthetic gravity profiles created by subsurface faults at different depths. We also tested our trained SVC code using real data.
基金supported in part by School Research Projects of Wuyi University (No.5041700175).
文摘Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.
基金supported by the National Natural Science Foundation of China(Grant Nos.60832003)the Key Laboratory of Advanced Display and System Applications(Shanghai University),Ministry of Education,China(Grant No.P200801)the Science and Technology Commission of Shanghai Municipality(Grant No.10510500500)
文摘Depth estimation is an active research area with the developing of stereo vision in recent years. It is one of the key technologies to resolve the large data of stereo vision communication. Now depth estimation still has some problems, such as occlusion, fuzzy edge, real-time processing, etc. Many algorithms have been proposed base on software, however the performance of the computer configurations limits the software processing speed. The other resolution is hardware design and the great developments of the digital signal processor (DSP), and application specific integrated circuit (ASIC) and field programmable gate array (FPGA) provide the opportunity of flexible applications. In this work, by analyzing the procedures of depth estimation, the proper algorithms which can be used in hardware design to execute real-time depth estimation are proposed. The different methods of calibration, matching and post-processing are analyzed based on the hardware design requirements. At last some tests for the algorithm have been analyzed. The results show that the algorithms proposed for hardware design can provide credited depth map for further view synthesis and are suitable for hardware design.
基金This work was supported by the national key research development plan(Project No.YS2018YFB1403703)research project of the communication university of china(Project No.CUC200D058).
文摘Learning-based multi-task models have been widely used in various scene understanding tasks,and complement each other,i.e.,they allow us to consider prior semantic information to better infer depth.We boost the unsupervised monocular depth estimation using semantic segmentation as an auxiliary task.To address the lack of cross-domain datasets and catastrophic forgetting problems encountered in multi-task training,we utilize existing methodology to obtain redundant segmentation maps to build our cross-domain dataset,which not only provides a new way to conduct multi-task training,but also helps us to evaluate results compared with those of other algorithms.In addition,in order to comprehensively use the extracted features of the two tasks in the early perception stage,we use a strategy of sharing weights in the network to fuse cross-domain features,and introduce a novel multi-task loss function to further smooth the depth values.Extensive experiments on KITTI and Cityscapes datasets show that our method has achieved state-of-the-art performance in the depth estimation task,as well improved semantic segmentation.
文摘For traffic object detection in foggy environment based on convolutional neural network(CNN),data sets in fog-free environment are generally used to train the network directly.As a result,the network cannot learn the object characteristics in the foggy environment in the training set,and the detection effect is not good.To improve the traffic object detection in foggy environment,we propose a method of generating foggy images on fog-free images from the perspective of data set construction.First,taking the KITTI objection detection data set as an original fog-free image,we generate the depth image of the original image by using improved Monodepth unsupervised depth estimation method.Then,a geometric prior depth template is constructed to fuse the image entropy taken as weight with the depth image.After that,a foggy image is acquired from the depth image based on the atmospheric scattering model.Finally,we take two typical object-detection frameworks,that is,the two-stage object-detection Fster region-based convolutional neural network(Faster-RCNN)and the one-stage object-detection network YOLOv4,to train the original data set,the foggy data set and the mixed data set,respectively.According to the test results on RESIDE-RTTS data set in the outdoor natural foggy environment,the model under the training on the mixed data set shows the best effect.The mean average precision(mAP)values are increased by 5.6%and by 5.0%under the YOLOv4 model and the Faster-RCNN network,respectively.It is proved that the proposed method can effectively improve object identification ability foggy environment.
文摘Most approaches to estimate a scene’s 3D depth from a single image often model the point spread function (PSF) as a 2D Gaussian function. However, those method<span>s</span><span> are suffered from some noises, and difficult to get a high quality of depth recovery. We presented a simple yet effective approach to estimate exactly the amount of spatially varying defocus blur at edges, based on </span><span>a</span><span> Cauchy distribution model for the PSF. The raw image was re-blurred twice using two known Cauchy distribution kernels, and the defocus blur amount at edges could be derived from the gradient ratio between the two re-blurred images. By propagating the blur amount at edge locations to the entire image using the matting interpolation, a full depth map was then recovered. Experimental results on several real images demonstrated both feasibility and effectiveness of our method, being a non-Gaussian model for DSF, in providing a better estimation of the defocus map from a single un-calibrated defocused image. These results also showed that our method </span><span>was</span><span> robust to image noises, inaccurate edge location and interferences of neighboring edges. It could generate more accurate scene depth maps than the most of existing methods using a Gaussian based DSF model.</span>
基金Supported by the National Natural Science Foundation of China under Grants 61872241, 62077037 and 62077037Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102。
文摘Background Monocular depth estimation aims to predict a dense depth map from a single RGB image,and has important applications in 3D reconstruction,automatic driving,and augmented reality.However,existing methods directly feed the original RGB image into the model to extract depth features without avoiding the interference of depth-irrelevant information on depth-estimation accuracy,which leads to inferior performance.Methods To remove the influence of depth-irrelevant information and improve the depth-prediction accuracy,we propose RADepthNet,a novel reflectance-guided network that fuses boundary features.Specifically,our method predicts depth maps using the following three steps:(1)Intrinsic Image Decomposition.We propose a reflectance extraction module consisting of an encoder-decoder structure to extract the depth-related reflectance.Through an ablation study,we demonstrate that the module can reduce the influence of illumination on depth estimation.(2)Boundary Detection.A boundary extraction module,consisting of an encoder,refinement block,and upsample block,was proposed to better predict the depth at object boundaries utilizing gradient constraints.(3)Depth Prediction Module.We use an encoder different from(2)to obtain depth features from the reflectance map and fuse boundary features to predict depth.In addition,we proposed FIFADataset,a depth-estimation dataset applied in soccer scenarios.Results Extensive experiments on a public dataset and our proposed FIFADataset show that our method achieves state-of-the-art performance.
基金National Science Foundation (1808436,1918074,2306708,2237142-CAREER)U.S.Department of Energy (234402)。
文摘Imaging of surface-enhanced Raman scattering(SERS) nanoparticles(NPs) has been intensively studied for cancer detection due to its high sensitivity, unconstrained low signal-to-noise ratios, and multiplexing detection capability. Furthermore, conjugating SERS NPs with various biomarkers is straightforward, resulting in numerous successful studies on cancer detection and diagnosis. However, Raman spectroscopy only provides spectral data from an imaging area without co-registered anatomic context.
基金supported by the National Key R&D Program of China(2022YFC3803600)the National Natural Science Foundation of China(62372023)the Open Fund of the State Key Laboratory of Software Development Environment,China(SKLSDE-2023ZX-11).
文摘Light Field(LF)depth estimation is an important research direction in the area of computer vision and computational photography,which aims to infer the depth information of different objects in threedimensional scenes by capturing LF data.Given this new era of significance,this article introduces a survey of the key concepts,methods,novel applications,and future trends in this area.We summarize the LF depth estimation methods,which are usually based on the interaction of radiance from rays in all directions of the LF data,such as epipolar-plane,multi-view geometry,focal stack,and deep learning.We analyze the many challenges facing each of these approaches,including complex algorithms,large amounts of computation,and speed requirements.In addition,this survey summarizes most of the currently available methods,conducts some comparative experiments,discusses the results,and investigates the novel directions in LF depth estimation.
文摘Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulnerable to illumination variance, occlusions, texture-less regions, as well as moving objects, making them not robust enough to deal with various scenes. To address this challenge, we study two kinds of robust cross-view consistency in this paper. Firstly, the spatial offset field between adjacent frames is obtained by reconstructing the reference frame from its neighbors via deformable alignment, which is used to align the temporal depth features via a depth feature alignment (DFA) loss. Secondly, the 3D point clouds of each reference frame and its nearby frames are calculated and transformed into voxel space, where the point density in each voxel is calculated and aligned via a voxel density alignment (VDA) loss. In this way, we exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the “point-to-point” alignment paradigm to the “region-to-region” one. Compared with the photometric consistency loss as well as the rigid point cloud alignment loss, the proposed DFA and VDA losses are more robust owing to the strong representation power of deep features as well as the high tolerance of voxel density to the aforementioned challenges. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques. Extensive ablation study and analysis validate the effectiveness of the proposed losses, especially in challenging scenes. The code and models are available at https://github.com/sunnyHelen/RCVC-depth.
基金supported by the National Key Research and Development Program of China (Grant No. 2018YFC0809302)the National Natural Science Foundation of China (Grant Nos. 61988101,61751305 and 61673176)+1 种基金the Fundamental Research Funds for the Central Universities (Grant No.JKH012016011)the Programme of Introducing Talents of Discipline to Universities (the “111” Project)(Grant No. B17017)。
文摘Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image(monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.
文摘This paper aims to address the problem of supervised monocular depth estimation.We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation.Moreover,the Transformer and convolution are good at long-range and close-range depth estimation,respectively.Therefore,we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch.The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents.However,independent branches lead to a shortage of connections between features.To bridge this gap,we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner.Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps,we adopt the deformable scheme to reduce the complexity.Extensive experiments on the KITTI,NYU,and SUN RGB-D datasets demonstrate that our proposed model,termed DepthFormer,surpasses state-of-the-art monocular depth estimation methods with prominent margins.The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
基金supported by the Ministry of Science,Researches and Technology,Iran
文摘We present an estimation of depth of anomalous bodies using normalized full gradient (NFG) of gravity anomaly. Maxima in the NFG map can locate the bodies and indicate their depth. Model calculation using a sphere and application of the NFG method to gravity anomalies over salt domes in the USA and Denmark shows effectiveness of the method. However, the accuracy of depth estimation strongly depends on the number of term N in the Fourier series used to calculate the NFG. An optimum N for the calculation can be given from a test.
基金supported by the National Natural Science Foundation of China(Grant No.52272421)Shenzhen Fundamental Research Fund(Grant Number:JCYJ20190808142613246 and 20200803015912001).
文摘Estimating depth from images captured by camera sensors is crucial for the advancement of autonomous driving technologies and has gained significant attention in recent years.However,most previous methods rely on stacked pooling or stride convolution to extract high-level features,which can limit network performance and lead to information redundancy.This paper proposes an improved bidirectional feature pyramid module(BiFPN)and a channel attention module(Seblock:squeeze and excitation)to address these issues in existing methods based on monocular camera sensor.The Seblock redistributes channel feature weights to enhance useful information,while the improved BiFPN facilitates efficient fusion of multi-scale features.The proposed method is in an end-to-end solution without any additional post-processing,resulting in efficient depth estimation.Experiment results show that the proposed method is competitive with state-of-the-art algorithms and preserves fine-grained texture of scene depth.
基金supported in part by the National High Technology Research and Development Program of China(Grant No.2021YFF0900500)the National Natural Science Foundation of China(Grant Nos.61972115 and 61872116)。
文摘Depth estimation is a fundamental computer vision problem that infers three-dimensional(3D)structures from a given scene.As it is an ill-posed problem,to fit the projection function from the given scene to the 3D structure,traditional methods generally require mass amounts of annotated data.Such pixel-level annotation is quite labor consuming,especially when addressing reflective surfaces such as mirrors or water.The widespread application of deep learning further intensifies the demand for large amounts of annotated data.Therefore,it is urgent and necessary to propose a framework that is able to reduce the requirement on the amount of data.In this paper,we propose a novel semisupervised learning framework to infer the 3D structure from the given scene.First,semantic information is employed to make the depth inference more accurate.Second,we make both the depth estimation and semantic segmentation coarse-to-fine frameworks;thus,the depth estimation can be gradually guided by semantic segmentation.We compare our model with state-of-the-art methods.The experimental results demonstrate that our method is better than many supervised learning-based methods,which proves the effectiveness of the proposed method.