A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decod...A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decoder architecture.DDFNet integrates three key innovations:first,we introduce a novel,lightweight multi-scale progressive aggregation residual network that effectively suppresses background interference and refines defect details,enabling efficient salient feature extraction.Then,we propose an innovative dual-branch decoding fusion structure,comprising the refined defect representation branch and the enhanced defect representation branch,which enhance accuracy in defect region identification and feature representation.Additionally,to further improve the detection of small and complex defects,we incorporate a multi-scale attention fusion module.Experimental results on the public ESDIs-SOD dataset show that DDFNet,with only 3.69 million parameters,achieves detection performance comparable to current state-of-the-art models,demonstrating its potential for real-time industrial applications.Furthermore,our DDFNet-L variant consistently outperforms leading methods in detection performance.The code is available at https://github.com/13140W/DDFNet.展开更多
At present, salient object detection (SOD) has achieved considerable progress. However, the methods that perform well still face the issue of inadequate detection accuracy. For example, sometimes there are problems of...At present, salient object detection (SOD) has achieved considerable progress. However, the methods that perform well still face the issue of inadequate detection accuracy. For example, sometimes there are problems of missed and false detections. Effectively optimizing features to capture key information and better integrating different levels of features to enhance their complementarity are two significant challenges in the domain of SOD. In response to these challenges, this study proposes a novel SOD method based on multi-strategy feature optimization. We propose the multi-size feature extraction module (MSFEM), which uses the attention mechanism, the multi-level feature fusion, and the residual block to obtain finer features. This module provides robust support for the subsequent accurate detection of the salient object. In addition, we use two rounds of feature fusion and the feedback mechanism to optimize the features obtained by the MSFEM to improve detection accuracy. The first round of feature fusion is applied to integrate the features extracted by the MSFEM to obtain more refined features. Subsequently, the feedback mechanism and the second round of feature fusion are applied to refine the features, thereby providing a stronger foundation for accurately detecting salient objects. To improve the fusion effect, we propose the feature enhancement module (FEM) and the feature optimization module (FOM). The FEM integrates the upper and lower features with the optimized features obtained by the FOM to enhance feature complementarity. The FOM uses different receptive fields, the attention mechanism, and the residual block to more effectively capture key information. Experimental results demonstrate that our method outperforms 10 state-of-the-art SOD methods.展开更多
The goal of salient object detection is to estimate the regions which are most likely to attract human's visual attention. As an important image preprocessing procedure to reduce the computational complexity, sali...The goal of salient object detection is to estimate the regions which are most likely to attract human's visual attention. As an important image preprocessing procedure to reduce the computational complexity, salient object detection is still a challenging problem in computer vision. In this paper, we proposed a salient object detection model by integrating local and global superpixel contrast at multiple scales. Three features are computed to estimate the saliency of superpixel. Two optimization measures are utilized to refine the resulting saliency map. Extensive experiments with the state-of-the-art saliency models on four public datasets demonstrate the effectiveness of the proposed model.展开更多
Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing com...Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.展开更多
The integrity and fineness characterization of non-connected regions and contours is a major challenge for existing salient object detection.The key to address is how to make full use of the subjective and objective s...The integrity and fineness characterization of non-connected regions and contours is a major challenge for existing salient object detection.The key to address is how to make full use of the subjective and objective structural information obtained in different steps.Therefore,by simulating the human visual mechanism,this paper proposes a novel multi-decoder matching correction network and subjective structural loss.Specifically,the loss pays different attentions to the foreground,boundary,and background of ground truth map in a top-down structure.And the perceived saliency is mapped to the corresponding objective structure of the prediction map,which is extracted in a bottom-up manner.Thus,multi-level salient features can be effectively detected with the loss as constraint.And then,through the mapping of improved binary cross entropy loss,the differences between salient regions and objects are checked to pay attention to the error prone region to achieve excellent error sensitivity.Finally,through tracking the identifying feature horizontally and vertically,the subjective and objective interaction is maximized.Extensive experiments on five benchmark datasets demonstrate that compared with 12 state-of-the-art methods,the algorithm has higher recall and precision,less error and strong robustness and generalization ability,and can predict complete and refined saliency maps.展开更多
Recently,weak supervision has received growing attention in the field of salient object detection due to the convenience of labelling.However,there is a large performance gap between weakly supervised and fully superv...Recently,weak supervision has received growing attention in the field of salient object detection due to the convenience of labelling.However,there is a large performance gap between weakly supervised and fully supervised salient object detectors because the scribble annotation can only provide very limited foreground/background information.Therefore,an intuitive idea is to infer annotations that cover more complete object and background regions for training.To this end,a label inference strategy is proposed based on the assumption that pixels with similar colours and close positions should have consistent labels.Specifically,k-means clustering algorithm was first performed on both colours and coordinates of original annotations,and then assigned the same labels to points having similar colours with colour cluster centres and near coordinate cluster centres.Next,the same annotations for pixels with similar colours within each kernel neighbourhood was set further.Extensive experiments on six benchmarks demonstrate that our method can significantly improve the performance and achieve the state-of-the-art results.展开更多
Exploring the interaction between red,green,blue(RGB)and thermal infrared modalities is critical to the success of RGB-thermal(RGB-T)salient object detection(RGB-T SOD).In this paper,a cross-modal attention and reinfo...Exploring the interaction between red,green,blue(RGB)and thermal infrared modalities is critical to the success of RGB-thermal(RGB-T)salient object detection(RGB-T SOD).In this paper,a cross-modal attention and reinforcement network(CAR-Net)was proposed to explore the implicit relationship between the two modalities,which fully leverages the beneficial expression and complementary fusion of the two modalities.Specifically,CAR-Net has a cross-modal attention module(CAM)that enables efficient interaction and key information extraction through joint attention.It also includes a feature strengthener module(FSM)for improved representation using channel rank and loop methods.A large number of experiments show that the CAR-Net achieves the best performance on three publicly available datasets.展开更多
Salient object detection remains one of the most important and active research topics in computer vision,with wide-ranging applications to object recognition,scene understanding,image retrieval,context aware image edi...Salient object detection remains one of the most important and active research topics in computer vision,with wide-ranging applications to object recognition,scene understanding,image retrieval,context aware image editing,image compression,etc. Most existing methods directly determine salient objects by exploring various salient object features.Here,we propose a novel graph based ranking method to detect and segment the most salient object in a scene according to its relationship to image border(background) regions,i.e.,the background feature.Firstly,we use regions/super-pixels as graph nodes,which are fully connected to enable both long range and short range relations to be modeled. The relationship of each region to the image border(background) is evaluated in two stages:(i) ranking with hard background queries,and(ii) ranking with soft foreground queries. We experimentally show how this two-stage ranking based salient object detection method is complementary to traditional methods,and that integrated results outperform both. Our method allows the exploitation of intrinsic image structure to achieve high quality salient object determination using a quadratic optimization framework,with a closed form solution which can be easily computed.Extensive method evaluation and comparison using three challenging saliency datasets demonstrate that our method consistently outperforms 10 state-of-theart models by a big margin.展开更多
Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a...Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways,using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend.This paper provides the first comprehensive review and a benchmark for light field SOD,which has long been lacking in the saliency community.Firstly,we introduce light fields,including theory and data forms,and then review existing studies on light field SOD,covering ten traditional models,seven deep learning-based models,a comparative study,and a brief review.Existing datasets for light field SOD are also summarized.Secondly,we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets,providing insightful discussions and analyses,including a comparison between light field SOD and RGB-D SOD models.Due to the inconsistency of current datasets,we further generate complete data and supplement focal stacks,depth maps,and multi-view images for them,making them consistent and uniform.Our supplemental data make a universal benchmark possible.Lastly,light field SOD is a specialised problem,because of its diverse data representations and high dependency on acquisition hardware,so it differs greatly from other saliency detection tasks.We provide nine observations on challenges and future directions,and outline several open issues.All the materials including models,datasets,benchmarking results,and supplemented light field datasets are publicly available at https://github.com/kerenfu/LFSOD-Survey.展开更多
Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from th...Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.展开更多
Salient object detection is used as a preprocess in many computer vision tasks(such as salient object segmentation,video salient object detection,etc.).When performing salient object detection,depth information can pr...Salient object detection is used as a preprocess in many computer vision tasks(such as salient object segmentation,video salient object detection,etc.).When performing salient object detection,depth information can provide clues to the location of target objects,so effective fusion of RGB and depth feature information is important.In this paper,we propose a new feature information aggregation approach,weighted group integration(WGI),to effectively integrate RGB and depth feature information.We use a dual-branch structure to slice the input RGB image and depth map separately and then merge the results separately by concatenation.As grouped features may lose global information about the target object,we also make use of the idea of residual learning,taking the features captured by the original fusion method as supplementary information to ensure both accuracy and completeness of the fused information.Experiments on five datasets show that our model performs better than typical existing approaches for four evaluation metrics.展开更多
Segment Anything Model(SAM)is a cutting-edge model that has shown impressive performance in general object segmentation.The birth of the segment anything is a groundbreaking step towards creating a universal intellige...Segment Anything Model(SAM)is a cutting-edge model that has shown impressive performance in general object segmentation.The birth of the segment anything is a groundbreaking step towards creating a universal intelligent model.Due to its superior performance in general object segmentation,it quickly gained attention and interest.This makes SAM particularly attractive in industrial surface defect segmentation,especially for complex industrial scenes with limited training data.However,its segmentation ability for specific industrial scenes remains unknown.Therefore,in this work,we select three representative and complex industrial surface defect detection scenarios,namely strip steel surface defects,tile surface defects,and rail surface defects,to evaluate the segmentation performance of SAM.Our results show that although SAM has great potential in general object segmentation,it cannot achieve satisfactory performance in complex industrial scenes.Our test results are available at:https://github.com/VDT-2048/SAM-IS.展开更多
The saliency detection of the same kind of stacked fruits can assist robots in completing sorting tasks,which is an important prerequisite for the grading and packing of fruits.In order to accurately obtain saliency t...The saliency detection of the same kind of stacked fruits can assist robots in completing sorting tasks,which is an important prerequisite for the grading and packing of fruits.In order to accurately obtain saliency targets of fruits in the same kind of stacked state under overexposure,non-uniform illumination,and low illumination,a method for detecting stacked fruits under poor illumination based on RGB-D visual saliency was proposed.Based on the Res2Net network,features from each layer of two images were obtained.To realize the complementary advantages between RGB features and depth features,the input RGB images were preprocessed using depth weighting to obtain purified RGB features.To increase the information interaction between branches of different scales and better balance the fusion features and modal exclusive features,a multi-scale progressive fusion module was proposed.To minimize the difference between the initial saliency maps generated by different features and improve the accuracy of the final predicted saliency maps,a multi-branch hybrid supervised method was used.The comprehensive experiments on the self-made dataset of the same kind of stacked fruits show that the proposed algorithm is superior to five state-of-the-art RGB-D SOD methods in four key indicators:S value,F value,and MAE value,which are 0.979,0.992,and 0.006,respectively,and the P-R curve,which is also closer to the upper right corner of the graph.These values demonstrate that the proposed algorithm can accurately obtain saliency targets in the same kind of stacked fruits.The results of this study can promote the automatic development of the fruit production and packaging industry.展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,...Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.展开更多
In this paper, we consider salient instance segmentation. As well as producing bounding boxes,our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Ta...In this paper, we consider salient instance segmentation. As well as producing bounding boxes,our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also the surrounding context, enabling us to distinguish instances in the same scope even with partial occlusion.Our network is end-to-end trainable and is fast(running at 40 fps for images with resolution 320 × 320). We evaluate our approach on a publicly available benchmark and show that it outperforms alternative solutions. We also provide a thorough analysis of our design choices to help readers better understand the function of each part of our network. Source code can be found at https://github.com/Ruochen Fan/S4 Net.展开更多
The burgeoning field of Camouflaged Object Detection(COD)seeks to identify objects that blend into their surroundings.Despite the impressive performance of recent learning-based models,their robustness is limited,as e...The burgeoning field of Camouflaged Object Detection(COD)seeks to identify objects that blend into their surroundings.Despite the impressive performance of recent learning-based models,their robustness is limited,as existing methods may misclassify salient objects as camouflaged ones,despite these contradictory characteristics.This limitation may stem from the lack of multipattern training images,leading to reduced robustness against salient objects.To overcome the scarcity of multi-pattern training images,we introduce CamDiff,a novel approach inspired by AI-Generated Content(AIGC).Specifically,we leverage a latent diffusion model to synthesize salient objects in camouflaged scenes,while using the zero-shot image classification ability of the Contrastive Language-Image Pre-training(CLIP)model to prevent synthesis failures and ensure that the synthesized objects align with the input prompt.Consequently,the synthesized image retains its original camouflage label while incorporating salient objects,yielding camouflaged scenes with richer characteristics.The results of user studies show that the salient objects in our synthesized scenes attract the user’s attention more;thus,such samples pose a greater challenge to the existing COD models.Our CamDiff enables flexible editing and effcient large-scale dataset generation at a low cost.It significantly enhances the training and testing phases of COD baselines,granting them robustness across diverse domains.Our newly generated datasets and source code are available at https://github.com/drlxj/CamDiff.展开更多
We introduce a novel bilateral reference framework(BiRefNet)for high-resolution dichotomous image segmentation(DIS).It comprises two essential components:the localization module(LM)and the reconstruction module(RM)wit...We introduce a novel bilateral reference framework(BiRefNet)for high-resolution dichotomous image segmentation(DIS).It comprises two essential components:the localization module(LM)and the reconstruction module(RM)with our proposed bilateral reference(BiRef).LM aids in object localization using global semantic information.Within the RM,we utilize BiRef for the reconstruction process,where hierarchical patches of images provide the source reference,and gradient maps serve as the target reference.These components collaborate to generate the final predicted maps.We also introduce auxiliary gradient supervision to enhance the focus on regions with finer details.In addition,we outline practical training strategies tailored for DIS to improve map quality and the training process.To validate the general applicability of our approach,we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance,outperforming task-specific cutting-edge methods across all benchmarks.Our codes are publicly available at https://github.com/ZhengPeng7/BiRefNet.展开更多
基金supported in part by the National Key R&D Program of China(Grant No.2023YFB3307604)the Shanxi Province Basic Research Program Youth Science Research Project(Grant Nos.202303021212054 and 202303021212046)+3 种基金the Key Projects Supported by Hebei Natural Science Foundation(Grant No.E2024203125)the National Science Foundation of China(Grant No.52105391)the Hebei Provincial Science and Technology Major Project(Grant No.23280101Z)the National Key Laboratory of Metal Forming Technology and Heavy Equipment Open Fund(Grant No.S2308100.W17).
文摘A novel dual-branch decoding fusion convolutional neural network model(DDFNet)specifically designed for real-time salient object detection(SOD)on steel surfaces is proposed.DDFNet is based on a standard encoder–decoder architecture.DDFNet integrates three key innovations:first,we introduce a novel,lightweight multi-scale progressive aggregation residual network that effectively suppresses background interference and refines defect details,enabling efficient salient feature extraction.Then,we propose an innovative dual-branch decoding fusion structure,comprising the refined defect representation branch and the enhanced defect representation branch,which enhance accuracy in defect region identification and feature representation.Additionally,to further improve the detection of small and complex defects,we incorporate a multi-scale attention fusion module.Experimental results on the public ESDIs-SOD dataset show that DDFNet,with only 3.69 million parameters,achieves detection performance comparable to current state-of-the-art models,demonstrating its potential for real-time industrial applications.Furthermore,our DDFNet-L variant consistently outperforms leading methods in detection performance.The code is available at https://github.com/13140W/DDFNet.
文摘At present, salient object detection (SOD) has achieved considerable progress. However, the methods that perform well still face the issue of inadequate detection accuracy. For example, sometimes there are problems of missed and false detections. Effectively optimizing features to capture key information and better integrating different levels of features to enhance their complementarity are two significant challenges in the domain of SOD. In response to these challenges, this study proposes a novel SOD method based on multi-strategy feature optimization. We propose the multi-size feature extraction module (MSFEM), which uses the attention mechanism, the multi-level feature fusion, and the residual block to obtain finer features. This module provides robust support for the subsequent accurate detection of the salient object. In addition, we use two rounds of feature fusion and the feedback mechanism to optimize the features obtained by the MSFEM to improve detection accuracy. The first round of feature fusion is applied to integrate the features extracted by the MSFEM to obtain more refined features. Subsequently, the feedback mechanism and the second round of feature fusion are applied to refine the features, thereby providing a stronger foundation for accurately detecting salient objects. To improve the fusion effect, we propose the feature enhancement module (FEM) and the feature optimization module (FOM). The FEM integrates the upper and lower features with the optimized features obtained by the FOM to enhance feature complementarity. The FOM uses different receptive fields, the attention mechanism, and the residual block to more effectively capture key information. Experimental results demonstrate that our method outperforms 10 state-of-the-art SOD methods.
基金the Natural Science Foundation of China(Nos.61602349,61375053,and 61273225)the China Scholarship Council(No.201508420248)Hubei Chengguang Talented Youth Development Foundation(No.2015B22)
文摘The goal of salient object detection is to estimate the regions which are most likely to attract human's visual attention. As an important image preprocessing procedure to reduce the computational complexity, salient object detection is still a challenging problem in computer vision. In this paper, we proposed a salient object detection model by integrating local and global superpixel contrast at multiple scales. Three features are computed to estimate the saliency of superpixel. Two optimization measures are utilized to refine the resulting saliency map. Extensive experiments with the state-of-the-art saliency models on four public datasets demonstrate the effectiveness of the proposed model.
基金funded by the Natural Science Foundation China(NSFC)under Grant No.62203192.
文摘Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.
基金supported by the National Natural Science Foundation of China(No.52174021)Key Research and Develop-ment Project of Hainan Province(No.ZDYF2022GXJS 003).
文摘The integrity and fineness characterization of non-connected regions and contours is a major challenge for existing salient object detection.The key to address is how to make full use of the subjective and objective structural information obtained in different steps.Therefore,by simulating the human visual mechanism,this paper proposes a novel multi-decoder matching correction network and subjective structural loss.Specifically,the loss pays different attentions to the foreground,boundary,and background of ground truth map in a top-down structure.And the perceived saliency is mapped to the corresponding objective structure of the prediction map,which is extracted in a bottom-up manner.Thus,multi-level salient features can be effectively detected with the loss as constraint.And then,through the mapping of improved binary cross entropy loss,the differences between salient regions and objects are checked to pay attention to the error prone region to achieve excellent error sensitivity.Finally,through tracking the identifying feature horizontally and vertically,the subjective and objective interaction is maximized.Extensive experiments on five benchmark datasets demonstrate that compared with 12 state-of-the-art methods,the algorithm has higher recall and precision,less error and strong robustness and generalization ability,and can predict complete and refined saliency maps.
文摘Recently,weak supervision has received growing attention in the field of salient object detection due to the convenience of labelling.However,there is a large performance gap between weakly supervised and fully supervised salient object detectors because the scribble annotation can only provide very limited foreground/background information.Therefore,an intuitive idea is to infer annotations that cover more complete object and background regions for training.To this end,a label inference strategy is proposed based on the assumption that pixels with similar colours and close positions should have consistent labels.Specifically,k-means clustering algorithm was first performed on both colours and coordinates of original annotations,and then assigned the same labels to points having similar colours with colour cluster centres and near coordinate cluster centres.Next,the same annotations for pixels with similar colours within each kernel neighbourhood was set further.Extensive experiments on six benchmarks demonstrate that our method can significantly improve the performance and achieve the state-of-the-art results.
基金supported by the National Natural Science Foundation of China(62471124)the Heilongjiang Province Natural Science Foundation(LH2022F005)。
文摘Exploring the interaction between red,green,blue(RGB)and thermal infrared modalities is critical to the success of RGB-thermal(RGB-T)salient object detection(RGB-T SOD).In this paper,a cross-modal attention and reinforcement network(CAR-Net)was proposed to explore the implicit relationship between the two modalities,which fully leverages the beneficial expression and complementary fusion of the two modalities.Specifically,CAR-Net has a cross-modal attention module(CAM)that enables efficient interaction and key information extraction through joint attention.It also includes a feature strengthener module(FSM)for improved representation using channel rank and loop methods.A large number of experiments show that the CAR-Net achieves the best performance on three publicly available datasets.
基金funded by the National Natural Science Foundation of China under project No.61231014 and No.61572264,respectivelysupported by Defense Advanced Research Projects Agency (No.HR001110-C-0034)+1 种基金the National Science Foundation (No.BCS-0827764)the Army Research Office (No.W911NF-08-1-0360)
文摘Salient object detection remains one of the most important and active research topics in computer vision,with wide-ranging applications to object recognition,scene understanding,image retrieval,context aware image editing,image compression,etc. Most existing methods directly determine salient objects by exploring various salient object features.Here,we propose a novel graph based ranking method to detect and segment the most salient object in a scene according to its relationship to image border(background) regions,i.e.,the background feature.Firstly,we use regions/super-pixels as graph nodes,which are fully connected to enable both long range and short range relations to be modeled. The relationship of each region to the image border(background) is evaluated in two stages:(i) ranking with hard background queries,and(ii) ranking with soft foreground queries. We experimentally show how this two-stage ranking based salient object detection method is complementary to traditional methods,and that integrated results outperform both. Our method allows the exploitation of intrinsic image structure to achieve high quality salient object determination using a quadratic optimization framework,with a closed form solution which can be easily computed.Extensive method evaluation and comparison using three challenging saliency datasets demonstrate that our method consistently outperforms 10 state-of-theart models by a big margin.
基金supported by the National Natural Science Foundation of China(Nos.62176169 and 61703077)SCU-Luzhou Municipal People's Government Strategic Cooperation Projetc(t No.2020CDLZ-10)+1 种基金supported by the National Natural Science Foundation of China(No.62172228)supported by the National Natural Science Foundation of China(No.61773270).
文摘Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways,using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend.This paper provides the first comprehensive review and a benchmark for light field SOD,which has long been lacking in the saliency community.Firstly,we introduce light fields,including theory and data forms,and then review existing studies on light field SOD,covering ten traditional models,seven deep learning-based models,a comparative study,and a brief review.Existing datasets for light field SOD are also summarized.Secondly,we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets,providing insightful discussions and analyses,including a comparison between light field SOD and RGB-D SOD models.Due to the inconsistency of current datasets,we further generate complete data and supplement focal stacks,depth maps,and multi-view images for them,making them consistent and uniform.Our supplemental data make a universal benchmark possible.Lastly,light field SOD is a specialised problem,because of its diverse data representations and high dependency on acquisition hardware,so it differs greatly from other saliency detection tasks.We provide nine observations on challenges and future directions,and outline several open issues.All the materials including models,datasets,benchmarking results,and supplemented light field datasets are publicly available at https://github.com/kerenfu/LFSOD-Survey.
基金supported in part by the CAMS Innovation Fund for Medical Sciences,China(No.2019-I2M5-016)National Natural Science Foundation of China(No.62172246)+1 种基金the Youth Innovation and Technology Support Plan of Colleges and Universities in Shandong Province,China(No.2021KJ062)National Science Foundation of USA(Nos.IIS-1715985 and IIS1812606).
文摘Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.
基金supported by the NEPU Natural Science Foundation under Grants Nos.2017PY ZL05,2018QNL-51,JY CX CX062018,JY CX JG062018,JY CX 142020。
文摘Salient object detection is used as a preprocess in many computer vision tasks(such as salient object segmentation,video salient object detection,etc.).When performing salient object detection,depth information can provide clues to the location of target objects,so effective fusion of RGB and depth feature information is important.In this paper,we propose a new feature information aggregation approach,weighted group integration(WGI),to effectively integrate RGB and depth feature information.We use a dual-branch structure to slice the input RGB image and depth map separately and then merge the results separately by concatenation.As grouped features may lose global information about the target object,we also make use of the idea of residual learning,taking the features captured by the original fusion method as supplementary information to ensure both accuracy and completeness of the fused information.Experiments on five datasets show that our model performs better than typical existing approaches for four evaluation metrics.
基金supported by the National Natural Science Foundation of China(51805078)Project of National Key Laboratory of Advanced Casting Technologies(CAT2023-002)the 111 Project(B16009).
文摘Segment Anything Model(SAM)is a cutting-edge model that has shown impressive performance in general object segmentation.The birth of the segment anything is a groundbreaking step towards creating a universal intelligent model.Due to its superior performance in general object segmentation,it quickly gained attention and interest.This makes SAM particularly attractive in industrial surface defect segmentation,especially for complex industrial scenes with limited training data.However,its segmentation ability for specific industrial scenes remains unknown.Therefore,in this work,we select three representative and complex industrial surface defect detection scenarios,namely strip steel surface defects,tile surface defects,and rail surface defects,to evaluate the segmentation performance of SAM.Our results show that although SAM has great potential in general object segmentation,it cannot achieve satisfactory performance in complex industrial scenes.Our test results are available at:https://github.com/VDT-2048/SAM-IS.
文摘The saliency detection of the same kind of stacked fruits can assist robots in completing sorting tasks,which is an important prerequisite for the grading and packing of fruits.In order to accurately obtain saliency targets of fruits in the same kind of stacked state under overexposure,non-uniform illumination,and low illumination,a method for detecting stacked fruits under poor illumination based on RGB-D visual saliency was proposed.Based on the Res2Net network,features from each layer of two images were obtained.To realize the complementary advantages between RGB features and depth features,the input RGB images were preprocessed using depth weighting to obtain purified RGB features.To increase the information interaction between branches of different scales and better balance the fusion features and modal exclusive features,a multi-scale progressive fusion module was proposed.To minimize the difference between the initial saliency maps generated by different features and improve the accuracy of the final predicted saliency maps,a multi-branch hybrid supervised method was used.The comprehensive experiments on the self-made dataset of the same kind of stacked fruits show that the proposed algorithm is superior to five state-of-the-art RGB-D SOD methods in four key indicators:S value,F value,and MAE value,which are 0.979,0.992,and 0.006,respectively,and the P-R curve,which is also closer to the upper right corner of the graph.These values demonstrate that the proposed algorithm can accurately obtain saliency targets in the same kind of stacked fruits.The results of this study can promote the automatic development of the fruit production and packaging industry.
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.
基金supported in part by the National Natural Science Foundation of China under Grant No.62172228in part by an Open Project of the Key Laboratory of System Control and Information Processing,Ministry of Education(Shanghai Jiao Tong University,No.Scip202102).
文摘Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.
基金supported by National Natural Science Foundation of China(61521002,61572264,61620106008)the National Youth Talent Support Program+1 种基金Tianjin Natural Science Foundation(17JCJQJC43700,18ZXZNGX00110)the Fundamental Research Funds for the Central Universities(Nankai University,No.63191501)。
文摘In this paper, we consider salient instance segmentation. As well as producing bounding boxes,our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also the surrounding context, enabling us to distinguish instances in the same scope even with partial occlusion.Our network is end-to-end trainable and is fast(running at 40 fps for images with resolution 320 × 320). We evaluate our approach on a publicly available benchmark and show that it outperforms alternative solutions. We also provide a thorough analysis of our design choices to help readers better understand the function of each part of our network. Source code can be found at https://github.com/Ruochen Fan/S4 Net.
文摘The burgeoning field of Camouflaged Object Detection(COD)seeks to identify objects that blend into their surroundings.Despite the impressive performance of recent learning-based models,their robustness is limited,as existing methods may misclassify salient objects as camouflaged ones,despite these contradictory characteristics.This limitation may stem from the lack of multipattern training images,leading to reduced robustness against salient objects.To overcome the scarcity of multi-pattern training images,we introduce CamDiff,a novel approach inspired by AI-Generated Content(AIGC).Specifically,we leverage a latent diffusion model to synthesize salient objects in camouflaged scenes,while using the zero-shot image classification ability of the Contrastive Language-Image Pre-training(CLIP)model to prevent synthesis failures and ensure that the synthesized objects align with the input prompt.Consequently,the synthesized image retains its original camouflage label while incorporating salient objects,yielding camouflaged scenes with richer characteristics.The results of user studies show that the salient objects in our synthesized scenes attract the user’s attention more;thus,such samples pose a greater challenge to the existing COD models.Our CamDiff enables flexible editing and effcient large-scale dataset generation at a low cost.It significantly enhances the training and testing phases of COD baselines,granting them robustness across diverse domains.Our newly generated datasets and source code are available at https://github.com/drlxj/CamDiff.
基金supported by the Fundamental Research Funds for the Central Universities(No.Nankai University,63243150).
文摘We introduce a novel bilateral reference framework(BiRefNet)for high-resolution dichotomous image segmentation(DIS).It comprises two essential components:the localization module(LM)and the reconstruction module(RM)with our proposed bilateral reference(BiRef).LM aids in object localization using global semantic information.Within the RM,we utilize BiRef for the reconstruction process,where hierarchical patches of images provide the source reference,and gradient maps serve as the target reference.These components collaborate to generate the final predicted maps.We also introduce auxiliary gradient supervision to enhance the focus on regions with finer details.In addition,we outline practical training strategies tailored for DIS to improve map quality and the training process.To validate the general applicability of our approach,we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance,outperforming task-specific cutting-edge methods across all benchmarks.Our codes are publicly available at https://github.com/ZhengPeng7/BiRefNet.