Although conventional object detection methods achieve high accuracy through extensively annotated datasets,acquiring such large-scale labeled data remains challenging and cost-prohibitive in numerous real-world appli...Although conventional object detection methods achieve high accuracy through extensively annotated datasets,acquiring such large-scale labeled data remains challenging and cost-prohibitive in numerous real-world applications.Few-shot object detection presents a new research idea that aims to localize and classify objects in images using only limited annotated examples.However,the inherent challenge in few-shot object detection lies in the insufficient sample diversity to fully characterize the sample feature distribution,which consequently impacts model performance.Inspired by contrastive learning principles,we propose an Implicit Feature Contrastive Learning(IFCL)module to address this limitation and augment feature diversity for more robust representational learning.This module generates augmented support sample features in a mixed feature space and implicitly contrasts them with query Region of Interest(RoI)features.This approach facilitates more comprehensive learning of both intra-class feature similarity and inter-class feature diversity,thereby enhancing the model’s object classification and localization capabilities.Extensive experiments on PASCAL VOC show that our method achieves a respective improvement of 3.2%,1.8%,and 2.3%on 10-shot of three Novel Sets compared to the baseline model FPD.展开更多
The self-attention mechanism of Transformers,which captures long-range contextual information,has demonstrated significant potential in image segmentation.However,their ability to learn local,contextual relationships ...The self-attention mechanism of Transformers,which captures long-range contextual information,has demonstrated significant potential in image segmentation.However,their ability to learn local,contextual relationships between pixels requires further improvement.Previous methods face challenges in efficiently managing multi-scale fea-tures of different granularities from the encoder backbone,leaving room for improvement in their global representation and feature extraction capabilities.To address these challenges,we propose a novel Decoder with Multi-Head Feature Receptors(DMHFR),which receives multi-scale features from the encoder backbone and organizes them into three feature groups with different granularities:coarse,fine-grained,and full set.These groups are subsequently processed by Multi-Head Feature Receptors(MHFRs)after feature capture and modeling operations.MHFRs include two Three-Head Feature Receptors(THFRs)and one Four-Head Feature Receptor(FHFR).Each group of features is passed through these MHFRs and then fed into axial transformers,which help the model capture long-range dependencies within the features.The three MHFRs produce three distinct feature outputs.The output from the FHFR serves as auxiliary auxiliary features in the prediction head,and the prediction output and their losses will eventually be aggregated.Experimental results show that the Transformer using DMHFR outperforms 15 state of the arts(SOTA)methods on five public datasets.Specifically,it achieved significant improvements in mean DICE scores over the classic Parallel Reverse Attention Network(PraNet)method,with gains of 4.1%,2.2%,1.4%,8.9%,and 16.3%on the CVC-ClinicDB,Kvasir-SEG,CVC-T,CVC-ColonDB,and ETIS-LaribPolypDB datasets,respectively.展开更多
Recently, there have been some attempts of Transformer in 3D point cloud classification. In order to reduce computations, most existing methods focus on local spatial attention,but ignore their content and fail to est...Recently, there have been some attempts of Transformer in 3D point cloud classification. In order to reduce computations, most existing methods focus on local spatial attention,but ignore their content and fail to establish relationships between distant but relevant points. To overcome the limitation of local spatial attention, we propose a point content-based Transformer architecture, called PointConT for short. It exploits the locality of points in the feature space(content-based), which clusters the sampled points with similar features into the same class and computes the self-attention within each class, thus enabling an effective trade-off between capturing long-range dependencies and computational complexity. We further introduce an inception feature aggregator for point cloud classification, which uses parallel structures to aggregate high-frequency and low-frequency information in each branch separately. Extensive experiments show that our PointConT model achieves a remarkable performance on point cloud shape classification. Especially, our method exhibits 90.3% Top-1 accuracy on the hardest setting of ScanObjectN N. Source code of this paper is available at https://github.com/yahuiliu99/PointC onT.展开更多
Plant diseases present a significant threat to global agricultural productivity, endangering both crop yields and quality. Traditional detection methods largely rely on manual inspection, a process that is not only la...Plant diseases present a significant threat to global agricultural productivity, endangering both crop yields and quality. Traditional detection methods largely rely on manual inspection, a process that is not only labor-intensive and time-consuming but also subject to subjective biases and dependent on operators’ expertise. Recent advancements in Transformer-based architectures have shown substantial progress in image classification tasks, particularly excelling in global feature extraction. However, despite their strong performance, the high computational complexity and large parameter requirements of Transformer models limit their practical application in plant disease detection. To address these constraints, this study proposes an optimized Efficient Swin Transformer specifically engineered to reduce computational complexity while enhancing classification accuracy. This model is an improvement over the Swin-T architecture, incorporating two pivotal modules: the Selective Token Generator and the Feature Fusion Aggregator. The Selective Token Generator minimizes the number of tokens processed, significantly increasing computational efficiency and facilitating multi-scale feature extraction. Concurrently, the Feature Fusion Aggregator adaptively integrates static and dynamic features, thereby enhancing the model’s ability to capture complex details within intricate environmental contexts.Empirical evaluations conducted on the PlantDoc dataset demonstrate the model’s superior classification performance, achieving a precision of 80.14% and a recall of 76.27%. Compared to the standard Swin-T model, the Efficient Swin Transformer achieves approximately 20.89% reduction in parameter size while improving precision by 4.29%. This study substantiates the potential of efficient token conversion techniques within Transformer architectures, presenting an effective and accurate solution for plant disease detection in the agricultural sector.展开更多
Siamese tracking algorithms usually take convolutional neural networks(CNNs)as feature extractors owing to their capability of extracting deep discriminative features.However,the convolution kernels in CNNs have limit...Siamese tracking algorithms usually take convolutional neural networks(CNNs)as feature extractors owing to their capability of extracting deep discriminative features.However,the convolution kernels in CNNs have limited receptive fields,making it difficult to capture global feature dependencies which is important for object detection,especially when the target undergoes large-scale variations or movement.In view of this,we develop a novel network called effective convolution mixed Transformer Siamese network(SiamCMT)for visual tracking,which integrates CNN-based and Transformer-based architectures to capture both local information and long-range dependencies.Specifically,we design a Transformer-based module named lightweight multi-head attention(LWMHA)which can be flexibly embedded into stage-wise CNNs and improve the network’s representation ability.Additionally,we introduce a stage-wise feature aggregation mechanism which integrates features learned from multiple stages.By leveraging both location and semantic information,this mechanism helps the SiamCMT to better locate and find the target.Moreover,to distinguish the contribution of different channels,a channel-wise attention mechanism is introduced to enhance the important channels and suppress the others.Extensive experiments on seven challenging benchmarks,i.e.,OTB2015,UAV123,GOT10K,LaSOT,DTB70,UAVTrack112_L,and VOT2018,demonstrate the effectiveness of the proposed algorithm.Specially,the proposed method outperforms the baseline by 3.5%and 3.1%in terms of precision and success rates with a real-time speed of 59.77 FPS on UAV123.展开更多
Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dyn...Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.展开更多
Few-shot learning has emerged as a crucial technique for coral species classification,addressing the challenge of limited labeled data in underwater environments.This study introduces an optimized few-shot learning mo...Few-shot learning has emerged as a crucial technique for coral species classification,addressing the challenge of limited labeled data in underwater environments.This study introduces an optimized few-shot learning model that enhances classification accuracy while minimizing reliance on extensive data collection.The proposed model integrates a hybrid similarity measure combining Euclidean distance and cosine similarity,effectively capturing both feature magnitude and directional relationships.This approach achieves a notable accuracy of 71.8%under a 5-way 5-shot evaluation,outperforming state-of-the-art models such as Prototypical Networks,FEAT,and ESPT by up to 10%.Notably,the model demonstrates high precision in classifying Siderastreidae(87.52%)and Fungiidae(88.95%),underscoring its effectiveness in distinguishing subtle morphological differences.To further enhance performance,we incorporate a self-supervised learning mechanism based on contrastive learning,enabling the model to extract robust representations by leveraging local structural patterns in corals.This enhancement significantly improves classification accuracy,particularly for species with high intra-class variation,leading to an overall accuracy of 76.52%under a 5-way 10-shot evaluation.Additionally,the model exploits the repetitive structures inherent in corals,introducing a local feature aggregation strategy that refines classification through spatial information integration.Beyond its technical contributions,this study presents a scalable and efficient approach for automated coral reef monitoring,reducing annotation costs while maintaining high classification accuracy.By improving few-shot learning performance in underwater environments,our model enhances monitoring accuracy by up to 15%compared to traditional methods,offering a practical solution for large-scale coral conservation efforts.展开更多
The subsurface of urban cities is becoming increasingly congested.In-time records of subsur-face structures are of vital importance for the maintenance and management of urban infrastructure beneath or above the groun...The subsurface of urban cities is becoming increasingly congested.In-time records of subsur-face structures are of vital importance for the maintenance and management of urban infrastructure beneath or above the ground.Ground-penetrating radar(GPR)is a nondestructive testing method that can survey and image the subsurface without excava-tion.However,the interpretation of GPR relies on the operator’s experience.An automatic workflow was proposed for recognizing and classifying subsurface structures with GPR using computer vision and machine learning techniques.The workflow comprises three stages:first,full-cover GPR measurements are processed to form the C-scans;second,the abnormal areas are extracted from the full-cover C-scans with coefficient of variation-active contour model(CV-ACM);finally,the extracted segments are recognized and classified from the corresponding B-scans with aggregate channel feature(ACF)to produce a semantic map.The selected computer vision methods were validated by a controlled test in the laboratory,and the entire workflow was evaluated with a real,on-site case study.The results of the controlled and on-site case were both promising.This study establishes the necessity of a full-cover 3D GPR survey,illustrating the feasibility of integrating advanced computer vision techniques to analyze a large amount of 3D GPR survey data,and paves the way for automating subsurface modeling with GPR.展开更多
As an important part of the new generation of information technology,the Internet of Things(IoT)has been widely concerned and regarded as an enabling technology of the next generation of health care system.The fundus ...As an important part of the new generation of information technology,the Internet of Things(IoT)has been widely concerned and regarded as an enabling technology of the next generation of health care system.The fundus photography equipment is connected to the cloud platform through the IoT,so as to realize the realtime uploading of fundus images and the rapid issuance of diagnostic suggestions by artificial intelligence.At the same time,important security and privacy issues have emerged.The data uploaded to the cloud platform involves more personal attributes,health status and medical application data of patients.Once leaked,abused or improperly disclosed,personal information security will be violated.Therefore,it is important to address the security and privacy issues of massive medical and healthcare equipment connecting to the infrastructure of IoT healthcare and health systems.To meet this challenge,we propose MIA-UNet,a multi-scale iterative aggregation U-network,which aims to achieve accurate and efficient retinal vessel segmentation for ophthalmic auxiliary diagnosis while ensuring that the network has low computational complexity to adapt to mobile terminals.In this way,users do not need to upload the data to the cloud platform,and can analyze and process the fundus images on their own mobile terminals,thus eliminating the leakage of personal information.Specifically,the interconnection between encoder and decoder,as well as the internal connection between decoder subnetworks in classic U-Net are redefined and redesigned.Furthermore,we propose a hybrid loss function to smooth the gradient and deal with the imbalance between foreground and background.Compared with the UNet,the segmentation performance of the proposed network is significantly improved on the premise that the number of parameters is only increased by 2%.When applied to three publicly available datasets:DRIVE,STARE and CHASE DB1,the proposed network achieves the accuracy/F1-score of 96.33%/84.34%,97.12%/83.17%and 97.06%/84.10%,respectively.The experimental results show that the MIA-UNet is superior to the state-of-the-art methods.展开更多
Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing...Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.展开更多
In the study on Ca-Mg silicate crystalline glazes, we found some disequilibrated crystallization phenomena, such as non-crystallographic small angle forking and spheroidal growth, parasitism and wedging-form of crysta...In the study on Ca-Mg silicate crystalline glazes, we found some disequilibrated crystallization phenomena, such as non-crystallographic small angle forking and spheroidal growth, parasitism and wedging-form of crystals, dendritic growth, secondary nucleation, etc. Those phenomena possibly resulted from two factors: (1) partial temperature gradient, which is caused by heat asymmetry in the electrical resistance furnace, when crystals crystalize from silicate melt; (2) constitutional supercooling near the surface of crystals. The disparity of disequilibrated crystallization phenomena in different main crystalline phases causes various morphological features of the crystal aggregates. At the same time, disequilibrated crystallization causes great stress retained in the crystals, which results in cracks in glazes when the temperature drops. According to the results, the authors analyzed those phenomena and displayed correlative figures and data.展开更多
Space-time video super-resolution(STVSR)serves the purpose to reconstruct high-resolution high-frame-rate videos from their low-resolution low-frame-rate counterparts.Recent approaches utilize end-to-end deep learning...Space-time video super-resolution(STVSR)serves the purpose to reconstruct high-resolution high-frame-rate videos from their low-resolution low-frame-rate counterparts.Recent approaches utilize end-to-end deep learning models to achieve STVSR.They first interpolate intermediate frame features between given frames,then perform local and global refinement among the feature sequence,and finally increase the spatial resolutions of these features.However,in the most important feature interpolation phase,they only capture spatial-temporal information from the most adjacent frame features,ignoring modelling long-term spatial-temporal correlations between multiple neighbouring frames to restore variable-speed object movements and maintain long-term motion continuity.In this paper,we propose a novel long-term temporal feature aggregation network(LTFA-Net)for STVSR.Specifically,we design a long-term mixture of experts(LTMoE)module for feature interpolation.LTMoE contains multiple experts to extract mutual and complementary spatial-temporal information from multiple consecutive adjacent frame features,which are then combined with different weights to obtain interpolation results using several gating nets.Next,we perform local and global feature refinement using the Locally-temporal Feature Comparison(LFC)module and bidirectional deformable ConvLSTM layer,respectively.Experimental results on two standard benchmarks,Adobe240 and GoPro,indicate the effectiveness and superiority of our approach over state of the art.展开更多
Rain streaks introduced by atmospheric precipitation significantly degrade image quality and impair the reliability of high-level vision tasks.We present a novel image deraining framework built on a three-stage dual-r...Rain streaks introduced by atmospheric precipitation significantly degrade image quality and impair the reliability of high-level vision tasks.We present a novel image deraining framework built on a three-stage dual-residual architecture that progressively restores rain-degraded content while preserving fine structural details.Each stage begins with a multi-scale feature extractor and a channel attention module that adaptively emphasizes informative representations for rain removal.The core restoration is achieved via enhanced dual-residual blocks,which stabilize training and mitigate feature degradation across layers.To further refine representations,we integrate crossdimensional spatial attention supervised by ground-truth guidance,ensuring that only high-quality features propagate to subsequent stages.Inter-stage feature fusion modules are employed to aggregate complementary information,reinforcing reconstruction continuity and consistency.Extensive experiments on five benchmark datasets(Rain100H,Rain100L,RainKITTI2012,RainKITTI2015,and JRSRD)demonstrate that our method establishes new state-of-the-art results in both fidelity and perceptual quality,effectively removing rain streaks while preserving natural textures and structural integrity.展开更多
Deep learning methods are applied into structured data and in typical methods,low-order features are discarded after combining with high-order featuresfor prediction tasks.However,in structured data,ignorance of low-o...Deep learning methods are applied into structured data and in typical methods,low-order features are discarded after combining with high-order featuresfor prediction tasks.However,in structured data,ignorance of low-order features may cause the low prediction rate.To address this issue,in this paper,deeper attention-based network(DAN)is proposed.With DAN method,to keep both low-and high-order features,attention average pooling layer was utilized to aggregate features of each order.Furthermore,by shortcut connections from each layer to attention average pooling layer,DAN can be built extremely deep to obtain enough capacity.Experimental results show DAN has good performance and works effectively.展开更多
Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,...Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.展开更多
基金funded by the China Chongqing Municipal Science and Technology Bureau,grant numbers CSTB2024TIAD-CYKJCXX0009,CSTB2024NSCQ-LZX0043,CSTB2022NSCQ-MSX0288Chongqing Municipal Commission of Housing and Urban-Rural Development,grant number CKZ2024-87+3 种基金the Chongqing University of Technology Graduate Education High-Quality Development Project,grant number gzlsz202401the Chongqing University of Technology—Chongqing LINGLUE Technology Co.,Ltd.Electronic Information(Artificial Intelligence)Graduate Joint Training Basethe Postgraduate Education and Teaching Reform Research Project in Chongqing,grant number yjg213116the Chongqing University of Technology-CISDI Chongqing Information Technology Co.,Ltd.Computer Technology Graduate Joint Training Base.
文摘Although conventional object detection methods achieve high accuracy through extensively annotated datasets,acquiring such large-scale labeled data remains challenging and cost-prohibitive in numerous real-world applications.Few-shot object detection presents a new research idea that aims to localize and classify objects in images using only limited annotated examples.However,the inherent challenge in few-shot object detection lies in the insufficient sample diversity to fully characterize the sample feature distribution,which consequently impacts model performance.Inspired by contrastive learning principles,we propose an Implicit Feature Contrastive Learning(IFCL)module to address this limitation and augment feature diversity for more robust representational learning.This module generates augmented support sample features in a mixed feature space and implicitly contrasts them with query Region of Interest(RoI)features.This approach facilitates more comprehensive learning of both intra-class feature similarity and inter-class feature diversity,thereby enhancing the model’s object classification and localization capabilities.Extensive experiments on PASCAL VOC show that our method achieves a respective improvement of 3.2%,1.8%,and 2.3%on 10-shot of three Novel Sets compared to the baseline model FPD.
基金supported by Xiamen Medical and Health Guidance Project in 2021(No.3502Z20214ZD1070)supported by a grant from Guangxi Key Laboratory of Machine Vision and Intelligent Control,China(No.2023B02).
文摘The self-attention mechanism of Transformers,which captures long-range contextual information,has demonstrated significant potential in image segmentation.However,their ability to learn local,contextual relationships between pixels requires further improvement.Previous methods face challenges in efficiently managing multi-scale fea-tures of different granularities from the encoder backbone,leaving room for improvement in their global representation and feature extraction capabilities.To address these challenges,we propose a novel Decoder with Multi-Head Feature Receptors(DMHFR),which receives multi-scale features from the encoder backbone and organizes them into three feature groups with different granularities:coarse,fine-grained,and full set.These groups are subsequently processed by Multi-Head Feature Receptors(MHFRs)after feature capture and modeling operations.MHFRs include two Three-Head Feature Receptors(THFRs)and one Four-Head Feature Receptor(FHFR).Each group of features is passed through these MHFRs and then fed into axial transformers,which help the model capture long-range dependencies within the features.The three MHFRs produce three distinct feature outputs.The output from the FHFR serves as auxiliary auxiliary features in the prediction head,and the prediction output and their losses will eventually be aggregated.Experimental results show that the Transformer using DMHFR outperforms 15 state of the arts(SOTA)methods on five public datasets.Specifically,it achieved significant improvements in mean DICE scores over the classic Parallel Reverse Attention Network(PraNet)method,with gains of 4.1%,2.2%,1.4%,8.9%,and 16.3%on the CVC-ClinicDB,Kvasir-SEG,CVC-T,CVC-ColonDB,and ETIS-LaribPolypDB datasets,respectively.
基金supported in part by the Nationa Natural Science Foundation of China (61876011)the National Key Research and Development Program of China (2022YFB4703700)+1 种基金the Key Research and Development Program 2020 of Guangzhou (202007050002)the Key-Area Research and Development Program of Guangdong Province (2020B090921003)。
文摘Recently, there have been some attempts of Transformer in 3D point cloud classification. In order to reduce computations, most existing methods focus on local spatial attention,but ignore their content and fail to establish relationships between distant but relevant points. To overcome the limitation of local spatial attention, we propose a point content-based Transformer architecture, called PointConT for short. It exploits the locality of points in the feature space(content-based), which clusters the sampled points with similar features into the same class and computes the self-attention within each class, thus enabling an effective trade-off between capturing long-range dependencies and computational complexity. We further introduce an inception feature aggregator for point cloud classification, which uses parallel structures to aggregate high-frequency and low-frequency information in each branch separately. Extensive experiments show that our PointConT model achieves a remarkable performance on point cloud shape classification. Especially, our method exhibits 90.3% Top-1 accuracy on the hardest setting of ScanObjectN N. Source code of this paper is available at https://github.com/yahuiliu99/PointC onT.
文摘Plant diseases present a significant threat to global agricultural productivity, endangering both crop yields and quality. Traditional detection methods largely rely on manual inspection, a process that is not only labor-intensive and time-consuming but also subject to subjective biases and dependent on operators’ expertise. Recent advancements in Transformer-based architectures have shown substantial progress in image classification tasks, particularly excelling in global feature extraction. However, despite their strong performance, the high computational complexity and large parameter requirements of Transformer models limit their practical application in plant disease detection. To address these constraints, this study proposes an optimized Efficient Swin Transformer specifically engineered to reduce computational complexity while enhancing classification accuracy. This model is an improvement over the Swin-T architecture, incorporating two pivotal modules: the Selective Token Generator and the Feature Fusion Aggregator. The Selective Token Generator minimizes the number of tokens processed, significantly increasing computational efficiency and facilitating multi-scale feature extraction. Concurrently, the Feature Fusion Aggregator adaptively integrates static and dynamic features, thereby enhancing the model’s ability to capture complex details within intricate environmental contexts.Empirical evaluations conducted on the PlantDoc dataset demonstrate the model’s superior classification performance, achieving a precision of 80.14% and a recall of 76.27%. Compared to the standard Swin-T model, the Efficient Swin Transformer achieves approximately 20.89% reduction in parameter size while improving precision by 4.29%. This study substantiates the potential of efficient token conversion techniques within Transformer architectures, presenting an effective and accurate solution for plant disease detection in the agricultural sector.
基金supported by the National Natural Science Foundation of China(Grant No.62033007)the Major Fundamental Research Program of Shandong Province(Grant No.ZR2023ZD37).
文摘Siamese tracking algorithms usually take convolutional neural networks(CNNs)as feature extractors owing to their capability of extracting deep discriminative features.However,the convolution kernels in CNNs have limited receptive fields,making it difficult to capture global feature dependencies which is important for object detection,especially when the target undergoes large-scale variations or movement.In view of this,we develop a novel network called effective convolution mixed Transformer Siamese network(SiamCMT)for visual tracking,which integrates CNN-based and Transformer-based architectures to capture both local information and long-range dependencies.Specifically,we design a Transformer-based module named lightweight multi-head attention(LWMHA)which can be flexibly embedded into stage-wise CNNs and improve the network’s representation ability.Additionally,we introduce a stage-wise feature aggregation mechanism which integrates features learned from multiple stages.By leveraging both location and semantic information,this mechanism helps the SiamCMT to better locate and find the target.Moreover,to distinguish the contribution of different channels,a channel-wise attention mechanism is introduced to enhance the important channels and suppress the others.Extensive experiments on seven challenging benchmarks,i.e.,OTB2015,UAV123,GOT10K,LaSOT,DTB70,UAVTrack112_L,and VOT2018,demonstrate the effectiveness of the proposed algorithm.Specially,the proposed method outperforms the baseline by 3.5%and 3.1%in terms of precision and success rates with a real-time speed of 59.77 FPS on UAV123.
基金supported by the National Natural Science Foundation of China under Grant Nos.62076117 and 62166026the Jiangxi Provincial Key Laboratory of Virtual Reality under Grant No.2024SSY03151.
文摘Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.
基金funded by theNational Science and TechnologyCouncil(NSTC),Taiwan,under grant numbers NSTC 112-2634-F-019-001 and NSTC 113-2634-F-A49-007.
文摘Few-shot learning has emerged as a crucial technique for coral species classification,addressing the challenge of limited labeled data in underwater environments.This study introduces an optimized few-shot learning model that enhances classification accuracy while minimizing reliance on extensive data collection.The proposed model integrates a hybrid similarity measure combining Euclidean distance and cosine similarity,effectively capturing both feature magnitude and directional relationships.This approach achieves a notable accuracy of 71.8%under a 5-way 5-shot evaluation,outperforming state-of-the-art models such as Prototypical Networks,FEAT,and ESPT by up to 10%.Notably,the model demonstrates high precision in classifying Siderastreidae(87.52%)and Fungiidae(88.95%),underscoring its effectiveness in distinguishing subtle morphological differences.To further enhance performance,we incorporate a self-supervised learning mechanism based on contrastive learning,enabling the model to extract robust representations by leveraging local structural patterns in corals.This enhancement significantly improves classification accuracy,particularly for species with high intra-class variation,leading to an overall accuracy of 76.52%under a 5-way 10-shot evaluation.Additionally,the model exploits the repetitive structures inherent in corals,introducing a local feature aggregation strategy that refines classification through spatial information integration.Beyond its technical contributions,this study presents a scalable and efficient approach for automated coral reef monitoring,reducing annotation costs while maintaining high classification accuracy.By improving few-shot learning performance in underwater environments,our model enhances monitoring accuracy by up to 15%compared to traditional methods,offering a practical solution for large-scale coral conservation efforts.
基金supported by the Shenzhen University[860-000002111308].
文摘The subsurface of urban cities is becoming increasingly congested.In-time records of subsur-face structures are of vital importance for the maintenance and management of urban infrastructure beneath or above the ground.Ground-penetrating radar(GPR)is a nondestructive testing method that can survey and image the subsurface without excava-tion.However,the interpretation of GPR relies on the operator’s experience.An automatic workflow was proposed for recognizing and classifying subsurface structures with GPR using computer vision and machine learning techniques.The workflow comprises three stages:first,full-cover GPR measurements are processed to form the C-scans;second,the abnormal areas are extracted from the full-cover C-scans with coefficient of variation-active contour model(CV-ACM);finally,the extracted segments are recognized and classified from the corresponding B-scans with aggregate channel feature(ACF)to produce a semantic map.The selected computer vision methods were validated by a controlled test in the laboratory,and the entire workflow was evaluated with a real,on-site case study.The results of the controlled and on-site case were both promising.This study establishes the necessity of a full-cover 3D GPR survey,illustrating the feasibility of integrating advanced computer vision techniques to analyze a large amount of 3D GPR survey data,and paves the way for automating subsurface modeling with GPR.
基金This work was supported in part by the National Natural Science Foundation of China(Nos.62072074,62076054,62027827,61902054)the Frontier Science and Technology Innovation Projects of National Key R&D Program(No.2019QY1405)+2 种基金the Sichuan Science and Technology Innovation Platform and Talent Plan(No.2020JDJQ0020)the Sichuan Science and Technology Support Plan(No.2020YFSY0010)the Natural Science Foundation of Guangdong Province(No.2018A030313354).
文摘As an important part of the new generation of information technology,the Internet of Things(IoT)has been widely concerned and regarded as an enabling technology of the next generation of health care system.The fundus photography equipment is connected to the cloud platform through the IoT,so as to realize the realtime uploading of fundus images and the rapid issuance of diagnostic suggestions by artificial intelligence.At the same time,important security and privacy issues have emerged.The data uploaded to the cloud platform involves more personal attributes,health status and medical application data of patients.Once leaked,abused or improperly disclosed,personal information security will be violated.Therefore,it is important to address the security and privacy issues of massive medical and healthcare equipment connecting to the infrastructure of IoT healthcare and health systems.To meet this challenge,we propose MIA-UNet,a multi-scale iterative aggregation U-network,which aims to achieve accurate and efficient retinal vessel segmentation for ophthalmic auxiliary diagnosis while ensuring that the network has low computational complexity to adapt to mobile terminals.In this way,users do not need to upload the data to the cloud platform,and can analyze and process the fundus images on their own mobile terminals,thus eliminating the leakage of personal information.Specifically,the interconnection between encoder and decoder,as well as the internal connection between decoder subnetworks in classic U-Net are redefined and redesigned.Furthermore,we propose a hybrid loss function to smooth the gradient and deal with the imbalance between foreground and background.Compared with the UNet,the segmentation performance of the proposed network is significantly improved on the premise that the number of parameters is only increased by 2%.When applied to three publicly available datasets:DRIVE,STARE and CHASE DB1,the proposed network achieves the accuracy/F1-score of 96.33%/84.34%,97.12%/83.17%and 97.06%/84.10%,respectively.The experimental results show that the MIA-UNet is superior to the state-of-the-art methods.
基金Supported by the National Natural Science Foundation of China (62172109,62072118)the National Science Foundation of Guangdong Province (2022A1515010322)+1 种基金the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010)the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。
文摘Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.
基金Supported by the Natural Science Foundation of Fujian Province(No.D0 2 10 0 12 )
文摘In the study on Ca-Mg silicate crystalline glazes, we found some disequilibrated crystallization phenomena, such as non-crystallographic small angle forking and spheroidal growth, parasitism and wedging-form of crystals, dendritic growth, secondary nucleation, etc. Those phenomena possibly resulted from two factors: (1) partial temperature gradient, which is caused by heat asymmetry in the electrical resistance furnace, when crystals crystalize from silicate melt; (2) constitutional supercooling near the surface of crystals. The disparity of disequilibrated crystallization phenomena in different main crystalline phases causes various morphological features of the crystal aggregates. At the same time, disequilibrated crystallization causes great stress retained in the crystals, which results in cracks in glazes when the temperature drops. According to the results, the authors analyzed those phenomena and displayed correlative figures and data.
文摘Space-time video super-resolution(STVSR)serves the purpose to reconstruct high-resolution high-frame-rate videos from their low-resolution low-frame-rate counterparts.Recent approaches utilize end-to-end deep learning models to achieve STVSR.They first interpolate intermediate frame features between given frames,then perform local and global refinement among the feature sequence,and finally increase the spatial resolutions of these features.However,in the most important feature interpolation phase,they only capture spatial-temporal information from the most adjacent frame features,ignoring modelling long-term spatial-temporal correlations between multiple neighbouring frames to restore variable-speed object movements and maintain long-term motion continuity.In this paper,we propose a novel long-term temporal feature aggregation network(LTFA-Net)for STVSR.Specifically,we design a long-term mixture of experts(LTMoE)module for feature interpolation.LTMoE contains multiple experts to extract mutual and complementary spatial-temporal information from multiple consecutive adjacent frame features,which are then combined with different weights to obtain interpolation results using several gating nets.Next,we perform local and global feature refinement using the Locally-temporal Feature Comparison(LFC)module and bidirectional deformable ConvLSTM layer,respectively.Experimental results on two standard benchmarks,Adobe240 and GoPro,indicate the effectiveness and superiority of our approach over state of the art.
基金supported by Key Scientific and Technological Research Program of Henan Province(Grant No.252102211111).
文摘Rain streaks introduced by atmospheric precipitation significantly degrade image quality and impair the reliability of high-level vision tasks.We present a novel image deraining framework built on a three-stage dual-residual architecture that progressively restores rain-degraded content while preserving fine structural details.Each stage begins with a multi-scale feature extractor and a channel attention module that adaptively emphasizes informative representations for rain removal.The core restoration is achieved via enhanced dual-residual blocks,which stabilize training and mitigate feature degradation across layers.To further refine representations,we integrate crossdimensional spatial attention supervised by ground-truth guidance,ensuring that only high-quality features propagate to subsequent stages.Inter-stage feature fusion modules are employed to aggregate complementary information,reinforcing reconstruction continuity and consistency.Extensive experiments on five benchmark datasets(Rain100H,Rain100L,RainKITTI2012,RainKITTI2015,and JRSRD)demonstrate that our method establishes new state-of-the-art results in both fidelity and perceptual quality,effectively removing rain streaks while preserving natural textures and structural integrity.
基金Sichuan Science and Technology Program 2018GZDZX0042,2018HH0061.
文摘Deep learning methods are applied into structured data and in typical methods,low-order features are discarded after combining with high-order featuresfor prediction tasks.However,in structured data,ignorance of low-order features may cause the low prediction rate.To address this issue,in this paper,deeper attention-based network(DAN)is proposed.With DAN method,to keep both low-and high-order features,attention average pooling layer was utilized to aggregate features of each order.Furthermore,by shortcut connections from each layer to attention average pooling layer,DAN can be built extremely deep to obtain enough capacity.Experimental results show DAN has good performance and works effectively.
基金supported in part by the National Natural Science Foundation of China under Grant No.62172228in part by an Open Project of the Key Laboratory of System Control and Information Processing,Ministry of Education(Shanghai Jiao Tong University,No.Scip202102).
文摘Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet.