Background With the gradual increase of infertility in the world,among which male sperm problems are the main factor for infertility,more and more couples are using computer-assisted sperm analysis(CASA)to assist in t...Background With the gradual increase of infertility in the world,among which male sperm problems are the main factor for infertility,more and more couples are using computer-assisted sperm analysis(CASA)to assist in the analysis and treatment of infertility.Meanwhile,the rapid development of deep learning(DL)has led to strong results in image classification tasks.However,the classification of sperm images has not been well studied in current deep learning methods,and the sperm images are often affected by noise in practical CASA applications.The purpose of this article is to investigate the anti-noise robustness of deep learning classification methods applied on sperm images.Methods The SVIA dataset is a publicly available large-scale sperm dataset containing three subsets.In this work,we used subset-C,which provides more than 125,000 independent images of sperms and impurities,including 121,401 sperm images and 4,479 impurity images.To investigate the anti-noise robustness of deep learning classification methods applied on sperm images,we conducted a comprehensive comparative study of sperm images using many convolutional neural network(CNN)and visual transformer(VT)deep learning methods to find the deep learning model with the most stable anti-noise robustness.Results This study proved that VT had strong robustness for the classification of tiny object(sperm and impurity)image datasets under some types of conventional noise and some adversarial attacks.In particular,under the influence of Poisson noise,accuracy changed from 91.45%to 91.08%,impurity precison changed from 92.7%to 91.3%,impurity recall changed from 88.8%to 89.5%,and impurity F1-score changed 90.7%to 90.4%.Meanwhile,sperm precision changed from 90.9%to 90.5%,sperm recall changed from 92.5%to 93.8%,and sperm F1-score changed from 92.1%to 90.4%.Conclusion Sperm image classification may be strongly affected by noise in current deep learning methods;the robustness with regard to noise of VT methods based on global information is greater than that of CNN methods based on local information,indicating that the robustness with regard to noise is reflected mainly in global information.展开更多
While traditional Convolutional Neural Network(CNN)-based semantic segmentation methods have proven effective,they often encounter significant computational challenges due to the requirement for dense pixel-level pred...While traditional Convolutional Neural Network(CNN)-based semantic segmentation methods have proven effective,they often encounter significant computational challenges due to the requirement for dense pixel-level predictions,which complicates real-time implementation.To address this,we introduce an advanced real-time semantic segmentation strategy specifically designed for autonomous driving,utilizing the capabilities of Visual Transformers.By leveraging the self-attention mechanism inherent in Visual Transformers,our method enhances global contextual awareness,refining the representation of each pixel in relation to the overall scene.This enhancement is critical for quickly and accurately interpreting the complex elements within driving sce-narios—a fundamental need for autonomous vehicles.Our experiments conducted on the DriveSeg autonomous driving dataset indicate that our model surpasses traditional segmentation methods,achieving a significant 4.5%improvement in Mean Intersection over Union(mIoU)while maintaining real-time responsiveness.This paper not only underscores the potential for optimized semantic segmentation but also establishes a promising direction for real-time processing in autonomous navigation systems.Future work will focus on integrating this technique with other perception modules in autonomous driving to further improve the robustness and efficiency of self-driving perception frameworks,thereby opening new pathways for research and practical applications in scenarios requiring rapid and precise decision-making capabilities.Further experimentation and adaptation of this model could lead to broader implications for the fields of machine learning and computer vision,particularly in enhancing the interaction between automated systems and their dynamic environments.展开更多
Most economists approach the economy of China from a single visual angle considering it as a special economic modality of transition economy. Based on the analysis from the single visual angle, the paper puts forward ...Most economists approach the economy of China from a single visual angle considering it as a special economic modality of transition economy. Based on the analysis from the single visual angle, the paper puts forward a dual visual angle treating China's economy as one of both transition and transformation features, and attempts to research it from this dual visual angle.展开更多
Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and hi...Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development.展开更多
Gaze following aims to interpret human-scene interactions by predicting the person’s focal point of gaze.Prevailing approaches often adopt a two-stage framework,whereby multi-modality information is extracted in the ...Gaze following aims to interpret human-scene interactions by predicting the person’s focal point of gaze.Prevailing approaches often adopt a two-stage framework,whereby multi-modality information is extracted in the initial stage for gaze target prediction.Consequently,the efficacy of these methods highly depends on the precision of the previous modality extraction.Others use a single-modality approach with complex decoders,increasing network computational load.Inspired by the remarkable success of pre-trained plain vision transformers(ViTs),we introduce a novel single-modality gaze following framework called ViTGaze.In contrast to previous methods,it creates a novel gaze following framework based mainly on powerful encoders(relative decoder parameters less than 1%).Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes.Leveraging this presumption,we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps.Furthermore,our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information.Many experiments have been conducted to demonstrate the performance of the proposed method.Our method achieves state-of-the-art performance among all single-modality methods(3.4%improvement in the area under curve score,5.1% improvement in the average precision)and very comparable performance against multi-modality methods with 59% fewer parameters.展开更多
3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width ...3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width dimensions simultaneously,leading to limited feature representation capabilities when handling complex visual tasks.To address this challenge,we propose a novel 3D model classification network named ViT-GE(Vision Transformer with Global and Efficient Attention),which integrates Global Grouped Coordinate Attention(GGCA)and Efficient Channel Attention(ECA)mechanisms.Specifically,the Vision Transformer(ViT)is employed to extract comprehensive global features from multi-view inputs using its self-attention mechanism,effectively capturing 3D shape characteristics.To further enhance spatial feature modeling,the GGCA module introduces a grouping strategy and global context interactions.Concurrently,the ECA module strengthens inter-channel information flow,enabling the network to adaptively emphasize key features and improve feature fusion.Finally,a voting mechanism is adopted to enhance classification accuracy,robustness,and stability.Experimental results on the ModelNet10 dataset demonstrate that our method achieves a classification accuracy of 93.50%,validating its effectiveness and superior performance.展开更多
Accurate identification of plant diseases is important for ensuring the safety of agricultural production.Convolutional neural networks(CNNs)and visual transformers(VTs)can extract effective representations of images ...Accurate identification of plant diseases is important for ensuring the safety of agricultural production.Convolutional neural networks(CNNs)and visual transformers(VTs)can extract effective representations of images and have been widely used for the intelligent recognition of plant disease images.However,CNNs have excellent local perception with poor global perception,and VTs have excellent global perception with poor local perception.This makes it difficult to further improve the performance of both CNNs and VTs on plant disease recognition tasks.In this paper,we propose a local and global feature-aware dual-branch network,named LGNet,for the identification of plant diseases.More specifically,we first design a dual-branch structure based on CNNs and VTs to extract the local and global features.Then,an adaptive feature fusion(AFF)module is designed to fuse the local and global features,thus driving the model to dynamically perceive the weights of different features.Finally,we design a hierarchical mixed-scale unit-guided feature fusion(HMUFF)module to mine the key information in the features at different levels and fuse the differentiated information among them,thereby enhancing the model's multiscale perception capability.Subsequently,extensive experiments were conducted on the Al Challenger 2018 dataset and the self-collected corn disease(SCD)dataset.The experimental results demonstrate that our proposed LGNet achieves state-of-the-art recognition performance on both the Al Challenger 2018 dataset and the SCD dataset,with accuracies of 88.74%and 99.08%,respectively.展开更多
基金supported by the National Natural Science Foundation of China(Grant No.82220108007).
文摘Background With the gradual increase of infertility in the world,among which male sperm problems are the main factor for infertility,more and more couples are using computer-assisted sperm analysis(CASA)to assist in the analysis and treatment of infertility.Meanwhile,the rapid development of deep learning(DL)has led to strong results in image classification tasks.However,the classification of sperm images has not been well studied in current deep learning methods,and the sperm images are often affected by noise in practical CASA applications.The purpose of this article is to investigate the anti-noise robustness of deep learning classification methods applied on sperm images.Methods The SVIA dataset is a publicly available large-scale sperm dataset containing three subsets.In this work,we used subset-C,which provides more than 125,000 independent images of sperms and impurities,including 121,401 sperm images and 4,479 impurity images.To investigate the anti-noise robustness of deep learning classification methods applied on sperm images,we conducted a comprehensive comparative study of sperm images using many convolutional neural network(CNN)and visual transformer(VT)deep learning methods to find the deep learning model with the most stable anti-noise robustness.Results This study proved that VT had strong robustness for the classification of tiny object(sperm and impurity)image datasets under some types of conventional noise and some adversarial attacks.In particular,under the influence of Poisson noise,accuracy changed from 91.45%to 91.08%,impurity precison changed from 92.7%to 91.3%,impurity recall changed from 88.8%to 89.5%,and impurity F1-score changed 90.7%to 90.4%.Meanwhile,sperm precision changed from 90.9%to 90.5%,sperm recall changed from 92.5%to 93.8%,and sperm F1-score changed from 92.1%to 90.4%.Conclusion Sperm image classification may be strongly affected by noise in current deep learning methods;the robustness with regard to noise of VT methods based on global information is greater than that of CNN methods based on local information,indicating that the robustness with regard to noise is reflected mainly in global information.
文摘While traditional Convolutional Neural Network(CNN)-based semantic segmentation methods have proven effective,they often encounter significant computational challenges due to the requirement for dense pixel-level predictions,which complicates real-time implementation.To address this,we introduce an advanced real-time semantic segmentation strategy specifically designed for autonomous driving,utilizing the capabilities of Visual Transformers.By leveraging the self-attention mechanism inherent in Visual Transformers,our method enhances global contextual awareness,refining the representation of each pixel in relation to the overall scene.This enhancement is critical for quickly and accurately interpreting the complex elements within driving sce-narios—a fundamental need for autonomous vehicles.Our experiments conducted on the DriveSeg autonomous driving dataset indicate that our model surpasses traditional segmentation methods,achieving a significant 4.5%improvement in Mean Intersection over Union(mIoU)while maintaining real-time responsiveness.This paper not only underscores the potential for optimized semantic segmentation but also establishes a promising direction for real-time processing in autonomous navigation systems.Future work will focus on integrating this technique with other perception modules in autonomous driving to further improve the robustness and efficiency of self-driving perception frameworks,thereby opening new pathways for research and practical applications in scenarios requiring rapid and precise decision-making capabilities.Further experimentation and adaptation of this model could lead to broader implications for the fields of machine learning and computer vision,particularly in enhancing the interaction between automated systems and their dynamic environments.
文摘Most economists approach the economy of China from a single visual angle considering it as a special economic modality of transition economy. Based on the analysis from the single visual angle, the paper puts forward a dual visual angle treating China's economy as one of both transition and transformation features, and attempts to research it from this dual visual angle.
基金supported by National Key R&D Program of China under Grant No.2020AAA0106200National Natural Science Foundation of China under Grant Nos.61832016 and U20B2070.
文摘Transformers,the dominant architecture for natural language processing,have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance.Transformers are sequence-to-sequence models,which use a selfattention mechanism rather than the RNN sequential structure.Thus,such models can be trained in parallel and can represent global information.This study comprehensively surveys recent visual transformer works.We categorize them according to task scenario:backbone design,high-level vision,low-level vision and generation,and multimodal learning.Their key ideas are also analyzed.Differing from previous surveys,we mainly focus on visual transformer methods in low-level vision and generation.The latest works on backbone design are also reviewed in detail.For ease of understanding,we precisely describe the main contributions of the latest works in the form of tables.As well as giving quantitative comparisons,we also present image results for low-level vision and generation tasks.Computational costs and source code links for various important works are also given in this survey to assist further development.
基金supported by the National Science and Technology Major Project(No.2022YFB4500602).
文摘Gaze following aims to interpret human-scene interactions by predicting the person’s focal point of gaze.Prevailing approaches often adopt a two-stage framework,whereby multi-modality information is extracted in the initial stage for gaze target prediction.Consequently,the efficacy of these methods highly depends on the precision of the previous modality extraction.Others use a single-modality approach with complex decoders,increasing network computational load.Inspired by the remarkable success of pre-trained plain vision transformers(ViTs),we introduce a novel single-modality gaze following framework called ViTGaze.In contrast to previous methods,it creates a novel gaze following framework based mainly on powerful encoders(relative decoder parameters less than 1%).Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes.Leveraging this presumption,we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps.Furthermore,our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information.Many experiments have been conducted to demonstrate the performance of the proposed method.Our method achieves state-of-the-art performance among all single-modality methods(3.4%improvement in the area under curve score,5.1% improvement in the average precision)and very comparable performance against multi-modality methods with 59% fewer parameters.
基金funded by the project supported by the Heilongjiang Provincial Natural Science Foundation of China(Grant Number LH2022F030).
文摘3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width dimensions simultaneously,leading to limited feature representation capabilities when handling complex visual tasks.To address this challenge,we propose a novel 3D model classification network named ViT-GE(Vision Transformer with Global and Efficient Attention),which integrates Global Grouped Coordinate Attention(GGCA)and Efficient Channel Attention(ECA)mechanisms.Specifically,the Vision Transformer(ViT)is employed to extract comprehensive global features from multi-view inputs using its self-attention mechanism,effectively capturing 3D shape characteristics.To further enhance spatial feature modeling,the GGCA module introduces a grouping strategy and global context interactions.Concurrently,the ECA module strengthens inter-channel information flow,enabling the network to adaptively emphasize key features and improve feature fusion.Finally,a voting mechanism is adopted to enhance classification accuracy,robustness,and stability.Experimental results on the ModelNet10 dataset demonstrate that our method achieves a classification accuracy of 93.50%,validating its effectiveness and superior performance.
基金supported by the National Key Research and Development Program of China(2021YFE0113700)the National Natural Science Foundation of China(32360705,31960555)+2 种基金the Guizhou Provincial Science and Technology Program(2019-1410,HZJD[2022]001,GCC[2023]070)the Outstanding Young Scientist Program of Guizhou Province(KY2021-026)the Program for Introducing Talents to Chinese Universities(111 Program,D20023).
文摘Accurate identification of plant diseases is important for ensuring the safety of agricultural production.Convolutional neural networks(CNNs)and visual transformers(VTs)can extract effective representations of images and have been widely used for the intelligent recognition of plant disease images.However,CNNs have excellent local perception with poor global perception,and VTs have excellent global perception with poor local perception.This makes it difficult to further improve the performance of both CNNs and VTs on plant disease recognition tasks.In this paper,we propose a local and global feature-aware dual-branch network,named LGNet,for the identification of plant diseases.More specifically,we first design a dual-branch structure based on CNNs and VTs to extract the local and global features.Then,an adaptive feature fusion(AFF)module is designed to fuse the local and global features,thus driving the model to dynamically perceive the weights of different features.Finally,we design a hierarchical mixed-scale unit-guided feature fusion(HMUFF)module to mine the key information in the features at different levels and fuse the differentiated information among them,thereby enhancing the model's multiscale perception capability.Subsequently,extensive experiments were conducted on the Al Challenger 2018 dataset and the self-collected corn disease(SCD)dataset.The experimental results demonstrate that our proposed LGNet achieves state-of-the-art recognition performance on both the Al Challenger 2018 dataset and the SCD dataset,with accuracies of 88.74%and 99.08%,respectively.