critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study pr...critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study proposes a non-invasive approach using digital mammography images and deep learning algorithm for classifying breast cancer molecular subtypes.Four pretrained models,including two Convolutional Neural Networks(MobileNet_V3_Large and VGG-16)and two Vision Transformers(ViT_B_16 and ViT_Base_Patch16_Clip_224)were fine-tuned to classify images into HER2-enriched,Luminal,Normal-like,and Triple Negative subtypes.Hyperparameter tuning,including learning rate adjustment and layer freezing strategies,was applied to optimize performance.Among the evaluated models,ViT_Base_Patch16_Clip_224 achieved the highest test accuracy(94.44%),with equally high precision,recall,and F1-score of 0.94,demonstrating excellent generalization.MobileNet_V3_Large achieved the same accuracy but showed less training stability.In contrast,VGG-16 recorded the lowest performance,indicating a limitation in its generalizability for this classification task.The study also highlighted the superior performance of the Vision Transformer models over CNNs,particularly due to their ability to capture global contextual features and the benefit of CLIP-based pretraining in ViT_Base_Patch16_Clip_224.To enhance clinical applicability,a graphical user interface(GUI)named“BCMS Dx”was developed for streamlined subtype prediction.Deep learning applied to mammography has proven effective for accurate and non-invasive molecular subtyping.The proposed Vision Transformer-based model and supporting GUI offer a promising direction for augmenting diagnostic workflows,minimizing the need for invasive procedures,and advancing personalized breast cancer management.展开更多
The vision transformer(ViT)architecture,with its attention mechanism based on multi-head attention layers,has been widely adopted in various computer-aided diagnosis tasks due to its effectiveness in processing medica...The vision transformer(ViT)architecture,with its attention mechanism based on multi-head attention layers,has been widely adopted in various computer-aided diagnosis tasks due to its effectiveness in processing medical image information.ViTs are notably recognized for their complex architecture,which requires high-performance GPUs or CPUs for efficient model training and deployment in real-world medical diagnostic devices.This renders them more intricate than convolutional neural networks(CNNs).This difficulty is also challenging in the context of histopathology image analysis,where the images are both limited and complex.In response to these challenges,this study proposes a TokenMixer hybrid-architecture that combines the strengths of CNNs and ViTs.This hybrid architecture aims to enhance feature extraction and classification accuracy with shorter training time and fewer parameters by minimizing the number of input patches employed during training,while incorporating tokenization of input patches using convolutional layers and encoder transformer layers to process patches across all network layers for fast and accurate breast cancer tumor subtype classification.The TokenMixer mechanism is inspired by the ConvMixer and Token-Learner models.First,the ConvMixer model dynamically generates spatial attention maps using convolutional layers,enabling the extraction of patches from input images to minimize the number of input patches used in training.Second,the TokenLearner model extracts relevant regions from the selected input patches,tokenizes them to improve feature extraction,and trains all tokenized patches in an encoder transformer network.We evaluated the TokenMixer model on the BreakHis public dataset,comparing it with ViT-based and other state-of-the-art methods.Our approach achieved impressive results for both binary and multi-classification of breast cancer subtypes across various magnification levels(40×,100×,200×,400×).The model demonstrated accuracies of 97.02%for binary classification and 93.29%for multi-classification,with decision times of 391.71 and 1173.56 s,respectively.These results highlight the potential of our hybrid deep ViT-CNN architecture for advancing tumor classification in histopathological images.The source code is accessible:https://github.com/abimo uloud/Token Mixer.展开更多
The rapid growth of unlabeled time-series data in domains such as wireless communications,radar,biomedical engineering,and the Internet of Things(IoT)has driven advancements in unsupervised learning.This review synthe...The rapid growth of unlabeled time-series data in domains such as wireless communications,radar,biomedical engineering,and the Internet of Things(IoT)has driven advancements in unsupervised learning.This review synthe-sizes recent progress in applying autoencoders and vision transformers for un-supervised signal analysis,focusing on their architectures,applications,and emerging trends.We explore how these models enable feature extraction,anomaly detection,and classification across diverse signal types,including electrocardiograms,radar waveforms,and IoT sensor data.The review high-lights the strengths of hybrid architectures and self-supervised learning,while identifying challenges in interpretability,scalability,and domain generaliza-tion.By bridging methodological innovations and practical applications,this work offers a roadmap for developing robust,adaptive models for signal in-telligence.展开更多
Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challe...Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.展开更多
Accurate estimation of diffuse horizontal irradiance(DHI)is critical for optimising photovoltaic system performance and energy forecasting yet remains challenging in regions lacking comprehensive ground-based instrume...Accurate estimation of diffuse horizontal irradiance(DHI)is critical for optimising photovoltaic system performance and energy forecasting yet remains challenging in regions lacking comprehensive ground-based instrumentation.Recent advancements using Vision Transformers(ViTs)trained on extensive sky image datasets have shown promise in replacing costly irradiance measurement equipment,but the scarcity of long-term,high-quality sky imagery significantly restricts practical implementation.Addressing this critical gap,this study proposes a novel dual-framework approach designed for data-scarce scenarios.First,calculated atmospheric parameters,including extraterrestrial irradiance and cyclic time encodings,are integrated to represent sky conditions without utilising any instrumentation.Next,a sequential pipeline initially predicts synthetic global horizontal irradiance(GHI)and uses it as a feature,to refine DHI estimation.Finally,a dual-parallel architecture simultaneously processes raw and overlay-enhanced fisheye sky images.Overlays are generated through unsupervised,physicsinformed cloud segmentation to highlight dynamic sky features.Empirical validation is performed using data from the Chilbolton Observatory,chosen for its temperate climate and frequent cloud variability.To simulate data-scarce conditions,models are trained on a single month(e.g.,January)and evaluated across a temporally disjoint,full-year test set.Under this setup,the sequential and dual-parallel frameworks achieve RMSE values within 2-3 W/m^(2)and 1-6 W/m^(2),respectively,of a state-of-the-art ViT trained on the complete dataset.By combining physics-informed modelling with unsupervised segmentation,the proposed method provides a scalable and cost-effective solution for DHI estimation,advancing solar resource assessment in data-constrained environments.展开更多
Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood s...Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood scenarios involving reflections,occlusions,or indistinct boundaries due to limited contextual modeling.To address these challenges,we propose a hybrid flood segmentation framework that integrates a Vision Transformer(ViT)encoder with a U-Net decoder,enhanced by a novel Flood-Aware Refinement Block(FARB).The FARB module improves boundary delineation and suppresses noise by combining residual smoothing with spatial-channel attention mechanisms.We evaluate our model on a UAV-acquired flood imagery dataset,demonstrating that the proposed ViTUNet+FARB architecture outperforms existing CNN and Transformer-based models in terms of accuracy and mean Intersection over Union(mIoU).Detailed ablation studies further validate the contribution of each component,confirming that the FARB design significantly enhances segmentation quality.To its better performance and computational efficiency,the proposed framework is well-suited for flood monitoring and disaster response applications,particularly in resource-constrained environments.展开更多
Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone t...Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.展开更多
In contemporary computer vision,convolutional neural networks(CNNs)and vision transformers(ViTs)represent the two primary architectural paradigms for image recognition.While both approaches have been widely adopted in...In contemporary computer vision,convolutional neural networks(CNNs)and vision transformers(ViTs)represent the two primary architectural paradigms for image recognition.While both approaches have been widely adopted in medical imaging applications,they operate based on fundamentally different computational principles.This report attempts to provide brief application notes on ViTs and CNNs,particularly focusing on scenarios that guide the selection of one architecture over the other in practical medical implementations.Generally,CNNs rely on convolutional kernels,localized receptive fields,and weight sharing,enabling efficient hierarchical feature extraction.These properties contribute to strong performance in detecting spatially constrained patterns such as textures,edges,and anatomical boundaries,while maintaining relatively low computational requirements.ViTs,on the other hand,decompose images into smaller segments referred to as tokens and employ self-attention mechanisms to model relationships across the entire image.This global modeling capability allows ViTs to capture long-range dependencies that may be difficult for convolution-based architectures to learn.However,ViTs typically achieve optimal performance when trained on extremely large datasets or when supported by extensive pretraining,as their reduced inductive bias requires greater data exposure to learn robust representations.This report briefly examines the architectural structure,underlying mathematical foundations,and relative performance characteristics of CNNs and ViTs,drawing upon recent findings from contemporary research.Emphasis is placed on understanding how differences in data availability,computational resources,and task requirements influence model effectiveness across medical imaging domains.Most importantly,the report serves as a concise application guide for practitioners seeking informed implementation decisions between these two influential deep learning frameworks.展开更多
Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spa...Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spatial and semantic information.However,the performance of CNN-based methods remains limited in classification accuracy,primarily due to insufficient exploration of local image characteristics.Unlike CNNs,Vision Transformer(ViT)captures discriminative features by modeling relationships between local image patches.However,such methods typically require a large number of training samples to perform effectively.In the context of foreign body classification on coal conveyor belts,the limited availability of training samples hinders the full exploitation of Vision Transformer’s(ViT)capabilities.To address this issue,we propose an efficient approach,termed Key Part-level Attention Vision Transformer(KPA-ViT),which incorporates key local information into the transformer architecture to enrich the training information.It comprises three main components:a key-point detection module,a key local mining module,and an attention module.To extract key local regions,a key-point detection strategy is first employed to identify the positions of key points.Subsequently,the key local mining module extracts the relevant local features based on these detected points.Finally,an attention module composed of self-attention and cross-attention blocks is introduced to integrate global and key part-level information,thereby enhancing the model’s ability to learn discriminative features.Compared to recent transformer-based frameworks—such as ViT,Swin-Transformer,and EfficientViT—the proposed KPA-ViT achieves performance improvements of 9.3%,6.6%,and 2.8%,respectively,on the CUMT-BelT dataset,demonstrating its effectiveness.展开更多
基金funded by the Ministry of Higher Education(MoHE)Malaysia through the Fundamental Research Grant Scheme—Early Career Researcher(FRGS-EC),grant number FRGSEC/1/2024/ICT02/UNIMAP/02/8.
文摘critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study proposes a non-invasive approach using digital mammography images and deep learning algorithm for classifying breast cancer molecular subtypes.Four pretrained models,including two Convolutional Neural Networks(MobileNet_V3_Large and VGG-16)and two Vision Transformers(ViT_B_16 and ViT_Base_Patch16_Clip_224)were fine-tuned to classify images into HER2-enriched,Luminal,Normal-like,and Triple Negative subtypes.Hyperparameter tuning,including learning rate adjustment and layer freezing strategies,was applied to optimize performance.Among the evaluated models,ViT_Base_Patch16_Clip_224 achieved the highest test accuracy(94.44%),with equally high precision,recall,and F1-score of 0.94,demonstrating excellent generalization.MobileNet_V3_Large achieved the same accuracy but showed less training stability.In contrast,VGG-16 recorded the lowest performance,indicating a limitation in its generalizability for this classification task.The study also highlighted the superior performance of the Vision Transformer models over CNNs,particularly due to their ability to capture global contextual features and the benefit of CLIP-based pretraining in ViT_Base_Patch16_Clip_224.To enhance clinical applicability,a graphical user interface(GUI)named“BCMS Dx”was developed for streamlined subtype prediction.Deep learning applied to mammography has proven effective for accurate and non-invasive molecular subtyping.The proposed Vision Transformer-based model and supporting GUI offer a promising direction for augmenting diagnostic workflows,minimizing the need for invasive procedures,and advancing personalized breast cancer management.
基金Deanship of Scientific Research at Northern Border University,Arar,KSA for funding this research work through the project number“NBU-FFR-2024-2439-05”.
文摘The vision transformer(ViT)architecture,with its attention mechanism based on multi-head attention layers,has been widely adopted in various computer-aided diagnosis tasks due to its effectiveness in processing medical image information.ViTs are notably recognized for their complex architecture,which requires high-performance GPUs or CPUs for efficient model training and deployment in real-world medical diagnostic devices.This renders them more intricate than convolutional neural networks(CNNs).This difficulty is also challenging in the context of histopathology image analysis,where the images are both limited and complex.In response to these challenges,this study proposes a TokenMixer hybrid-architecture that combines the strengths of CNNs and ViTs.This hybrid architecture aims to enhance feature extraction and classification accuracy with shorter training time and fewer parameters by minimizing the number of input patches employed during training,while incorporating tokenization of input patches using convolutional layers and encoder transformer layers to process patches across all network layers for fast and accurate breast cancer tumor subtype classification.The TokenMixer mechanism is inspired by the ConvMixer and Token-Learner models.First,the ConvMixer model dynamically generates spatial attention maps using convolutional layers,enabling the extraction of patches from input images to minimize the number of input patches used in training.Second,the TokenLearner model extracts relevant regions from the selected input patches,tokenizes them to improve feature extraction,and trains all tokenized patches in an encoder transformer network.We evaluated the TokenMixer model on the BreakHis public dataset,comparing it with ViT-based and other state-of-the-art methods.Our approach achieved impressive results for both binary and multi-classification of breast cancer subtypes across various magnification levels(40×,100×,200×,400×).The model demonstrated accuracies of 97.02%for binary classification and 93.29%for multi-classification,with decision times of 391.71 and 1173.56 s,respectively.These results highlight the potential of our hybrid deep ViT-CNN architecture for advancing tumor classification in histopathological images.The source code is accessible:https://github.com/abimo uloud/Token Mixer.
文摘The rapid growth of unlabeled time-series data in domains such as wireless communications,radar,biomedical engineering,and the Internet of Things(IoT)has driven advancements in unsupervised learning.This review synthe-sizes recent progress in applying autoencoders and vision transformers for un-supervised signal analysis,focusing on their architectures,applications,and emerging trends.We explore how these models enable feature extraction,anomaly detection,and classification across diverse signal types,including electrocardiograms,radar waveforms,and IoT sensor data.The review high-lights the strengths of hybrid architectures and self-supervised learning,while identifying challenges in interpretability,scalability,and domain generaliza-tion.By bridging methodological innovations and practical applications,this work offers a roadmap for developing robust,adaptive models for signal in-telligence.
基金supported by the National Natural Science Foundation of China(Nos.62301092 and 62301093).
文摘Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.
基金support of the Engineering and Phys-ical Sciences Research Council(EPSRC)Doctoral Training Partnership(DTP)Fundingsupport from EPSRC grant EP/X033333/1.
文摘Accurate estimation of diffuse horizontal irradiance(DHI)is critical for optimising photovoltaic system performance and energy forecasting yet remains challenging in regions lacking comprehensive ground-based instrumentation.Recent advancements using Vision Transformers(ViTs)trained on extensive sky image datasets have shown promise in replacing costly irradiance measurement equipment,but the scarcity of long-term,high-quality sky imagery significantly restricts practical implementation.Addressing this critical gap,this study proposes a novel dual-framework approach designed for data-scarce scenarios.First,calculated atmospheric parameters,including extraterrestrial irradiance and cyclic time encodings,are integrated to represent sky conditions without utilising any instrumentation.Next,a sequential pipeline initially predicts synthetic global horizontal irradiance(GHI)and uses it as a feature,to refine DHI estimation.Finally,a dual-parallel architecture simultaneously processes raw and overlay-enhanced fisheye sky images.Overlays are generated through unsupervised,physicsinformed cloud segmentation to highlight dynamic sky features.Empirical validation is performed using data from the Chilbolton Observatory,chosen for its temperate climate and frequent cloud variability.To simulate data-scarce conditions,models are trained on a single month(e.g.,January)and evaluated across a temporally disjoint,full-year test set.Under this setup,the sequential and dual-parallel frameworks achieve RMSE values within 2-3 W/m^(2)and 1-6 W/m^(2),respectively,of a state-of-the-art ViT trained on the complete dataset.By combining physics-informed modelling with unsupervised segmentation,the proposed method provides a scalable and cost-effective solution for DHI estimation,advancing solar resource assessment in data-constrained environments.
基金supported by the National Research Foundation of Korea(NRF)grant funded by theKorea government(MSIT)(No.RS-2024-00405278)partially supported by the Jeju Industry-University Convergence District Project for Promoting Industry-Campus Cooperationfunded by the Ministry of Trade,Industry and Energy(MOTIE,Korea)[Project Name:Jeju Industry-University Convergence District Project for Promoting Industry-Campus Cooperation/Project Number:P0029950].
文摘Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood scenarios involving reflections,occlusions,or indistinct boundaries due to limited contextual modeling.To address these challenges,we propose a hybrid flood segmentation framework that integrates a Vision Transformer(ViT)encoder with a U-Net decoder,enhanced by a novel Flood-Aware Refinement Block(FARB).The FARB module improves boundary delineation and suppresses noise by combining residual smoothing with spatial-channel attention mechanisms.We evaluate our model on a UAV-acquired flood imagery dataset,demonstrating that the proposed ViTUNet+FARB architecture outperforms existing CNN and Transformer-based models in terms of accuracy and mean Intersection over Union(mIoU).Detailed ablation studies further validate the contribution of each component,confirming that the FARB design significantly enhances segmentation quality.To its better performance and computational efficiency,the proposed framework is well-suited for flood monitoring and disaster response applications,particularly in resource-constrained environments.
文摘Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.
文摘In contemporary computer vision,convolutional neural networks(CNNs)and vision transformers(ViTs)represent the two primary architectural paradigms for image recognition.While both approaches have been widely adopted in medical imaging applications,they operate based on fundamentally different computational principles.This report attempts to provide brief application notes on ViTs and CNNs,particularly focusing on scenarios that guide the selection of one architecture over the other in practical medical implementations.Generally,CNNs rely on convolutional kernels,localized receptive fields,and weight sharing,enabling efficient hierarchical feature extraction.These properties contribute to strong performance in detecting spatially constrained patterns such as textures,edges,and anatomical boundaries,while maintaining relatively low computational requirements.ViTs,on the other hand,decompose images into smaller segments referred to as tokens and employ self-attention mechanisms to model relationships across the entire image.This global modeling capability allows ViTs to capture long-range dependencies that may be difficult for convolution-based architectures to learn.However,ViTs typically achieve optimal performance when trained on extremely large datasets or when supported by extensive pretraining,as their reduced inductive bias requires greater data exposure to learn robust representations.This report briefly examines the architectural structure,underlying mathematical foundations,and relative performance characteristics of CNNs and ViTs,drawing upon recent findings from contemporary research.Emphasis is placed on understanding how differences in data availability,computational resources,and task requirements influence model effectiveness across medical imaging domains.Most importantly,the report serves as a concise application guide for practitioners seeking informed implementation decisions between these two influential deep learning frameworks.
基金funded by the National Key Research and Development Program of China(grant number 2023YFC2907600)the National Natural Science Foundation of China(grant number 52504132)Tiandi Science and Technology Co.,Ltd.Science and Technology Innovation Venture Capital Special Project(grant number 2023-TD-ZD011-004).
文摘Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spatial and semantic information.However,the performance of CNN-based methods remains limited in classification accuracy,primarily due to insufficient exploration of local image characteristics.Unlike CNNs,Vision Transformer(ViT)captures discriminative features by modeling relationships between local image patches.However,such methods typically require a large number of training samples to perform effectively.In the context of foreign body classification on coal conveyor belts,the limited availability of training samples hinders the full exploitation of Vision Transformer’s(ViT)capabilities.To address this issue,we propose an efficient approach,termed Key Part-level Attention Vision Transformer(KPA-ViT),which incorporates key local information into the transformer architecture to enrich the training information.It comprises three main components:a key-point detection module,a key local mining module,and an attention module.To extract key local regions,a key-point detection strategy is first employed to identify the positions of key points.Subsequently,the key local mining module extracts the relevant local features based on these detected points.Finally,an attention module composed of self-attention and cross-attention blocks is introduced to integrate global and key part-level information,thereby enhancing the model’s ability to learn discriminative features.Compared to recent transformer-based frameworks—such as ViT,Swin-Transformer,and EfficientViT—the proposed KPA-ViT achieves performance improvements of 9.3%,6.6%,and 2.8%,respectively,on the CUMT-BelT dataset,demonstrating its effectiveness.