critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study pr...critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study proposes a non-invasive approach using digital mammography images and deep learning algorithm for classifying breast cancer molecular subtypes.Four pretrained models,including two Convolutional Neural Networks(MobileNet_V3_Large and VGG-16)and two Vision Transformers(ViT_B_16 and ViT_Base_Patch16_Clip_224)were fine-tuned to classify images into HER2-enriched,Luminal,Normal-like,and Triple Negative subtypes.Hyperparameter tuning,including learning rate adjustment and layer freezing strategies,was applied to optimize performance.Among the evaluated models,ViT_Base_Patch16_Clip_224 achieved the highest test accuracy(94.44%),with equally high precision,recall,and F1-score of 0.94,demonstrating excellent generalization.MobileNet_V3_Large achieved the same accuracy but showed less training stability.In contrast,VGG-16 recorded the lowest performance,indicating a limitation in its generalizability for this classification task.The study also highlighted the superior performance of the Vision Transformer models over CNNs,particularly due to their ability to capture global contextual features and the benefit of CLIP-based pretraining in ViT_Base_Patch16_Clip_224.To enhance clinical applicability,a graphical user interface(GUI)named“BCMS Dx”was developed for streamlined subtype prediction.Deep learning applied to mammography has proven effective for accurate and non-invasive molecular subtyping.The proposed Vision Transformer-based model and supporting GUI offer a promising direction for augmenting diagnostic workflows,minimizing the need for invasive procedures,and advancing personalized breast cancer management.展开更多
Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challe...Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.展开更多
Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood s...Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood scenarios involving reflections,occlusions,or indistinct boundaries due to limited contextual modeling.To address these challenges,we propose a hybrid flood segmentation framework that integrates a Vision Transformer(ViT)encoder with a U-Net decoder,enhanced by a novel Flood-Aware Refinement Block(FARB).The FARB module improves boundary delineation and suppresses noise by combining residual smoothing with spatial-channel attention mechanisms.We evaluate our model on a UAV-acquired flood imagery dataset,demonstrating that the proposed ViTUNet+FARB architecture outperforms existing CNN and Transformer-based models in terms of accuracy and mean Intersection over Union(mIoU).Detailed ablation studies further validate the contribution of each component,confirming that the FARB design significantly enhances segmentation quality.To its better performance and computational efficiency,the proposed framework is well-suited for flood monitoring and disaster response applications,particularly in resource-constrained environments.展开更多
Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone t...Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.展开更多
Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spa...Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spatial and semantic information.However,the performance of CNN-based methods remains limited in classification accuracy,primarily due to insufficient exploration of local image characteristics.Unlike CNNs,Vision Transformer(ViT)captures discriminative features by modeling relationships between local image patches.However,such methods typically require a large number of training samples to perform effectively.In the context of foreign body classification on coal conveyor belts,the limited availability of training samples hinders the full exploitation of Vision Transformer’s(ViT)capabilities.To address this issue,we propose an efficient approach,termed Key Part-level Attention Vision Transformer(KPA-ViT),which incorporates key local information into the transformer architecture to enrich the training information.It comprises three main components:a key-point detection module,a key local mining module,and an attention module.To extract key local regions,a key-point detection strategy is first employed to identify the positions of key points.Subsequently,the key local mining module extracts the relevant local features based on these detected points.Finally,an attention module composed of self-attention and cross-attention blocks is introduced to integrate global and key part-level information,thereby enhancing the model’s ability to learn discriminative features.Compared to recent transformer-based frameworks—such as ViT,Swin-Transformer,and EfficientViT—the proposed KPA-ViT achieves performance improvements of 9.3%,6.6%,and 2.8%,respectively,on the CUMT-BelT dataset,demonstrating its effectiveness.展开更多
Detecting pavement cracks is critical for road safety and infrastructure management.Traditional methods,relying on manual inspection and basic image processing,are time-consuming and prone to errors.Recent deep-learni...Detecting pavement cracks is critical for road safety and infrastructure management.Traditional methods,relying on manual inspection and basic image processing,are time-consuming and prone to errors.Recent deep-learning(DL)methods automate crack detection,but many still struggle with variable crack patterns and environmental conditions.This study aims to address these limitations by introducing the Masker Transformer,a novel hybrid deep learning model that integrates the precise localization capabilities of Mask Region-based Convolutional Neural Network(Mask R-CNN)with the global contextual awareness of Vision Transformer(ViT).The research focuses on leveraging the strengths of both architectures to enhance segmentation accuracy and adaptability across different pavement conditions.We evaluated the performance of theMaskerTransformer against other state-of-theartmodels such asU-Net,TransformerU-Net(TransUNet),U-NetTransformer(UNETr),SwinU-NetTransformer(Swin-UNETr),You Only Look Once version 8(YoloV8),and Mask R-CNN using two benchmark datasets:Crack500 and DeepCrack.The findings reveal that the MaskerTransformer significantly outperforms the existing models,achieving the highest Dice SimilarityCoefficient(DSC),precision,recall,and F1-Score across both datasets.Specifically,the model attained a DSC of 80.04%on Crack500 and 91.37%on DeepCrack,demonstrating superior segmentation accuracy and reliability.The high precision and recall rates further substantiate its effectiveness in real-world applications,suggesting that the Masker Transformer can serve as a robust tool for automated pavement crack detection,potentially replacing more traditional methods.展开更多
This study investigates the application of Learnable Memory Vision Transformers(LMViT)for detecting metal surface flaws,comparing their performance with traditional CNNs,specifically ResNet18 and ResNet50,as well as o...This study investigates the application of Learnable Memory Vision Transformers(LMViT)for detecting metal surface flaws,comparing their performance with traditional CNNs,specifically ResNet18 and ResNet50,as well as other transformer-based models including Token to Token ViT,ViT withoutmemory,and Parallel ViT.Leveraging awidely-used steel surface defect dataset,the research applies data augmentation and t-distributed stochastic neighbor embedding(t-SNE)to enhance feature extraction and understanding.These techniques mitigated overfitting,stabilized training,and improved generalization capabilities.The LMViT model achieved a test accuracy of 97.22%,significantly outperforming ResNet18(88.89%)and ResNet50(88.90%),aswell as the Token to TokenViT(88.46%),ViT without memory(87.18),and Parallel ViT(91.03%).Furthermore,LMViT exhibited superior training and validation performance,attaining a validation accuracy of 98.2%compared to 91.0%for ResNet 18,96.0%for ResNet50,and 89.12%,87.51%,and 91.21%for Token to Token ViT,ViT without memory,and Parallel ViT,respectively.The findings highlight the LMViT’s ability to capture long-range dependencies in images,an areawhere CNNs struggle due to their reliance on local receptive fields and hierarchical feature extraction.The additional transformer-based models also demonstrate improved performance in capturing complex features over CNNs,with LMViT excelling particularly at detecting subtle and complex defects,which is critical for maintaining product quality and operational efficiency in industrial applications.For instance,the LMViT model successfully identified fine scratches and minor surface irregularities that CNNs often misclassify.This study not only demonstrates LMViT’s potential for real-world defect detection but also underscores the promise of other transformer-based architectures like Token to Token ViT,ViT without memory,and Parallel ViT in industrial scenarios where complex spatial relationships are key.Future research may focus on enhancing LMViT’s computational efficiency for deployment in real-time quality control systems.展开更多
基金funded by the Ministry of Higher Education(MoHE)Malaysia through the Fundamental Research Grant Scheme—Early Career Researcher(FRGS-EC),grant number FRGSEC/1/2024/ICT02/UNIMAP/02/8.
文摘critical for guiding treatment and improving patient outcomes.Traditional molecular subtyping via immuno-histochemistry(IHC)test is invasive,time-consuming,and may not fully represent tumor heterogeneity.This study proposes a non-invasive approach using digital mammography images and deep learning algorithm for classifying breast cancer molecular subtypes.Four pretrained models,including two Convolutional Neural Networks(MobileNet_V3_Large and VGG-16)and two Vision Transformers(ViT_B_16 and ViT_Base_Patch16_Clip_224)were fine-tuned to classify images into HER2-enriched,Luminal,Normal-like,and Triple Negative subtypes.Hyperparameter tuning,including learning rate adjustment and layer freezing strategies,was applied to optimize performance.Among the evaluated models,ViT_Base_Patch16_Clip_224 achieved the highest test accuracy(94.44%),with equally high precision,recall,and F1-score of 0.94,demonstrating excellent generalization.MobileNet_V3_Large achieved the same accuracy but showed less training stability.In contrast,VGG-16 recorded the lowest performance,indicating a limitation in its generalizability for this classification task.The study also highlighted the superior performance of the Vision Transformer models over CNNs,particularly due to their ability to capture global contextual features and the benefit of CLIP-based pretraining in ViT_Base_Patch16_Clip_224.To enhance clinical applicability,a graphical user interface(GUI)named“BCMS Dx”was developed for streamlined subtype prediction.Deep learning applied to mammography has proven effective for accurate and non-invasive molecular subtyping.The proposed Vision Transformer-based model and supporting GUI offer a promising direction for augmenting diagnostic workflows,minimizing the need for invasive procedures,and advancing personalized breast cancer management.
基金supported by the National Natural Science Foundation of China(Nos.62301092 and 62301093).
文摘Vision Transformers(ViTs)have achieved remarkable success across various artificial intelligence-based computer vision applications.However,their demanding computational and memory requirements pose significant challenges for de-ployment on resource-constrained edge devices.Although post-training quantization(PTQ)provides a promising solution by reducing model precision with minimal calibration data,aggressive low-bit quantization typically leads to substantial perfor-mance degradation.To address this challenge,we present the truncated uniform-log2 quantizer and progressive bit-decline reconstruction method for vision Transformer quantization(TP-ViT).It is an innovative PTQ framework specifically designed for ViTs,featuring two key technical contributions:(1)truncated uniform-log2 quantizer,a novel quantization approach which effectively handles outlier values in post-Softmax activations,significantly reducing quantization errors;(2)bit-decline optimiza-tion strategy,which employs transition weights to gradually reduce bit precision while maintaining model performance under extreme quantization conditions.Comprehensive experiments on image classification,object detection,and instance segmenta-tion tasks demonstrate TP-ViT’s superior performance compared to state-of-the-art PTQ methods,particularly in challenging 3-bit quantization scenarios.Our framework achieves a notable 6.18 percentage points improvement in top-1 accuracy for ViT-small under 3-bit quantization.These results validate TP-ViT’s robustness and general applicability,paving the way for more efficient deployment of ViT models in computer vision applications on edge hardware.
基金supported by the National Research Foundation of Korea(NRF)grant funded by theKorea government(MSIT)(No.RS-2024-00405278)partially supported by the Jeju Industry-University Convergence District Project for Promoting Industry-Campus Cooperationfunded by the Ministry of Trade,Industry and Energy(MOTIE,Korea)[Project Name:Jeju Industry-University Convergence District Project for Promoting Industry-Campus Cooperation/Project Number:P0029950].
文摘Recent advances in deep learning have significantly improved flood detection and segmentation from aerial and satellite imagery.However,conventional convolutional neural networks(CNNs)often struggle in complex flood scenarios involving reflections,occlusions,or indistinct boundaries due to limited contextual modeling.To address these challenges,we propose a hybrid flood segmentation framework that integrates a Vision Transformer(ViT)encoder with a U-Net decoder,enhanced by a novel Flood-Aware Refinement Block(FARB).The FARB module improves boundary delineation and suppresses noise by combining residual smoothing with spatial-channel attention mechanisms.We evaluate our model on a UAV-acquired flood imagery dataset,demonstrating that the proposed ViTUNet+FARB architecture outperforms existing CNN and Transformer-based models in terms of accuracy and mean Intersection over Union(mIoU).Detailed ablation studies further validate the contribution of each component,confirming that the FARB design significantly enhances segmentation quality.To its better performance and computational efficiency,the proposed framework is well-suited for flood monitoring and disaster response applications,particularly in resource-constrained environments.
文摘Lung cancer remains a major global health challenge,with early diagnosis crucial for improved patient survival.Traditional diagnostic techniques,including manual histopathology and radiological assessments,are prone to errors and variability.Deep learning methods,particularly Vision Transformers(ViT),have shown promise for improving diagnostic accuracy by effectively extracting global features.However,ViT-based approaches face challenges related to computational complexity and limited generalizability.This research proposes the DualSet ViT-PSO-SVM framework,integrating aViTwith dual attentionmechanisms,Particle Swarm Optimization(PSO),and SupportVector Machines(SVM),aiming for efficient and robust lung cancer classification acrossmultiple medical image datasets.The study utilized three publicly available datasets:LIDC-IDRI,LUNA16,and TCIA,encompassing computed tomography(CT)scans and histopathological images.Data preprocessing included normalization,augmentation,and segmentation.Dual attention mechanisms enhanced ViT’s feature extraction capabilities.PSO optimized feature selection,and SVM performed classification.Model performance was evaluated on individual and combined datasets,benchmarked against CNN-based and standard ViT approaches.The DualSet ViT-PSO-SVM significantly outperformed existing methods,achieving superior accuracy rates of 97.85%(LIDC-IDRI),98.32%(LUNA16),and 96.75%(TCIA).Crossdataset evaluations demonstrated strong generalization capabilities and stability across similar imagingmodalities.The proposed framework effectively bridges advanced deep learning techniques with clinical applicability,offering a robust diagnostic tool for lung cancer detection,reducing complexity,and improving diagnostic reliability and interpretability.
基金funded by the National Key Research and Development Program of China(grant number 2023YFC2907600)the National Natural Science Foundation of China(grant number 52504132)Tiandi Science and Technology Co.,Ltd.Science and Technology Innovation Venture Capital Special Project(grant number 2023-TD-ZD011-004).
文摘Foreign body classification on coal conveyor belts is a critical component of intelligent coal mining systems.Previous approaches have primarily utilized convolutional neural networks(CNNs)to effectively integrate spatial and semantic information.However,the performance of CNN-based methods remains limited in classification accuracy,primarily due to insufficient exploration of local image characteristics.Unlike CNNs,Vision Transformer(ViT)captures discriminative features by modeling relationships between local image patches.However,such methods typically require a large number of training samples to perform effectively.In the context of foreign body classification on coal conveyor belts,the limited availability of training samples hinders the full exploitation of Vision Transformer’s(ViT)capabilities.To address this issue,we propose an efficient approach,termed Key Part-level Attention Vision Transformer(KPA-ViT),which incorporates key local information into the transformer architecture to enrich the training information.It comprises three main components:a key-point detection module,a key local mining module,and an attention module.To extract key local regions,a key-point detection strategy is first employed to identify the positions of key points.Subsequently,the key local mining module extracts the relevant local features based on these detected points.Finally,an attention module composed of self-attention and cross-attention blocks is introduced to integrate global and key part-level information,thereby enhancing the model’s ability to learn discriminative features.Compared to recent transformer-based frameworks—such as ViT,Swin-Transformer,and EfficientViT—the proposed KPA-ViT achieves performance improvements of 9.3%,6.6%,and 2.8%,respectively,on the CUMT-BelT dataset,demonstrating its effectiveness.
文摘Detecting pavement cracks is critical for road safety and infrastructure management.Traditional methods,relying on manual inspection and basic image processing,are time-consuming and prone to errors.Recent deep-learning(DL)methods automate crack detection,but many still struggle with variable crack patterns and environmental conditions.This study aims to address these limitations by introducing the Masker Transformer,a novel hybrid deep learning model that integrates the precise localization capabilities of Mask Region-based Convolutional Neural Network(Mask R-CNN)with the global contextual awareness of Vision Transformer(ViT).The research focuses on leveraging the strengths of both architectures to enhance segmentation accuracy and adaptability across different pavement conditions.We evaluated the performance of theMaskerTransformer against other state-of-theartmodels such asU-Net,TransformerU-Net(TransUNet),U-NetTransformer(UNETr),SwinU-NetTransformer(Swin-UNETr),You Only Look Once version 8(YoloV8),and Mask R-CNN using two benchmark datasets:Crack500 and DeepCrack.The findings reveal that the MaskerTransformer significantly outperforms the existing models,achieving the highest Dice SimilarityCoefficient(DSC),precision,recall,and F1-Score across both datasets.Specifically,the model attained a DSC of 80.04%on Crack500 and 91.37%on DeepCrack,demonstrating superior segmentation accuracy and reliability.The high precision and recall rates further substantiate its effectiveness in real-world applications,suggesting that the Masker Transformer can serve as a robust tool for automated pavement crack detection,potentially replacing more traditional methods.
基金funded by Woosong University Academic Research 2024.
文摘This study investigates the application of Learnable Memory Vision Transformers(LMViT)for detecting metal surface flaws,comparing their performance with traditional CNNs,specifically ResNet18 and ResNet50,as well as other transformer-based models including Token to Token ViT,ViT withoutmemory,and Parallel ViT.Leveraging awidely-used steel surface defect dataset,the research applies data augmentation and t-distributed stochastic neighbor embedding(t-SNE)to enhance feature extraction and understanding.These techniques mitigated overfitting,stabilized training,and improved generalization capabilities.The LMViT model achieved a test accuracy of 97.22%,significantly outperforming ResNet18(88.89%)and ResNet50(88.90%),aswell as the Token to TokenViT(88.46%),ViT without memory(87.18),and Parallel ViT(91.03%).Furthermore,LMViT exhibited superior training and validation performance,attaining a validation accuracy of 98.2%compared to 91.0%for ResNet 18,96.0%for ResNet50,and 89.12%,87.51%,and 91.21%for Token to Token ViT,ViT without memory,and Parallel ViT,respectively.The findings highlight the LMViT’s ability to capture long-range dependencies in images,an areawhere CNNs struggle due to their reliance on local receptive fields and hierarchical feature extraction.The additional transformer-based models also demonstrate improved performance in capturing complex features over CNNs,with LMViT excelling particularly at detecting subtle and complex defects,which is critical for maintaining product quality and operational efficiency in industrial applications.For instance,the LMViT model successfully identified fine scratches and minor surface irregularities that CNNs often misclassify.This study not only demonstrates LMViT’s potential for real-world defect detection but also underscores the promise of other transformer-based architectures like Token to Token ViT,ViT without memory,and Parallel ViT in industrial scenarios where complex spatial relationships are key.Future research may focus on enhancing LMViT’s computational efficiency for deployment in real-time quality control systems.