3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width ...3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width dimensions simultaneously,leading to limited feature representation capabilities when handling complex visual tasks.To address this challenge,we propose a novel 3D model classification network named ViT-GE(Vision Transformer with Global and Efficient Attention),which integrates Global Grouped Coordinate Attention(GGCA)and Efficient Channel Attention(ECA)mechanisms.Specifically,the Vision Transformer(ViT)is employed to extract comprehensive global features from multi-view inputs using its self-attention mechanism,effectively capturing 3D shape characteristics.To further enhance spatial feature modeling,the GGCA module introduces a grouping strategy and global context interactions.Concurrently,the ECA module strengthens inter-channel information flow,enabling the network to adaptively emphasize key features and improve feature fusion.Finally,a voting mechanism is adopted to enhance classification accuracy,robustness,and stability.Experimental results on the ModelNet10 dataset demonstrate that our method achieves a classification accuracy of 93.50%,validating its effectiveness and superior performance.展开更多
基金funded by the project supported by the Heilongjiang Provincial Natural Science Foundation of China(Grant Number LH2022F030).
文摘3D model classification has emerged as a significant research focus in computer vision.However,traditional convolutional neural networks(CNNs)often struggle to capture global dependencies across both height and width dimensions simultaneously,leading to limited feature representation capabilities when handling complex visual tasks.To address this challenge,we propose a novel 3D model classification network named ViT-GE(Vision Transformer with Global and Efficient Attention),which integrates Global Grouped Coordinate Attention(GGCA)and Efficient Channel Attention(ECA)mechanisms.Specifically,the Vision Transformer(ViT)is employed to extract comprehensive global features from multi-view inputs using its self-attention mechanism,effectively capturing 3D shape characteristics.To further enhance spatial feature modeling,the GGCA module introduces a grouping strategy and global context interactions.Concurrently,the ECA module strengthens inter-channel information flow,enabling the network to adaptively emphasize key features and improve feature fusion.Finally,a voting mechanism is adopted to enhance classification accuracy,robustness,and stability.Experimental results on the ModelNet10 dataset demonstrate that our method achieves a classification accuracy of 93.50%,validating its effectiveness and superior performance.