With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary feat...With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.展开更多
Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly...Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly on imaging-based technology,which requires complex staining and sophisticated instrumentation.In this work,we develop a label-free method based on artificial intelligence(AI)-assisted impedance-based flow cytometry(IFC)to differentiate between various BC cells and epithelial cells at single-cell resolution.By applying multiple-frequency excitations,the electrical characteristics of cells,including membrane and nuclear opacities,are extracted,allowing distinction to be made between epithelial cells,low-grade,and high-grade BC cells.Through the use of a constriction channel,the electro-mechanical properties associated with active deformation behavior of cells are investigated,and it is demonstrated that BC cells have a greater capability of shape recovery,an observation that further increases differentiation accuracy.With the assistance of a convolutional neural network-based AI algorithm,IFC is able to effectively differentiate various BC and epithelial cells with accuracies of over 95%.In addition,different grades of BC cells are successfully differentiated in both spiked mixed samples and bladder tumor tissues.展开更多
Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;howev...Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;however,these methods face multiple challenges that include high gesture variability,occlusions,limited signer diversity,and the scarcity of large annotated datasets.Existing methods,often relying solely on either skeletal data or video-based features,struggle with generalization and robustness,especially in dynamic and real-world conditions.This paper proposes a novel multimodal ensemble classification framework that integrates geometric features derived from 3D skeletal joint distances and angles with temporal features extracted from RGB videos using the Inflated 3D ConvNet(I3D).By fusing these complementary modalities at the feature level and applying a majority-voting ensemble of XGBoost,Random Forest,and Support Vector Machine classifiers,the framework robustly captures both spatial configurations and motion dynamics of sign gestures.Feature selection using the Pearson Correlation Coefficient further enhances efficiency by reducing redundancy.Extensive experiments on the ArabSign dataset,which includes RGB videos and corresponding skeletal data,demonstrate that the proposed approach significantly outperforms state-of-the-art methods,achieving an average F1-score of 97%using a majority-voting ensemble of XGBoost,Random Forest,and SVM classifiers,and improving recognition accuracy by more than 7%over previous best methods.This work not only advances the technical stateof-the-art in ArSL recognition but also provides a scalable,real-time solution for practical deployment in educational,social,and assistive communication technologies.Even though this study is about Arabic Sign Language,the framework proposed here can be extended to different sign languages,creating possibilities for potentially worldwide applicability in sign language recognition tasks.展开更多
Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning netwo...Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning network for hand gesture recognition.The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation.To learn short-term features,each video input is segmented into a fixed number of frame groups.A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot.These two entities are fused and fed into a convolutional neural network(Conv Net)for feature extraction.The Conv Nets for all groups share parameters.To learn longterm features,outputs from all Conv Nets are fed into a long short-term memory(LSTM)network,by which a final classification result is predicted.The new model has been tested with two popular hand gesture datasets,namely the Jester dataset and Nvidia dataset.Comparing with other models,our model produced very competitive results.The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.展开更多
Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the ...Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the real-time camera pose solution based on a single image.In order to solve the problem of precision and robustness of PoseNet and its improved algorithms in complex environment,this paper proposes and implements a new visual relocation method based on deep convolutional neural networks(VNLSTM-PoseNet).Firstly,this method directly resizes the input image without cropping to increase the receptive field of the training image.Then,the image and the corresponding pose labels are put into the improved Long Short-Term Memory based(LSTM-based)PoseNet network for training and the network is optimized by the Nadam optimizer.Finally,the trained network is used for image localization to obtain the camera pose.Experimental results on outdoor public datasets show our VNLSTM-PoseNet can lead to drastic improvements in relocalization performance compared to existing state-of-theart CNN-based methods.展开更多
While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applyi...While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applying self-attention in computer vision:(1)treating images as 1D sequences neglects their 2D structures;(2)the quadratic complexity is too expensive for high-resolution images;(3)it only captures spatial adaptability but ignores channel adaptability.In this paper,we propose a novel linear attention named large kernel attention(LKA)to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.Furthermore,we present a neural network based on LKA,namely Visual Attention Network(VAN).While extremely simple,VAN achieves comparable results with similar size convolutional neural networks(CNNs)and vision transformers(ViTs)in various tasks,including image classification,object detection,semantic segmentation,panoptic segmentation,pose estimation,etc.For example,VAN-B6 achieves 87.8%accuracy on ImageNet benchmark,and sets new state-of-the-art performance(58.2%PQ)for panoptic segmentation.Besides,VAN-B2 surpasses Swin-T 4%mloU(50.1%vs.46.1%)for semantic segmentation on ADE20K benchmark,2.6%AP(48.8%vs.46.2%)for object detection on COCO dataset.It provides a novel method and a simple yet strong baseline for the community.The code is available at https://github.com/Visual-Attention-Network.展开更多
基金This work was supported by the Natural Science Foundation of Guangdong Province(Grant Nos.2022A1515140119 and 2023A1515011307)the National Key Laboratory of Air-based Information Perception and Fusion and the Aeronautic Science Foundation of China(Grant No.20220001068001)+1 种基金Dongguan Science and Technology Special Commissioner Project(Grant No.20221800500362)the National Natural Science Foundation of China(Grant Nos.62376261,61972090,and U21A20487).
文摘With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.
基金financial support from the National Natural Science Foundation of China(NSFC Grant No.22076138)the National Natural Science Foundation of China(NSFC Grant No.62174119).
文摘Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly on imaging-based technology,which requires complex staining and sophisticated instrumentation.In this work,we develop a label-free method based on artificial intelligence(AI)-assisted impedance-based flow cytometry(IFC)to differentiate between various BC cells and epithelial cells at single-cell resolution.By applying multiple-frequency excitations,the electrical characteristics of cells,including membrane and nuclear opacities,are extracted,allowing distinction to be made between epithelial cells,low-grade,and high-grade BC cells.Through the use of a constriction channel,the electro-mechanical properties associated with active deformation behavior of cells are investigated,and it is demonstrated that BC cells have a greater capability of shape recovery,an observation that further increases differentiation accuracy.With the assistance of a convolutional neural network-based AI algorithm,IFC is able to effectively differentiate various BC and epithelial cells with accuracies of over 95%.In addition,different grades of BC cells are successfully differentiated in both spiked mixed samples and bladder tumor tissues.
基金funding this work through Research Group No.KS-2024-376.
文摘Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;however,these methods face multiple challenges that include high gesture variability,occlusions,limited signer diversity,and the scarcity of large annotated datasets.Existing methods,often relying solely on either skeletal data or video-based features,struggle with generalization and robustness,especially in dynamic and real-world conditions.This paper proposes a novel multimodal ensemble classification framework that integrates geometric features derived from 3D skeletal joint distances and angles with temporal features extracted from RGB videos using the Inflated 3D ConvNet(I3D).By fusing these complementary modalities at the feature level and applying a majority-voting ensemble of XGBoost,Random Forest,and Support Vector Machine classifiers,the framework robustly captures both spatial configurations and motion dynamics of sign gestures.Feature selection using the Pearson Correlation Coefficient further enhances efficiency by reducing redundancy.Extensive experiments on the ArabSign dataset,which includes RGB videos and corresponding skeletal data,demonstrate that the proposed approach significantly outperforms state-of-the-art methods,achieving an average F1-score of 97%using a majority-voting ensemble of XGBoost,Random Forest,and SVM classifiers,and improving recognition accuracy by more than 7%over previous best methods.This work not only advances the technical stateof-the-art in ArSL recognition but also provides a scalable,real-time solution for practical deployment in educational,social,and assistive communication technologies.Even though this study is about Arabic Sign Language,the framework proposed here can be extended to different sign languages,creating possibilities for potentially worldwide applicability in sign language recognition tasks.
文摘Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning network for hand gesture recognition.The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation.To learn short-term features,each video input is segmented into a fixed number of frame groups.A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot.These two entities are fused and fed into a convolutional neural network(Conv Net)for feature extraction.The Conv Nets for all groups share parameters.To learn longterm features,outputs from all Conv Nets are fed into a long short-term memory(LSTM)network,by which a final classification result is predicted.The new model has been tested with two popular hand gesture datasets,namely the Jester dataset and Nvidia dataset.Comparing with other models,our model produced very competitive results.The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.
基金This work is supported by the National Key R&D Program of China[grant number 2018YFB0505400]the National Natural Science Foundation of China(NSFC)[grant num-ber 41901407]+1 种基金the LIESMARS Special Research Funding[grant number 2021]the College Students’Innovative Entrepreneurial Training Plan Program[grant number S2020634016].
文摘Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the real-time camera pose solution based on a single image.In order to solve the problem of precision and robustness of PoseNet and its improved algorithms in complex environment,this paper proposes and implements a new visual relocation method based on deep convolutional neural networks(VNLSTM-PoseNet).Firstly,this method directly resizes the input image without cropping to increase the receptive field of the training image.Then,the image and the corresponding pose labels are put into the improved Long Short-Term Memory based(LSTM-based)PoseNet network for training and the network is optimized by the Nadam optimizer.Finally,the trained network is used for image localization to obtain the camera pose.Experimental results on outdoor public datasets show our VNLSTM-PoseNet can lead to drastic improvements in relocalization performance compared to existing state-of-theart CNN-based methods.
基金supported by National Key R&D Program of China(Project No.2021ZD0112902)the National Natural Science Foundation of China(Project No.62220106003)Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applying self-attention in computer vision:(1)treating images as 1D sequences neglects their 2D structures;(2)the quadratic complexity is too expensive for high-resolution images;(3)it only captures spatial adaptability but ignores channel adaptability.In this paper,we propose a novel linear attention named large kernel attention(LKA)to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.Furthermore,we present a neural network based on LKA,namely Visual Attention Network(VAN).While extremely simple,VAN achieves comparable results with similar size convolutional neural networks(CNNs)and vision transformers(ViTs)in various tasks,including image classification,object detection,semantic segmentation,panoptic segmentation,pose estimation,etc.For example,VAN-B6 achieves 87.8%accuracy on ImageNet benchmark,and sets new state-of-the-art performance(58.2%PQ)for panoptic segmentation.Besides,VAN-B2 surpasses Swin-T 4%mloU(50.1%vs.46.1%)for semantic segmentation on ADE20K benchmark,2.6%AP(48.8%vs.46.2%)for object detection on COCO dataset.It provides a novel method and a simple yet strong baseline for the community.The code is available at https://github.com/Visual-Attention-Network.