Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly...Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly on imaging-based technology,which requires complex staining and sophisticated instrumentation.In this work,we develop a label-free method based on artificial intelligence(AI)-assisted impedance-based flow cytometry(IFC)to differentiate between various BC cells and epithelial cells at single-cell resolution.By applying multiple-frequency excitations,the electrical characteristics of cells,including membrane and nuclear opacities,are extracted,allowing distinction to be made between epithelial cells,low-grade,and high-grade BC cells.Through the use of a constriction channel,the electro-mechanical properties associated with active deformation behavior of cells are investigated,and it is demonstrated that BC cells have a greater capability of shape recovery,an observation that further increases differentiation accuracy.With the assistance of a convolutional neural network-based AI algorithm,IFC is able to effectively differentiate various BC and epithelial cells with accuracies of over 95%.In addition,different grades of BC cells are successfully differentiated in both spiked mixed samples and bladder tumor tissues.展开更多
Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the ...Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the real-time camera pose solution based on a single image.In order to solve the problem of precision and robustness of PoseNet and its improved algorithms in complex environment,this paper proposes and implements a new visual relocation method based on deep convolutional neural networks(VNLSTM-PoseNet).Firstly,this method directly resizes the input image without cropping to increase the receptive field of the training image.Then,the image and the corresponding pose labels are put into the improved Long Short-Term Memory based(LSTM-based)PoseNet network for training and the network is optimized by the Nadam optimizer.Finally,the trained network is used for image localization to obtain the camera pose.Experimental results on outdoor public datasets show our VNLSTM-PoseNet can lead to drastic improvements in relocalization performance compared to existing state-of-theart CNN-based methods.展开更多
With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary feat...With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.展开更多
Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;howev...Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;however,these methods face multiple challenges that include high gesture variability,occlusions,limited signer diversity,and the scarcity of large annotated datasets.Existing methods,often relying solely on either skeletal data or video-based features,struggle with generalization and robustness,especially in dynamic and real-world conditions.This paper proposes a novel multimodal ensemble classification framework that integrates geometric features derived from 3D skeletal joint distances and angles with temporal features extracted from RGB videos using the Inflated 3D ConvNet(I3D).By fusing these complementary modalities at the feature level and applying a majority-voting ensemble of XGBoost,Random Forest,and Support Vector Machine classifiers,the framework robustly captures both spatial configurations and motion dynamics of sign gestures.Feature selection using the Pearson Correlation Coefficient further enhances efficiency by reducing redundancy.Extensive experiments on the ArabSign dataset,which includes RGB videos and corresponding skeletal data,demonstrate that the proposed approach significantly outperforms state-of-the-art methods,achieving an average F1-score of 97%using a majority-voting ensemble of XGBoost,Random Forest,and SVM classifiers,and improving recognition accuracy by more than 7%over previous best methods.This work not only advances the technical stateof-the-art in ArSL recognition but also provides a scalable,real-time solution for practical deployment in educational,social,and assistive communication technologies.Even though this study is about Arabic Sign Language,the framework proposed here can be extended to different sign languages,creating possibilities for potentially worldwide applicability in sign language recognition tasks.展开更多
目前基于3D-ConvNet的行为识别算法普遍使用全局平均池化(global average pooling,GAP)压缩特征信息,但会产生信息损失、信息冗余和网络过拟合等问题。为了解决上述问题,更好地保留卷积层提取到的高级语义信息,提出了基于全局频域池化(g...目前基于3D-ConvNet的行为识别算法普遍使用全局平均池化(global average pooling,GAP)压缩特征信息,但会产生信息损失、信息冗余和网络过拟合等问题。为了解决上述问题,更好地保留卷积层提取到的高级语义信息,提出了基于全局频域池化(global frequency domain pooling,GFDP)的行为识别算法。首先,根据离散余弦变换(discrete cosine transform,DCT)看出,GAP是频域中特征分解的一种特例,从而引入更多频率分量增加特征通道间的特异性,减少信息压缩后的信息冗余;其次,为了更好地抑制过拟合问题,引入卷积层的批标准化策略,并将其拓展在以ERB(efficient residual block)-Res3D为骨架的行为识别模型的全连接层以优化数据分布;最后,将该方法在UCF101数据集上进行验证。结果表明,模型计算量为3.5 GFlops,参数量为7.4 M,最终的识别准确率在ERB-Res3D模型的基础上提升了3.9%,在原始Res3D模型基础上提升了17.4%,高效实现了更加准确的行为识别结果。展开更多
Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning netwo...Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning network for hand gesture recognition.The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation.To learn short-term features,each video input is segmented into a fixed number of frame groups.A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot.These two entities are fused and fed into a convolutional neural network(Conv Net)for feature extraction.The Conv Nets for all groups share parameters.To learn longterm features,outputs from all Conv Nets are fed into a long short-term memory(LSTM)network,by which a final classification result is predicted.The new model has been tested with two popular hand gesture datasets,namely the Jester dataset and Nvidia dataset.Comparing with other models,our model produced very competitive results.The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.展开更多
闭环检测是同时定位与建图(Simultaneous localization and mapping,SLAM)的重要组成部分,能够有效减小SLAM系统中的累积误差,并且如果在定位与建图过程中跟踪丢失,还可以利用闭环检测进行重定位。与传统的手动设计的特征(hand-crafted ...闭环检测是同时定位与建图(Simultaneous localization and mapping,SLAM)的重要组成部分,能够有效减小SLAM系统中的累积误差,并且如果在定位与建图过程中跟踪丢失,还可以利用闭环检测进行重定位。与传统的手动设计的特征(hand-crafted feature)相比,从神经网络中学习到的图像特征具有更好的环境不变性和语义识别能力。考虑到基于陆标(landmark)的卷积特征能够克服整个图像特征对视点变化敏感的缺陷,文中提出了一种新的闭环检测算法。其首先通过卷积神经网络的卷积层直接识别出图像的感兴趣区域生成陆标,然后对图像中识别出的每个陆标提取卷积特征,生成图像的最终表示以检测闭环。为了验证算法的有效性,在典型的数据集上进行了对比实验,结果表明所提算法具有优异的性能,且即使是在极端的视点和外观变化的情况下仍然具有高鲁棒性。展开更多
While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applyi...While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applying self-attention in computer vision:(1)treating images as 1D sequences neglects their 2D structures;(2)the quadratic complexity is too expensive for high-resolution images;(3)it only captures spatial adaptability but ignores channel adaptability.In this paper,we propose a novel linear attention named large kernel attention(LKA)to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.Furthermore,we present a neural network based on LKA,namely Visual Attention Network(VAN).While extremely simple,VAN achieves comparable results with similar size convolutional neural networks(CNNs)and vision transformers(ViTs)in various tasks,including image classification,object detection,semantic segmentation,panoptic segmentation,pose estimation,etc.For example,VAN-B6 achieves 87.8%accuracy on ImageNet benchmark,and sets new state-of-the-art performance(58.2%PQ)for panoptic segmentation.Besides,VAN-B2 surpasses Swin-T 4%mloU(50.1%vs.46.1%)for semantic segmentation on ADE20K benchmark,2.6%AP(48.8%vs.46.2%)for object detection on COCO dataset.It provides a novel method and a simple yet strong baseline for the community.The code is available at https://github.com/Visual-Attention-Network.展开更多
基金financial support from the National Natural Science Foundation of China(NSFC Grant No.22076138)the National Natural Science Foundation of China(NSFC Grant No.62174119).
文摘Bladder cancer(BC)is a common malignancy and among the leading causes of cancer death worldwide.Analysis of BC cells is of great significance for clinical diagnosis and disease treatment.Current approaches rely mainly on imaging-based technology,which requires complex staining and sophisticated instrumentation.In this work,we develop a label-free method based on artificial intelligence(AI)-assisted impedance-based flow cytometry(IFC)to differentiate between various BC cells and epithelial cells at single-cell resolution.By applying multiple-frequency excitations,the electrical characteristics of cells,including membrane and nuclear opacities,are extracted,allowing distinction to be made between epithelial cells,low-grade,and high-grade BC cells.Through the use of a constriction channel,the electro-mechanical properties associated with active deformation behavior of cells are investigated,and it is demonstrated that BC cells have a greater capability of shape recovery,an observation that further increases differentiation accuracy.With the assistance of a convolutional neural network-based AI algorithm,IFC is able to effectively differentiate various BC and epithelial cells with accuracies of over 95%.In addition,different grades of BC cells are successfully differentiated in both spiked mixed samples and bladder tumor tissues.
基金This work is supported by the National Key R&D Program of China[grant number 2018YFB0505400]the National Natural Science Foundation of China(NSFC)[grant num-ber 41901407]+1 种基金the LIESMARS Special Research Funding[grant number 2021]the College Students’Innovative Entrepreneurial Training Plan Program[grant number S2020634016].
文摘Image-based relocalization is a renewed interest in outdoor environments,because it is an important problem with many applications.PoseNet introduces Convolutional Neural Network(CNN)for the first time to realize the real-time camera pose solution based on a single image.In order to solve the problem of precision and robustness of PoseNet and its improved algorithms in complex environment,this paper proposes and implements a new visual relocation method based on deep convolutional neural networks(VNLSTM-PoseNet).Firstly,this method directly resizes the input image without cropping to increase the receptive field of the training image.Then,the image and the corresponding pose labels are put into the improved Long Short-Term Memory based(LSTM-based)PoseNet network for training and the network is optimized by the Nadam optimizer.Finally,the trained network is used for image localization to obtain the camera pose.Experimental results on outdoor public datasets show our VNLSTM-PoseNet can lead to drastic improvements in relocalization performance compared to existing state-of-theart CNN-based methods.
基金This work was supported by the Natural Science Foundation of Guangdong Province(Grant Nos.2022A1515140119 and 2023A1515011307)the National Key Laboratory of Air-based Information Perception and Fusion and the Aeronautic Science Foundation of China(Grant No.20220001068001)+1 种基金Dongguan Science and Technology Special Commissioner Project(Grant No.20221800500362)the National Natural Science Foundation of China(Grant Nos.62376261,61972090,and U21A20487).
文摘With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.
基金funding this work through Research Group No.KS-2024-376.
文摘Arabic Sign Language(ArSL)recognition plays a vital role in enhancing the communication for the Deaf and Hard of Hearing(DHH)community.Researchers have proposed multiple methods for automated recognition of ArSL;however,these methods face multiple challenges that include high gesture variability,occlusions,limited signer diversity,and the scarcity of large annotated datasets.Existing methods,often relying solely on either skeletal data or video-based features,struggle with generalization and robustness,especially in dynamic and real-world conditions.This paper proposes a novel multimodal ensemble classification framework that integrates geometric features derived from 3D skeletal joint distances and angles with temporal features extracted from RGB videos using the Inflated 3D ConvNet(I3D).By fusing these complementary modalities at the feature level and applying a majority-voting ensemble of XGBoost,Random Forest,and Support Vector Machine classifiers,the framework robustly captures both spatial configurations and motion dynamics of sign gestures.Feature selection using the Pearson Correlation Coefficient further enhances efficiency by reducing redundancy.Extensive experiments on the ArabSign dataset,which includes RGB videos and corresponding skeletal data,demonstrate that the proposed approach significantly outperforms state-of-the-art methods,achieving an average F1-score of 97%using a majority-voting ensemble of XGBoost,Random Forest,and SVM classifiers,and improving recognition accuracy by more than 7%over previous best methods.This work not only advances the technical stateof-the-art in ArSL recognition but also provides a scalable,real-time solution for practical deployment in educational,social,and assistive communication technologies.Even though this study is about Arabic Sign Language,the framework proposed here can be extended to different sign languages,creating possibilities for potentially worldwide applicability in sign language recognition tasks.
文摘Hand gestures are a natural way for human-robot interaction.Vision based dynamic hand gesture recognition has become a hot research topic due to its various applications.This paper presents a novel deep learning network for hand gesture recognition.The network integrates several well-proved modules together to learn both short-term and long-term features from video inputs and meanwhile avoid intensive computation.To learn short-term features,each video input is segmented into a fixed number of frame groups.A frame is randomly selected from each group and represented as an RGB image as well as an optical flow snapshot.These two entities are fused and fed into a convolutional neural network(Conv Net)for feature extraction.The Conv Nets for all groups share parameters.To learn longterm features,outputs from all Conv Nets are fed into a long short-term memory(LSTM)network,by which a final classification result is predicted.The new model has been tested with two popular hand gesture datasets,namely the Jester dataset and Nvidia dataset.Comparing with other models,our model produced very competitive results.The robustness of the new model has also been proved with an augmented dataset with enhanced diversity of hand gestures.
文摘闭环检测是同时定位与建图(Simultaneous localization and mapping,SLAM)的重要组成部分,能够有效减小SLAM系统中的累积误差,并且如果在定位与建图过程中跟踪丢失,还可以利用闭环检测进行重定位。与传统的手动设计的特征(hand-crafted feature)相比,从神经网络中学习到的图像特征具有更好的环境不变性和语义识别能力。考虑到基于陆标(landmark)的卷积特征能够克服整个图像特征对视点变化敏感的缺陷,文中提出了一种新的闭环检测算法。其首先通过卷积神经网络的卷积层直接识别出图像的感兴趣区域生成陆标,然后对图像中识别出的每个陆标提取卷积特征,生成图像的最终表示以检测闭环。为了验证算法的有效性,在典型的数据集上进行了对比实验,结果表明所提算法具有优异的性能,且即使是在极端的视点和外观变化的情况下仍然具有高鲁棒性。
基金supported by National Key R&D Program of China(Project No.2021ZD0112902)the National Natural Science Foundation of China(Project No.62220106003)Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘While originally designed for natural language processing tasks,the self-attention mechanism has recently taken various computer vision areas by storm.However,the 2D nature of images brings three challenges for applying self-attention in computer vision:(1)treating images as 1D sequences neglects their 2D structures;(2)the quadratic complexity is too expensive for high-resolution images;(3)it only captures spatial adaptability but ignores channel adaptability.In this paper,we propose a novel linear attention named large kernel attention(LKA)to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.Furthermore,we present a neural network based on LKA,namely Visual Attention Network(VAN).While extremely simple,VAN achieves comparable results with similar size convolutional neural networks(CNNs)and vision transformers(ViTs)in various tasks,including image classification,object detection,semantic segmentation,panoptic segmentation,pose estimation,etc.For example,VAN-B6 achieves 87.8%accuracy on ImageNet benchmark,and sets new state-of-the-art performance(58.2%PQ)for panoptic segmentation.Besides,VAN-B2 surpasses Swin-T 4%mloU(50.1%vs.46.1%)for semantic segmentation on ADE20K benchmark,2.6%AP(48.8%vs.46.2%)for object detection on COCO dataset.It provides a novel method and a simple yet strong baseline for the community.The code is available at https://github.com/Visual-Attention-Network.