Representation learning from unlabeled skeleton data is a challenging task.Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations....Representation learning from unlabeled skeleton data is a challenging task.Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations.However,the structural information of the skeleton data,which also plays a critical role in action recognition,is rarely explored in existing unsupervised methods.To deal with this limitation,we propose a novel twostream autoencoder network to combine the topological information with temporal information of skeleton data.Specifically,we encode the graph structure by graph convolutional network(GCN)and integrate the extracted GCN-based representations into the gate recurrent unit stream.Then we design a transfer module to merge the representations of the two streams adaptively.According to the characteristics of the two-stream autoencoder,a unified loss function composed of multiple tasks is proposed to update the learnable parameters of our model.Comprehensive experiments on NW-UCLA,UWA3D,and NTU-RGBD 60 datasets demonstrate that our proposed method can achieve an excellent performance among the unsupervised skeleton-based methods and even perform a similar or superior performance over numerous supervised skeleton-based methods.展开更多
Graph convolutional network(GCN)as an essential tool in human action recognition tasks have achieved excellent performance in previous studies.However,most current skeleton-based action recognition using GCN methods u...Graph convolutional network(GCN)as an essential tool in human action recognition tasks have achieved excellent performance in previous studies.However,most current skeleton-based action recognition using GCN methods use a shared topology,which cannot flexibly adapt to the diverse correlations between joints under different motion features.The video-shooting angle or the occlusion of the body parts may bring about errors when extracting the human pose coordinates with estimation algorithms.In this work,we propose a novel graph convolutional learning framework,called PCCTR-GCN,which integrates pose correction and channel topology refinement for skeleton-based human action recognition.Firstly,a pose correction module(PCM)is introduced,which corrects the pose coordinates of the input network to reduce the error in pose feature extraction.Secondly,channel topology refinement graph convolution(CTR-GC)is employed,which can dynamically learn the topology features and aggregate joint features in different channel dimensions so as to enhance the performance of graph convolution networks in feature extraction.Finally,considering that the joint stream and bone stream of skeleton data and their dynamic information are also important for distinguishing different actions,we employ a multi-stream data fusion approach to improve the network’s recognition performance.We evaluate the model using top-1 and top-5 classification accuracy.On the benchmark datasets iMiGUE and Kinetics,the top-1 classification accuracy reaches 55.08%and 36.5%,respectively,while the top-5 classification accuracy reaches 89.98%and 59.2%,respectively.On the NTU dataset,for the two benchmark RGB+Dsettings(X-Sub and X-View),the classification accuracy achieves 89.7%and 95.4%,respectively.展开更多
Skeleton-based human action recognition focuses on identifying actions from dynamic skeletal data,which contains both temporal and spatial characteristics.However,this approach faces chal-lenges such as viewpoint vari...Skeleton-based human action recognition focuses on identifying actions from dynamic skeletal data,which contains both temporal and spatial characteristics.However,this approach faces chal-lenges such as viewpoint variations,low recognition accuracy,and high model complexity.Skeleton-based graph convolutional network(GCN)generally outperform other deep learning methods in rec-ognition accuracy.However,they often underutilize temporal features and suffer from high model complexity,leading to increased training and validation costs,especially on large-scale datasets.This paper proposes a dual-channel graph convolutional network with multi-order information fusion(DM-AGCN)for human action recognition.The network integrates high frame rate skeleton chan-nels to capture action dynamics and low frame rate channels to preserve static semantic information,effectively balancing temporal and spatial features.This dual-channel architecture allows for separate processing of temporal and spatial information.Additionally,DM-AGCN extracts joint keypoints and bidirectional bone vectors from skeleton sequences,and employs a three-stream graph convolu-tional structure to extract features that describe human movement.Experimental results on the NTU-RGB+D dataset demonstrate that DM-AGCN achieves an accuracy of 89.4%on the X-Sub and 95.8%on the X-View,while reducing model complexity to 3.68 GFLOPs(Giga Floating-point Oper-ations Per Second).On the Kinetics-Skeleton dataset,the model achieves a Top-1 accuracy of 37.2%and a Top-5 accuracy of 60.3%,further validating its effectiveness across different benchmarks.展开更多
In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal ...In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.展开更多
Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines shar...Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL.展开更多
Three-dimensional skeleton-based action recognition(3D SAR)has gained important attention within the computer vision community,owing to the inherent advantages offered by skeleton data.As a result,a plethora of impres...Three-dimensional skeleton-based action recognition(3D SAR)has gained important attention within the computer vision community,owing to the inherent advantages offered by skeleton data.As a result,a plethora of impressive works,including those based on conventional handcrafted features and learned feature extraction methods,have been conducted over the years.However,prior surveys on action recognition have primarily focused on video or red-green-blue(RGB)data-dominated approaches,with limited coverage of reviews related to skeleton data.Furthermore,despite the extensive application of deep learning methods in this field,there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures.To address these limitations,this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional(3D)skeleton data as a valuable modality.Subsequently,we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures,i.e.,recurrent neural networks,convolutional neural networks,graph convolutional network,and Transformers.All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion.Finally,we offer insights into the current largest 3D skeleton dataset,NTU-RGB+D,and its new edition,NTU-RGB+D 120,along with an overview of several top-performing algorithms on these datasets.To the best of our knowledge,this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.展开更多
The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,th...The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.展开更多
Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action...Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.展开更多
Reliable human action recognition(HAR)in video sequences is critical for a wide range of applications,such as security surveillance,healthcare monitoring,and human-computer interaction.Several automated systems have b...Reliable human action recognition(HAR)in video sequences is critical for a wide range of applications,such as security surveillance,healthcare monitoring,and human-computer interaction.Several automated systems have been designed for this purpose;however,existing methods often struggle to effectively integrate spatial and temporal information from input samples such as 2-stream networks or 3D convolutional neural networks(CNNs),which limits their accuracy in discriminating numerous human actions.Therefore,this study introduces a novel deeplearning framework called theARNet,designed for robustHAR.ARNet consists of two mainmodules,namely,a refined InceptionResNet-V2-based CNN and a Bi-LSTM(Long Short-Term Memory)network.The refined InceptionResNet-V2 employs a parametric rectified linear unit(PReLU)activation strategy within convolutional layers to enhance spatial feature extraction fromindividual video frames.The inclusion of the PReLUmethod improves the spatial informationcapturing ability of the approach as it uses learnable parameters to adaptively control the slope of the negative part of the activation function,allowing richer gradient flow during backpropagation and resulting in robust information capturing and stable model training.These spatial features holding essential pixel characteristics are then processed by the Bi-LSTMmodule for temporal analysis,which assists the ARNet in understanding the dynamic behavior of actions over time.The ARNet integrates three additional dense layers after the Bi-LSTM module to ensure a comprehensive computation of both spatial and temporal patterns and further boost the feature representation.The experimental validation of the model is conducted on 3 benchmark datasets named HMDB51,KTH,and UCF Sports and reports accuracies of 93.82%,99%,and 99.16%,respectively.The Precision results of HMDB51,KTH,and UCF Sports datasets are 97.41%,99.54%,and 99.01%;the Recall values are 98.87%,98.60%,99.08%,and the F1-Score is 98.13%,99.07%,99.04%,respectively.These results highlight the robustness of the ARNet approach and its potential as a versatile tool for accurate HAR across various real-world applications.展开更多
Real-time surveillance is attributed to recognizing the variety of actions performed by humans.Human Action Recognition(HAR)is a technique that recognizes human actions from a video stream.A range of variations in hum...Real-time surveillance is attributed to recognizing the variety of actions performed by humans.Human Action Recognition(HAR)is a technique that recognizes human actions from a video stream.A range of variations in human actions makes it difficult to recognize with considerable accuracy.This paper presents a novel deep neural network architecture called Attention RB-Net for HAR using video frames.The input is provided to the model in the form of video frames.The proposed deep architecture is based on the unique structuring of residual blocks with several filter sizes.Features are extracted from each frame via several operations with specific parameters defined in the presented novel Attention-based Residual Bottleneck(Attention-RB)DCNN architecture.A fully connected layer receives an attention-based features matrix,and final classification is performed.Several hyperparameters of the proposed model are initialized using Bayesian Optimization(BO)and later utilized in the trained model for testing.In testing,features are extracted from the self-attention layer and passed to neural network classifiers for the final action classification.Two highly cited datasets,HMDB51 and UCF101,were used to validate the proposed architecture and obtained an average accuracy of 87.70%and 97.30%,respectively.The deep convolutional neural network(DCNN)architecture is compared with state-of-the-art(SOTA)methods,including pre-trained models,inside blocks,and recently published techniques,and performs better.展开更多
With the rapid development of artificial intelligence and Internet of Things technologies,video action recognition technology is widely applied in various scenarios,such as personal life and industrial production.Howe...With the rapid development of artificial intelligence and Internet of Things technologies,video action recognition technology is widely applied in various scenarios,such as personal life and industrial production.However,while enjoying the convenience brought by this technology,it is crucial to effectively protect the privacy of users’video data.Therefore,this paper proposes a video action recognition method based on personalized federated learning and spatiotemporal features.Under the framework of federated learning,a video action recognition method leveraging spatiotemporal features is designed.For the local spatiotemporal features of the video,a new differential information extraction scheme is proposed to extract differential features with a single RGB frame as the center,and a spatialtemporal module based on local information is designed to improve the effectiveness of local feature extraction;for the global temporal features,a method of extracting action rhythm features using differential technology is proposed,and a timemodule based on global information is designed.Different translational strides are used in the module to obtain bidirectional differential features under different action rhythms.Additionally,to address user data privacy issues,the method divides model parameters into local private parameters and public parameters based on the structure of the video action recognition model.This approach enhancesmodel training performance and ensures the security of video data.The experimental results show that under personalized federated learning conditions,an average accuracy of 97.792%was achieved on the UCF-101 dataset,which is non-independent and identically distributed(non-IID).This research provides technical support for privacy protection in video action recognition.展开更多
Smart grid substation operations often take place in hazardous environments and pose significant threats to the safety of power personnel.Relying solely on manual supervision can lead to inadequate oversight.In respon...Smart grid substation operations often take place in hazardous environments and pose significant threats to the safety of power personnel.Relying solely on manual supervision can lead to inadequate oversight.In response to the demand for technology to identify improper operations in substation work scenarios,this paper proposes a substation safety action recognition technology to avoid the misoperation and enhance the safety management.In general,this paper utilizes a dual-branch transformer network to extract spatial and temporal information from the video dataset of operational behaviors in complex substation environments.Firstly,in order to capture the spatial-temporal correlation of people's behaviors in smart grid substation,we devise a sparse attention module and a segmented linear attention module that are embedded into spatial branch transformer and temporal branch transformer respectively.To avoid the redundancy of spatial and temporal information,we fuse the temporal and spatial features using a tensor decomposition fusion module by a decoupled manner.Experimental results indicate that our proposed method accurately detects improper operational behaviors in substation work scenarios,outperforming other existing methods in terms of detection and recognition accuracy.展开更多
Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions i...Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.展开更多
Classroom behavior recognition is a hot research topic,which plays a vital role in assessing and improving the quality of classroom teaching.However,existing classroom behavior recognition methods have challenges for ...Classroom behavior recognition is a hot research topic,which plays a vital role in assessing and improving the quality of classroom teaching.However,existing classroom behavior recognition methods have challenges for high recognition accuracy with datasets with problems such as scenes with blurred pictures,and inconsistent objects.To address this challenge,we proposed an effective,lightweight object detector method called the RFNet model(YOLO-FR).The YOLO-FR is a lightweight and effective model.Specifically,for efficient multi-scale feature extraction,effective feature pyramid shared convolutional(FPSC)was designed to improve the feature extract performance by leveraging convolutional layers with varying dilation rates from the input image in the backbone.Secondly,to address the problem of multi-scale variability in the scene,we design the Rep Ghost fusion Cross Stage Partial and Efficient Layer Aggregation Network(RGCSPELAN)to improve the network performance further and reduce the amount of computation and the number of parameters.In addition,by conducting experimental valuation on the SCB dataset3 and STBD-08 dataset.Experimental results indicate that,compared to the baseline model,the RFNet model has increased mean accuracy precision(mAP@50)from 69.6%to 71.0%on the SCB dataset3 and from 91.8%to 93.1%on the STBD-08 dataset.The RFNet approach has effectiveness precision at 68.6%,surpassing the baseline method(YOLOv11)at 3.3%and archieve the minimal size(4.9 M)on the SCB dataset3.Finally,comparing it with other algorithms,it accurately detects student behavior in complex classroom environments results confirmed that RFNet is well-suited for real-time and efficiently recognizing classroom behaviors.展开更多
Multimodal-based action recognition methods have achieved high success using pose and RGB modality.However,skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitatio...Multimodal-based action recognition methods have achieved high success using pose and RGB modality.However,skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations.To address this,the authors introduce human parsing feature map as a novel modality,since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise.The authors propose a new dual-branch framework called ensemble human parsing and pose network(EPP-Net),which is the first to leverage both skeletons and human parsing modalities for action recognition.The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features,while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones.The two high-level features will be effectively combined through a late fusion strategy for better action recognition.Extensive experiments on NTU RGB t D and NTU RGB t D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net,which outperforms the existing action recognition methods.Our code is available at https://github.com/liujf69/EPP-Net-Action.展开更多
Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(...Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.展开更多
Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data...Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.展开更多
In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the e...In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.展开更多
To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-t...To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.展开更多
In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision lev...In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision level fusion is proposed. Firstly, the minimal 3D space region of human action region is detected by combining frame difference method and Vi BE algorithm, and the three-dimensional histogram of oriented gradient(HOG3D) is extracted. At the same time, the characteristics of global descriptors based on frequency domain filtering(FDF) and the local descriptors based on spatial-temporal interest points(STIP) are extracted. Principal component analysis(PCA) is implemented to reduce the dimension of the gradient histogram and the global descriptor, and bag of words(BoW) model is applied to describe the local descriptors based on STIP. Finally, a linear support vector machine(SVM) is used to create a new decision level fusion classifier. Some experiments are done to verify the performance of the multi-features, and the results show that they have good representation ability and generalization ability. Otherwise, the proposed scheme obtains very competitive results on the well-known datasets in terms of mean average precision.展开更多
文摘Representation learning from unlabeled skeleton data is a challenging task.Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations.However,the structural information of the skeleton data,which also plays a critical role in action recognition,is rarely explored in existing unsupervised methods.To deal with this limitation,we propose a novel twostream autoencoder network to combine the topological information with temporal information of skeleton data.Specifically,we encode the graph structure by graph convolutional network(GCN)and integrate the extracted GCN-based representations into the gate recurrent unit stream.Then we design a transfer module to merge the representations of the two streams adaptively.According to the characteristics of the two-stream autoencoder,a unified loss function composed of multiple tasks is proposed to update the learnable parameters of our model.Comprehensive experiments on NW-UCLA,UWA3D,and NTU-RGBD 60 datasets demonstrate that our proposed method can achieve an excellent performance among the unsupervised skeleton-based methods and even perform a similar or superior performance over numerous supervised skeleton-based methods.
基金The Fundamental Research Funds for the Central Universities provided financial support for this research.
文摘Graph convolutional network(GCN)as an essential tool in human action recognition tasks have achieved excellent performance in previous studies.However,most current skeleton-based action recognition using GCN methods use a shared topology,which cannot flexibly adapt to the diverse correlations between joints under different motion features.The video-shooting angle or the occlusion of the body parts may bring about errors when extracting the human pose coordinates with estimation algorithms.In this work,we propose a novel graph convolutional learning framework,called PCCTR-GCN,which integrates pose correction and channel topology refinement for skeleton-based human action recognition.Firstly,a pose correction module(PCM)is introduced,which corrects the pose coordinates of the input network to reduce the error in pose feature extraction.Secondly,channel topology refinement graph convolution(CTR-GC)is employed,which can dynamically learn the topology features and aggregate joint features in different channel dimensions so as to enhance the performance of graph convolution networks in feature extraction.Finally,considering that the joint stream and bone stream of skeleton data and their dynamic information are also important for distinguishing different actions,we employ a multi-stream data fusion approach to improve the network’s recognition performance.We evaluate the model using top-1 and top-5 classification accuracy.On the benchmark datasets iMiGUE and Kinetics,the top-1 classification accuracy reaches 55.08%and 36.5%,respectively,while the top-5 classification accuracy reaches 89.98%and 59.2%,respectively.On the NTU dataset,for the two benchmark RGB+Dsettings(X-Sub and X-View),the classification accuracy achieves 89.7%and 95.4%,respectively.
基金Supported by the National Natural Science Foundation of China(No.62303163)the Science and Technology Key Project of Science and Technology Department of Henan Province(No.252102211041).
文摘Skeleton-based human action recognition focuses on identifying actions from dynamic skeletal data,which contains both temporal and spatial characteristics.However,this approach faces chal-lenges such as viewpoint variations,low recognition accuracy,and high model complexity.Skeleton-based graph convolutional network(GCN)generally outperform other deep learning methods in rec-ognition accuracy.However,they often underutilize temporal features and suffer from high model complexity,leading to increased training and validation costs,especially on large-scale datasets.This paper proposes a dual-channel graph convolutional network with multi-order information fusion(DM-AGCN)for human action recognition.The network integrates high frame rate skeleton chan-nels to capture action dynamics and low frame rate channels to preserve static semantic information,effectively balancing temporal and spatial features.This dual-channel architecture allows for separate processing of temporal and spatial information.Additionally,DM-AGCN extracts joint keypoints and bidirectional bone vectors from skeleton sequences,and employs a three-stream graph convolu-tional structure to extract features that describe human movement.Experimental results on the NTU-RGB+D dataset demonstrate that DM-AGCN achieves an accuracy of 89.4%on the X-Sub and 95.8%on the X-View,while reducing model complexity to 3.68 GFLOPs(Giga Floating-point Oper-ations Per Second).On the Kinetics-Skeleton dataset,the model achieves a Top-1 accuracy of 37.2%and a Top-5 accuracy of 60.3%,further validating its effectiveness across different benchmarks.
基金supported in part by the 2023 Key Supported Project of the 14th Five Year Plan for Education and Science in Hunan Province with No.ND230795.
文摘In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods.
基金supported by the National Natural Science Foundation of China(Nos.62276263,62006225 and 62071468)the Strategic Priority Research Program of Chinese Academy of Sciences(CAS),China(No.XDA27040700)the National Key Research and Development Program of China(No.2022YFC3310400).
文摘Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL.
基金supported by the National Natural Science Foundation of China(No.62203476)the Natural Science Foundation of Shenzhen(No.JCYJ20230807120801002).
文摘Three-dimensional skeleton-based action recognition(3D SAR)has gained important attention within the computer vision community,owing to the inherent advantages offered by skeleton data.As a result,a plethora of impressive works,including those based on conventional handcrafted features and learned feature extraction methods,have been conducted over the years.However,prior surveys on action recognition have primarily focused on video or red-green-blue(RGB)data-dominated approaches,with limited coverage of reviews related to skeleton data.Furthermore,despite the extensive application of deep learning methods in this field,there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures.To address these limitations,this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional(3D)skeleton data as a valuable modality.Subsequently,we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures,i.e.,recurrent neural networks,convolutional neural networks,graph convolutional network,and Transformers.All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion.Finally,we offer insights into the current largest 3D skeleton dataset,NTU-RGB+D,and its new edition,NTU-RGB+D 120,along with an overview of several top-performing algorithms on these datasets.To the best of our knowledge,this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.
基金supported by the National Natural Science Foundation of China under Grant 62107034the Major Science and Technology Project of Yunnan Province(202402AD080002)Yunnan International Joint R&D Center of China-Laos-Thailand Educational Digitalization(202203AP140006).
文摘The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters.
基金Shanghai Municipal Commission of Economy and Information Technology,China (No.202301054)。
文摘Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.
基金supported and funded by theDeanship of Scientific Research at ImamMohammad Ibn Saud Islamic University(IMSIU)(grant number IMSIU-DDRSP2504).
文摘Reliable human action recognition(HAR)in video sequences is critical for a wide range of applications,such as security surveillance,healthcare monitoring,and human-computer interaction.Several automated systems have been designed for this purpose;however,existing methods often struggle to effectively integrate spatial and temporal information from input samples such as 2-stream networks or 3D convolutional neural networks(CNNs),which limits their accuracy in discriminating numerous human actions.Therefore,this study introduces a novel deeplearning framework called theARNet,designed for robustHAR.ARNet consists of two mainmodules,namely,a refined InceptionResNet-V2-based CNN and a Bi-LSTM(Long Short-Term Memory)network.The refined InceptionResNet-V2 employs a parametric rectified linear unit(PReLU)activation strategy within convolutional layers to enhance spatial feature extraction fromindividual video frames.The inclusion of the PReLUmethod improves the spatial informationcapturing ability of the approach as it uses learnable parameters to adaptively control the slope of the negative part of the activation function,allowing richer gradient flow during backpropagation and resulting in robust information capturing and stable model training.These spatial features holding essential pixel characteristics are then processed by the Bi-LSTMmodule for temporal analysis,which assists the ARNet in understanding the dynamic behavior of actions over time.The ARNet integrates three additional dense layers after the Bi-LSTM module to ensure a comprehensive computation of both spatial and temporal patterns and further boost the feature representation.The experimental validation of the model is conducted on 3 benchmark datasets named HMDB51,KTH,and UCF Sports and reports accuracies of 93.82%,99%,and 99.16%,respectively.The Precision results of HMDB51,KTH,and UCF Sports datasets are 97.41%,99.54%,and 99.01%;the Recall values are 98.87%,98.60%,99.08%,and the F1-Score is 98.13%,99.07%,99.04%,respectively.These results highlight the robustness of the ARNet approach and its potential as a versatile tool for accurate HAR across various real-world applications.
基金Nourah bint Abdulrahman University,Riyadh,Saudi Arabia.supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(*MSIT)(No.2018R1A5A7059549)the Competitive Research Fund of The University of Aizu,Japan.
文摘Real-time surveillance is attributed to recognizing the variety of actions performed by humans.Human Action Recognition(HAR)is a technique that recognizes human actions from a video stream.A range of variations in human actions makes it difficult to recognize with considerable accuracy.This paper presents a novel deep neural network architecture called Attention RB-Net for HAR using video frames.The input is provided to the model in the form of video frames.The proposed deep architecture is based on the unique structuring of residual blocks with several filter sizes.Features are extracted from each frame via several operations with specific parameters defined in the presented novel Attention-based Residual Bottleneck(Attention-RB)DCNN architecture.A fully connected layer receives an attention-based features matrix,and final classification is performed.Several hyperparameters of the proposed model are initialized using Bayesian Optimization(BO)and later utilized in the trained model for testing.In testing,features are extracted from the self-attention layer and passed to neural network classifiers for the final action classification.Two highly cited datasets,HMDB51 and UCF101,were used to validate the proposed architecture and obtained an average accuracy of 87.70%and 97.30%,respectively.The deep convolutional neural network(DCNN)architecture is compared with state-of-the-art(SOTA)methods,including pre-trained models,inside blocks,and recently published techniques,and performs better.
基金supported by National Natural Science Foundation of China(Grant No.62071098)Sichuan Science and Technology Program(Grants 2022YFG0319,2023YFG0301 and 2023YFG0018).
文摘With the rapid development of artificial intelligence and Internet of Things technologies,video action recognition technology is widely applied in various scenarios,such as personal life and industrial production.However,while enjoying the convenience brought by this technology,it is crucial to effectively protect the privacy of users’video data.Therefore,this paper proposes a video action recognition method based on personalized federated learning and spatiotemporal features.Under the framework of federated learning,a video action recognition method leveraging spatiotemporal features is designed.For the local spatiotemporal features of the video,a new differential information extraction scheme is proposed to extract differential features with a single RGB frame as the center,and a spatialtemporal module based on local information is designed to improve the effectiveness of local feature extraction;for the global temporal features,a method of extracting action rhythm features using differential technology is proposed,and a timemodule based on global information is designed.Different translational strides are used in the module to obtain bidirectional differential features under different action rhythms.Additionally,to address user data privacy issues,the method divides model parameters into local private parameters and public parameters based on the structure of the video action recognition model.This approach enhancesmodel training performance and ensures the security of video data.The experimental results show that under personalized federated learning conditions,an average accuracy of 97.792%was achieved on the UCF-101 dataset,which is non-independent and identically distributed(non-IID).This research provides technical support for privacy protection in video action recognition.
文摘Smart grid substation operations often take place in hazardous environments and pose significant threats to the safety of power personnel.Relying solely on manual supervision can lead to inadequate oversight.In response to the demand for technology to identify improper operations in substation work scenarios,this paper proposes a substation safety action recognition technology to avoid the misoperation and enhance the safety management.In general,this paper utilizes a dual-branch transformer network to extract spatial and temporal information from the video dataset of operational behaviors in complex substation environments.Firstly,in order to capture the spatial-temporal correlation of people's behaviors in smart grid substation,we devise a sparse attention module and a segmented linear attention module that are embedded into spatial branch transformer and temporal branch transformer respectively.To avoid the redundancy of spatial and temporal information,we fuse the temporal and spatial features using a tensor decomposition fusion module by a decoupled manner.Experimental results indicate that our proposed method accurately detects improper operational behaviors in substation work scenarios,outperforming other existing methods in terms of detection and recognition accuracy.
文摘Regular exercise is a crucial aspect of daily life, as it enables individuals to stay physically active, lowers thelikelihood of developing illnesses, and enhances life expectancy. The recognition of workout actions in videostreams holds significant importance in computer vision research, as it aims to enhance exercise adherence, enableinstant recognition, advance fitness tracking technologies, and optimize fitness routines. However, existing actiondatasets often lack diversity and specificity for workout actions, hindering the development of accurate recognitionmodels. To address this gap, the Workout Action Video dataset (WAVd) has been introduced as a significantcontribution. WAVd comprises a diverse collection of labeled workout action videos, meticulously curated toencompass various exercises performed by numerous individuals in different settings. This research proposes aninnovative framework based on the Attention driven Residual Deep Convolutional-Gated Recurrent Unit (ResDCGRU)network for workout action recognition in video streams. Unlike image-based action recognition, videoscontain spatio-temporal information, making the task more complex and challenging. While substantial progresshas been made in this area, challenges persist in detecting subtle and complex actions, handling occlusions,and managing the computational demands of deep learning approaches. The proposed ResDC-GRU Attentionmodel demonstrated exceptional classification performance with 95.81% accuracy in classifying workout actionvideos and also outperformed various state-of-the-art models. The method also yielded 81.6%, 97.2%, 95.6%, and93.2% accuracy on established benchmark datasets, namely HMDB51, Youtube Actions, UCF50, and UCF101,respectively, showcasing its superiority and robustness in action recognition. The findings suggest practicalimplications in real-world scenarios where precise video action recognition is paramount, addressing the persistingchallenges in the field. TheWAVd dataset serves as a catalyst for the development ofmore robust and effective fitnesstracking systems and ultimately promotes healthier lifestyles through improved exercise monitoring and analysis.
基金suported by the Fundamental Research Grant Scheme(FRGS)of Universiti Sains Malaysia,Research Number:FRGS/1/2024/ICT02/USM/02/1.
文摘Classroom behavior recognition is a hot research topic,which plays a vital role in assessing and improving the quality of classroom teaching.However,existing classroom behavior recognition methods have challenges for high recognition accuracy with datasets with problems such as scenes with blurred pictures,and inconsistent objects.To address this challenge,we proposed an effective,lightweight object detector method called the RFNet model(YOLO-FR).The YOLO-FR is a lightweight and effective model.Specifically,for efficient multi-scale feature extraction,effective feature pyramid shared convolutional(FPSC)was designed to improve the feature extract performance by leveraging convolutional layers with varying dilation rates from the input image in the backbone.Secondly,to address the problem of multi-scale variability in the scene,we design the Rep Ghost fusion Cross Stage Partial and Efficient Layer Aggregation Network(RGCSPELAN)to improve the network performance further and reduce the amount of computation and the number of parameters.In addition,by conducting experimental valuation on the SCB dataset3 and STBD-08 dataset.Experimental results indicate that,compared to the baseline model,the RFNet model has increased mean accuracy precision(mAP@50)from 69.6%to 71.0%on the SCB dataset3 and from 91.8%to 93.1%on the STBD-08 dataset.The RFNet approach has effectiveness precision at 68.6%,surpassing the baseline method(YOLOv11)at 3.3%and archieve the minimal size(4.9 M)on the SCB dataset3.Finally,comparing it with other algorithms,it accurately detects student behavior in complex classroom environments results confirmed that RFNet is well-suited for real-time and efficiently recognizing classroom behaviors.
基金National Natural Science Foundation of China,Grant/Award Number:62203476Natural Science Foundation of Guangdong Province,Grant/Award Number:2024A1515012089+1 种基金Natural Science Foundation of Shenzhen,Grant/Award Number:JCYJ20230807120801002Shenzhen Innovation in Science and Technology Foundation for The Excellent Youth Scholars,Grant/Award Number:RCYX20231211090248064。
文摘Multimodal-based action recognition methods have achieved high success using pose and RGB modality.However,skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations.To address this,the authors introduce human parsing feature map as a novel modality,since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise.The authors propose a new dual-branch framework called ensemble human parsing and pose network(EPP-Net),which is the first to leverage both skeletons and human parsing modalities for action recognition.The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features,while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones.The two high-level features will be effectively combined through a late fusion strategy for better action recognition.Extensive experiments on NTU RGB t D and NTU RGB t D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net,which outperforms the existing action recognition methods.Our code is available at https://github.com/liujf69/EPP-Net-Action.
基金supportted by Natural Science Foundation of Jiangsu Province(No.BK20230696).
文摘Electric power training is essential for ensuring the safety and reliability of the system.In this study,we introduce a novel Abnormal Action Recognition(AAR)system that utilizes a Lightweight Pose Estimation Network(LPEN)to efficiently and effectively detect abnormal fall-down and trespass incidents in electric power training scenarios.The LPEN network,comprising three stages—MobileNet,Initial Stage,and Refinement Stage—is employed to swiftly extract image features,detect human key points,and refine them for accurate analysis.Subsequently,a Pose-aware Action Analysis Module(PAAM)captures the positional coordinates of human skeletal points in each frame.Finally,an Abnormal Action Inference Module(AAIM)evaluates whether abnormal fall-down or unauthorized trespass behavior is occurring.For fall-down recognition,three criteria—falling speed,main angles of skeletal points,and the person’s bounding box—are considered.To identify unauthorized trespass,emphasis is placed on the position of the ankles.Extensive experiments validate the effectiveness and efficiency of the proposed system in ensuring the safety and reliability of electric power training.
基金the National Natural Science Foundation of China under Grant No.62072255.
文摘Recognition of human gesture actions is a challenging issue due to the complex patterns in both visual andskeletal features. Existing gesture action recognition (GAR) methods typically analyze visual and skeletal data,failing to meet the demands of various scenarios. Furthermore, multi-modal approaches lack the versatility toefficiently process both uniformand disparate input patterns.Thus, in this paper, an attention-enhanced pseudo-3Dresidual model is proposed to address the GAR problem, called HgaNets. This model comprises two independentcomponents designed formodeling visual RGB (red, green and blue) images and 3Dskeletal heatmaps, respectively.More specifically, each component consists of two main parts: 1) a multi-dimensional attention module forcapturing important spatial, temporal and feature information in human gestures;2) a spatiotemporal convolutionmodule that utilizes pseudo-3D residual convolution to characterize spatiotemporal features of gestures. Then,the output weights of the two components are fused to generate the recognition results. Finally, we conductedexperiments on four datasets to assess the efficiency of the proposed model. The results show that the accuracy onfour datasets reaches 85.40%, 91.91%, 94.70%, and 95.30%, respectively, as well as the inference time is 0.54 s andthe parameters is 2.74M. These findings highlight that the proposed model outperforms other existing approachesin terms of recognition accuracy.
文摘In recent years,wearable devices-based Human Activity Recognition(HAR)models have received significant attention.Previously developed HAR models use hand-crafted features to recognize human activities,leading to the extraction of basic features.The images captured by wearable sensors contain advanced features,allowing them to be analyzed by deep learning algorithms to enhance the detection and recognition of human actions.Poor lighting and limited sensor capabilities can impact data quality,making the recognition of human actions a challenging task.The unimodal-based HAR approaches are not suitable in a real-time environment.Therefore,an updated HAR model is developed using multiple types of data and an advanced deep-learning approach.Firstly,the required signals and sensor data are accumulated from the standard databases.From these signals,the wave features are retrieved.Then the extracted wave features and sensor data are given as the input to recognize the human activity.An Adaptive Hybrid Deep Attentive Network(AHDAN)is developed by incorporating a“1D Convolutional Neural Network(1DCNN)”with a“Gated Recurrent Unit(GRU)”for the human activity recognition process.Additionally,the Enhanced Archerfish Hunting Optimizer(EAHO)is suggested to fine-tune the network parameters for enhancing the recognition process.An experimental evaluation is performed on various deep learning networks and heuristic algorithms to confirm the effectiveness of the proposed HAR model.The EAHO-based HAR model outperforms traditional deep learning networks with an accuracy of 95.36,95.25 for recall,95.48 for specificity,and 95.47 for precision,respectively.The result proved that the developed model is effective in recognizing human action by taking less time.Additionally,it reduces the computation complexity and overfitting issue through using an optimization approach.
基金The National Natural Science Foundation of China(No.60971098,61201345)
文摘To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance.
基金supported by the National Natural Science Foundation of China under Grant No. 61503424the Research Project by The State Ethnic Affairs Commission under Grant No. 14ZYZ017+2 种基金the Jiangsu Future Networks Innovation Institute-Prospective Research Project on Future Networks under Grant No. BY2013095-2-14the Fundamental Research Funds for the Central Universities No. FRF-TP-14-046A2the first-class discipline construction transitional funds of Minzu University of China
文摘In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision level fusion is proposed. Firstly, the minimal 3D space region of human action region is detected by combining frame difference method and Vi BE algorithm, and the three-dimensional histogram of oriented gradient(HOG3D) is extracted. At the same time, the characteristics of global descriptors based on frequency domain filtering(FDF) and the local descriptors based on spatial-temporal interest points(STIP) are extracted. Principal component analysis(PCA) is implemented to reduce the dimension of the gradient histogram and the global descriptor, and bag of words(BoW) model is applied to describe the local descriptors based on STIP. Finally, a linear support vector machine(SVM) is used to create a new decision level fusion classifier. Some experiments are done to verify the performance of the multi-features, and the results show that they have good representation ability and generalization ability. Otherwise, the proposed scheme obtains very competitive results on the well-known datasets in terms of mean average precision.