期刊文献+
共找到1,455篇文章
< 1 2 73 >
每页显示 20 50 100
Two-Stream Auto-Encoder Network for Unsupervised Skeleton-Based Action Recognition
1
作者 WANG Gang GUAN Yaonan LI Dewei 《Journal of Shanghai Jiaotong university(Science)》 2025年第2期330-336,共7页
Representation learning from unlabeled skeleton data is a challenging task.Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations.... Representation learning from unlabeled skeleton data is a challenging task.Prior unsupervised learning algorithms mainly rely on the modeling ability of recurrent neural networks to extract the action representations.However,the structural information of the skeleton data,which also plays a critical role in action recognition,is rarely explored in existing unsupervised methods.To deal with this limitation,we propose a novel twostream autoencoder network to combine the topological information with temporal information of skeleton data.Specifically,we encode the graph structure by graph convolutional network(GCN)and integrate the extracted GCN-based representations into the gate recurrent unit stream.Then we design a transfer module to merge the representations of the two streams adaptively.According to the characteristics of the two-stream autoencoder,a unified loss function composed of multiple tasks is proposed to update the learnable parameters of our model.Comprehensive experiments on NW-UCLA,UWA3D,and NTU-RGBD 60 datasets demonstrate that our proposed method can achieve an excellent performance among the unsupervised skeleton-based methods and even perform a similar or superior performance over numerous supervised skeleton-based methods. 展开更多
关键词 representation learning skeleton-based action recognition unsupervised deep learning
原文传递
Skeleton-Based Action Recognition Using Graph Convolutional Network with Pose Correction and Channel Topology Refinement
2
作者 Yuxin Gao Xiaodong Duan Qiguo Dai 《Computers, Materials & Continua》 2025年第4期701-718,共18页
Graph convolutional network(GCN)as an essential tool in human action recognition tasks have achieved excellent performance in previous studies.However,most current skeleton-based action recognition using GCN methods u... Graph convolutional network(GCN)as an essential tool in human action recognition tasks have achieved excellent performance in previous studies.However,most current skeleton-based action recognition using GCN methods use a shared topology,which cannot flexibly adapt to the diverse correlations between joints under different motion features.The video-shooting angle or the occlusion of the body parts may bring about errors when extracting the human pose coordinates with estimation algorithms.In this work,we propose a novel graph convolutional learning framework,called PCCTR-GCN,which integrates pose correction and channel topology refinement for skeleton-based human action recognition.Firstly,a pose correction module(PCM)is introduced,which corrects the pose coordinates of the input network to reduce the error in pose feature extraction.Secondly,channel topology refinement graph convolution(CTR-GC)is employed,which can dynamically learn the topology features and aggregate joint features in different channel dimensions so as to enhance the performance of graph convolution networks in feature extraction.Finally,considering that the joint stream and bone stream of skeleton data and their dynamic information are also important for distinguishing different actions,we employ a multi-stream data fusion approach to improve the network’s recognition performance.We evaluate the model using top-1 and top-5 classification accuracy.On the benchmark datasets iMiGUE and Kinetics,the top-1 classification accuracy reaches 55.08%and 36.5%,respectively,while the top-5 classification accuracy reaches 89.98%and 59.2%,respectively.On the NTU dataset,for the two benchmark RGB+Dsettings(X-Sub and X-View),the classification accuracy achieves 89.7%and 95.4%,respectively. 展开更多
关键词 Pose correction multi-stream fusion GCN action recognition
在线阅读 下载PDF
Dual-channel graph convolutional network with multi-order information fusion for skeleton-based action recognition
3
作者 JIANG Tao HU Zhentao +2 位作者 WANG Kaige QIU Qian REN Xing 《High Technology Letters》 2025年第3期257-265,共9页
Skeleton-based human action recognition focuses on identifying actions from dynamic skeletal data,which contains both temporal and spatial characteristics.However,this approach faces chal-lenges such as viewpoint vari... Skeleton-based human action recognition focuses on identifying actions from dynamic skeletal data,which contains both temporal and spatial characteristics.However,this approach faces chal-lenges such as viewpoint variations,low recognition accuracy,and high model complexity.Skeleton-based graph convolutional network(GCN)generally outperform other deep learning methods in rec-ognition accuracy.However,they often underutilize temporal features and suffer from high model complexity,leading to increased training and validation costs,especially on large-scale datasets.This paper proposes a dual-channel graph convolutional network with multi-order information fusion(DM-AGCN)for human action recognition.The network integrates high frame rate skeleton chan-nels to capture action dynamics and low frame rate channels to preserve static semantic information,effectively balancing temporal and spatial features.This dual-channel architecture allows for separate processing of temporal and spatial information.Additionally,DM-AGCN extracts joint keypoints and bidirectional bone vectors from skeleton sequences,and employs a three-stream graph convolu-tional structure to extract features that describe human movement.Experimental results on the NTU-RGB+D dataset demonstrate that DM-AGCN achieves an accuracy of 89.4%on the X-Sub and 95.8%on the X-View,while reducing model complexity to 3.68 GFLOPs(Giga Floating-point Oper-ations Per Second).On the Kinetics-Skeleton dataset,the model achieves a Top-1 accuracy of 37.2%and a Top-5 accuracy of 60.3%,further validating its effectiveness across different benchmarks. 展开更多
关键词 human action recognition graph convolutional network spatiotemporal fusion feature extraction
在线阅读 下载PDF
BCCLR:A Skeleton-Based Action Recognition with Graph Convolutional Network Combining Behavior Dependence and Context Clues 被引量:4
4
作者 Yunhe Wang Yuxin Xia Shuai Liu 《Computers, Materials & Continua》 SCIE EI 2024年第3期4489-4507,共19页
In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal ... In recent years,skeleton-based action recognition has made great achievements in Computer Vision.A graph convolutional network(GCN)is effective for action recognition,modelling the human skeleton as a spatio-temporal graph.Most GCNs define the graph topology by physical relations of the human joints.However,this predefined graph ignores the spatial relationship between non-adjacent joint pairs in special actions and the behavior dependence between joint pairs,resulting in a low recognition rate for specific actions with implicit correlation between joint pairs.In addition,existing methods ignore the trend correlation between adjacent frames within an action and context clues,leading to erroneous action recognition with similar poses.Therefore,this study proposes a learnable GCN based on behavior dependence,which considers implicit joint correlation by constructing a dynamic learnable graph with extraction of specific behavior dependence of joint pairs.By using the weight relationship between the joint pairs,an adaptive model is constructed.It also designs a self-attention module to obtain their inter-frame topological relationship for exploring the context of actions.Combining the shared topology and the multi-head self-attention map,the module obtains the context-based clue topology to update the dynamic graph convolution,achieving accurate recognition of different actions with similar poses.Detailed experiments on public datasets demonstrate that the proposed method achieves better results and realizes higher quality representation of actions under various evaluation protocols compared to state-of-the-art methods. 展开更多
关键词 action recognition deep learning GCN behavior dependence context clue self-attention
在线阅读 下载PDF
Multi-Scale Adaptive Large Kernel Graph Convolutional Network for Skeleton-Based Action Recognition
5
作者 Yu-Qing Zhang Chen Pang +2 位作者 Pei Geng Xue-Quan Lu Lei Lyu 《Journal of Computer Science & Technology》 2025年第5期1285-1300,共16页
Graph convolutional networks(GCNs)have become a dominant approach for skeleton-based action recognition tasks.Although GCNs have made significant progress in modeling skeletons as spatial-temporal graphs,they often re... Graph convolutional networks(GCNs)have become a dominant approach for skeleton-based action recognition tasks.Although GCNs have made significant progress in modeling skeletons as spatial-temporal graphs,they often require stacking multiple graph convolution layers to effectively capture long-distance relationships among nodes.This stacking not only increases computational burdens but also raises the risk of over-smoothing,which can lead to the neglect of crucial local action features.To address this issue,we propose a novel multi-scale adaptive large kernel graph convolutional network(MSLK-GCN)to effectively aggregate local and global spatio-temporal correlations while maintaining the computational efficiency.The core components of the network include two multi-scale large kernel graph convolution(LKGC)modules,a multi-channel adaptive graph convolution(MAGC)module,and a multi-scale temporal self-attention convolution(MSTC)module.The LKGC module adaptively focuses on active motion regions by utilizing a large convolution kernel and a gating mechanism,effectively capturing long-distance dependencies within the skeleton sequence.Meanwhile,the MAGC module dynamically learns relationships between different joints by adjusting connection weights between nodes.To further enhance the ability to capture temporal dynamics,the MSTC module effectively aggregates the temporal information by integrating Efficient Channel Attention(ECA)with multi-scale convolution.In addition,we use a multi-stream fusion strategy to make full use of different modal skeleton data,including bone,joint,joint motion,and bone motion.Exhaustive experiments on three scale-varying datasets,i.e.,NTU-60,NTU-120,and NW-UCLA,demonstrate that our MSLK-GCN can achieve state-of-the-art performance with fewer parameters. 展开更多
关键词 skeleton-based action recognition graph convolutional network(GCN) multi-scale large kernel attention
原文传递
Balanced Representation Learning for Long-tailed Skeleton-based Action Recognition
6
作者 Hongda Liu Yunlong Wang +4 位作者 Min Ren Junxing Hu Zhengquan Luo Guangqi Hou Zhenan Sun 《Machine Intelligence Research》 2025年第3期466-483,共18页
Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines shar... Skeleton-based action recognition has recently made significant progress.However,data imbalance is still a great challenge in real-world scenarios.The performance of current action recognition algorithms declines sharply when training data suffers from heavy class imbalance.The imbalanced data actually degrades the representations learned by these methods and becomes the bottleneck for action recognition.How to learn unbiased representations from imbalanced action data is the key to long-tailed action recognition.In this paper,we propose a novel balanced representation learning method to address the long-tailed problem in action recognition.Firstly,a spatial-temporal action exploration strategy is presented to expand the sample space effectively,generating more valuable samples in a rebalanced manner.Secondly,we design a detached action-aware learning schedule to further mitigate the bias in the representation space.The schedule detaches the representation learning of tail classes from training and proposes an action-aware loss to impose more effective constraints.Additionally,a skip-type representation is proposed to provide complementary structural information.The proposed method is validated on four skeleton datasets,NTU RGB+D 60,NTU RGB+D 120,NW-UCLA and Kinetics.It not only achieves consistently large improvement compared to the state-of-the-art(SOTA)methods,but also demonstrates a superior generalization capacity through extensive experiments.Our code is available at https://github.com/firework8/BRL. 展开更多
关键词 action recognition skeleton sequence long-tailed visual recognition imbalance learning.
原文传递
Lightweight Multiscale Spatio-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition
7
作者 Zhiyun Zheng Qilong Yuan +2 位作者 Huaizhu Zhang Yizhou Wang Junfeng Wang 《Big Data Mining and Analytics》 2025年第2期310-325,共16页
Using skeletal information to model and recognize human actions is currently a hot research subject in the realm of Human Action Recognition(HAR).Graph Convolutional Networks(GCN)have gained popularity in this discipl... Using skeletal information to model and recognize human actions is currently a hot research subject in the realm of Human Action Recognition(HAR).Graph Convolutional Networks(GCN)have gained popularity in this discipline due to their capacity to efficiently process graph-structured data.However,it is challenging for current models to handle distant dependencies that commonly exist between human skeleton nodes,which hinders the development of algorithms in related fields.To solve these problems,the Lightweight Multiscale Spatio-Temporal Graph Convolutional Network(LMSTGCN)is proposed.Firstly,the Lightweight Multiscale Spatial Graph Convolutional Network(LMSGCN)is constructed to capture the information in various hierarchies,and multiple inner connections between skeleton joints are captured by dividing the input features into a number of subsets along the channel direction.Secondly,the dilated convolution is incorporated into the temporal convolution to construct Lightweight Multiscale Temporal Convolutional Network(LMTCN),which allows to obtain a wider receptive field while keeping the size of the convolution kernel unchanged.Thirdly,the Spatio-Temporal Location Attention(STLAtt)module is used to identify the most informative joints in the sequence of skeletal information at a specific frame,hence improving the model’s ability to extract features and recognize actions.Finally,multi-stream data fusion input structure is used to enhance the input data and expand the feature information.Experiments on three public datasets illustrate the effectiveness of the proposed network. 展开更多
关键词 Human action recognition(HAR) skeleton data Graph Convolutional Network(GCN) attention mechanism
原文传递
A Survey on 3D Skeleton-Based Action Recognition Using Learning Method 被引量:1
8
作者 Bin Ren Mengyuan Liu +1 位作者 Runwei Ding Hong Liu 《Cyborg and Bionic Systems》 2024年第1期410-425,共16页
Three-dimensional skeleton-based action recognition(3D SAR)has gained important attention within the computer vision community,owing to the inherent advantages offered by skeleton data.As a result,a plethora of impres... Three-dimensional skeleton-based action recognition(3D SAR)has gained important attention within the computer vision community,owing to the inherent advantages offered by skeleton data.As a result,a plethora of impressive works,including those based on conventional handcrafted features and learned feature extraction methods,have been conducted over the years.However,prior surveys on action recognition have primarily focused on video or red-green-blue(RGB)data-dominated approaches,with limited coverage of reviews related to skeleton data.Furthermore,despite the extensive application of deep learning methods in this field,there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures.To address these limitations,this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional(3D)skeleton data as a valuable modality.Subsequently,we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures,i.e.,recurrent neural networks,convolutional neural networks,graph convolutional network,and Transformers.All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion.Finally,we offer insights into the current largest 3D skeleton dataset,NTU-RGB+D,and its new edition,NTU-RGB+D 120,along with an overview of several top-performing algorithms on these datasets.To the best of our knowledge,this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data. 展开更多
关键词 skeleton dataas conventional handcrafted features action recognition computer vision learned feature extraction methodshave deep learning action recognition d sar D skeleton data
原文传递
Video action recognition meets vision-language models exploring human factors in scene interaction: a review
9
作者 GUO Yuping GAO Hongwei +3 位作者 YU Jiahui GE Jinchao HAN Meng JU Zhaojie 《Optoelectronics Letters》 2025年第10期626-640,共15页
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions... Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions. 展开更多
关键词 human factors video action recognition vision language models analyze dynamic behaviors spatiotemporal granularity video action recognition var aims multimodal alignment scene interaction
原文传递
Lightweight Classroom Student Action Recognition Method Based on Spatiotemporal Multimodal Feature Fusion
10
作者 Shaodong Zou Di Wu +2 位作者 Jianhou Gan Juxiang Zhou Jiatian Mei 《Computers, Materials & Continua》 2025年第4期1101-1116,共16页
The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,th... The task of student action recognition in the classroom is to precisely capture and analyze the actions of students in classroom videos,providing a foundation for realizing intelligent and accurate teaching.However,the complex nature of the classroom environment has added challenges and difficulties in the process of student action recognition.In this research article,with regard to the circumstances where students are prone to be occluded and classroom computing resources are restricted in real classroom scenarios,a lightweight multi-modal fusion action recognition approach is put forward.This proposed method is capable of enhancing the accuracy of student action recognition while concurrently diminishing the number of parameters of the model and the Computation Amount,thereby achieving a more efficient and accurate recognition performance.In the feature extraction stage,this method fuses the keypoint heatmap with the RGB(Red-Green-Blue color model)image.In order to fully utilize the unique information of different modalities for feature complementarity,a Feature Fusion Module(FFE)is introduced.The FFE encodes and fuses the unique features of the two modalities during the feature extraction process.This fusion strategy not only achieves fusion and complementarity between modalities,but also improves the overall model performance.Furthermore,to reduce the computational load and parameter scale of the model,we use keypoint information to crop RGB images.At the same time,the first three networks of the lightweight feature extraction network X3D are used to extract dual-branch features.These methods significantly reduce the computational load and parameter scale.The number of parameters of the model is 1.40 million,and the computation amount is 5.04 billion floating-point operations per second(GFLOPs),achieving an efficient lightweight design.In the Student Classroom Action Dataset(SCAD),the accuracy of the model is 88.36%.In NTU 60(Nanyang Technological University Red-Green-Blue-Depth RGB+Ddataset with 60 categories),the accuracies on X-Sub(The people in the training set are different from those in the test set)and X-View(The perspectives of the training set and the test set are different)are 95.76%and 98.82%,respectively.On the NTU 120 dataset(Nanyang Technological University Red-Green-Blue-Depth dataset with 120 categories),RGB+Dthe accuracies on X-Sub and X-Set(the perspectives of the training set and the test set are different)are 91.97%and 93.45%,respectively.The model has achieved a balance in terms of accuracy,computation amount,and the number of parameters. 展开更多
关键词 action recognition student classroom action multimodal fusion lightweight model design
在线阅读 下载PDF
An Efficient Temporal Decoding Module for Action Recognition
11
作者 HUANG Qiubo MEI Jianmin +3 位作者 ZHAO Wupeng LU Yiru WANG Mei CHEN Dehua 《Journal of Donghua University(English Edition)》 2025年第2期187-196,共10页
Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action... Action recognition,a fundamental task in the field of video understanding,has been extensively researched and applied.In contrast to an image,a video introduces an extra temporal dimension.However,many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition,resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly,the decoder initializes a set of learnable queries,termed video-level action category prediction queries.Then,they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally,these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51,MSRDailyAct3D,Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE),after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11%for TokShift-Transformer and nearly 5%for VideoMAE across the four datasets.Furthermore,the work explores the combination of the decoder with various action recognition networks,including Timesformer,as encoders.This results in an average accuracy improvement of more than 3.5%on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder. 展开更多
关键词 action recognition video understanding temporal relationship temporal decoder TRANSFORMER
在线阅读 下载PDF
ARNet:Integrating Spatial and Temporal Deep Learning for Robust Action Recognition in Videos
12
作者 Hussain Dawood Marriam Nawaz +3 位作者 Tahira Nazir Ali Javed Abdul Khader Jilani Saudagar Hatoon S.AlSagri 《Computer Modeling in Engineering & Sciences》 2025年第7期429-459,共31页
Reliable human action recognition(HAR)in video sequences is critical for a wide range of applications,such as security surveillance,healthcare monitoring,and human-computer interaction.Several automated systems have b... Reliable human action recognition(HAR)in video sequences is critical for a wide range of applications,such as security surveillance,healthcare monitoring,and human-computer interaction.Several automated systems have been designed for this purpose;however,existing methods often struggle to effectively integrate spatial and temporal information from input samples such as 2-stream networks or 3D convolutional neural networks(CNNs),which limits their accuracy in discriminating numerous human actions.Therefore,this study introduces a novel deeplearning framework called theARNet,designed for robustHAR.ARNet consists of two mainmodules,namely,a refined InceptionResNet-V2-based CNN and a Bi-LSTM(Long Short-Term Memory)network.The refined InceptionResNet-V2 employs a parametric rectified linear unit(PReLU)activation strategy within convolutional layers to enhance spatial feature extraction fromindividual video frames.The inclusion of the PReLUmethod improves the spatial informationcapturing ability of the approach as it uses learnable parameters to adaptively control the slope of the negative part of the activation function,allowing richer gradient flow during backpropagation and resulting in robust information capturing and stable model training.These spatial features holding essential pixel characteristics are then processed by the Bi-LSTMmodule for temporal analysis,which assists the ARNet in understanding the dynamic behavior of actions over time.The ARNet integrates three additional dense layers after the Bi-LSTM module to ensure a comprehensive computation of both spatial and temporal patterns and further boost the feature representation.The experimental validation of the model is conducted on 3 benchmark datasets named HMDB51,KTH,and UCF Sports and reports accuracies of 93.82%,99%,and 99.16%,respectively.The Precision results of HMDB51,KTH,and UCF Sports datasets are 97.41%,99.54%,and 99.01%;the Recall values are 98.87%,98.60%,99.08%,and the F1-Score is 98.13%,99.07%,99.04%,respectively.These results highlight the robustness of the ARNet approach and its potential as a versatile tool for accurate HAR across various real-world applications. 展开更多
关键词 action recognition Bi-LSTM computer vision deep learning InceptionResNet-V2 PReLU
在线阅读 下载PDF
A Novel Attention-Based Parallel Blocks Deep Architecture for Human Action Recognition
13
作者 Yasir Khan Jadoon Yasir Noman Khalid +4 位作者 Muhammad Attique Khan Jungpil Shin Fatimah Alhayan Hee-Chan Cho Byoungchol Chang 《Computer Modeling in Engineering & Sciences》 2025年第7期1143-1164,共22页
Real-time surveillance is attributed to recognizing the variety of actions performed by humans.Human Action Recognition(HAR)is a technique that recognizes human actions from a video stream.A range of variations in hum... Real-time surveillance is attributed to recognizing the variety of actions performed by humans.Human Action Recognition(HAR)is a technique that recognizes human actions from a video stream.A range of variations in human actions makes it difficult to recognize with considerable accuracy.This paper presents a novel deep neural network architecture called Attention RB-Net for HAR using video frames.The input is provided to the model in the form of video frames.The proposed deep architecture is based on the unique structuring of residual blocks with several filter sizes.Features are extracted from each frame via several operations with specific parameters defined in the presented novel Attention-based Residual Bottleneck(Attention-RB)DCNN architecture.A fully connected layer receives an attention-based features matrix,and final classification is performed.Several hyperparameters of the proposed model are initialized using Bayesian Optimization(BO)and later utilized in the trained model for testing.In testing,features are extracted from the self-attention layer and passed to neural network classifiers for the final action classification.Two highly cited datasets,HMDB51 and UCF101,were used to validate the proposed architecture and obtained an average accuracy of 87.70%and 97.30%,respectively.The deep convolutional neural network(DCNN)architecture is compared with state-of-the-art(SOTA)methods,including pre-trained models,inside blocks,and recently published techniques,and performs better. 展开更多
关键词 Human action recognition self-attention video streams residual bottleneck classification neural networks
在线阅读 下载PDF
Video Action Recognition Method Based on Personalized Federated Learning and Spatiotemporal Features
14
作者 Rongsen Wu Jie Xu +6 位作者 Yuhang Zhang Changming Zhao Yiweng Xie Zelei Wu Yunji Li Jinhong Guo Shiyang Tang 《Computers, Materials & Continua》 2025年第6期4961-4978,共18页
With the rapid development of artificial intelligence and Internet of Things technologies,video action recognition technology is widely applied in various scenarios,such as personal life and industrial production.Howe... With the rapid development of artificial intelligence and Internet of Things technologies,video action recognition technology is widely applied in various scenarios,such as personal life and industrial production.However,while enjoying the convenience brought by this technology,it is crucial to effectively protect the privacy of users’video data.Therefore,this paper proposes a video action recognition method based on personalized federated learning and spatiotemporal features.Under the framework of federated learning,a video action recognition method leveraging spatiotemporal features is designed.For the local spatiotemporal features of the video,a new differential information extraction scheme is proposed to extract differential features with a single RGB frame as the center,and a spatialtemporal module based on local information is designed to improve the effectiveness of local feature extraction;for the global temporal features,a method of extracting action rhythm features using differential technology is proposed,and a timemodule based on global information is designed.Different translational strides are used in the module to obtain bidirectional differential features under different action rhythms.Additionally,to address user data privacy issues,the method divides model parameters into local private parameters and public parameters based on the structure of the video action recognition model.This approach enhancesmodel training performance and ensures the security of video data.The experimental results show that under personalized federated learning conditions,an average accuracy of 97.792%was achieved on the UCF-101 dataset,which is non-independent and identically distributed(non-IID).This research provides technical support for privacy protection in video action recognition. 展开更多
关键词 Video action recognition personalized federated learning spatiotemporal features data privacy
在线阅读 下载PDF
Dual-branch spatial-temporal decoupled fusion transformer for safety action recognition in smart grid substation
15
作者 HAO Yu ZHENG Hao +3 位作者 WANG Tongwen WANG Yu SUN Wei ZHANG Shujuan 《Optoelectronics Letters》 2025年第8期507-512,共6页
Smart grid substation operations often take place in hazardous environments and pose significant threats to the safety of power personnel.Relying solely on manual supervision can lead to inadequate oversight.In respon... Smart grid substation operations often take place in hazardous environments and pose significant threats to the safety of power personnel.Relying solely on manual supervision can lead to inadequate oversight.In response to the demand for technology to identify improper operations in substation work scenarios,this paper proposes a substation safety action recognition technology to avoid the misoperation and enhance the safety management.In general,this paper utilizes a dual-branch transformer network to extract spatial and temporal information from the video dataset of operational behaviors in complex substation environments.Firstly,in order to capture the spatial-temporal correlation of people's behaviors in smart grid substation,we devise a sparse attention module and a segmented linear attention module that are embedded into spatial branch transformer and temporal branch transformer respectively.To avoid the redundancy of spatial and temporal information,we fuse the temporal and spatial features using a tensor decomposition fusion module by a decoupled manner.Experimental results indicate that our proposed method accurately detects improper operational behaviors in substation work scenarios,outperforming other existing methods in terms of detection and recognition accuracy. 展开更多
关键词 identify improper operations manual supervision avoid misoperation spatial temporal substation safety action recognition technology dual branch decoupled fusion enhance safety managementin
原文传递
Action recognition using a hierarchy of feature groups
16
作者 周同驰 程旭 +3 位作者 李拟珺 徐勤军 周琳 吴镇扬 《Journal of Southeast University(English Edition)》 EI CAS 2015年第3期327-332,共6页
To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-t... To improve the recognition performance of video human actions,an approach that models the video actions in a hierarchical way is proposed. This hierarchical model summarizes the action contents with different spatio-temporal domains according to the properties of human body movement.First,the temporal gradient combined with the constraint of coherent motion pattern is utilized to extract stable and dense motion features that are viewed as point features,then the mean-shift clustering algorithm with the adaptive scale kernel is used to label these features.After pooling the features with the same label to generate part-based representation,the visual word responses within one large scale volume are collected as video object representation.On the benchmark KTH(Kungliga Tekniska H?gskolan)and UCF (University of Central Florida)-sports action datasets,the experimental results show that the proposed method enhances the representative and discriminative power of action features, and improves recognition rates.Compared with other related literature,the proposed method obtains superior performance. 展开更多
关键词 action recognition coherent motion pattern feature groups part-based representation
在线阅读 下载PDF
A Novel Human Action Recognition Algorithm Based on Decision Level Multi-Feature Fusion 被引量:4
17
作者 SONG Wei LIU Ningning +1 位作者 YANG Guosheng YANG Pei 《China Communications》 SCIE CSCD 2015年第S2期93-102,共10页
In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision lev... In order to take advantage of the logical structure of video sequences and improve the recognition accuracy of the human action, a novel hybrid human action detection method based on three descriptors and decision level fusion is proposed. Firstly, the minimal 3D space region of human action region is detected by combining frame difference method and Vi BE algorithm, and the three-dimensional histogram of oriented gradient(HOG3D) is extracted. At the same time, the characteristics of global descriptors based on frequency domain filtering(FDF) and the local descriptors based on spatial-temporal interest points(STIP) are extracted. Principal component analysis(PCA) is implemented to reduce the dimension of the gradient histogram and the global descriptor, and bag of words(BoW) model is applied to describe the local descriptors based on STIP. Finally, a linear support vector machine(SVM) is used to create a new decision level fusion classifier. Some experiments are done to verify the performance of the multi-features, and the results show that they have good representation ability and generalization ability. Otherwise, the proposed scheme obtains very competitive results on the well-known datasets in terms of mean average precision. 展开更多
关键词 HUMAN action recognition FEATURE FUSION HOG3D
在线阅读 下载PDF
Multiple Feature Fusion in Convolutional Neural Networks for Action Recognition 被引量:5
18
作者 LI Hongyang CHEN Jun HU Ruimin 《Wuhan University Journal of Natural Sciences》 CAS CSCD 2017年第1期73-78,共6页
Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution n... Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution neural networks(CNN).For capturing human motion information in one CNN,we take both the optical flow maps and gray images as input,and combine multiple convolutional features by max pooling across frames.In another CNN,we input single color frame to capture context information.Finally,we take the top full connected layer vectors as video representation and train the classifiers by linear support vector machine.The experimental results show that the representation which integrates the optical flow maps and gray images obtains more discriminative properties than those which depend on only one element.On the most challenging data sets HMDB51 and UCF101,this video representation obtains competitive performance. 展开更多
关键词 action recognition video deep-learned representa-tion convolutional neural network feature fusion
原文传递
Study of Human Action Recognition Based on Improved Spatio-temporal Features 被引量:7
19
作者 Xiao-Fei Ji Qian-Qian Wu +1 位作者 Zhao-Jie Ju Yang-Yang Wang 《International Journal of Automation and computing》 EI CSCD 2014年第5期500-509,共10页
Most of the exist action recognition methods mainly utilize spatio-temporal descriptors of single interest point while ignoring their potential integral information, such as spatial distribution information. By combin... Most of the exist action recognition methods mainly utilize spatio-temporal descriptors of single interest point while ignoring their potential integral information, such as spatial distribution information. By combining local spatio-temporal feature and global positional distribution information(PDI) of interest points, a novel motion descriptor is proposed in this paper. The proposed method detects interest points by using an improved interest point detection method. Then, 3-dimensional scale-invariant feature transform(3D SIFT) descriptors are extracted for every interest point. In order to obtain a compact description and efficient computation, the principal component analysis(PCA) method is utilized twice on the 3D SIFT descriptors of single frame and multiple frames. Simultaneously, the PDI of the interest points are computed and combined with the above features. The combined features are quantified and selected and finally tested by using the support vector machine(SVM) recognition algorithm on the public KTH dataset. The testing results have showed that the recognition rate has been significantly improved and the proposed features can more accurately describe human motion with high adaptability to scenarios. 展开更多
关键词 action recognition spatio-temporal interest points 3-dimensional scale-invariant feature transform (3D SIFT) positional distribution information dimension reduction
原文传递
Structural iMoSIFT for Human Action Recognition 被引量:2
20
作者 CHEN Huafeng CHEN Jun HU Ruimin 《Wuhan University Journal of Natural Sciences》 CAS CSCD 2016年第3期262-266,共5页
Classic local space-time features are successful representations for action recognition in videos. However, these features always confuse object motions with camera motions, which seriously affect the accuracy of acti... Classic local space-time features are successful representations for action recognition in videos. However, these features always confuse object motions with camera motions, which seriously affect the accuracy of action recognition. In this paper, we propose improved motion scale-inviriant feature transform (iMoSIFT) algorithm to eliminate the negative effects caused by camera motions. Based on iMoSIFT, we consider the spatial-temporal structure relationship among iMoSIFT interest points, and adopt locally weighted word context descriptors to code this relationship. Then, we use two-layer BOW representation for every video clip. The proposed approach is evaluated on available datasets, namely Weizemann, KTH and UCF sports. The experimental results clearly demonstrate the effectiveness of the proposed approach. 展开更多
关键词 action recognition iMoSIFT locally weighted word context PCA
原文传递
上一页 1 2 73 下一页 到第
使用帮助 返回顶部