Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decode...Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decoder (ACSF-ED) network to predict the action and locate the object efficiently. In the Adaptive Cross-Scale Fusion Spatio-Temporal Encoder (ACSF ST-Encoder), the Asymptotic Cross-scale Feature-fusion Module (ACCFM) is designed to address the issue of information degradation caused by the propagation of high-level semantic information, thereby extracting high-quality multi-scale features to provide superior features for subsequent spatio-temporal information modeling. Within the Shared-Head Decoder structure, a shared classification and regression detection head is constructed. A multi-constraint loss function composed of one-to-one, one-to-many, and contrastive denoising losses is designed to address the problem of insufficient constraint force in predicting results with traditional methods. This loss function enhances the accuracy of model classification predictions and improves the proximity of regression position predictions to ground truth objects. The proposed method model is evaluated on the popular dataset UCF101-24 and JHMDB-21. Experimental results demonstrate that the proposed method achieves an accuracy of 81.52% on the Frame-mAP metric, surpassing current existing methods.展开更多
Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method ca...Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method called TubeRAPT(Tubelet Transformer based onAdapter and Prefix TrainingModule).Thismethod primarily comprises three key components:the TubeR network,an adaptive clustering attention mechanism,and a prefix training module.These components work in synergy to address the challenge of knowledge preservation in models pretrained on large datasets while maintaining training efficiency.The TubeR network serves as the backbone for spatio-temporal feature extraction,while the adaptive clustering attention mechanism refines the focus on relevant information.The prefix training module facilitates efficient fine-tuning and knowledge transfer.Experimental results demonstrate the effectiveness of TubeRAPT,achieving a 68.44%mean Average Precision(mAP)on the CLA(Crazy LabActivity)small-scale dataset,marking a significant improvement of 1.53%over the previous TubeR method.This research not only showcases the potential applications of TubeRAPT in the field of abnormal action detection but also offers innovative ideas and technical support for the future development of laboratory safety monitoring technologies.The proposed method has implications for improving safety management systems in various laboratory environments,potentially reducing accidents and enhancing overall workplace safety.展开更多
Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper...Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper,we propose a light-weight and robust algorithm to meet these requirements.By only two hands'trajectories,our algorithm requires no Graphic Processing Unit(GPU)acceleration,which can be used in low-cost devices.In the training stage,in order to find potential topological structures of the training trajectories,spectral clustering with eigengap heuristic is applied to cluster trajectory points.A gradient descent based algorithm is proposed to find the topological structures,which reflects main representations for each cluster.In the fine-tuning stage,a topological optimization algorithm is proposed to fine-tune the parameters of topological structures in all training data.Finally,our method not only performs more robustly compared to some popular offline action detection methods,but also obtains better detection accuracy in an extended action sequence.展开更多
Background:Intelligent monitoring of human action in production is an important step to help standardize production processes and construct a digital twin shop-floor rapidly.Human action has a significant impact on th...Background:Intelligent monitoring of human action in production is an important step to help standardize production processes and construct a digital twin shop-floor rapidly.Human action has a significant impact on the production safety and efficiency of a shop-floor,however,because of the high individual initiative of humans,it is difficult to realize real-time action detection in a digital twin shop-floor.Methods:We proposed a real-time detection approach for shop-floor production action.This approach used the sequence data of continuous human skeleton joints sequences as the input.We then reconstructed the Joint Classification-Regression Recurrent Neural Networks(JCR-RNN)based on Temporal Convolution Network(TCN)and Graph Convolution Network(GCN).We called this approach the Temporal Action Detection Net(TAD-Net),which realized real-time shop-floor production action detection.Results:The results of the verification experiment showed that our approach has achieved a high temporal positioning score,recognition speed,and accuracy when applied to the existing Online Action Detection(OAD)dataset and the Nanjing University of Science and Technology 3 Dimensions(NJUST3D)dataset.TAD-Net can meet the actual needs of the digital twin shop-floor.Conclusions:Our method has higher recognition accuracy,temporal positioning accuracy,and faster running speed than other mainstream network models,it can better meet actual application requirements,and has important research value and practical significance for standardizing shop-floor production processes,reducing production security risks,and contributing to the understanding of real-time production action.展开更多
Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action det...Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。展开更多
Micro-expressions are spontaneous, unconscious movements that reveal true emotions.Accurate facial movement information and network training learning methods are crucial for micro-expression recognition.However, most ...Micro-expressions are spontaneous, unconscious movements that reveal true emotions.Accurate facial movement information and network training learning methods are crucial for micro-expression recognition.However, most existing micro-expression recognition technologies so far focus on modeling the single category of micro-expression images and neural network structure.Aiming at the problems of low recognition rate and weak model generalization ability in micro-expression recognition, a micro-expression recognition algorithm is proposed based on graph convolution network(GCN) and Transformer model.Firstly, action unit(AU) feature detection is extracted and facial muscle nodes in the neighborhood are divided into three subsets for recognition.Then, graph convolution layer is used to find the layout of dependencies between AU nodes of micro-expression classification.Finally, multiple attentional features of each facial action are enriched with Transformer model to include more sequence information before calculating the overall correlation of each region.The proposed method is validated in CASME II and CAS(ME)^2 datasets, and the recognition rate reached 69.85%.展开更多
Micro-Expression Recognition(MER)is a challenging task as the subtle changes occur over different action regions of a face.Changes in facial action regions are formed as Action Units(AUs),and AUs in micro-expressions ...Micro-Expression Recognition(MER)is a challenging task as the subtle changes occur over different action regions of a face.Changes in facial action regions are formed as Action Units(AUs),and AUs in micro-expressions can be seen as the actors in cooperative group activities.In this paper,we propose a novel deep neural network model for objective class-based MER,which simultaneously detects AUs and aggregates AU-level features into micro-expression-level representation through Graph Convolutional Networks(GCN).Specifically,we propose two new strategies in our AU detection module for more effective AU feature learning:the attention mechanism and the balanced detection loss function.With these two strategies,features are learned for all the AUs in a unified model,eliminating the error-prune landmark detection process and tedious separate training for each AU.Moreover,our model incorporates a tailored objective class-based AU knowledge-graph,which facilitates the GCN to aggregate the AU-level features into a micro-expression-level feature representation.Extensive experiments on two tasks in MEGC 2018 show that our approach outperforms the current state-of-the-art methods in MER.Additionally,we also report our single model-based micro-expression AU detection results.展开更多
基金support for this work was supported by Key Lab of Intelligent and Green Flexographic Printing under Grant ZBKT202301.
文摘Current spatio-temporal action detection methods lack sufficient capabilities in extracting and comprehending spatio-temporal information. This paper introduces an end-to-end Adaptive Cross-Scale Fusion Encoder-Decoder (ACSF-ED) network to predict the action and locate the object efficiently. In the Adaptive Cross-Scale Fusion Spatio-Temporal Encoder (ACSF ST-Encoder), the Asymptotic Cross-scale Feature-fusion Module (ACCFM) is designed to address the issue of information degradation caused by the propagation of high-level semantic information, thereby extracting high-quality multi-scale features to provide superior features for subsequent spatio-temporal information modeling. Within the Shared-Head Decoder structure, a shared classification and regression detection head is constructed. A multi-constraint loss function composed of one-to-one, one-to-many, and contrastive denoising losses is designed to address the problem of insufficient constraint force in predicting results with traditional methods. This loss function enhances the accuracy of model classification predictions and improves the proximity of regression position predictions to ground truth objects. The proposed method model is evaluated on the popular dataset UCF101-24 and JHMDB-21. Experimental results demonstrate that the proposed method achieves an accuracy of 81.52% on the Frame-mAP metric, surpassing current existing methods.
基金supported by the Philosophy and Social Sciences Planning Project of Guangdong Province of China(GD23XGL099)the Guangdong General Universities Young Innovative Talents Project(2023KQNCX247)the Research Project of Shanwei Institute of Technology(SWKT22-019).
文摘Laboratory safety is a critical area of broad societal concern,particularly in the detection of abnormal actions.To enhance the efficiency and accuracy of detecting such actions,this paper introduces a novel method called TubeRAPT(Tubelet Transformer based onAdapter and Prefix TrainingModule).Thismethod primarily comprises three key components:the TubeR network,an adaptive clustering attention mechanism,and a prefix training module.These components work in synergy to address the challenge of knowledge preservation in models pretrained on large datasets while maintaining training efficiency.The TubeR network serves as the backbone for spatio-temporal feature extraction,while the adaptive clustering attention mechanism refines the focus on relevant information.The prefix training module facilitates efficient fine-tuning and knowledge transfer.Experimental results demonstrate the effectiveness of TubeRAPT,achieving a 68.44%mean Average Precision(mAP)on the CLA(Crazy LabActivity)small-scale dataset,marking a significant improvement of 1.53%over the previous TubeR method.This research not only showcases the potential applications of TubeRAPT in the field of abnormal action detection but also offers innovative ideas and technical support for the future development of laboratory safety monitoring technologies.The proposed method has implications for improving safety management systems in various laboratory environments,potentially reducing accidents and enhancing overall workplace safety.
基金Our research has been supported in part by National Natural Science Foundation of China under Grants 61673261 and 61703273.We gratefully acknowledge the support from some companies.
文摘Most of the intelligent surveillances in the industry only care about the safety of the workers.It is meaningful if the camera can know what,where and how the worker has performed the action in real time.In this paper,we propose a light-weight and robust algorithm to meet these requirements.By only two hands'trajectories,our algorithm requires no Graphic Processing Unit(GPU)acceleration,which can be used in low-cost devices.In the training stage,in order to find potential topological structures of the training trajectories,spectral clustering with eigengap heuristic is applied to cluster trajectory points.A gradient descent based algorithm is proposed to find the topological structures,which reflects main representations for each cluster.In the fine-tuning stage,a topological optimization algorithm is proposed to fine-tune the parameters of topological structures in all training data.Finally,our method not only performs more robustly compared to some popular offline action detection methods,but also obtains better detection accuracy in an extended action sequence.
基金This work was supported by the National Key Research and Development Program,China(2020YFB1708400)the National Defense Fundamental Research Program,China(JCKY2020210B006,JCKY2017204B053)awarded to TL.
文摘Background:Intelligent monitoring of human action in production is an important step to help standardize production processes and construct a digital twin shop-floor rapidly.Human action has a significant impact on the production safety and efficiency of a shop-floor,however,because of the high individual initiative of humans,it is difficult to realize real-time action detection in a digital twin shop-floor.Methods:We proposed a real-time detection approach for shop-floor production action.This approach used the sequence data of continuous human skeleton joints sequences as the input.We then reconstructed the Joint Classification-Regression Recurrent Neural Networks(JCR-RNN)based on Temporal Convolution Network(TCN)and Graph Convolution Network(GCN).We called this approach the Temporal Action Detection Net(TAD-Net),which realized real-time shop-floor production action detection.Results:The results of the verification experiment showed that our approach has achieved a high temporal positioning score,recognition speed,and accuracy when applied to the existing Online Action Detection(OAD)dataset and the Nanjing University of Science and Technology 3 Dimensions(NJUST3D)dataset.TAD-Net can meet the actual needs of the digital twin shop-floor.Conclusions:Our method has higher recognition accuracy,temporal positioning accuracy,and faster running speed than other mainstream network models,it can better meet actual application requirements,and has important research value and practical significance for standardizing shop-floor production processes,reducing production security risks,and contributing to the understanding of real-time production action.
基金supported by the National Educational Science 13th Five-Year Plan Project(JYKYB2019012)the Basic Research Fund for the Engineering University of PAP(WJY201907)the Basic Research Fund of the Engineering University of PAP(WJY202120).
文摘Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected。
基金Supported by Shaanxi Province Key Research and Development Project (2021GY-280)the National Natural Science Foundation of China (No.61834005,61772417,61802304)。
文摘Micro-expressions are spontaneous, unconscious movements that reveal true emotions.Accurate facial movement information and network training learning methods are crucial for micro-expression recognition.However, most existing micro-expression recognition technologies so far focus on modeling the single category of micro-expression images and neural network structure.Aiming at the problems of low recognition rate and weak model generalization ability in micro-expression recognition, a micro-expression recognition algorithm is proposed based on graph convolution network(GCN) and Transformer model.Firstly, action unit(AU) feature detection is extracted and facial muscle nodes in the neighborhood are divided into three subsets for recognition.Then, graph convolution layer is used to find the layout of dependencies between AU nodes of micro-expression classification.Finally, multiple attentional features of each facial action are enriched with Transformer model to include more sequence information before calculating the overall correlation of each region.The proposed method is validated in CASME II and CAS(ME)^2 datasets, and the recognition rate reached 69.85%.
基金supported by the Science and Technology Development Fund of Macao(No.0035/2023/ITP1)the National Natural Science Foundation of China(Nos.U1836220 and 61672267)+2 种基金the Postgraduate Research&Practice Innovation Program of Jiangsu Province(No.KYCX19_1616)the Qing Lan Talent Program of Jiangsu ProvinceJiangsu Province Key Research and Development Plan(Industry Foresight and Key Core Technology)-Competitive Project(No.BE2020036).
文摘Micro-Expression Recognition(MER)is a challenging task as the subtle changes occur over different action regions of a face.Changes in facial action regions are formed as Action Units(AUs),and AUs in micro-expressions can be seen as the actors in cooperative group activities.In this paper,we propose a novel deep neural network model for objective class-based MER,which simultaneously detects AUs and aggregates AU-level features into micro-expression-level representation through Graph Convolutional Networks(GCN).Specifically,we propose two new strategies in our AU detection module for more effective AU feature learning:the attention mechanism and the balanced detection loss function.With these two strategies,features are learned for all the AUs in a unified model,eliminating the error-prune landmark detection process and tedious separate training for each AU.Moreover,our model incorporates a tailored objective class-based AU knowledge-graph,which facilitates the GCN to aggregate the AU-level features into a micro-expression-level feature representation.Extensive experiments on two tasks in MEGC 2018 show that our approach outperforms the current state-of-the-art methods in MER.Additionally,we also report our single model-based micro-expression AU detection results.