Action segmentation has made significant progress,but segmenting and recognizing actions from untrimmed long videos remains a challenging problem.Most state-of-the-art methods focus on designing models based on tempor...Action segmentation has made significant progress,but segmenting and recognizing actions from untrimmed long videos remains a challenging problem.Most state-of-the-art methods focus on designing models based on temporal convolution.However,the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models.To address the issue of over-segmentation in existing action segmentation methods,which leads to classification errors and reduced segmentation quality,this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method.The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder(ED)structure in judging the action segmentation point more accurately and,at the same time,suppress the excessive segmentation phenomenon caused by the ED structure.The method proposed in this paper achieves 93%frame accuracy on the constructed real Tai Chi action dataset.The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.展开更多
Despite the gradual transformation of traditional manufacturing by the Human-Robot Collaboration Assembly(HRCA),challenges remain in the robot’s ability to understand and predict human assembly intentions.This study ...Despite the gradual transformation of traditional manufacturing by the Human-Robot Collaboration Assembly(HRCA),challenges remain in the robot’s ability to understand and predict human assembly intentions.This study aims to enhance the robot’s comprehension and prediction capabilities of operator assembly intentions by capturing and analyzing operator behavior and movements.We propose a video feature extraction method based on the Temporal Shift Module Network(TSM-ResNet50)to extract spatiotemporal features from assembly videos and differentiate various assembly actions using feature differences between video frames.Furthermore,we construct an action recognition and segmentation model based on the Refined-Multi-Scale Temporal Convolutional Network(Refined-MS-TCN)to identify assembly action intervals and accurately acquire action categories.Experiments on our self-built reducer assembly action dataset demonstrate that our network can classify assembly actions frame by frame,achieving an accuracy rate of 83%.Additionally,we develop a HiddenMarkovModel(HMM)integrated with assembly task constraints to predict operator assembly intentions based on the probability transition matrix and assembly task constraints.The experimental results show that our method for predicting operator assembly intentions can achieve an accuracy of 90.6%,which is a 13.3%improvement over the HMM without task constraints.展开更多
基金supported by the National Natural Science Foundation of China(No.62277010)the Fuzhou-Xiamen-Quanzhou National Independent Innovation Demonstration Zone Collaborative Innovation Platform Project(No.2022FX6)the Fujian Provincial Health Commission Technology Plan Project(No.2021CXA001).
文摘Action segmentation has made significant progress,but segmenting and recognizing actions from untrimmed long videos remains a challenging problem.Most state-of-the-art methods focus on designing models based on temporal convolution.However,the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models.To address the issue of over-segmentation in existing action segmentation methods,which leads to classification errors and reduced segmentation quality,this paper proposes a global spatial-temporal information encoder-decoder based action segmentation method.The method proposed in this paper uses the global temporal information captured by refinement layer to assist the Encoder-Decoder(ED)structure in judging the action segmentation point more accurately and,at the same time,suppress the excessive segmentation phenomenon caused by the ED structure.The method proposed in this paper achieves 93%frame accuracy on the constructed real Tai Chi action dataset.The experimental results prove that this method can accurately and efficiently complete the long video action segmentation task.
文摘Despite the gradual transformation of traditional manufacturing by the Human-Robot Collaboration Assembly(HRCA),challenges remain in the robot’s ability to understand and predict human assembly intentions.This study aims to enhance the robot’s comprehension and prediction capabilities of operator assembly intentions by capturing and analyzing operator behavior and movements.We propose a video feature extraction method based on the Temporal Shift Module Network(TSM-ResNet50)to extract spatiotemporal features from assembly videos and differentiate various assembly actions using feature differences between video frames.Furthermore,we construct an action recognition and segmentation model based on the Refined-Multi-Scale Temporal Convolutional Network(Refined-MS-TCN)to identify assembly action intervals and accurately acquire action categories.Experiments on our self-built reducer assembly action dataset demonstrate that our network can classify assembly actions frame by frame,achieving an accuracy rate of 83%.Additionally,we develop a HiddenMarkovModel(HMM)integrated with assembly task constraints to predict operator assembly intentions based on the probability transition matrix and assembly task constraints.The experimental results show that our method for predicting operator assembly intentions can achieve an accuracy of 90.6%,which is a 13.3%improvement over the HMM without task constraints.