Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton s...Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton structure information is not utilized and multi-view pose information is not completely fused.Moreover,existing graph convolutional operations do not consider the specificity of different joints and different views of pose information when processing skeleton graphs,making the correlation weights between nodes in the graph and their neighborhood nodes shared.Existing Graph Convolutional Networks(GCNs)cannot extract global and deeplevel skeleton structure information and view correlations efficiently.To solve these problems,pre-estimated multiview 2D poses are designed as a multi-view skeleton graph to fuse skeleton priors and view correlations explicitly to process occlusion problem,with the skeleton-edge and symmetry-edge representing the structure correlations between adjacent joints in each viewof skeleton graph and the view-edge representing the view correlations between the same joints in different views.To make graph convolution operation mine elaborate and sufficient skeleton structure information and view correlations,different correlation weights are assigned to different categories of neighborhood nodes and further assigned to each node in the graph.Based on the graph convolution operation proposed above,a Residual Graph Convolution(RGC)module is designed as the basic module to be combined with the simplified Hourglass architecture to construct the Hourglass-GCN as our 3D pose estimation network.Hourglass-GCNwith a symmetrical and concise architecture processes three scales ofmulti-viewskeleton graphs to extract local-to-global scale and shallow-to-deep level skeleton features efficiently.Experimental results on common large 3D pose dataset Human3.6M and MPI-INF-3DHP show that Hourglass-GCN outperforms some excellent methods in 3D pose estimation accuracy.展开更多
Advancements in animal behavior quantification methods have driven the development of computational ethology,enabling fully automated behavior analysis.Existing multianimal pose estimation workflows rely on tracking-b...Advancements in animal behavior quantification methods have driven the development of computational ethology,enabling fully automated behavior analysis.Existing multianimal pose estimation workflows rely on tracking-bydetection frameworks for either bottom-up or top-down approaches,requiring retraining to accommodate diverse animal appearances.This study introduces InteBOMB,an integrated workflow that enhances top-down approaches by incorporating generic object tracking,eliminating the need for prior knowledge of target animals while maintaining broad generalizability.InteBOMB includes two key strategies for tracking and segmentation in laboratory environments and two techniques for pose estimation in natural settings.The“background enhancement”strategy optimizesforeground-backgroundcontrastiveloss,generating more discriminative correlation maps.The“online proofreading”strategy stores human-in-the-loop long-term memory and dynamic short-term memory,enabling adaptive updates to object visual features.The“automated labeling suggestion”technique reuses the visual features saved during tracking to identify representative frames for training set labeling.Additionally,the“joint behavior analysis”technique integrates these features with multimodal data,expanding the latent space for behavior classification and clustering.To evaluate the framework,six datasets of mice and six datasets of nonhuman primates were compiled,covering laboratory and natural scenes.Benchmarking results demonstrated a24%improvement in zero-shot generic tracking and a 21%enhancement in joint latent space performance across datasets,highlighting the effectiveness of this approach in robust,generalizable behavior analysis.展开更多
This paper presents a manifold-optimized Error-State Kalman Filter(ESKF)framework for unmanned aerial vehicle(UAV)pose estimation,integrating Inertial Measurement Unit(IMU)data with GPS or LiDAR to enhance estimation ...This paper presents a manifold-optimized Error-State Kalman Filter(ESKF)framework for unmanned aerial vehicle(UAV)pose estimation,integrating Inertial Measurement Unit(IMU)data with GPS or LiDAR to enhance estimation accuracy and robustness.We employ a manifold-based optimization approach,leveraging exponential and logarithmic mappings to transform rotation vectors into rotation matrices.The proposed ESKF framework ensures state variables remain near the origin,effectively mitigating singularity issues and enhancing numerical stability.Additionally,due to the small magnitude of state variables,second-order terms can be neglected,simplifying Jacobian matrix computation and improving computational efficiency.Furthermore,we introduce a novel Kalman filter gain computation strategy that dynamically adapts to low-dimensional and high-dimensional observation equations,enabling efficient processing across different sensor modalities.Specifically,for resource-constrained UAV platforms,this method significantly reduces computational cost,making it highly suitable for real-time UAV applications.展开更多
Pose estimation of spacecraft targets is a key technology for achieving space operation tasks,such as the cleaning of failed satellites and the detection and scanning of non-cooperative targets.This paper reviews the ...Pose estimation of spacecraft targets is a key technology for achieving space operation tasks,such as the cleaning of failed satellites and the detection and scanning of non-cooperative targets.This paper reviews the target pose estimation methods based on image feature extraction and PnP,the target estimation methods based on registration,and the spacecraft target pose estimation methods based on deep learning,and introduces the corresponding research methods.展开更多
Vision-based relative pose estimation plays a pivotal role in various space missions.Deep learning enhances monocular spacecraft pose estimation,but high computational demands necessitate model simplification for onbo...Vision-based relative pose estimation plays a pivotal role in various space missions.Deep learning enhances monocular spacecraft pose estimation,but high computational demands necessitate model simplification for onboard systems.In this paper,we aim to achieve an optimal balance between accuracy and computational efficiency.We present a Perspective-n-Point(PnP)based method for spacecraft pose estimation,leveraging lightweight neural networks to localize semantic keypoints and reduce computational load.Since the accuracy of keypoint localization is closely related to the heatmap resolution,we devise an efficient upsampling module to increase the resolution of heatmaps with minimal overhead.Furthermore,the heatmaps predicted by the lightweight models tend to show high-level noise.To tackle this issue,we propose a weighting strategy by analyzing the statistical characteristics of predicted semantic keypoints and substantially improve the pose estimation accuracy.The experiments carried out on the SPEED dataset underscore the prospect of our method in engineering applications.We dramatically reduce the model parameters to 0.7 M,merely 2.5%of that required by the top-performing method,and achieve lower pose estimation error and better real-time performance.展开更多
[Objective]Fish pose estimation(FPE)provides fish physiological information,facilitating health monitoring in aquaculture.It aids decision-making in areas such as fish behavior recognition.When fish are injured or def...[Objective]Fish pose estimation(FPE)provides fish physiological information,facilitating health monitoring in aquaculture.It aids decision-making in areas such as fish behavior recognition.When fish are injured or deficient,they often display abnormal behaviors and noticeable changes in the positioning of their body parts.Moreover,the unpredictable posture and orientation of fish during swimming,combined with the rapid swimming speed of fish,restrict the current scope of research in FPE.In this research,a FPE model named HPFPE is presented to capture the swimming posture of fish and accurately detect their key points.[Methods]On the one hand,this model incorporated the CBAM module into the HRNet framework.The attention module enhanced accuracy without adding computational complexity,while effectively capturing a broader range of contextual information.On the other hand,the model incorporated dilated convolution to increase the receptive field,allowing it to capture more spatial context.[Results and Discussions]Experiments showed that compared with the baseline method,the average precision(AP)of HPFPE based on different backbones and input sizes on the oplegnathus punctatus datasets had increased by 0.62,1.35,1.76,and 1.28 percent point,respectively,while the average recall(AR)had also increased by 0.85,1.50,1.40,and 1.00,respectively.Additionally,HPFPE outperformed other mainstream methods,including DeepPose,CPM,SCNet,and Lite-HRNet.Furthermore,when compared to other methods using the ornamental fish data,HPFPE achieved the highest AP and AR values of 52.96%,and 59.50%,respectively.[Conclusions]The proposed HPFPE can accurately estimate fish posture and assess their swimming patterns,serving as a valuable reference for applications such as fish behavior recognition.展开更多
Background:Q uantifying the rich home-c age activities of tree shrews provides a reliable basis for understanding their daily routines and building disease models.However,due to the lack of effective behavioral method...Background:Q uantifying the rich home-c age activities of tree shrews provides a reliable basis for understanding their daily routines and building disease models.However,due to the lack of effective behavioral methods,most efforts on tree shrew behavior are limited to simple measures,resulting in the loss of much behavioral information.Methods:T o address this issue,we present a deep learning(DL)approach to achieve markerless pose estimation and recognize multiple spontaneous behaviors of tree shrews,including drinking,eating,resting,and staying in the dark house,etc.Results:T his high-t hroughput approach can monitor the home-cage activities of 16 tree shrews simultaneously over an extended period.Additionally,we demonstrated an innovative system with reliable apparatus,paradigms,and analysis methods for investigating food grasping behavior.The median duration for each bout of grasping was 0.20 s.Conclusion:T his study provides an efficient tool for quantifying and understand tree shrews'natural behaviors.展开更多
With the rapid progress of the artificial intelligence(AI)technology and mobile internet,3D hand pose estimation has become critical to various intelligent application areas,e.g.,human-computer interaction.To avoid th...With the rapid progress of the artificial intelligence(AI)technology and mobile internet,3D hand pose estimation has become critical to various intelligent application areas,e.g.,human-computer interaction.To avoid the low accuracy of single-modal estimation and the high complexity of traditional multi-modal 3D estimation,this paper proposes a novel multi-modal multi-view(MMV)3D hand pose estimation system,which introduces a registration before translation(RT)-translation before registration(TR)jointed conditional generative adversarial network(cGAN)to train a multi-modal registration network,and then employs the multi-modal feature fusion to achieve high-quality estimation,with low hardware and software costs both in data acquisition and processing.Experimental results demonstrate that the MMV system is effective and feasible in various scenarios.It is promising for the MMV system to be used in broad intelligent application areas.展开更多
Human pose estimation has received much attention from the research community because of its wide range of applications.However,current research for pose estimation is usually complex and computationally intensive,esp...Human pose estimation has received much attention from the research community because of its wide range of applications.However,current research for pose estimation is usually complex and computationally intensive,especially the feature loss problems in the feature fusion process.To address the above problems,we propose a lightweight human pose estimation network based on multi-attention mechanism(LMANet).In our method,network parameters can be significantly reduced by lightweighting the bottleneck blocks with depth-wise separable convolution on the high-resolution networks.After that,we also introduce a multi-attention mechanism to improve the model prediction accuracy,and the channel attention module is added in the initial stage of the network to enhance the local cross-channel information interaction.More importantly,we inject spatial crossawareness module in the multi-scale feature fusion stage to reduce the spatial information loss during feature extraction.Extensive experiments on COCO2017 dataset and MPII dataset show that LMANet can guarantee a higher prediction accuracy with fewer network parameters and computational effort.Compared with the highresolution network HRNet,the number of parameters and the computational complexity of the network are reduced by 67%and 73%,respectively.展开更多
Due to self-occlusion and high degree of freedom,estimating 3D hand pose from a single RGB image is a great challenging problem.Graph convolutional networks(GCNs)use graphs to describe the physical connection relation...Due to self-occlusion and high degree of freedom,estimating 3D hand pose from a single RGB image is a great challenging problem.Graph convolutional networks(GCNs)use graphs to describe the physical connection relationships between hand joints and improve the accuracy of 3D hand pose regression.However,GCNs cannot effectively describe the relationships between non-adjacent hand joints.Recently,hypergraph convolutional networks(HGCNs)have received much attention as they can describe multi-dimensional relationships between nodes through hyperedges;therefore,this paper proposes a framework for 3D hand pose estimation based on HGCN,which can better extract correlated relationships between adjacent and non-adjacent hand joints.To overcome the shortcomings of predefined hypergraph structures,a kind of dynamic hypergraph convolutional network is proposed,in which hyperedges are constructed dynamically based on hand joint feature similarity.To better explore the local semantic relationships between nodes,a kind of semantic dynamic hypergraph convolution is proposed.The proposed method is evaluated on publicly available benchmark datasets.Qualitative and quantitative experimental results both show that the proposed HGCN and improved methods for 3D hand pose estimation are better than GCN,and achieve state-of-the-art performance compared with existing methods.展开更多
Virtual maintenance,as an important means of industrial training and education,places strict requirements on the accuracy of participant pose perception and assessment of motion standardization.However,existing resear...Virtual maintenance,as an important means of industrial training and education,places strict requirements on the accuracy of participant pose perception and assessment of motion standardization.However,existing research mainly focuses on human pose estimation in general scenarios,lacking specialized solutions for maintenance scenarios.This paper proposes a virtual maintenance human pose estimation method based on multi-scale feature enhancement(VMHPE),which integrates adaptive input feature enhancement,multi-scale feature correction for improved expression of fine movements and complex poses,and multi-scale feature fusion to enhance keypoint localization accuracy.Meanwhile,this study constructs the first virtual maintenance-specific human keypoint dataset(VMHKP),which records standard action sequences of professional maintenance personnel in five typical maintenance tasks and provides a reliable benchmark for evaluating operator motion standardization.The dataset is publicly available at.Using high-precision keypoint prediction results,an action assessment system utilizing topological structure similarity was established.Experiments show that our method achieves significant performance improvements:average precision(AP)reaches 94.4%,an increase of 2.3 percentage points over baseline methods;average recall(AR)reaches 95.6%,an increase of 1.3 percentage points.This research establishes a scientific four-level evaluation standard based on comparative motion analysis and provides a reliable solution for standardizing industrial maintenance training.展开更多
Real-time 6 Degree-of-Freedom(DoF)pose estimation is of paramount importance for various on-orbit tasks.Benefiting from the development of deep learning,Convolutional Neural Networks(CNNs)in feature extraction has yie...Real-time 6 Degree-of-Freedom(DoF)pose estimation is of paramount importance for various on-orbit tasks.Benefiting from the development of deep learning,Convolutional Neural Networks(CNNs)in feature extraction has yielded impressive achievements for spacecraft pose estimation.To improve the robustness and interpretability of CNNs,this paper proposes a Pose Estimation approach based on Variational Auto-Encoder structure(PE-VAE)and a Feature-Aided pose estimation approach based on Variational Auto-Encoder structure(FA-VAE),which aim to accurately estimate the 6 DoF pose of a target spacecraft.Both methods treat the pose vector as latent variables,employing an encoder-decoder network with a Variational Auto-Encoder(VAE)structure.To enhance the precision of pose estimation,PE-VAE uses the VAE structure to introduce reconstruction mechanism with the whole image.Furthermore,FA-VAE enforces feature shape constraints by exclusively reconstructing the segment of the target spacecraft with the desired shape.Comparative evaluation against leading methods on public datasets reveals similar accuracy with a threefold improvement in processing speed,showcasing the significant contribution of VAE structures to accuracy enhancement,and the additional benefit of incorporating global shape prior features.展开更多
Error or drift is frequently produced in pose estimation based on geometric"feature detection and tracking"monocular visual odometry(VO)when the speed of camera movement exceeds 1.5 m/s.While,in most VO meth...Error or drift is frequently produced in pose estimation based on geometric"feature detection and tracking"monocular visual odometry(VO)when the speed of camera movement exceeds 1.5 m/s.While,in most VO methods based on deep learning,weight factors are in the form of fixed values,which are easy to lead to overfitting.A new measurement system,for monocular visual odometry,named Deep Learning Visual Odometry(DLVO),is proposed based on neural network.In this system,Convolutional Neural Network(CNN)is used to extract feature and perform feature matching.Moreover,Recurrent Neural Network(RNN)is used for sequence modeling to estimate camera’s 6-dof poses.Instead of fixed weight values of CNN,Bayesian distribution of weight factors are introduced in order to effectively solve the problem of network overfitting.The 18,726 frame images in KITTI dataset are used for training network.This system can increase the generalization ability of network model in prediction process.Compared with original Recurrent Convolutional Neural Network(RCNN),our method can reduce the loss of test model by 5.33%.And it’s an effective method in improving the robustness of translation and rotation information than traditional VO methods.展开更多
This paper introduces a new algorithm for estimating the relative pose of a moving camera using consecutive frames of a video sequence. State-of-the-art algorithms for calculating the relative pose between two images ...This paper introduces a new algorithm for estimating the relative pose of a moving camera using consecutive frames of a video sequence. State-of-the-art algorithms for calculating the relative pose between two images use matching features to estimate the essential matrix. The essential matrix is then decomposed into the relative rotation and normalized translation between frames. To be robust to noise and feature match outliers, these methods generate a large number of essential matrix hypotheses from randomly selected minimal subsets of feature pairs, and then score these hypotheses on all feature pairs. Alternatively, the algorithm introduced in this paper calculates relative pose hypotheses by directly optimizing the rotation and normalized translation between frames, rather than calculating the essential matrix and then performing the decomposition. The resulting algorithm improves computation time by an order of magnitude. If an inertial measurement unit(IMU) is available, it is used to seed the optimizer, and in addition, we reuse the best hypothesis at each iteration to seed the optimizer thereby reducing the number of relative pose hypotheses that must be generated and scored. These advantages greatly speed up performance and enable the algorithm to run in real-time on low cost embedded hardware. We show application of our algorithm to visual multi-target tracking(MTT) in the presence of parallax and demonstrate its real-time performance on a 640 × 480 video sequence captured on a UAV. Video results are available at https://youtu.be/Hh K-p2 h XNn U.展开更多
In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstl...In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstly,the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation.Secondly,unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands.In contrast to these methods,this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision.The hand is modelled using a novel latent tree dependency model(LDTM)which transforms internal joint location to an explicit representation.Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively.Finally,an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose.Experiments on three challenging public datasets,ICVL,MSRA,and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.展开更多
Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namel...Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namely activity recognition and human-computer interface.Despite the benefits of HPE,it is still a challenging process due to the variations in visual appearances,lighting,occlusions,dimensionality,etc.To resolve these issues,this paper presents a squirrel search optimization with a deep convolutional neural network for HPE(SSDCNN-HPE)technique.The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently.Primarily,the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process.Then,the EfficientNet model is applied to identify the body points of a person with no problem constraints.Besides,the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm(SSA).In the final stage,the multiclass support vector machine(M-SVM)technique was utilized for the identification and classification of human poses.The design of bilateral filtering followed by SSA based EfficientNetmodel for HPE depicts the novelty of the work.To demonstrate the enhanced outcomes of the SSDCNN-HPE approach,a series of simulations are executed.The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.展开更多
In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interac...In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interaction(HOI)is important in terms of visual relationship detection and human pose estimation.Activities understanding and interaction recognition between human and object along with the pose estimation and interaction modeling have been explained.Some existing algorithms and feature extraction procedures are complicated including accurate detection of rare human postures,occluded regions,and unsatisfactory detection of objects,especially small-sized objects.The existing HOI detection techniques are instancecentric(object-based)where interaction is predicted between all the pairs.Such estimation depends on appearance features and spatial information.Therefore,we propose a novel approach to demonstrate that the appearance features alone are not sufficient to predict the HOI.Furthermore,we detect the human body parts by using the Gaussian Matric Model(GMM)followed by object detection using YOLO.We predict the interaction points which directly classify the interaction and pair them with densely predicted HOI vectors by using the interaction algorithm.The interactions are linked with the human and object to predict the actions.The experiments have been performed on two benchmark HOI datasets demonstrating the proposed approach.展开更多
Controlling multiple multi-joint fish-like robots has long captivated the attention of engineers and biologists,for which a fundamental but challenging topic is to robustly track the postures of the individuals in rea...Controlling multiple multi-joint fish-like robots has long captivated the attention of engineers and biologists,for which a fundamental but challenging topic is to robustly track the postures of the individuals in real time.This requires detecting multiple robots,estimating multi-joint postures,and tracking identities,as well as processing fast in real time.To the best of our knowledge,this challenge has not been tackled in the previous studies.In this paper,to precisely track the planar postures of multiple swimming multi-joint fish-like robots in real time,we propose a novel deep neural network-based method,named TAB-IOL.Its TAB part fuses the top-down and bottom-up approaches for vision-based pose estimation,while the IOL part with long short-term memory considers the motion constraints among joints for precise pose tracking.The satisfying performance of our TAB-IOL is verified by testing on a group of freely swimming fish-like robots in various scenarios with strong disturbances and by a deed comparison of accuracy,speed,and robustness with most state-of-the-art algorithms.Further,based on the precise pose estimation and tracking realized by our TAB-IOL,several formation control experiments are conducted for the group of fish-like robots.The results clearly demonstrate that our TAB-IOL lays a solid foundation for the coordination control of multiple fish-like robots in a real working environment.We believe our proposed method will facilitate the growth and development of related fields.展开更多
3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimat...3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimation of monocular RGB images and videos.An overall perspective ofmethods integrated with deep learning is introduced.Novel image-based and video-based inputs are proposed as the analysis framework.From this viewpoint,common problems are discussed.The diversity of human postures usually leads to problems such as occlusion and ambiguity,and the lack of training datasets often results in poor generalization ability of the model.Regression methods are crucial for solving such problems.Considering image-based input,the multi-view method is commonly used to solve occlusion problems.Here,the multi-view method is analyzed comprehensively.By referring to video-based input,the human prior knowledge of restricted motion is used to predict human postures.In addition,structural constraints are widely used as prior knowledge.Furthermore,weakly supervised learningmethods are studied and discussed for these two types of inputs to improve the model generalization ability.The problem of insufficient training datasets must also be considered,especially because 3D datasets are usually biased and limited.Finally,emerging and popular datasets and evaluation indicators are discussed.The characteristics of the datasets and the relationships of the indicators are explained and highlighted.Thus,this article can be useful and instructive for researchers who are lacking in experience and find this field confusing.In addition,by providing an overview of 3D human pose estimation,this article sorts and refines recent studies on 3D human pose estimation.It describes kernel problems and common useful methods,and discusses the scope for further research.展开更多
Face image analysis is one among several important cues in computer vision.Over the last five decades,methods for face analysis have received immense attention due to large scale applications in various face analysis ...Face image analysis is one among several important cues in computer vision.Over the last five decades,methods for face analysis have received immense attention due to large scale applications in various face analysis tasks.Face parsing strongly benefits various human face image analysis tasks inducing face pose estimation.In this paper we propose a 3D head pose estimation framework developed through a prior end to end deep face parsing model.We have developed an end to end face parts segmentation framework through deep convolutional neural networks(DCNNs).For training a deep face parts parsing model,we label face images for seven different classes,including eyes,brows,nose,hair,mouth,skin,and back.We extract features from gray scale images by using DCNNs.We train a classifier using the extracted features.We use the probabilistic classification method to produce gray scale images in the form of probability maps for each dense semantic class.We use a next stage of DCNNs and extract features from grayscale images created as probability maps during the segmentation phase.We assess the performance of our newly proposed model on four standard head pose datasets,including Pointing’04,Annotated Facial Landmarks in the Wild(AFLW),Boston University(BU),and ICT-3DHP,obtaining superior results as compared to previous results.展开更多
基金supported in part by the National Natural Science Foundation of China under Grants 61973065,U20A20197,61973063.
文摘Previous multi-view 3D human pose estimation methods neither correlate different human joints in each view nor model learnable correlations between the same joints in different views explicitly,meaning that skeleton structure information is not utilized and multi-view pose information is not completely fused.Moreover,existing graph convolutional operations do not consider the specificity of different joints and different views of pose information when processing skeleton graphs,making the correlation weights between nodes in the graph and their neighborhood nodes shared.Existing Graph Convolutional Networks(GCNs)cannot extract global and deeplevel skeleton structure information and view correlations efficiently.To solve these problems,pre-estimated multiview 2D poses are designed as a multi-view skeleton graph to fuse skeleton priors and view correlations explicitly to process occlusion problem,with the skeleton-edge and symmetry-edge representing the structure correlations between adjacent joints in each viewof skeleton graph and the view-edge representing the view correlations between the same joints in different views.To make graph convolution operation mine elaborate and sufficient skeleton structure information and view correlations,different correlation weights are assigned to different categories of neighborhood nodes and further assigned to each node in the graph.Based on the graph convolution operation proposed above,a Residual Graph Convolution(RGC)module is designed as the basic module to be combined with the simplified Hourglass architecture to construct the Hourglass-GCN as our 3D pose estimation network.Hourglass-GCNwith a symmetrical and concise architecture processes three scales ofmulti-viewskeleton graphs to extract local-to-global scale and shallow-to-deep level skeleton features efficiently.Experimental results on common large 3D pose dataset Human3.6M and MPI-INF-3DHP show that Hourglass-GCN outperforms some excellent methods in 3D pose estimation accuracy.
基金supported by the STI 2030-Major Projects(2022ZD0211900,2022ZD0211902)STI 2030-Major Projects(2021ZD0204500,2021ZD0204503)+1 种基金National Natural Science Foundation of China(32171461)National Key Research and Development Program of China(2023YFC3208303)。
文摘Advancements in animal behavior quantification methods have driven the development of computational ethology,enabling fully automated behavior analysis.Existing multianimal pose estimation workflows rely on tracking-bydetection frameworks for either bottom-up or top-down approaches,requiring retraining to accommodate diverse animal appearances.This study introduces InteBOMB,an integrated workflow that enhances top-down approaches by incorporating generic object tracking,eliminating the need for prior knowledge of target animals while maintaining broad generalizability.InteBOMB includes two key strategies for tracking and segmentation in laboratory environments and two techniques for pose estimation in natural settings.The“background enhancement”strategy optimizesforeground-backgroundcontrastiveloss,generating more discriminative correlation maps.The“online proofreading”strategy stores human-in-the-loop long-term memory and dynamic short-term memory,enabling adaptive updates to object visual features.The“automated labeling suggestion”technique reuses the visual features saved during tracking to identify representative frames for training set labeling.Additionally,the“joint behavior analysis”technique integrates these features with multimodal data,expanding the latent space for behavior classification and clustering.To evaluate the framework,six datasets of mice and six datasets of nonhuman primates were compiled,covering laboratory and natural scenes.Benchmarking results demonstrated a24%improvement in zero-shot generic tracking and a 21%enhancement in joint latent space performance across datasets,highlighting the effectiveness of this approach in robust,generalizable behavior analysis.
基金National Natural Science Foundation of China(Grant No.62266045)National Science and Technology Major Project of China(No.2022YFE0138600)。
文摘This paper presents a manifold-optimized Error-State Kalman Filter(ESKF)framework for unmanned aerial vehicle(UAV)pose estimation,integrating Inertial Measurement Unit(IMU)data with GPS or LiDAR to enhance estimation accuracy and robustness.We employ a manifold-based optimization approach,leveraging exponential and logarithmic mappings to transform rotation vectors into rotation matrices.The proposed ESKF framework ensures state variables remain near the origin,effectively mitigating singularity issues and enhancing numerical stability.Additionally,due to the small magnitude of state variables,second-order terms can be neglected,simplifying Jacobian matrix computation and improving computational efficiency.Furthermore,we introduce a novel Kalman filter gain computation strategy that dynamically adapts to low-dimensional and high-dimensional observation equations,enabling efficient processing across different sensor modalities.Specifically,for resource-constrained UAV platforms,this method significantly reduces computational cost,making it highly suitable for real-time UAV applications.
文摘Pose estimation of spacecraft targets is a key technology for achieving space operation tasks,such as the cleaning of failed satellites and the detection and scanning of non-cooperative targets.This paper reviews the target pose estimation methods based on image feature extraction and PnP,the target estimation methods based on registration,and the spacecraft target pose estimation methods based on deep learning,and introduces the corresponding research methods.
基金co-supported by the National Natural Science Foundation of China(Nos.12302252 and 12472189)the Research Program of National University of Defense Technology,China(No.ZK24-31).
文摘Vision-based relative pose estimation plays a pivotal role in various space missions.Deep learning enhances monocular spacecraft pose estimation,but high computational demands necessitate model simplification for onboard systems.In this paper,we aim to achieve an optimal balance between accuracy and computational efficiency.We present a Perspective-n-Point(PnP)based method for spacecraft pose estimation,leveraging lightweight neural networks to localize semantic keypoints and reduce computational load.Since the accuracy of keypoint localization is closely related to the heatmap resolution,we devise an efficient upsampling module to increase the resolution of heatmaps with minimal overhead.Furthermore,the heatmaps predicted by the lightweight models tend to show high-level noise.To tackle this issue,we propose a weighting strategy by analyzing the statistical characteristics of predicted semantic keypoints and substantially improve the pose estimation accuracy.The experiments carried out on the SPEED dataset underscore the prospect of our method in engineering applications.We dramatically reduce the model parameters to 0.7 M,merely 2.5%of that required by the top-performing method,and achieve lower pose estimation error and better real-time performance.
文摘[Objective]Fish pose estimation(FPE)provides fish physiological information,facilitating health monitoring in aquaculture.It aids decision-making in areas such as fish behavior recognition.When fish are injured or deficient,they often display abnormal behaviors and noticeable changes in the positioning of their body parts.Moreover,the unpredictable posture and orientation of fish during swimming,combined with the rapid swimming speed of fish,restrict the current scope of research in FPE.In this research,a FPE model named HPFPE is presented to capture the swimming posture of fish and accurately detect their key points.[Methods]On the one hand,this model incorporated the CBAM module into the HRNet framework.The attention module enhanced accuracy without adding computational complexity,while effectively capturing a broader range of contextual information.On the other hand,the model incorporated dilated convolution to increase the receptive field,allowing it to capture more spatial context.[Results and Discussions]Experiments showed that compared with the baseline method,the average precision(AP)of HPFPE based on different backbones and input sizes on the oplegnathus punctatus datasets had increased by 0.62,1.35,1.76,and 1.28 percent point,respectively,while the average recall(AR)had also increased by 0.85,1.50,1.40,and 1.00,respectively.Additionally,HPFPE outperformed other mainstream methods,including DeepPose,CPM,SCNet,and Lite-HRNet.Furthermore,when compared to other methods using the ornamental fish data,HPFPE achieved the highest AP and AR values of 52.96%,and 59.50%,respectively.[Conclusions]The proposed HPFPE can accurately estimate fish posture and assess their swimming patterns,serving as a valuable reference for applications such as fish behavior recognition.
基金supported by grants from the National Key Research and Development Program of China(2023YFF0724902)the China Postdoctoral Science Foundation(2020?M670027,2023TQ0183)the Local Standards Research of BeiJing Laboratory Tree Shrew(CHYX-2023-DGB001)。
文摘Background:Q uantifying the rich home-c age activities of tree shrews provides a reliable basis for understanding their daily routines and building disease models.However,due to the lack of effective behavioral methods,most efforts on tree shrew behavior are limited to simple measures,resulting in the loss of much behavioral information.Methods:T o address this issue,we present a deep learning(DL)approach to achieve markerless pose estimation and recognize multiple spontaneous behaviors of tree shrews,including drinking,eating,resting,and staying in the dark house,etc.Results:T his high-t hroughput approach can monitor the home-cage activities of 16 tree shrews simultaneously over an extended period.Additionally,we demonstrated an innovative system with reliable apparatus,paradigms,and analysis methods for investigating food grasping behavior.The median duration for each bout of grasping was 0.20 s.Conclusion:T his study provides an efficient tool for quantifying and understand tree shrews'natural behaviors.
文摘With the rapid progress of the artificial intelligence(AI)technology and mobile internet,3D hand pose estimation has become critical to various intelligent application areas,e.g.,human-computer interaction.To avoid the low accuracy of single-modal estimation and the high complexity of traditional multi-modal 3D estimation,this paper proposes a novel multi-modal multi-view(MMV)3D hand pose estimation system,which introduces a registration before translation(RT)-translation before registration(TR)jointed conditional generative adversarial network(cGAN)to train a multi-modal registration network,and then employs the multi-modal feature fusion to achieve high-quality estimation,with low hardware and software costs both in data acquisition and processing.Experimental results demonstrate that the MMV system is effective and feasible in various scenarios.It is promising for the MMV system to be used in broad intelligent application areas.
基金the National Natural Science Foundation of China(Nos.61775139,62072126,61772164,and 61872242)。
文摘Human pose estimation has received much attention from the research community because of its wide range of applications.However,current research for pose estimation is usually complex and computationally intensive,especially the feature loss problems in the feature fusion process.To address the above problems,we propose a lightweight human pose estimation network based on multi-attention mechanism(LMANet).In our method,network parameters can be significantly reduced by lightweighting the bottleneck blocks with depth-wise separable convolution on the high-resolution networks.After that,we also introduce a multi-attention mechanism to improve the model prediction accuracy,and the channel attention module is added in the initial stage of the network to enhance the local cross-channel information interaction.More importantly,we inject spatial crossawareness module in the multi-scale feature fusion stage to reduce the spatial information loss during feature extraction.Extensive experiments on COCO2017 dataset and MPII dataset show that LMANet can guarantee a higher prediction accuracy with fewer network parameters and computational effort.Compared with the highresolution network HRNet,the number of parameters and the computational complexity of the network are reduced by 67%and 73%,respectively.
基金the National Key Research and Development Program of China(No.2021ZD0111902)the National Natural Science Foundation of China(Nos.62172022 and U21B2038)。
文摘Due to self-occlusion and high degree of freedom,estimating 3D hand pose from a single RGB image is a great challenging problem.Graph convolutional networks(GCNs)use graphs to describe the physical connection relationships between hand joints and improve the accuracy of 3D hand pose regression.However,GCNs cannot effectively describe the relationships between non-adjacent hand joints.Recently,hypergraph convolutional networks(HGCNs)have received much attention as they can describe multi-dimensional relationships between nodes through hyperedges;therefore,this paper proposes a framework for 3D hand pose estimation based on HGCN,which can better extract correlated relationships between adjacent and non-adjacent hand joints.To overcome the shortcomings of predefined hypergraph structures,a kind of dynamic hypergraph convolutional network is proposed,in which hyperedges are constructed dynamically based on hand joint feature similarity.To better explore the local semantic relationships between nodes,a kind of semantic dynamic hypergraph convolution is proposed.The proposed method is evaluated on publicly available benchmark datasets.Qualitative and quantitative experimental results both show that the proposed HGCN and improved methods for 3D hand pose estimation are better than GCN,and achieve state-of-the-art performance compared with existing methods.
基金funded by the Joint Development Project with Pharmapack Technologies Corporation:Open Multi-Person Collaborative Virtual Assembly/Disassembly Training and Virtual Engineering Visualization Platform,Grant Number 23HK0101.
文摘Virtual maintenance,as an important means of industrial training and education,places strict requirements on the accuracy of participant pose perception and assessment of motion standardization.However,existing research mainly focuses on human pose estimation in general scenarios,lacking specialized solutions for maintenance scenarios.This paper proposes a virtual maintenance human pose estimation method based on multi-scale feature enhancement(VMHPE),which integrates adaptive input feature enhancement,multi-scale feature correction for improved expression of fine movements and complex poses,and multi-scale feature fusion to enhance keypoint localization accuracy.Meanwhile,this study constructs the first virtual maintenance-specific human keypoint dataset(VMHKP),which records standard action sequences of professional maintenance personnel in five typical maintenance tasks and provides a reliable benchmark for evaluating operator motion standardization.The dataset is publicly available at.Using high-precision keypoint prediction results,an action assessment system utilizing topological structure similarity was established.Experiments show that our method achieves significant performance improvements:average precision(AP)reaches 94.4%,an increase of 2.3 percentage points over baseline methods;average recall(AR)reaches 95.6%,an increase of 1.3 percentage points.This research establishes a scientific four-level evaluation standard based on comparative motion analysis and provides a reliable solution for standardizing industrial maintenance training.
基金supported by the National Natural Science Foundation of China(No.52272390)the Natural Science Foundation of Heilongjiang Province of China(No.YQ2022A009)the Shanghai Sailing Program,China(No.20YF1417300).
文摘Real-time 6 Degree-of-Freedom(DoF)pose estimation is of paramount importance for various on-orbit tasks.Benefiting from the development of deep learning,Convolutional Neural Networks(CNNs)in feature extraction has yielded impressive achievements for spacecraft pose estimation.To improve the robustness and interpretability of CNNs,this paper proposes a Pose Estimation approach based on Variational Auto-Encoder structure(PE-VAE)and a Feature-Aided pose estimation approach based on Variational Auto-Encoder structure(FA-VAE),which aim to accurately estimate the 6 DoF pose of a target spacecraft.Both methods treat the pose vector as latent variables,employing an encoder-decoder network with a Variational Auto-Encoder(VAE)structure.To enhance the precision of pose estimation,PE-VAE uses the VAE structure to introduce reconstruction mechanism with the whole image.Furthermore,FA-VAE enforces feature shape constraints by exclusively reconstructing the segment of the target spacecraft with the desired shape.Comparative evaluation against leading methods on public datasets reveals similar accuracy with a threefold improvement in processing speed,showcasing the significant contribution of VAE structures to accuracy enhancement,and the additional benefit of incorporating global shape prior features.
基金supported by National Key R&D Plan(2017YFB1301104),NSFC(61877040,61772351)Sci-Tech Innovation Fundamental Scientific Research Funds(025195305000)(19210010005),academy for multidisciplinary study of Capital Normal University。
文摘Error or drift is frequently produced in pose estimation based on geometric"feature detection and tracking"monocular visual odometry(VO)when the speed of camera movement exceeds 1.5 m/s.While,in most VO methods based on deep learning,weight factors are in the form of fixed values,which are easy to lead to overfitting.A new measurement system,for monocular visual odometry,named Deep Learning Visual Odometry(DLVO),is proposed based on neural network.In this system,Convolutional Neural Network(CNN)is used to extract feature and perform feature matching.Moreover,Recurrent Neural Network(RNN)is used for sequence modeling to estimate camera’s 6-dof poses.Instead of fixed weight values of CNN,Bayesian distribution of weight factors are introduced in order to effectively solve the problem of network overfitting.The 18,726 frame images in KITTI dataset are used for training network.This system can increase the generalization ability of network model in prediction process.Compared with original Recurrent Convolutional Neural Network(RCNN),our method can reduce the loss of test model by 5.33%.And it’s an effective method in improving the robustness of translation and rotation information than traditional VO methods.
基金funded by the Center for Unmanned Aircraft Systems(C-UAS)a National Science Foundation Industry/University Cooperative Research Center(I/UCRC)under NSF award Numbers IIP-1161036 and CNS-1650547along with significant contributions from C-UAS industry members。
文摘This paper introduces a new algorithm for estimating the relative pose of a moving camera using consecutive frames of a video sequence. State-of-the-art algorithms for calculating the relative pose between two images use matching features to estimate the essential matrix. The essential matrix is then decomposed into the relative rotation and normalized translation between frames. To be robust to noise and feature match outliers, these methods generate a large number of essential matrix hypotheses from randomly selected minimal subsets of feature pairs, and then score these hypotheses on all feature pairs. Alternatively, the algorithm introduced in this paper calculates relative pose hypotheses by directly optimizing the rotation and normalized translation between frames, rather than calculating the essential matrix and then performing the decomposition. The resulting algorithm improves computation time by an order of magnitude. If an inertial measurement unit(IMU) is available, it is used to seed the optimizer, and in addition, we reuse the best hypothesis at each iteration to seed the optimizer thereby reducing the number of relative pose hypotheses that must be generated and scored. These advantages greatly speed up performance and enable the algorithm to run in real-time on low cost embedded hardware. We show application of our algorithm to visual multi-target tracking(MTT) in the presence of parallax and demonstrate its real-time performance on a 640 × 480 video sequence captured on a UAV. Video results are available at https://youtu.be/Hh K-p2 h XNn U.
基金supported in part by the Fundamental Research Funds for the Central Universities(WK2350000002)。
文摘In this paper we present a CNN based approach for a real time 3 D-hand pose estimation from the depth sequence.Prior discriminative approaches have achieved remarkable success but are facing two main challenges:Firstly,the methods are fully supervised hence require large numbers of annotated training data to extract the dynamic information from a hand representation.Secondly,unreliable hand detectors based on strong assumptions or a weak detector which often fail in several situations like complex environment and multiple hands.In contrast to these methods,this paper presents an approach that can be considered as semi-supervised by performing predictive coding of image sequences of hand poses in order to capture latent features underlying a given image without supervision.The hand is modelled using a novel latent tree dependency model(LDTM)which transforms internal joint location to an explicit representation.Then the modeled hand topology is integrated with the pose estimator using data dependent method to jointly learn latent variables of the posterior pose appearance and the pose configuration respectively.Finally,an unsupervised error term which is a part of the recurrent architecture ensures smooth estimations of the final pose.Experiments on three challenging public datasets,ICVL,MSRA,and NYU demonstrate the significant performance of the proposed method which is comparable or better than state-of-the-art approaches.
文摘Human pose estimation(HPE)is a procedure for determining the structure of the body pose and it is considered a challenging issue in the computer vision(CV)communities.HPE finds its applications in several fields namely activity recognition and human-computer interface.Despite the benefits of HPE,it is still a challenging process due to the variations in visual appearances,lighting,occlusions,dimensionality,etc.To resolve these issues,this paper presents a squirrel search optimization with a deep convolutional neural network for HPE(SSDCNN-HPE)technique.The major intention of the SSDCNN-HPE technique is to identify the human pose accurately and efficiently.Primarily,the video frame conversion process is performed and pre-processing takes place via bilateral filtering-based noise removal process.Then,the EfficientNet model is applied to identify the body points of a person with no problem constraints.Besides,the hyperparameter tuning of the EfficientNet model takes place by the use of the squirrel search algorithm(SSA).In the final stage,the multiclass support vector machine(M-SVM)technique was utilized for the identification and classification of human poses.The design of bilateral filtering followed by SSA based EfficientNetmodel for HPE depicts the novelty of the work.To demonstrate the enhanced outcomes of the SSDCNN-HPE approach,a series of simulations are executed.The experimental results reported the betterment of the SSDCNN-HPE system over the recent existing techniques in terms of different measures.
基金supported by Priority Research Centers Program through NRF funded by MEST(2018R1A6A1A03024003)the Grand Information Technology Research Center support program IITP-2020-2020-0-01612 supervised by the IITP by MSIT,Korea.
文摘In the new era of technology,daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds.To understand the scenes and activities from human life logs,human-object interaction(HOI)is important in terms of visual relationship detection and human pose estimation.Activities understanding and interaction recognition between human and object along with the pose estimation and interaction modeling have been explained.Some existing algorithms and feature extraction procedures are complicated including accurate detection of rare human postures,occluded regions,and unsatisfactory detection of objects,especially small-sized objects.The existing HOI detection techniques are instancecentric(object-based)where interaction is predicted between all the pairs.Such estimation depends on appearance features and spatial information.Therefore,we propose a novel approach to demonstrate that the appearance features alone are not sufficient to predict the HOI.Furthermore,we detect the human body parts by using the Gaussian Matric Model(GMM)followed by object detection using YOLO.We predict the interaction points which directly classify the interaction and pair them with densely predicted HOI vectors by using the interaction algorithm.The interactions are linked with the human and object to predict the actions.The experiments have been performed on two benchmark HOI datasets demonstrating the proposed approach.
基金This work was supported in part by the National Natural Science Foundation of China(61973007,61633002).
文摘Controlling multiple multi-joint fish-like robots has long captivated the attention of engineers and biologists,for which a fundamental but challenging topic is to robustly track the postures of the individuals in real time.This requires detecting multiple robots,estimating multi-joint postures,and tracking identities,as well as processing fast in real time.To the best of our knowledge,this challenge has not been tackled in the previous studies.In this paper,to precisely track the planar postures of multiple swimming multi-joint fish-like robots in real time,we propose a novel deep neural network-based method,named TAB-IOL.Its TAB part fuses the top-down and bottom-up approaches for vision-based pose estimation,while the IOL part with long short-term memory considers the motion constraints among joints for precise pose tracking.The satisfying performance of our TAB-IOL is verified by testing on a group of freely swimming fish-like robots in various scenarios with strong disturbances and by a deed comparison of accuracy,speed,and robustness with most state-of-the-art algorithms.Further,based on the precise pose estimation and tracking realized by our TAB-IOL,several formation control experiments are conducted for the group of fish-like robots.The results clearly demonstrate that our TAB-IOL lays a solid foundation for the coordination control of multiple fish-like robots in a real working environment.We believe our proposed method will facilitate the growth and development of related fields.
基金supported by the Program of Entrepreneurship and Innovation Ph.D.in Jiangsu Province(JSSCBS20211175)the School Ph.D.Talent Funding(Z301B2055)the Natural Science Foundation of the Jiangsu Higher Education Institutions of China(21KJB520002).
文摘3D human pose estimation is a major focus area in the field of computer vision,which plays an important role in practical applications.This article summarizes the framework and research progress related to the estimation of monocular RGB images and videos.An overall perspective ofmethods integrated with deep learning is introduced.Novel image-based and video-based inputs are proposed as the analysis framework.From this viewpoint,common problems are discussed.The diversity of human postures usually leads to problems such as occlusion and ambiguity,and the lack of training datasets often results in poor generalization ability of the model.Regression methods are crucial for solving such problems.Considering image-based input,the multi-view method is commonly used to solve occlusion problems.Here,the multi-view method is analyzed comprehensively.By referring to video-based input,the human prior knowledge of restricted motion is used to predict human postures.In addition,structural constraints are widely used as prior knowledge.Furthermore,weakly supervised learningmethods are studied and discussed for these two types of inputs to improve the model generalization ability.The problem of insufficient training datasets must also be considered,especially because 3D datasets are usually biased and limited.Finally,emerging and popular datasets and evaluation indicators are discussed.The characteristics of the datasets and the relationships of the indicators are explained and highlighted.Thus,this article can be useful and instructive for researchers who are lacking in experience and find this field confusing.In addition,by providing an overview of 3D human pose estimation,this article sorts and refines recent studies on 3D human pose estimation.It describes kernel problems and common useful methods,and discusses the scope for further research.
基金Institute of Information&communications Technology Planning&Evaluation(IITP)grant funded by the Korea government(MSIT)(2020-0-01592)Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Education under Grant(2019R1F1A1058548)and Grant(2020R1G1A1013221).
文摘Face image analysis is one among several important cues in computer vision.Over the last five decades,methods for face analysis have received immense attention due to large scale applications in various face analysis tasks.Face parsing strongly benefits various human face image analysis tasks inducing face pose estimation.In this paper we propose a 3D head pose estimation framework developed through a prior end to end deep face parsing model.We have developed an end to end face parts segmentation framework through deep convolutional neural networks(DCNNs).For training a deep face parts parsing model,we label face images for seven different classes,including eyes,brows,nose,hair,mouth,skin,and back.We extract features from gray scale images by using DCNNs.We train a classifier using the extracted features.We use the probabilistic classification method to produce gray scale images in the form of probability maps for each dense semantic class.We use a next stage of DCNNs and extract features from grayscale images created as probability maps during the segmentation phase.We assess the performance of our newly proposed model on four standard head pose datasets,including Pointing’04,Annotated Facial Landmarks in the Wild(AFLW),Boston University(BU),and ICT-3DHP,obtaining superior results as compared to previous results.