Background Generally, it is difficult to obtain accurate pose and depth for a non-rigid moving object from a single RGB camera to create augmented reality (AR). In this study, we build an augmented reality system from...Background Generally, it is difficult to obtain accurate pose and depth for a non-rigid moving object from a single RGB camera to create augmented reality (AR). In this study, we build an augmented reality system from a single RGB camera for a non-rigid moving human by accurately computing pose and depth, for which two key tasks are segmentation and monocular Simultaneous Localization and Mapping (SLAM). Most existing monocular SLAM systems are designed for static scenes, while in this AR system, the human body is always moving and non-rigid. Methods In order to make the SLAM system suitable for a moving human, we first segment the rigid part of the human in each frame. A segmented moving body part can be regarded as a static object, and the relative motions between each moving body part and the camera can be considered the motion of the camera. Typical SLAM systems designed for static scenes can then be applied. In the segmentation step of this AR system, we first employ the proposed BowtieNet, which adds the atrous spatial pyramid pooling (ASPP) of DeepLab between the encoder and decoder of SegNet to segment the human in the original frame, and then we use color information to extract the face from the segmented human area. Results Based on the human segmentation results and a monocular SLAM, this system can change the video background and add a virtual object to humans. Conclusions The experiments on the human image segmentation datasets show that BowtieNet obtains state-of-the-art human image segmentation performance and enough speed for real-time segmentation. The experiments on videos show that the proposed AR system can robustly add a virtual object to humans and can accurately change the video background.展开更多
Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)i...Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.展开更多
Surveillance systems can take various forms,but gait-based surveillance is emerging as a powerful approach due to its ability to identify individuals without requiring their cooperation.In the existing studies,several...Surveillance systems can take various forms,but gait-based surveillance is emerging as a powerful approach due to its ability to identify individuals without requiring their cooperation.In the existing studies,several approaches have been suggested for gait recognition;nevertheless,the performance of existing systems is often degraded in real-world conditions due to covariate factors such as occlusions,clothing changes,walking speed,and varying camera viewpoints.Furthermore,most existing research focuses on single-person gait recognition;however,counting,tracking,detecting,and recognizing individuals in dual-subject settings with occlusions remains a challenging task.Therefore,this research proposed a variant of an automated gait model for occluded dual-subject walk scenarios.More precisely,in the proposed method,we have designed a deep learning(DL)-based dual-subject gait model(DSG)involving three modules.The first module handles silhouette segmentation,localization,and counting(SLC)using Mask-RCNN with MobileNetV2.The next stage uses a Convolutional block attention module(CBAM)-based Siamese network for frame-level tracking with a modified gallery setting.Following the last,gait recognition based on regionbased deep learning is proposed for dual-subject gait recognition.The proposed method,tested on Shri Mata Vaishno Devi University(SMVDU)-Multi-Gait and Single-Gait datasets,shows strong performance with 94.00%segmentation,58.36%tracking,and 63.04%gait recognition accuracy in dual-subject walk scenarios.展开更多
We address the problem of 3D human pose estimation in a single real scene image. Normally, 3D pose estimation from real image needs background subtraction to extract the appropriate features. We do not make such assum...We address the problem of 3D human pose estimation in a single real scene image. Normally, 3D pose estimation from real image needs background subtraction to extract the appropriate features. We do not make such assumption, In this paper, a two-step approach is proposed, first, instead of applying background subtraction to get the segmentation of human, we combine the segmentation with human detection using an ISM-based detector. Then, silhouette feature can be extracted and 3D pose estimation is solved as a regression problem. RVMs and ridge regression method are applied to solve this problem. The results show the robustness and accuracy of our method.展开更多
Current image-editing tools do not match up to the demands of personalized image manipulation,one application of which is changing clothes in usercaptured images. Previous work can change single color clothes using pa...Current image-editing tools do not match up to the demands of personalized image manipulation,one application of which is changing clothes in usercaptured images. Previous work can change single color clothes using parametric human warping methods.In this paper, we propose an image-based clothes changing system, exploiting body factor extraction and content-aware image warping. Image segmentation and mask generation are first applied to the user input.Afterwards, we determine joint positions via a neural network. Then, body shape matching is performed and the shape of the model is warped to the user's shape. Finally, head swapping is performed to produce realistic virtual results. We also provide a supervision and labeling tool for refinement and further assistance when creating a dataset.展开更多
文摘Background Generally, it is difficult to obtain accurate pose and depth for a non-rigid moving object from a single RGB camera to create augmented reality (AR). In this study, we build an augmented reality system from a single RGB camera for a non-rigid moving human by accurately computing pose and depth, for which two key tasks are segmentation and monocular Simultaneous Localization and Mapping (SLAM). Most existing monocular SLAM systems are designed for static scenes, while in this AR system, the human body is always moving and non-rigid. Methods In order to make the SLAM system suitable for a moving human, we first segment the rigid part of the human in each frame. A segmented moving body part can be regarded as a static object, and the relative motions between each moving body part and the camera can be considered the motion of the camera. Typical SLAM systems designed for static scenes can then be applied. In the segmentation step of this AR system, we first employ the proposed BowtieNet, which adds the atrous spatial pyramid pooling (ASPP) of DeepLab between the encoder and decoder of SegNet to segment the human in the original frame, and then we use color information to extract the face from the segmented human area. Results Based on the human segmentation results and a monocular SLAM, this system can change the video background and add a virtual object to humans. Conclusions The experiments on the human image segmentation datasets show that BowtieNet obtains state-of-the-art human image segmentation performance and enough speed for real-time segmentation. The experiments on videos show that the proposed AR system can robustly add a virtual object to humans and can accurately change the video background.
基金This research was financially supported by the Ministry of Small and Medium-sized Enterprises(SMEs)and Startups(MSS)Korea,under the“Regional Specialized Industry Development Program(R&D,S3091627)”supervised by the Korea Institute for Advancement of Technology(KIAT).
文摘Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.
基金supported by the MSIT(Ministry of Science and ICT),Republic of Korea,under the Convergence Security Core Talent Training Business Support Program(IITP-2025-RS-2023-00266605)supervised by the IITP(Institute for Information&Communications Technology Planning&Evaluation).
文摘Surveillance systems can take various forms,but gait-based surveillance is emerging as a powerful approach due to its ability to identify individuals without requiring their cooperation.In the existing studies,several approaches have been suggested for gait recognition;nevertheless,the performance of existing systems is often degraded in real-world conditions due to covariate factors such as occlusions,clothing changes,walking speed,and varying camera viewpoints.Furthermore,most existing research focuses on single-person gait recognition;however,counting,tracking,detecting,and recognizing individuals in dual-subject settings with occlusions remains a challenging task.Therefore,this research proposed a variant of an automated gait model for occluded dual-subject walk scenarios.More precisely,in the proposed method,we have designed a deep learning(DL)-based dual-subject gait model(DSG)involving three modules.The first module handles silhouette segmentation,localization,and counting(SLC)using Mask-RCNN with MobileNetV2.The next stage uses a Convolutional block attention module(CBAM)-based Siamese network for frame-level tracking with a modified gallery setting.Following the last,gait recognition based on regionbased deep learning is proposed for dual-subject gait recognition.The proposed method,tested on Shri Mata Vaishno Devi University(SMVDU)-Multi-Gait and Single-Gait datasets,shows strong performance with 94.00%segmentation,58.36%tracking,and 63.04%gait recognition accuracy in dual-subject walk scenarios.
基金Supported by the National Basic Research Program of China (Grant No.2006CB303103)Key Program of the National Natural Science Foundation of China (Grant No.60833009)
文摘We address the problem of 3D human pose estimation in a single real scene image. Normally, 3D pose estimation from real image needs background subtraction to extract the appropriate features. We do not make such assumption, In this paper, a two-step approach is proposed, first, instead of applying background subtraction to get the segmentation of human, we combine the segmentation with human detection using an ISM-based detector. Then, silhouette feature can be extracted and 3D pose estimation is solved as a regression problem. RVMs and ridge regression method are applied to solve this problem. The results show the robustness and accuracy of our method.
基金supported by the National Natural Science Foundation of China (Project No. 61521002)Research Grant of Beijing Higher Institution Engineering Research Center
文摘Current image-editing tools do not match up to the demands of personalized image manipulation,one application of which is changing clothes in usercaptured images. Previous work can change single color clothes using parametric human warping methods.In this paper, we propose an image-based clothes changing system, exploiting body factor extraction and content-aware image warping. Image segmentation and mask generation are first applied to the user input.Afterwards, we determine joint positions via a neural network. Then, body shape matching is performed and the shape of the model is warped to the user's shape. Finally, head swapping is performed to produce realistic virtual results. We also provide a supervision and labeling tool for refinement and further assistance when creating a dataset.