We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differ...We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differently from current discriminative tracking-by-detection solutions,our proposed hierarchical structural embedding learning can predict more highquality masks with accurate boundary details over spatio-temporal space via the normalizing flows.We formulate the instance inference procedure as a hierarchical spatio-temporal embedded learning across time and space.Given the video clip,our method first coarsely locates pixels belonging to a particular instance with Gaussian distribution and then builds a novel mixing distribution to promote the instance boundary by fusing hierarchical appearance embedding information in a coarse-to-fine manner.For the mixing distribution,we utilize a factorization condition normalized flow fashion to estimate the distribution parameters to improve the segmentation performance.Comprehensive qualitative,quantitative,and ablation experiments are performed on three representative video instance segmentation benchmarks(i.e.,YouTube-VIS19,YouTube-VIS21,and OVIS)and the effectiveness of the proposed method is demonstrated.More impressively,the superior performance of our model on an unsupervised video object segmentation dataset(i.e.,DAVIS19)proves its generalizability.Our algorithm implementations are publicly available at https://github.com/zyqin19/HEVis.展开更多
Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)i...Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.展开更多
基金supported in part by the National Natural Science Foundation of China(62176139,62106128,62176141)the Major Basic Research Project of Shandong Natural Science Foundation(ZR2021ZD15)+4 种基金the Natural Science Foundation of Shandong Province(ZR2021QF001)the Young Elite Scientists Sponsorship Program by CAST(2021QNRC001)the Open Project of Key Laboratory of Artificial Intelligence,Ministry of Educationthe Shandong Provincial Natural Science Foundation for Distinguished Young Scholars(ZR2021JQ26)the Taishan Scholar Project of Shandong Province(tsqn202103088)。
文摘We introduce a novel method using a new generative model that automatically learns effective representations of the target and background appearance to detect,segment and track each instance in a video sequence.Differently from current discriminative tracking-by-detection solutions,our proposed hierarchical structural embedding learning can predict more highquality masks with accurate boundary details over spatio-temporal space via the normalizing flows.We formulate the instance inference procedure as a hierarchical spatio-temporal embedded learning across time and space.Given the video clip,our method first coarsely locates pixels belonging to a particular instance with Gaussian distribution and then builds a novel mixing distribution to promote the instance boundary by fusing hierarchical appearance embedding information in a coarse-to-fine manner.For the mixing distribution,we utilize a factorization condition normalized flow fashion to estimate the distribution parameters to improve the segmentation performance.Comprehensive qualitative,quantitative,and ablation experiments are performed on three representative video instance segmentation benchmarks(i.e.,YouTube-VIS19,YouTube-VIS21,and OVIS)and the effectiveness of the proposed method is demonstrated.More impressively,the superior performance of our model on an unsupervised video object segmentation dataset(i.e.,DAVIS19)proves its generalizability.Our algorithm implementations are publicly available at https://github.com/zyqin19/HEVis.
基金This research was financially supported by the Ministry of Small and Medium-sized Enterprises(SMEs)and Startups(MSS)Korea,under the“Regional Specialized Industry Development Program(R&D,S3091627)”supervised by the Korea Institute for Advancement of Technology(KIAT).
文摘Awide range of camera apps and online video conferencing services support the feature of changing the background in real-time for aesthetic,privacy,and security reasons.Numerous studies show that theDeep-Learning(DL)is a suitable option for human segmentation,and the ensemble of multiple DL-based segmentation models can improve the segmentation result.However,these approaches are not as effective when directly applied to the image segmentation in a video.This paper proposes an Adaptive N-Frames Ensemble(AFE)approach for high-movement human segmentation in a video using an ensemble of multiple DL models.In contrast to an ensemble,which executes multiple DL models simultaneously for every single video frame,the proposed AFE approach executes only a single DL model upon a current video frame.It combines the segmentation outputs of previous frames for the final segmentation output when the frame difference is less than a particular threshold.Our method employs the idea of the N-Frames Ensemble(NFE)method,which uses the ensemble of the image segmentation of a current video frame and previous video frames.However,NFE is not suitable for the segmentation of fast-moving objects in a video nor a video with low frame rates.The proposed AFE approach addresses the limitations of the NFE method.Our experiment uses three human segmentation models,namely Fully Convolutional Network(FCN),DeepLabv3,and Mediapipe.We evaluated our approach using 1711 videos of the TikTok50f dataset with a single-person view.The TikTok50f dataset is a reconstructed version of the publicly available TikTok dataset by cropping,resizing and dividing it into videos having 50 frames each.This paper compares the proposed AFE with single models and the Two-Models Ensemble,as well as the NFE models.The experiment results show that the proposed AFE is suitable for low-movement as well as high-movement human segmentation in a video.