We present an animatable 3D Gaussian representation for synthesizing high-fidelity human videos under novel views and poses in real time.Given multi-view videos of a human subject,we learn a collection of 3D Gaussians...We present an animatable 3D Gaussian representation for synthesizing high-fidelity human videos under novel views and poses in real time.Given multi-view videos of a human subject,we learn a collection of 3D Gaussians in the canonical space of the rest pose.Each Gaussian is associated with a few basic properties(i.e.,position,opacity,scale,rotation,spherical harmonics coefficients)representing the average human appearance across all video frames,as well as a latent code and a set of blend weights for dynamic appearance correction and pose transformation.The latent code is fed to an Multi-layer Perceptron(MLP)with a target pose to correct Gaussians in the canonical space to capture appearance changes under the target pose.The corrected Gaussians are then transformed to the target pose using linear blend skinning(LBS)with their blend weights.High-fidelity human images under novel views and poses can be rendered in real time through Gaussian splatting.Compared to state-of-the-art NeRF-based methods,our animatable Gaussian representation produces more compelling results with well captured details,and achieves superior rendering performance.展开更多
In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attent...In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism,we design a light-weight speech encoder that leams useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data.To learn subject-independent facial motion,we use deformation gradients as the internal representation,which allows nuanced local motions to be better synthesized than using vertex offsets.Compared with state-of-the-art automatic-speech-recognition-based methods,our model is much smaller but achieves similar robustness and quality most of the time,and noticeably better results in certain challenging cases.展开更多
Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or ...Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or two-way consistency.In this paper,we re-examine UIT from a new perspective:distributional semantics consistency,based on the observation that data variations contain semantics,e.g.,shoes varying in colors.Further,the semantics can be multi-dimensional,e.g.,shoes also varying in style,functionality,etc.Given two image domains,matching these semantic dimensions during UIT will produce mappings with explicable correspondences,which has not been investigated previously.We propose distributional semantics mapping(DSM),the first UIT method which explicitly matches semantics between two domains.We show that distributional semantics has been rarely considered within and beyond UIT,even though it is a common problem in deep learning.We evaluate DSM on several benchmark datasets,demonstrating its general ability to capture distributional semantics.Extensive comparisons show that DSM not only produces explicable mappings,but also improves image quality in general.展开更多
基金supported by the National Key R&D Program of China(No.2022YFF0902302)the National Natural Science Foundation of China(Grant Nos.62172357&62322209)。
文摘We present an animatable 3D Gaussian representation for synthesizing high-fidelity human videos under novel views and poses in real time.Given multi-view videos of a human subject,we learn a collection of 3D Gaussians in the canonical space of the rest pose.Each Gaussian is associated with a few basic properties(i.e.,position,opacity,scale,rotation,spherical harmonics coefficients)representing the average human appearance across all video frames,as well as a latent code and a set of blend weights for dynamic appearance correction and pose transformation.The latent code is fed to an Multi-layer Perceptron(MLP)with a target pose to correct Gaussians in the canonical space to capture appearance changes under the target pose.The corrected Gaussians are then transformed to the target pose using linear blend skinning(LBS)with their blend weights.High-fidelity human images under novel views and poses can be rendered in real time through Gaussian splatting.Compared to state-of-the-art NeRF-based methods,our animatable Gaussian representation produces more compelling results with well captured details,and achieves superior rendering performance.
文摘In this paper,we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip.By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism,we design a light-weight speech encoder that leams useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data.To learn subject-independent facial motion,we use deformation gradients as the internal representation,which allows nuanced local motions to be better synthesized than using vertex offsets.Compared with state-of-the-art automatic-speech-recognition-based methods,our model is much smaller but achieves similar robustness and quality most of the time,and noticeably better results in certain challenging cases.
基金supported by National Natural Science Foundation of China(Grant No.61772462)the 100 Talents Program of Zhejiang University。
文摘Unsupervised image translation(UIT)studies the mapping between two image domains.Since such mappings are under-constrained,existing research has pursued various desirable properties such as distributional matching or two-way consistency.In this paper,we re-examine UIT from a new perspective:distributional semantics consistency,based on the observation that data variations contain semantics,e.g.,shoes varying in colors.Further,the semantics can be multi-dimensional,e.g.,shoes also varying in style,functionality,etc.Given two image domains,matching these semantic dimensions during UIT will produce mappings with explicable correspondences,which has not been investigated previously.We propose distributional semantics mapping(DSM),the first UIT method which explicitly matches semantics between two domains.We show that distributional semantics has been rarely considered within and beyond UIT,even though it is a common problem in deep learning.We evaluate DSM on several benchmark datasets,demonstrating its general ability to capture distributional semantics.Extensive comparisons show that DSM not only produces explicable mappings,but also improves image quality in general.