The generation of high-quality,realistic face generation has emerged as a key field of research in computer vision.This paper proposes a robust approach that combines a Super-Resolution Generative Adversarial Network(...The generation of high-quality,realistic face generation has emerged as a key field of research in computer vision.This paper proposes a robust approach that combines a Super-Resolution Generative Adversarial Network(SRGAN)with a Pyramid Attention Module(PAM)to enhance the quality of deep face generation.The SRGAN framework is designed to improve the resolution of generated images,addressing common challenges such as blurriness and a lack of intricate details.The Pyramid Attention Module further complements the process by focusing on multi-scale feature extraction,enabling the network to capture finer details and complex facial features more effectively.The proposed method was trained and evaluated over 100 epochs on the CelebA dataset,demonstrating consistent improvements in image quality and a marked decrease in generator and discriminator losses,reflecting the model’s capacity to learn and synthesize high-quality images effectively,given adequate computational resources.Experimental outcome demonstrates that the SRGAN model with PAM module has outperformed,yielding an aggregate discriminator loss of 0.055 for real,0.043 for fake,and a generator loss of 10.58 after training for 100 epochs.The model has yielded an structural similarity index measure of 0.923,that has outperformed the other models that are considered in the current study for analysis.展开更多
Near infrared-visible(NIR-VIS)face recognition is to match an NIR face image to a VIS image.The main challenges of NIR-VIS face recognition are the gap caused by cross-modality and the lack of sufficient paired NIR-VI...Near infrared-visible(NIR-VIS)face recognition is to match an NIR face image to a VIS image.The main challenges of NIR-VIS face recognition are the gap caused by cross-modality and the lack of sufficient paired NIR-VIS face images to train models.This paper focuses on the generation of paired NIR-VIS face images and proposes a dual variational generator based on ResNeSt(RS-DVG).RS-DVG can generate a large number of paired NIR-VIS face images from noise,and these generated NIR-VIS face images can be used as the training set together with the real NIR-VIS face images.In addition,a triplet loss function is introduced and a novel triplet selection method is proposed specifically for the training of the current face recognition model,which maximizes the inter-class distance and minimizes the intra-class distance in the input face images.The method proposed in this paper was evaluated on the datasets CASIA NIR-VIS 2.0 and BUAA-VisNir,and relatively good results were obtained.展开更多
Synthesizing a real⁃time,high⁃resolution,and lip⁃sync digital human is a challenging task.Although the Wav2Lip model represents a remarkable advancement in real⁃time lip⁃sync,its clarity is still limited.To address th...Synthesizing a real⁃time,high⁃resolution,and lip⁃sync digital human is a challenging task.Although the Wav2Lip model represents a remarkable advancement in real⁃time lip⁃sync,its clarity is still limited.To address this,we enhanced the Wav2Lip model in this study and trained it on a high⁃resolution video dataset produced in our laboratory.Experimental results indicate that the improved Wav2Lip model produces digital humans with greater clarity than the original model,while maintaining its real⁃time performance and accurate lip⁃sync.We implemented the improved Wav2Lip model in a government interface application,generating a government digital human.Testing revealed that this government digital human can interact seamlessly with users in real⁃time,delivering clear visuals and synthesized speech that closely resembles a human voice.展开更多
The flow over a backward facing step (BFS) has been taken as a useful proto- type to investigate intrinsic mechanisms of separated flow with heat transfer. However, to date, the open literature on the effect of Rich...The flow over a backward facing step (BFS) has been taken as a useful proto- type to investigate intrinsic mechanisms of separated flow with heat transfer. However, to date, the open literature on the effect of Richardson number on entropy generation over the BFS is absent yet, although the flow pattern and heat transfer characteristic both will receive significant influence caused by the variation of Richardson number in many prac- tical applications, such as in microelectromechanical systems and aerocrafts. The effect of Richardson number on entropy generation in the BFS flow is reported in this paper for the first time. The entropy generation analysis is conducted through numerically solving the entropy generation equation. The velocity and temperature, which are the inputs of the entropy generation equation, are evaluated by the lattice Boltzmann method. It is found that the distributions of local entropy generation number and Bejan number are significantly influenced by the variation of Richardson number. The total entropy gen- eration number is a monotonic decreasing function of Richardson number, whereas the average Bejan number is a monotonic increasing function of Richardson number.展开更多
Conversation is an essential component of virtual avatar activities in the metaverse.With the development of natural language processing,significant breakthroughs have been made in text and voice conversation generati...Conversation is an essential component of virtual avatar activities in the metaverse.With the development of natural language processing,significant breakthroughs have been made in text and voice conversation generation.However,face-to-face conversations account for the vast majority of daily conversations,while most existing methods focused on single-person talking head generation.In this work,we take a step further and consider generating realistic face-to-face conversation videos.Conversation generation is more challenging than single-person talking head generation,because it requires not only the generation of photo-realistic individual talking heads,but also the listener’s response to the speaker.In this paper,we propose a novel unified framework based on the neural radiance field(NeRF)to address these challenges.Specifically,we model both the speaker and the listener with a NeRF framework under different conditions to control individual expressions.The speaker is driven by the audio signal,while the response of the listener depends on both visual and acoustic information.In this way,face-to-face conversation videos are generated between human avatars,with all the interlocutors modeled within the same network.Moreover,to facilitate future research on this task,we also collected a new human conversation dataset containing 34 video clips.Quantitative and qualitative experiments evaluate our method in different aspects,e.g.,image quality,pose sequence trend,and natural rendering of the scene in the generated videos.Experimental results demonstrate that the avatars in the resulting videos are able to carry on a realistic conversation,and maintain individual styles.展开更多
Recent studies have shown remarkable success in face image generation task.However,existing approaches have limited diversity,quality and controllability in generating results.To address these issues,we propose a nove...Recent studies have shown remarkable success in face image generation task.However,existing approaches have limited diversity,quality and controllability in generating results.To address these issues,we propose a novel end-to-end learning framework to generate diverse,realistic and controllable face images guided by face masks.The face mask provides a good geometric constraint for a face by specifying the size and location of different components of the face,such as eyes,nose and mouse.The framework consists of four components:style encoder,style decoder,generator and discriminator.The style encoder generates a style code which represents the style of the result face;the generator translate the input face mask into a real face based on the style code;the style decoder learns to reconstruct the style code from the generated face image;and the discriminator classifies an input face image as real or fake.With the style code,the proposed model can generate different face images matching the input face mask,and by manipulating the face mask,we can finely control the generated face image.We empirically demonstrate the effectiveness of our approach on mask guided face image synthesis task.展开更多
Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as express...Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for lip-sync accuracy. Prevailing generative works usually struggle to juggle to generate subtle variations of emotional expression and lip-synchronized talking. To address these challenges, we suggest modeling the implicit and explicit correlations between audio and emotional talking faces with a unified framework. As human emotional expressions usually present subtle and implicit relations with speech audio, we propose incorporating audio and emotional style embeddings into the diffusion-based generation process, for realistic generation while concentrating on emotional expressions. We then propose lip-based explicit correlation learning to construct a strong mapping of audio to lip motions, assuring lip-audio synchronization. Furthermore, we deploy a video-to-video rendering module to transfer expressions and lip motions from a proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, MagicTalk outperforms state-of-the-art methods in terms of expressiveness, lip-sync, and perceptual quality.展开更多
Existing lip synchronization(lip-sync)methods generate accurately synchronized mouths and faces in a generated video.However,they still confront the problem of artifacts in regions of non-interest(RONI),e.g.,backgroun...Existing lip synchronization(lip-sync)methods generate accurately synchronized mouths and faces in a generated video.However,they still confront the problem of artifacts in regions of non-interest(RONI),e.g.,background and other parts of a face,which decreases the overall visual quality.To solve these problems,we innovatively introduce diverse image inpainting to lip-sync generation.We propose Modulated Inpainting Lip-sync GAN(MILG),an audio-constraint inpainting network to predict synchronous mouths.MILG utilizes prior knowledge of RONI and audio sequences to predict lip shape instead of image generation,which can keep the RONI consistent.Specifically,we integrate modulated spatially probabilistic diversity normalization(MSPD Norm)in our inpainting network,which helps the network generate fine-grained diverse mouth movements guided by the continuous audio features.Furthermore,to lower the training overhead,we modify the contrastive loss in lipsync to support small-batch-size and few-sample training.Extensive experiments demonstrate that our approach outperforms the existing state-of-the-art of image quality and authenticity while keeping lip-sync.展开更多
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korea government(*MSIT)(No.2018R1A5A7059549).
文摘The generation of high-quality,realistic face generation has emerged as a key field of research in computer vision.This paper proposes a robust approach that combines a Super-Resolution Generative Adversarial Network(SRGAN)with a Pyramid Attention Module(PAM)to enhance the quality of deep face generation.The SRGAN framework is designed to improve the resolution of generated images,addressing common challenges such as blurriness and a lack of intricate details.The Pyramid Attention Module further complements the process by focusing on multi-scale feature extraction,enabling the network to capture finer details and complex facial features more effectively.The proposed method was trained and evaluated over 100 epochs on the CelebA dataset,demonstrating consistent improvements in image quality and a marked decrease in generator and discriminator losses,reflecting the model’s capacity to learn and synthesize high-quality images effectively,given adequate computational resources.Experimental outcome demonstrates that the SRGAN model with PAM module has outperformed,yielding an aggregate discriminator loss of 0.055 for real,0.043 for fake,and a generator loss of 10.58 after training for 100 epochs.The model has yielded an structural similarity index measure of 0.923,that has outperformed the other models that are considered in the current study for analysis.
基金National Natural Science Foundation of China(No.62006039)National Key Research and Development Program of China(No.2019YFE0190500)。
文摘Near infrared-visible(NIR-VIS)face recognition is to match an NIR face image to a VIS image.The main challenges of NIR-VIS face recognition are the gap caused by cross-modality and the lack of sufficient paired NIR-VIS face images to train models.This paper focuses on the generation of paired NIR-VIS face images and proposes a dual variational generator based on ResNeSt(RS-DVG).RS-DVG can generate a large number of paired NIR-VIS face images from noise,and these generated NIR-VIS face images can be used as the training set together with the real NIR-VIS face images.In addition,a triplet loss function is introduced and a novel triplet selection method is proposed specifically for the training of the current face recognition model,which maximizes the inter-class distance and minimizes the intra-class distance in the input face images.The method proposed in this paper was evaluated on the datasets CASIA NIR-VIS 2.0 and BUAA-VisNir,and relatively good results were obtained.
基金Sponsored by Collaborative Education Projects Between Industry and Academia by Ministry of Education(Grant No.230801065261444)Humanities and Social Sciences Pre Research Fund Project of Zhejiang University of Technology(Grant No.SKY-ZX-20220207).
文摘Synthesizing a real⁃time,high⁃resolution,and lip⁃sync digital human is a challenging task.Although the Wav2Lip model represents a remarkable advancement in real⁃time lip⁃sync,its clarity is still limited.To address this,we enhanced the Wav2Lip model in this study and trained it on a high⁃resolution video dataset produced in our laboratory.Experimental results indicate that the improved Wav2Lip model produces digital humans with greater clarity than the original model,while maintaining its real⁃time performance and accurate lip⁃sync.We implemented the improved Wav2Lip model in a government interface application,generating a government digital human.Testing revealed that this government digital human can interact seamlessly with users in real⁃time,delivering clear visuals and synthesized speech that closely resembles a human voice.
基金Project supported by the National Natural Science Foundation of China (Nos. 51176061 and51006043)the Research Foundation for Out standing Young Teachers of Huazhong University of Science and Technology (No. 2012QN168)the Research Fund for the Doctoral Program of Higher Education of China (No. 20100142120048)
文摘The flow over a backward facing step (BFS) has been taken as a useful proto- type to investigate intrinsic mechanisms of separated flow with heat transfer. However, to date, the open literature on the effect of Richardson number on entropy generation over the BFS is absent yet, although the flow pattern and heat transfer characteristic both will receive significant influence caused by the variation of Richardson number in many prac- tical applications, such as in microelectromechanical systems and aerocrafts. The effect of Richardson number on entropy generation in the BFS flow is reported in this paper for the first time. The entropy generation analysis is conducted through numerically solving the entropy generation equation. The velocity and temperature, which are the inputs of the entropy generation equation, are evaluated by the lattice Boltzmann method. It is found that the distributions of local entropy generation number and Bejan number are significantly influenced by the variation of Richardson number. The total entropy gen- eration number is a monotonic decreasing function of Richardson number, whereas the average Bejan number is a monotonic increasing function of Richardson number.
基金supported by the National Natural Science Foundation of China(Nos.62201342 and 62101325)Shanghai Municipal Science and Technology Major Project(No.2021SHZDZX0102).
文摘Conversation is an essential component of virtual avatar activities in the metaverse.With the development of natural language processing,significant breakthroughs have been made in text and voice conversation generation.However,face-to-face conversations account for the vast majority of daily conversations,while most existing methods focused on single-person talking head generation.In this work,we take a step further and consider generating realistic face-to-face conversation videos.Conversation generation is more challenging than single-person talking head generation,because it requires not only the generation of photo-realistic individual talking heads,but also the listener’s response to the speaker.In this paper,we propose a novel unified framework based on the neural radiance field(NeRF)to address these challenges.Specifically,we model both the speaker and the listener with a NeRF framework under different conditions to control individual expressions.The speaker is driven by the audio signal,while the response of the listener depends on both visual and acoustic information.In this way,face-to-face conversation videos are generated between human avatars,with all the interlocutors modeled within the same network.Moreover,to facilitate future research on this task,we also collected a new human conversation dataset containing 34 video clips.Quantitative and qualitative experiments evaluate our method in different aspects,e.g.,image quality,pose sequence trend,and natural rendering of the scene in the generated videos.Experimental results demonstrate that the avatars in the resulting videos are able to carry on a realistic conversation,and maintain individual styles.
基金This work is supported by the National Key Research and Development Program of China(2018YFF0214700).
文摘Recent studies have shown remarkable success in face image generation task.However,existing approaches have limited diversity,quality and controllability in generating results.To address these issues,we propose a novel end-to-end learning framework to generate diverse,realistic and controllable face images guided by face masks.The face mask provides a good geometric constraint for a face by specifying the size and location of different components of the face,such as eyes,nose and mouse.The framework consists of four components:style encoder,style decoder,generator and discriminator.The style encoder generates a style code which represents the style of the result face;the generator translate the input face mask into a real face based on the style code;the style decoder learns to reconstruct the style code from the generated face image;and the discriminator classifies an input face image as real or fake.With the style code,the proposed model can generate different face images matching the input face mask,and by manipulating the face mask,we can finely control the generated face image.We empirically demonstrate the effectiveness of our approach on mask guided face image synthesis task.
文摘Generating emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for lip-sync accuracy. Prevailing generative works usually struggle to juggle to generate subtle variations of emotional expression and lip-synchronized talking. To address these challenges, we suggest modeling the implicit and explicit correlations between audio and emotional talking faces with a unified framework. As human emotional expressions usually present subtle and implicit relations with speech audio, we propose incorporating audio and emotional style embeddings into the diffusion-based generation process, for realistic generation while concentrating on emotional expressions. We then propose lip-based explicit correlation learning to construct a strong mapping of audio to lip motions, assuring lip-audio synchronization. Furthermore, we deploy a video-to-video rendering module to transfer expressions and lip motions from a proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, MagicTalk outperforms state-of-the-art methods in terms of expressiveness, lip-sync, and perceptual quality.
基金partially funded by the National Key Research and Development Program of China(Grant No.2020AAA0140004).
文摘Existing lip synchronization(lip-sync)methods generate accurately synchronized mouths and faces in a generated video.However,they still confront the problem of artifacts in regions of non-interest(RONI),e.g.,background and other parts of a face,which decreases the overall visual quality.To solve these problems,we innovatively introduce diverse image inpainting to lip-sync generation.We propose Modulated Inpainting Lip-sync GAN(MILG),an audio-constraint inpainting network to predict synchronous mouths.MILG utilizes prior knowledge of RONI and audio sequences to predict lip shape instead of image generation,which can keep the RONI consistent.Specifically,we integrate modulated spatially probabilistic diversity normalization(MSPD Norm)in our inpainting network,which helps the network generate fine-grained diverse mouth movements guided by the continuous audio features.Furthermore,to lower the training overhead,we modify the contrastive loss in lipsync to support small-batch-size and few-sample training.Extensive experiments demonstrate that our approach outperforms the existing state-of-the-art of image quality and authenticity while keeping lip-sync.