Facial expression generation from pure textual descriptions is widely applied in human-computer interaction,computer-aided design,assisted education,etc.However,this task is challenging due to the intricate facial str...Facial expression generation from pure textual descriptions is widely applied in human-computer interaction,computer-aided design,assisted education,etc.However,this task is challenging due to the intricate facial structure and the complex mapping between texts and images.Existing methods face limitations in generating high-resolution images or capturing diverse facial expressions.In this study,we propose a novel generation approach,named FaceCLIP,to tackle these problems.The proposed method utilizes a CLIP-based multi-stage generative adversarial model to produce vivid facial expressions with high resolutions.With strong semantic priors from multi-modal textual and visual cues,the proposed method effectively disentangles facial attributes,enabling attribute editing and semantic reasoning.To facilitate text-toexpression generation,we build a new dataset called the FET dataset,which contains facial expression images and corresponding textual descriptions.Experiments on the dataset demonstrate improved image quality and semantic consistency compared with state-of-the-art methods.展开更多
基金supported by the Natural Science Foundation of Shandong Province of China under Grant No.ZR2023MF041the National Natural Science Foundation of China under Grant No.62072469+1 种基金Shandong Data Open Innovative Application Laboratory,the Spanish Ministry of Economy and Competitiveness(MINECO)the European Regional Development Fund(ERDF)under Project No.PID2020-120611RBI00/AEI/10.13039/501100011033.
文摘Facial expression generation from pure textual descriptions is widely applied in human-computer interaction,computer-aided design,assisted education,etc.However,this task is challenging due to the intricate facial structure and the complex mapping between texts and images.Existing methods face limitations in generating high-resolution images or capturing diverse facial expressions.In this study,we propose a novel generation approach,named FaceCLIP,to tackle these problems.The proposed method utilizes a CLIP-based multi-stage generative adversarial model to produce vivid facial expressions with high resolutions.With strong semantic priors from multi-modal textual and visual cues,the proposed method effectively disentangles facial attributes,enabling attribute editing and semantic reasoning.To facilitate text-toexpression generation,we build a new dataset called the FET dataset,which contains facial expression images and corresponding textual descriptions.Experiments on the dataset demonstrate improved image quality and semantic consistency compared with state-of-the-art methods.