Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semant...Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semantics of video for transmission,is a key aspect in the framework of multimedia semantic communication.In this paper,we propose a facial video semantic coding method with low bitrate based on the temporal continuity of video semantics.At the sender’s end,we selectively transmit facial keypoints and deformation information,allocating distinct bitrates to different keypoints across frames.Compressive techniques involving sampling and quantization are employed to reduce the bitrate while retaining facial key semantic information.At the receiver’s end,a GAN-based generative network is utilized for reconstruction,effectively mitigating block artifacts and buffering problems present in traditional codec algorithms under low bitrates.The performance of the proposed approach is validated on multiple datasets,such as VoxCeleb and TalkingHead-1kH,employing metrics such as LPIPS,DISTS,and AKD for assessment.Experimental results demonstrate significant advantages over traditional codec methods,achieving up to approximately 10-fold bitrate reduction in prolonged,stable head pose scenarios across diverse conversational video settings.展开更多
Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate b...Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.展开更多
The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcast...The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcasters on social networks.In this study,we aim to solve the novel problem of semantically amplifying the emotions of a facial video.This poses new challenges for effectively editing a sequence of video frames in terms of face semantics,emotion adaptiveness,and temporal coherence.Our approach is based on semantic face editing in the disentangled latent space of a state-of-the-art StyleGAN model.We presented a new face dataset with diverse emotions to fine-tune the pre-trained StyleGAN and improve the expressiveness of its original emotion-biased latent space.An emotion-editing subspace was constructed to allow adaptive emotion amplification while preserving other facial attributes.We further propose an effective stitching-tuning technique to ensure temporally coherent video frames.Our work results in plausible emotion amplification for a wide range of facial videos.Qualitative and quantitative evaluations demonstrated the advantages of our method over other baseline methods.The proposed dataset and research code will be made publicly available.展开更多
Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opini...Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opinion. However, despite deepfake technology’s risks, current deepfake detection methods lack generalization and are inconsistent when applied to unknown videos, i.e., videos on which they have not been trained. The purpose of this study is to develop a generalizable deepfake detection model by training convoluted neural networks (CNNs) to classify human facial features in videos. The study formulated the research questions: “How effectively does the developed model provide reliable generalizations?” A CNN model was trained to distinguish between real and fake videos using the facial features of human subjects in videos. The model was trained, validated, and tested using the FaceForensiq++ dataset, which contains more than 500,000 frames and subsets of the DFDC dataset, totaling more than 22,000 videos. The study demonstrated high generalizability, as the accuracy of the unknown dataset was only marginally (about 1%) lower than that of the known dataset. The findings of this study indicate that detection systems can be more generalizable, lighter, and faster by focusing on just a small region (the human face) of an entire video.展开更多
微表情检测旨在视频中定位幅度微弱、时间短暂的表情区间。其难点在于有效提取面部区域间的动态关联特征和多尺度时序特征,进而精准捕捉面部各区域微小动作之间的关联。针对这些问题,提出了一种融合自适应图注意力和多尺度可变空洞卷积...微表情检测旨在视频中定位幅度微弱、时间短暂的表情区间。其难点在于有效提取面部区域间的动态关联特征和多尺度时序特征,进而精准捕捉面部各区域微小动作之间的关联。针对这些问题,提出了一种融合自适应图注意力和多尺度可变空洞卷积的微表情检测网络(AG-DDNet)。通过引入参数可学习矩阵来实现键值对的特征变换,通过计算面部区域特征向量间的相似度得到动态邻接矩阵,并结合图注意力机制计算区域间权重系数,实现特征的动态融合;采用了多尺度可变空洞卷积模块,通过自适应池化与卷积组合的预测器生成动态感受野,从而实现多尺度的特征提取;引入基于Fisher信息矩阵的自然梯度优化机制,通过Fisher Adam优化器有效捕捉参数空间的几何结构信息,实现学习率的精确自适应调整,从而显著增强了模型对微表情和宏表情的协同检测能力。在微表情检测任务中,该算法与同类代表性算法相比,在CAS(ME)2数据集和SAMM Long Videos数据集上的性能分别提升了54.20%和20.11%。与最新算法相比,两个数据集上的提升幅度分别为38.43%和6.81%,有效证明了该方法在长视频微表情检测任务上的优越性能。展开更多
Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign mo...Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign monitoring. However, it is hard to obtain more accurate heart rate detection values under the conditions of subject’s facial movement, weak ambient light intensity and long detection distance, etc. In this article, a non-contact heart rate detection method based on face tracking is proposed, which can effectively improve the accuracy of non-contact heart rate detection method in practical application. The corner tracker algorithm is used to track the human face to reduce the motion artifact caused by the movement of the subject’s face and enhance the use value of the signal. And the maximum ratio combining algorithm is used to weight the pixel space pulse wave signal in the facial region of interest to improve the pulse wave extraction accuracy. We analyzed the facial images collected under different experimental distances and action states. This proposed method significantly reduces the error rate compared with the independent component analysis method. After theoretical analysis and experimental verification, this method effectively reduces the error rate under different experimental variables and has good consistency with the heart rate value collected by the medical physiological vest. This method will help to improve the accuracy of non-contact heart rate detection in complex environments.展开更多
目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,...目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,经两轮粗选后,请50名中国大学生对剩余视频片段的表情类型、愉悦度、唤醒度及表演者的长相进行自我报告式评定。将表情类型、愉悦度、唤醒度一致性高且表情类型与愉悦度相一致的片段纳入CFEVS,做分布分析,同时分析评测者性别、表演者长相对愉悦度、唤醒度分值的影响。结果:纳入CFEVS的喜悦表情男18女43共61个,悲伤表情男23女28共51个,无表情中性男13女17共31个,咀嚼中性男7女17共24个。散点图显示CFEVS在愉悦度及唤醒度上分布较为广泛。方差分析表明评测者性别及表演者长相对视频片段的愉悦度、唤醒度的影响与其表情类型有关。结论:本研究初步建立了一个拥有喜悦、悲伤及中性表情的CFEVS,并发现评测者的性别及表演者的长相可影响实验结果。展开更多
基金supported by the National Natural Science Foundation of China (Nos. NSFC 61925105, 62322109, 62171257 and U22B2001)the Xplorer Prize in Information and Electronics technologiesthe Tsinghua University (Department of Electronic Engineering)-Nantong Research Institute for Advanced Communication Technologies Joint Research Center for Space, Air, Ground and Sea Cooperative Communication Network Technology
文摘Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semantics of video for transmission,is a key aspect in the framework of multimedia semantic communication.In this paper,we propose a facial video semantic coding method with low bitrate based on the temporal continuity of video semantics.At the sender’s end,we selectively transmit facial keypoints and deformation information,allocating distinct bitrates to different keypoints across frames.Compressive techniques involving sampling and quantization are employed to reduce the bitrate while retaining facial key semantic information.At the receiver’s end,a GAN-based generative network is utilized for reconstruction,effectively mitigating block artifacts and buffering problems present in traditional codec algorithms under low bitrates.The performance of the proposed approach is validated on multiple datasets,such as VoxCeleb and TalkingHead-1kH,employing metrics such as LPIPS,DISTS,and AKD for assessment.Experimental results demonstrate significant advantages over traditional codec methods,achieving up to approximately 10-fold bitrate reduction in prolonged,stable head pose scenarios across diverse conversational video settings.
基金supported by the Key Research Program of the Chinese Academy of Sciences(grant number ZDRW-ZS-2021-1-2).
文摘Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.
基金supported by the National Key R&D Program of China(No.2022YFF0902302)National Natural Science Foundation of China(No.62322209)+1 种基金RCUK grant CAMERA(Nos.EP/M023281/1,EP/T022523/1)a gift from Adobe.
文摘The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcasters on social networks.In this study,we aim to solve the novel problem of semantically amplifying the emotions of a facial video.This poses new challenges for effectively editing a sequence of video frames in terms of face semantics,emotion adaptiveness,and temporal coherence.Our approach is based on semantic face editing in the disentangled latent space of a state-of-the-art StyleGAN model.We presented a new face dataset with diverse emotions to fine-tune the pre-trained StyleGAN and improve the expressiveness of its original emotion-biased latent space.An emotion-editing subspace was constructed to allow adaptive emotion amplification while preserving other facial attributes.We further propose an effective stitching-tuning technique to ensure temporally coherent video frames.Our work results in plausible emotion amplification for a wide range of facial videos.Qualitative and quantitative evaluations demonstrated the advantages of our method over other baseline methods.The proposed dataset and research code will be made publicly available.
文摘Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opinion. However, despite deepfake technology’s risks, current deepfake detection methods lack generalization and are inconsistent when applied to unknown videos, i.e., videos on which they have not been trained. The purpose of this study is to develop a generalizable deepfake detection model by training convoluted neural networks (CNNs) to classify human facial features in videos. The study formulated the research questions: “How effectively does the developed model provide reliable generalizations?” A CNN model was trained to distinguish between real and fake videos using the facial features of human subjects in videos. The model was trained, validated, and tested using the FaceForensiq++ dataset, which contains more than 500,000 frames and subsets of the DFDC dataset, totaling more than 22,000 videos. The study demonstrated high generalizability, as the accuracy of the unknown dataset was only marginally (about 1%) lower than that of the known dataset. The findings of this study indicate that detection systems can be more generalizable, lighter, and faster by focusing on just a small region (the human face) of an entire video.
文摘微表情检测旨在视频中定位幅度微弱、时间短暂的表情区间。其难点在于有效提取面部区域间的动态关联特征和多尺度时序特征,进而精准捕捉面部各区域微小动作之间的关联。针对这些问题,提出了一种融合自适应图注意力和多尺度可变空洞卷积的微表情检测网络(AG-DDNet)。通过引入参数可学习矩阵来实现键值对的特征变换,通过计算面部区域特征向量间的相似度得到动态邻接矩阵,并结合图注意力机制计算区域间权重系数,实现特征的动态融合;采用了多尺度可变空洞卷积模块,通过自适应池化与卷积组合的预测器生成动态感受野,从而实现多尺度的特征提取;引入基于Fisher信息矩阵的自然梯度优化机制,通过Fisher Adam优化器有效捕捉参数空间的几何结构信息,实现学习率的精确自适应调整,从而显著增强了模型对微表情和宏表情的协同检测能力。在微表情检测任务中,该算法与同类代表性算法相比,在CAS(ME)2数据集和SAMM Long Videos数据集上的性能分别提升了54.20%和20.11%。与最新算法相比,两个数据集上的提升幅度分别为38.43%和6.81%,有效证明了该方法在长视频微表情检测任务上的优越性能。
文摘Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign monitoring. However, it is hard to obtain more accurate heart rate detection values under the conditions of subject’s facial movement, weak ambient light intensity and long detection distance, etc. In this article, a non-contact heart rate detection method based on face tracking is proposed, which can effectively improve the accuracy of non-contact heart rate detection method in practical application. The corner tracker algorithm is used to track the human face to reduce the motion artifact caused by the movement of the subject’s face and enhance the use value of the signal. And the maximum ratio combining algorithm is used to weight the pixel space pulse wave signal in the facial region of interest to improve the pulse wave extraction accuracy. We analyzed the facial images collected under different experimental distances and action states. This proposed method significantly reduces the error rate compared with the independent component analysis method. After theoretical analysis and experimental verification, this method effectively reduces the error rate under different experimental variables and has good consistency with the heart rate value collected by the medical physiological vest. This method will help to improve the accuracy of non-contact heart rate detection in complex environments.
文摘目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,经两轮粗选后,请50名中国大学生对剩余视频片段的表情类型、愉悦度、唤醒度及表演者的长相进行自我报告式评定。将表情类型、愉悦度、唤醒度一致性高且表情类型与愉悦度相一致的片段纳入CFEVS,做分布分析,同时分析评测者性别、表演者长相对愉悦度、唤醒度分值的影响。结果:纳入CFEVS的喜悦表情男18女43共61个,悲伤表情男23女28共51个,无表情中性男13女17共31个,咀嚼中性男7女17共24个。散点图显示CFEVS在愉悦度及唤醒度上分布较为广泛。方差分析表明评测者性别及表演者长相对视频片段的愉悦度、唤醒度的影响与其表情类型有关。结论:本研究初步建立了一个拥有喜悦、悲伤及中性表情的CFEVS,并发现评测者的性别及表演者的长相可影响实验结果。