Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semant...Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semantics of video for transmission,is a key aspect in the framework of multimedia semantic communication.In this paper,we propose a facial video semantic coding method with low bitrate based on the temporal continuity of video semantics.At the sender’s end,we selectively transmit facial keypoints and deformation information,allocating distinct bitrates to different keypoints across frames.Compressive techniques involving sampling and quantization are employed to reduce the bitrate while retaining facial key semantic information.At the receiver’s end,a GAN-based generative network is utilized for reconstruction,effectively mitigating block artifacts and buffering problems present in traditional codec algorithms under low bitrates.The performance of the proposed approach is validated on multiple datasets,such as VoxCeleb and TalkingHead-1kH,employing metrics such as LPIPS,DISTS,and AKD for assessment.Experimental results demonstrate significant advantages over traditional codec methods,achieving up to approximately 10-fold bitrate reduction in prolonged,stable head pose scenarios across diverse conversational video settings.展开更多
Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate b...Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.展开更多
The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcast...The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcasters on social networks.In this study,we aim to solve the novel problem of semantically amplifying the emotions of a facial video.This poses new challenges for effectively editing a sequence of video frames in terms of face semantics,emotion adaptiveness,and temporal coherence.Our approach is based on semantic face editing in the disentangled latent space of a state-of-the-art StyleGAN model.We presented a new face dataset with diverse emotions to fine-tune the pre-trained StyleGAN and improve the expressiveness of its original emotion-biased latent space.An emotion-editing subspace was constructed to allow adaptive emotion amplification while preserving other facial attributes.We further propose an effective stitching-tuning technique to ensure temporally coherent video frames.Our work results in plausible emotion amplification for a wide range of facial videos.Qualitative and quantitative evaluations demonstrated the advantages of our method over other baseline methods.The proposed dataset and research code will be made publicly available.展开更多
Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opini...Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opinion. However, despite deepfake technology’s risks, current deepfake detection methods lack generalization and are inconsistent when applied to unknown videos, i.e., videos on which they have not been trained. The purpose of this study is to develop a generalizable deepfake detection model by training convoluted neural networks (CNNs) to classify human facial features in videos. The study formulated the research questions: “How effectively does the developed model provide reliable generalizations?” A CNN model was trained to distinguish between real and fake videos using the facial features of human subjects in videos. The model was trained, validated, and tested using the FaceForensiq++ dataset, which contains more than 500,000 frames and subsets of the DFDC dataset, totaling more than 22,000 videos. The study demonstrated high generalizability, as the accuracy of the unknown dataset was only marginally (about 1%) lower than that of the known dataset. The findings of this study indicate that detection systems can be more generalizable, lighter, and faster by focusing on just a small region (the human face) of an entire video.展开更多
Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign mo...Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign monitoring. However, it is hard to obtain more accurate heart rate detection values under the conditions of subject’s facial movement, weak ambient light intensity and long detection distance, etc. In this article, a non-contact heart rate detection method based on face tracking is proposed, which can effectively improve the accuracy of non-contact heart rate detection method in practical application. The corner tracker algorithm is used to track the human face to reduce the motion artifact caused by the movement of the subject’s face and enhance the use value of the signal. And the maximum ratio combining algorithm is used to weight the pixel space pulse wave signal in the facial region of interest to improve the pulse wave extraction accuracy. We analyzed the facial images collected under different experimental distances and action states. This proposed method significantly reduces the error rate compared with the independent component analysis method. After theoretical analysis and experimental verification, this method effectively reduces the error rate under different experimental variables and has good consistency with the heart rate value collected by the medical physiological vest. This method will help to improve the accuracy of non-contact heart rate detection in complex environments.展开更多
目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,...目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,经两轮粗选后,请50名中国大学生对剩余视频片段的表情类型、愉悦度、唤醒度及表演者的长相进行自我报告式评定。将表情类型、愉悦度、唤醒度一致性高且表情类型与愉悦度相一致的片段纳入CFEVS,做分布分析,同时分析评测者性别、表演者长相对愉悦度、唤醒度分值的影响。结果:纳入CFEVS的喜悦表情男18女43共61个,悲伤表情男23女28共51个,无表情中性男13女17共31个,咀嚼中性男7女17共24个。散点图显示CFEVS在愉悦度及唤醒度上分布较为广泛。方差分析表明评测者性别及表演者长相对视频片段的愉悦度、唤醒度的影响与其表情类型有关。结论:本研究初步建立了一个拥有喜悦、悲伤及中性表情的CFEVS,并发现评测者的性别及表演者的长相可影响实验结果。展开更多
基金supported by the National Natural Science Foundation of China (Nos. NSFC 61925105, 62322109, 62171257 and U22B2001)the Xplorer Prize in Information and Electronics technologiesthe Tsinghua University (Department of Electronic Engineering)-Nantong Research Institute for Advanced Communication Technologies Joint Research Center for Space, Air, Ground and Sea Cooperative Communication Network Technology
文摘Multimedia semantic communication has been receiving increasing attention due to its significant enhancement of communication efficiency.Semantic coding,which is oriented towards extracting and encoding the key semantics of video for transmission,is a key aspect in the framework of multimedia semantic communication.In this paper,we propose a facial video semantic coding method with low bitrate based on the temporal continuity of video semantics.At the sender’s end,we selectively transmit facial keypoints and deformation information,allocating distinct bitrates to different keypoints across frames.Compressive techniques involving sampling and quantization are employed to reduce the bitrate while retaining facial key semantic information.At the receiver’s end,a GAN-based generative network is utilized for reconstruction,effectively mitigating block artifacts and buffering problems present in traditional codec algorithms under low bitrates.The performance of the proposed approach is validated on multiple datasets,such as VoxCeleb and TalkingHead-1kH,employing metrics such as LPIPS,DISTS,and AKD for assessment.Experimental results demonstrate significant advantages over traditional codec methods,achieving up to approximately 10-fold bitrate reduction in prolonged,stable head pose scenarios across diverse conversational video settings.
基金supported by the Key Research Program of the Chinese Academy of Sciences(grant number ZDRW-ZS-2021-1-2).
文摘Pulse rate is one of the important characteristics of traditional Chinese medicine pulse diagnosis,and it is of great significance for determining the nature of cold and heat in diseases.The prediction of pulse rate based on facial video is an exciting research field for getting palpation information by observation diagnosis.However,most studies focus on optimizing the algorithm based on a small sample of participants without systematically investigating multiple influencing factors.A total of 209 participants and 2,435 facial videos,based on our self-constructed Multi-Scene Sign Dataset and the public datasets,were used to perform a multi-level and multi-factor comprehensive comparison.The effects of different datasets,blood volume pulse signal extraction algorithms,region of interests,time windows,color spaces,pulse rate calculation methods,and video recording scenes were analyzed.Furthermore,we proposed a blood volume pulse signal quality optimization strategy based on the inverse Fourier transform and an improvement strategy for pulse rate estimation based on signal-to-noise ratio threshold sliding.We found that the effects of video estimation of pulse rate in the Multi-Scene Sign Dataset and Pulse Rate Detection Dataset were better than in other datasets.Compared with Fast independent component analysis and Single Channel algorithms,chrominance-based method and plane-orthogonal-to-skin algorithms have a more vital anti-interference ability and higher robustness.The performances of the five-organs fusion area and the full-face area were better than that of single sub-regions,and the fewer motion artifacts and better lighting can improve the precision of pulse rate estimation.
基金supported by the National Key R&D Program of China(No.2022YFF0902302)National Natural Science Foundation of China(No.62322209)+1 种基金RCUK grant CAMERA(Nos.EP/M023281/1,EP/T022523/1)a gift from Adobe.
文摘The ability to exhibit appropriate emotions is crucial for the expressiveness and attractiveness of facial videos.However,it is difficult to control the level of emotion,even for experienced actors and amateur podcasters on social networks.In this study,we aim to solve the novel problem of semantically amplifying the emotions of a facial video.This poses new challenges for effectively editing a sequence of video frames in terms of face semantics,emotion adaptiveness,and temporal coherence.Our approach is based on semantic face editing in the disentangled latent space of a state-of-the-art StyleGAN model.We presented a new face dataset with diverse emotions to fine-tune the pre-trained StyleGAN and improve the expressiveness of its original emotion-biased latent space.An emotion-editing subspace was constructed to allow adaptive emotion amplification while preserving other facial attributes.We further propose an effective stitching-tuning technique to ensure temporally coherent video frames.Our work results in plausible emotion amplification for a wide range of facial videos.Qualitative and quantitative evaluations demonstrated the advantages of our method over other baseline methods.The proposed dataset and research code will be made publicly available.
文摘Deepfake technology can be used to replace people’s faces in videos or pictures to show them saying or doing things they never said or did. Deepfake media are often used to extort, defame, and manipulate public opinion. However, despite deepfake technology’s risks, current deepfake detection methods lack generalization and are inconsistent when applied to unknown videos, i.e., videos on which they have not been trained. The purpose of this study is to develop a generalizable deepfake detection model by training convoluted neural networks (CNNs) to classify human facial features in videos. The study formulated the research questions: “How effectively does the developed model provide reliable generalizations?” A CNN model was trained to distinguish between real and fake videos using the facial features of human subjects in videos. The model was trained, validated, and tested using the FaceForensiq++ dataset, which contains more than 500,000 frames and subsets of the DFDC dataset, totaling more than 22,000 videos. The study demonstrated high generalizability, as the accuracy of the unknown dataset was only marginally (about 1%) lower than that of the known dataset. The findings of this study indicate that detection systems can be more generalizable, lighter, and faster by focusing on just a small region (the human face) of an entire video.
文摘Image photoplethysmography can realize low-cost and easy-to-operate non-contact heart rate detection from the facial video, and effectively overcome the limitations of traditional contact method in daily vital sign monitoring. However, it is hard to obtain more accurate heart rate detection values under the conditions of subject’s facial movement, weak ambient light intensity and long detection distance, etc. In this article, a non-contact heart rate detection method based on face tracking is proposed, which can effectively improve the accuracy of non-contact heart rate detection method in practical application. The corner tracker algorithm is used to track the human face to reduce the motion artifact caused by the movement of the subject’s face and enhance the use value of the signal. And the maximum ratio combining algorithm is used to weight the pixel space pulse wave signal in the facial region of interest to improve the pulse wave extraction accuracy. We analyzed the facial images collected under different experimental distances and action states. This proposed method significantly reduces the error rate compared with the independent component analysis method. After theoretical analysis and experimental verification, this method effectively reduces the error rate under different experimental variables and has good consistency with the heart rate value collected by the medical physiological vest. This method will help to improve the accuracy of non-contact heart rate detection in complex environments.
文摘目的:建立本土化的中国面部表情视频系统(chinese facial expression video system,CFEVS)以增加情绪研究的取材范围。方法:录制强度分为三等级的喜悦、悲伤、惊奇、恐惧、愤怒、厌恶及中性(无表情及咀嚼动作两种)等面部表情视频片段,经两轮粗选后,请50名中国大学生对剩余视频片段的表情类型、愉悦度、唤醒度及表演者的长相进行自我报告式评定。将表情类型、愉悦度、唤醒度一致性高且表情类型与愉悦度相一致的片段纳入CFEVS,做分布分析,同时分析评测者性别、表演者长相对愉悦度、唤醒度分值的影响。结果:纳入CFEVS的喜悦表情男18女43共61个,悲伤表情男23女28共51个,无表情中性男13女17共31个,咀嚼中性男7女17共24个。散点图显示CFEVS在愉悦度及唤醒度上分布较为广泛。方差分析表明评测者性别及表演者长相对视频片段的愉悦度、唤醒度的影响与其表情类型有关。结论:本研究初步建立了一个拥有喜悦、悲伤及中性表情的CFEVS,并发现评测者的性别及表演者的长相可影响实验结果。