Emotion recognition via facial expressions (ERFE) has attracted a great deal of interest with recent advances in artificial intelligence and pattern recognition. Most studies are based on 2D images, and their perfor...Emotion recognition via facial expressions (ERFE) has attracted a great deal of interest with recent advances in artificial intelligence and pattern recognition. Most studies are based on 2D images, and their performance is usually computationally expensive. In this paper, we propose a real-time emotion recognition approach based on both 2D and 3D facial expression features captured by Kinect sensors. To capture the deformation of the 3D mesh during facial expression, we combine the features of animation units (AUs) and feature point positions (FPPs) tracked by Kinect. A fusion algorithm based on improved emotional profiles (IEPs) arid maximum confidence is proposed to recognize emotions with these real-time facial expression features. Experiments on both an emotion dataset and a real-time video show the superior performance of our method.展开更多
Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method com...Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method combining the functional paralanguage features with the accompanying paralanguage features is proposed for the speaker-independent speech emotion recognition. Using this method, the functional paralanguages, such as laughter, cry, and sigh, are used to assist speech emotion recognition. The contributions of our work are threefold. First, one emotional speech database including six kinds of functional paralanguage and six typical emotions were recorded by our research group. Second, the functional paralanguage is put forward to recognize the speech emotions combined with the accompanying paralanguage features. Third, a fusion algorithm based on confidences and probabilities is proposed to combine the functional paralanguage features with the accompanying paralanguage features for speech emotion recognition. We evaluate the usefulness of the functional paralanguage features and the fusion algorithm in terms of precision, recall, and F1-measurement on the emotional speech database recorded by our research group. The overall recognition accuracy achieved for six emotions is over 67% in the speaker-independent condition using the functional paralanguage features.展开更多
Emotion-based features are critical for achieving high performance in a speech emotion recognition(SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this p...Emotion-based features are critical for achieving high performance in a speech emotion recognition(SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms(including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.展开更多
Ad Hoc网络是一个无基础结构的无线网络,由于架构这种网络非常方便,因此它可广泛应用于灾难救助、战场指挥、临时会议等场合。这些应用通常都有一个共同的特征,就是一到多或是多到多的数据传输。因此,组播路由协议在Ad Hoc网络中具有非...Ad Hoc网络是一个无基础结构的无线网络,由于架构这种网络非常方便,因此它可广泛应用于灾难救助、战场指挥、临时会议等场合。这些应用通常都有一个共同的特征,就是一到多或是多到多的数据传输。因此,组播路由协议在Ad Hoc网络中具有非常重要的作用。展开更多
In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually ta...In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually takes the form of a continuous real value which has an ordinal property. The aforementioned methods do not focus on taking advantage of this important information. Therefore, we propose an affective rating ranking framework for affect recognition based on face images in the valence and arousal dimensional space. Our approach can appropriately use the ordinal information among affective ratings which are generated by discretizing continuous annotations.Specifically, we first train a series of basic cost-sensitive binary classifiers, each of which uses all samples relabeled according to the comparison results between corresponding ratings and a given rank of a binary classifier. We obtain the final affective ratings by aggregating the outputs of binary classifiers. By comparing the experimental results with the baseline and deep learning based classification and regression methods on the benchmarking database of the AVEC 2015 Challenge and the selected subset of SEMAINE database, we find that our ordinal ranking method is effective in both arousal and valence dimensions.展开更多
Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features...Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest.However,these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech.In this study,we propose a novel weighted-factor autoencoder(WFAE)model for MSS,which introduces a regularization loss in the objective function to isolate one source without containing other sources.By incorporating a latent attention mechanism and a supervised source constructor in the separation layer,WFAE can learn source-specific generative factors and a set of discriminative features for each source,leading to MSS performance improvement.Experiments on benchmark datasets show that our approach outperforms the existing methods.In terms of three important metrics,WFAE has great success on a relatively challenging MSS case,i.e.,speaker-independent MSS.展开更多
Large-scale datasets are driving the rapid developments of deep convolutional neural networks for visual sentiment analysis.However,the annotation of large-scale datasets is expensive and time consuming.Instead,it ise...Large-scale datasets are driving the rapid developments of deep convolutional neural networks for visual sentiment analysis.However,the annotation of large-scale datasets is expensive and time consuming.Instead,it iseasy to obtain weakly labeled web images from the Internet.However,noisy labels st.ill lead to seriously degraded performance when we use images directly from the web for training networks.To address this drawback,we propose an end-to-end weakly supervised learning network,which is robust to mislabeled web images.Specifically,the proposed attention module automatically eliminates the distraction of those samples with incorrect labels bv reducing their attention scores in the training process.On the other hand,the special-class activation map module is designed to stimulate the network by focusing on the significant regions from the samples with correct labels in a weakly supervised learning approach.Besides the process of feature learning,applying regularization to the classifier is considered to minimize the distance of those samples within the same class and maximize the distance between different class centroids.Quantitative and qualitative evaluations on well-and mislabeled web image datasets demonstrate that the proposed algorithm outperforms the related methods.展开更多
基金Project'supportedV by the National Natural Science Foundation of China (No. 61272211) and the Six Talent Peaks Project in Jiangsu Province of China (No. DZXX-026)
文摘Emotion recognition via facial expressions (ERFE) has attracted a great deal of interest with recent advances in artificial intelligence and pattern recognition. Most studies are based on 2D images, and their performance is usually computationally expensive. In this paper, we propose a real-time emotion recognition approach based on both 2D and 3D facial expression features captured by Kinect sensors. To capture the deformation of the 3D mesh during facial expression, we combine the features of animation units (AUs) and feature point positions (FPPs) tracked by Kinect. A fusion algorithm based on improved emotional profiles (IEPs) arid maximum confidence is proposed to recognize emotions with these real-time facial expression features. Experiments on both an emotion dataset and a real-time video show the superior performance of our method.
基金supported by the National Natural Science Foundation of China (Nos. 61272211 and 61170126)the Natural Science Foundation of Jiangsu Province (No. BK2011521)the Research Foundation for Talented Scholars of Jiangsu University (No. 10JDG065), China
文摘Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method combining the functional paralanguage features with the accompanying paralanguage features is proposed for the speaker-independent speech emotion recognition. Using this method, the functional paralanguages, such as laughter, cry, and sigh, are used to assist speech emotion recognition. The contributions of our work are threefold. First, one emotional speech database including six kinds of functional paralanguage and six typical emotions were recorded by our research group. Second, the functional paralanguage is put forward to recognize the speech emotions combined with the accompanying paralanguage features. Third, a fusion algorithm based on confidences and probabilities is proposed to combine the functional paralanguage features with the accompanying paralanguage features for speech emotion recognition. We evaluate the usefulness of the functional paralanguage features and the fusion algorithm in terms of precision, recall, and F1-measurement on the emotional speech database recorded by our research group. The overall recognition accuracy achieved for six emotions is over 67% in the speaker-independent condition using the functional paralanguage features.
基金supported by the National Natural Science Foundation of China(Nos.61272211 and 61170126)the Six Talent Peaks Foundation of Jiangsu Province,China(No.DZXX027)
文摘Emotion-based features are critical for achieving high performance in a speech emotion recognition(SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms(including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.
基金supported by the National Natural Science Foundation of China(Nos.61272211 and 61672267)the Open Project Program of the National Laboratory of Pattern Recognition(No.201700022)+1 种基金the China Postdoctoral Science Foundation(No.2015M570413)and the Innovation Project of Undergraduate Students in Jiangsu University(No.16A235)
文摘In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually takes the form of a continuous real value which has an ordinal property. The aforementioned methods do not focus on taking advantage of this important information. Therefore, we propose an affective rating ranking framework for affect recognition based on face images in the valence and arousal dimensional space. Our approach can appropriately use the ordinal information among affective ratings which are generated by discretizing continuous annotations.Specifically, we first train a series of basic cost-sensitive binary classifiers, each of which uses all samples relabeled according to the comparison results between corresponding ratings and a given rank of a binary classifier. We obtain the final affective ratings by aggregating the outputs of binary classifiers. By comparing the experimental results with the baseline and deep learning based classification and regression methods on the benchmarking database of the AVEC 2015 Challenge and the selected subset of SEMAINE database, we find that our ordinal ranking method is effective in both arousal and valence dimensions.
基金the Key Project of the National Natural Science Foundation of China(No.U1836220)the National Natural Science Foundation of China(No.61672267)+1 种基金the Qing Lan Talent Program of Jiangsu Province,Chinathe Key Innovation Project of Undergraduate Students in Jiangsu Province,China(No.201810299045Z)。
文摘Much recent progress in monaural speech separation(MSS)has been achieved through a series of deep learning architectures based on autoencoders,which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest.However,these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech.In this study,we propose a novel weighted-factor autoencoder(WFAE)model for MSS,which introduces a regularization loss in the objective function to isolate one source without containing other sources.By incorporating a latent attention mechanism and a supervised source constructor in the separation layer,WFAE can learn source-specific generative factors and a set of discriminative features for each source,leading to MSS performance improvement.Experiments on benchmark datasets show that our approach outperforms the existing methods.In terms of three important metrics,WFAE has great success on a relatively challenging MSS case,i.e.,speaker-independent MSS.
基金Project supported by the Key Project of the National Natural Science Foundation of China(No.U1836220)the National Nat-ural Science Foundation of China(No.61672267)+1 种基金the Qing Lan Talent Program of Jiangsu Province,China,the Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace,China,the Finnish Cultural Foundation,the Jiangsu Specially-Appointed Professor Program,China(No.3051107219003)the liangsu Joint Research Project of Sino-Foreign Cooperative Education Platform,China,and the Talent Startup Project of Nanjing Institute of Technology,China(No.YKJ201982)。
文摘Large-scale datasets are driving the rapid developments of deep convolutional neural networks for visual sentiment analysis.However,the annotation of large-scale datasets is expensive and time consuming.Instead,it iseasy to obtain weakly labeled web images from the Internet.However,noisy labels st.ill lead to seriously degraded performance when we use images directly from the web for training networks.To address this drawback,we propose an end-to-end weakly supervised learning network,which is robust to mislabeled web images.Specifically,the proposed attention module automatically eliminates the distraction of those samples with incorrect labels bv reducing their attention scores in the training process.On the other hand,the special-class activation map module is designed to stimulate the network by focusing on the significant regions from the samples with correct labels in a weakly supervised learning approach.Besides the process of feature learning,applying regularization to the classifier is considered to minimize the distance of those samples within the same class and maximize the distance between different class centroids.Quantitative and qualitative evaluations on well-and mislabeled web image datasets demonstrate that the proposed algorithm outperforms the related methods.