With the explosive growth of Internet text information,the task of text classification is more important.As a part of text classification,Chinese news text classification also plays an important role.In public securit...With the explosive growth of Internet text information,the task of text classification is more important.As a part of text classification,Chinese news text classification also plays an important role.In public security work,public opinion news classification is an important topic.Effective and accurate classification of public opinion news is a necessary prerequisite for relevant departments to grasp the situation of public opinion and control the trend of public opinion in time.This paper introduces a combinedconvolutional neural network text classification model based on word2vec and improved TF-IDF:firstly,the word vector is trained through word2vec model,then the weight of each word is calculated by using the improved TFIDF algorithm based on class frequency variance,and the word vector and weight are combined to construct the text vector representation.Finally,the combined-convolutional neural network is used to train and test the Thucnews data set.The results show that the classification effect of this model is better than the traditional Text-RNN model,the traditional Text-CNN model and word2vec-CNN model.The test accuracy is 97.56%,the accuracy rate is 97%,the recall rate is 97%,and the F1-score is 97%.展开更多
先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕...先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.展开更多
Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the dev...Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the development of SER system.To address these issues,this paper proposes a framework that incorporates the Attentive Mask Residual Network(AM-ResNet)and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively,together with a cross-attention module to interact and fuse these two features.The AM-ResNet branch mainly consists of maximum amplitude difference detection,mask residual block,and an attention mechanism.Among them,the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network,respectively,to reduce the impact of silent frames,and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech.In the Wav2vec 2.0 branch,this model is introduced as a feature extractor to obtain general speech features(Wav2vec 2.0 features)through pre-training with a large amount of unlabeled speech data,which can assist the SER task and cope with data sparsity problems.In the cross-attention module,AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features,which are used to predict the final emotion.Furthermore,multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations.Finally,experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches.展开更多
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua...Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.展开更多
With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification...With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.展开更多
Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconv...Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconverting enzyme 2(ACE2)protein to infect human cells.Thus,identifying and blocking these PPIs contribute to controlling and preventing viruses.However,wet-lab experiment-based identification of human-virus PPIs is usually expensive,labor-intensive,and time-consuming,which presents the need for computational methods.Many machine-learning methods have been proposed recently and achieved good results in predicting humanvirus PPIs.However,most methods are based on protein sequence features and apply manually extracted features,such as statistical characteristics,phylogenetic profiles,and physicochemical properties.In this work,we present an embedding-based neural framework with convolutional neural network(CNN)and bi-directional long short-term memory unit(Bi-LSTM)architecture,named Emvirus,to predict human-virus PPIs(including human-SARS-CoV-2 PPIs).In addition,we conduct cross-viral experiments to explore the generalization ability of Emvirus.Compared to other feature extraction methods,Emvirus achieves better prediction accuracy.展开更多
Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to...Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.展开更多
基金This work was supported by Ministry of public security technology research program[Grant No.2020JSYJC22ok]Fundamental Research Funds for the Central Universities(No.2021JKF215)+1 种基金Open Research Fund of the Public Security Behavioral Science Laboratory,People’s Public Security University of China(2020SYS03)Police and people build/share a smart community(PJ13-201912-0525).
文摘With the explosive growth of Internet text information,the task of text classification is more important.As a part of text classification,Chinese news text classification also plays an important role.In public security work,public opinion news classification is an important topic.Effective and accurate classification of public opinion news is a necessary prerequisite for relevant departments to grasp the situation of public opinion and control the trend of public opinion in time.This paper introduces a combinedconvolutional neural network text classification model based on word2vec and improved TF-IDF:firstly,the word vector is trained through word2vec model,then the weight of each word is calculated by using the improved TFIDF algorithm based on class frequency variance,and the word vector and weight are combined to construct the text vector representation.Finally,the combined-convolutional neural network is used to train and test the Thucnews data set.The results show that the classification effect of this model is better than the traditional Text-RNN model,the traditional Text-CNN model and word2vec-CNN model.The test accuracy is 97.56%,the accuracy rate is 97%,the recall rate is 97%,and the F1-score is 97%.
文摘先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences.
基金supported by Chongqing University of Posts and Telecommunications Ph.D.Innovative Talents Project(Grant No.BYJS202106)Chongqing Postgraduate Research Innovation Project(Grant No.CYB21203).
文摘Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the development of SER system.To address these issues,this paper proposes a framework that incorporates the Attentive Mask Residual Network(AM-ResNet)and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively,together with a cross-attention module to interact and fuse these two features.The AM-ResNet branch mainly consists of maximum amplitude difference detection,mask residual block,and an attention mechanism.Among them,the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network,respectively,to reduce the impact of silent frames,and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech.In the Wav2vec 2.0 branch,this model is introduced as a feature extractor to obtain general speech features(Wav2vec 2.0 features)through pre-training with a large amount of unlabeled speech data,which can assist the SER task and cope with data sparsity problems.In the cross-attention module,AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features,which are used to predict the final emotion.Furthermore,multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations.Finally,experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches.
基金supported by the Chung-Ang University Graduate Research Scholarship in 2021.
文摘Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.
基金Supported by the Sichuan Science and Technology Program (2021YFQ0003).
文摘With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%.
文摘Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconverting enzyme 2(ACE2)protein to infect human cells.Thus,identifying and blocking these PPIs contribute to controlling and preventing viruses.However,wet-lab experiment-based identification of human-virus PPIs is usually expensive,labor-intensive,and time-consuming,which presents the need for computational methods.Many machine-learning methods have been proposed recently and achieved good results in predicting humanvirus PPIs.However,most methods are based on protein sequence features and apply manually extracted features,such as statistical characteristics,phylogenetic profiles,and physicochemical properties.In this work,we present an embedding-based neural framework with convolutional neural network(CNN)and bi-directional long short-term memory unit(Bi-LSTM)architecture,named Emvirus,to predict human-virus PPIs(including human-SARS-CoV-2 PPIs).In addition,we conduct cross-viral experiments to explore the generalization ability of Emvirus.Compared to other feature extraction methods,Emvirus achieves better prediction accuracy.
文摘Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.