期刊文献+
共找到15篇文章
< 1 >
每页显示 20 50 100
基于Wav2Vec2.0特征融合与联合损失的深度伪造语音检测方法 被引量:1
1
作者 陈飞飞 郭海燕 +2 位作者 郭延民 葛子瑞 陆华庆 《信号处理》 北大核心 2025年第9期1547-1557,共11页
语音预训练模型Wav2Vec2.0能够通过多个隐藏层提取丰富的多层嵌入特征,在深度伪造语音检测任务中表现出良好的性能。将Wav2Vec2.0各层特征进行融合,是进一步挖掘语音数据深层次表示的有效途径,而改进Wav2Vec2.0各层特征的融合方式则有... 语音预训练模型Wav2Vec2.0能够通过多个隐藏层提取丰富的多层嵌入特征,在深度伪造语音检测任务中表现出良好的性能。将Wav2Vec2.0各层特征进行融合,是进一步挖掘语音数据深层次表示的有效途径,而改进Wav2Vec2.0各层特征的融合方式则有望进一步提升深度伪造语音检测性能。鉴于此,本文基于Wav2Vec2.0深度伪造语音检测架构,提出引入卷积注意力模块(Convolutional Block Attention Module,CBAM)对Wav2Vec2.0各层嵌入特征进行融合,通过结合通道注意力和空间注意力的加权融合方式来自适应地增强关键特征,有效提升模型的特征提取能力。在此基础上,考虑到伪造语音类型复杂多样,不同类型的伪造语音在鉴别难度上可能存在显著差异,为避免模型在处理难鉴别样本时存在的偏倚,同时使得类内特征分布紧凑、类间特征分布疏远。本文提出联合交叉熵损失、中心损失和焦点损失,构造模型的整体损失函数,充分利用各类损失的优势来增强模型在多种伪造语音场景下的判别能力和泛化性能。在ASVspoof 2019 LA、ASVspoof 2021 LA、ASVspoof 2021 DF和CFAD数据集上的实验结果表明,所提出的方法在常用评价指标等错误率(equal error rate,EER)和最小串联检测代价函数(minimum tandem detection cost function,min t-DCF)均表现出色。尤其是在ASVspoof 2021 LA数据集上,相较于AASIST、ECAPA-TDNN、ResNet,以及采用Wav2Vec2.0进行前端特征提取的多种对比方案,本文方法显著优于所有对比方法。 展开更多
关键词 深度伪造语音检测 wav2vec2.0 特征融合 联合损失
在线阅读 下载PDF
Chinese News Text Classification Based on Convolutional Neural Network 被引量:2
2
作者 Hanxu Wang Xin Li 《Journal on Big Data》 2022年第1期41-60,共20页
With the explosive growth of Internet text information,the task of text classification is more important.As a part of text classification,Chinese news text classification also plays an important role.In public securit... With the explosive growth of Internet text information,the task of text classification is more important.As a part of text classification,Chinese news text classification also plays an important role.In public security work,public opinion news classification is an important topic.Effective and accurate classification of public opinion news is a necessary prerequisite for relevant departments to grasp the situation of public opinion and control the trend of public opinion in time.This paper introduces a combinedconvolutional neural network text classification model based on word2vec and improved TF-IDF:firstly,the word vector is trained through word2vec model,then the weight of each word is calculated by using the improved TFIDF algorithm based on class frequency variance,and the word vector and weight are combined to construct the text vector representation.Finally,the combined-convolutional neural network is used to train and test the Thucnews data set.The results show that the classification effect of this model is better than the traditional Text-RNN model,the traditional Text-CNN model and word2vec-CNN model.The test accuracy is 97.56%,the accuracy rate is 97%,the recall rate is 97%,and the F1-score is 97%. 展开更多
关键词 Chinese news text classification word2vec model improved TF-IDF combined-convolutional neural network public opinion news
在线阅读 下载PDF
多任务师生模型的语音情感识别实验设计
3
作者 孙林慧 李平安 +1 位作者 雷云龙 张子晓 《实验科学与技术》 2025年第4期1-11,共11页
针对人机智能交互中语音情感识别的研究热点,将基于多任务约束师生模型的含噪语音情感识别设计为研究型教学实验,观察教师模型的指导作用、学生模型的学习过程和多级增强损失的约束力。设计基于Wav2vec 2.0的师生模型和多级增强损失机制... 针对人机智能交互中语音情感识别的研究热点,将基于多任务约束师生模型的含噪语音情感识别设计为研究型教学实验,观察教师模型的指导作用、学生模型的学习过程和多级增强损失的约束力。设计基于Wav2vec 2.0的师生模型和多级增强损失机制,将语音增强辅助任务引入学生模型,使学生模型能够通过学习获取教师模型的特征表示能力。在测试阶段学生模型直接从含噪语音中提取关键情感特征,用于情感分类,最后通过大量实验分析情感识别系统的性能和鲁棒性。该师生模型实验设计有助于提升学生思考能力、科研创新和探索意识。 展开更多
关键词 语音情感识别 多任务约束 语音增强 Wav2vec 2.0 教师学生模型
在线阅读 下载PDF
基于多头注意力机制的Wav2Vec 2.0-LSTM语音情感识别
4
作者 张红兵 孙惠民 《电声技术》 2025年第8期27-29,79,共4页
传统语音情感识别方法依赖人工设计的特征,难以捕捉到语音中的复杂情感信息并进行准确分类。针对该问题提出一种基于多头注意力机制的Wav2Vec 2.0模型和长短期记忆(Long Short-Term Memory,LSTM)网络相结合的语音情感识别模型,并采用加... 传统语音情感识别方法依赖人工设计的特征,难以捕捉到语音中的复杂情感信息并进行准确分类。针对该问题提出一种基于多头注意力机制的Wav2Vec 2.0模型和长短期记忆(Long Short-Term Memory,LSTM)网络相结合的语音情感识别模型,并采用加权准确率和未加全准确率作为评价指标,在两个公开情感数据集IEMOCAP和RAVDESS上进行实验。实验结果表明,相较于其他基线模型,新模型在语音情感识别任务中具有较高的识别精度。 展开更多
关键词 语音情感识别 Wav2Vec 2.0模型 长短期记忆(LSTM)网络 多头注意力机制
在线阅读 下载PDF
基于声音信号的转辙机故障诊断研究 被引量:1
5
作者 梁续继 戴胜华 《铁道标准设计》 北大核心 2025年第2期183-190,共8页
铁路信号系统中转辙机的故障率较高,需要采用智能化解决方案对故障进行诊断。传统的解决方案基于电信号,未能充分利用机械电子设备的物理特征。针对这一问题,基于转辙机动作时的声音进行故障诊断。首先,根据转辙机的动作特性提出6种会... 铁路信号系统中转辙机的故障率较高,需要采用智能化解决方案对故障进行诊断。传统的解决方案基于电信号,未能充分利用机械电子设备的物理特征。针对这一问题,基于转辙机动作时的声音进行故障诊断。首先,根据转辙机的动作特性提出6种会影响声音信号的常见机械故障。然后,根据声音诊断在特征提取方面的不同路线,采用3种技术方案。端到端方案通过wav2vec2.0语音识别框架直接进行训练和识别;特征矩阵方案提取声音信号的梅尔倒谱系数(MFCC),通过主成分分析(PCA)得到固定尺寸的特征矩阵,由多分类支持向量机(SVM)进行故障分类;声音图像化方案生成声音信号的语谱图,同时建立卷积神经网络VGG16的轻量化改进模型,将语谱图输入至该模型中进行训练和识别。实验结果表明,3种技术方案均能有效地对包括正常工作和6种故障类型的7种工作状态实现诊断,准确率分别为99.8%、94.2%和96.6%。验证了基于声音进行转辙机故障诊断的3种技术方案的可行性,并体现了语音领域技术在转辙机故障诊断中的应用价值。 展开更多
关键词 转辙机 故障诊断 声音信号 特征提取 wav2vec2.0 MFCC 语谱图
在线阅读 下载PDF
多层次通道融合语音情感识别方法 被引量:1
6
作者 张丽敏 李扬 +1 位作者 蔡浩 燕浩 《计算机科学与探索》 北大核心 2025年第8期2219-2228,共10页
语音情感识别是机器情感认知能力的关键,对于提高人机交互质量至关重要。然而,现有研究多聚焦于浅层特征的分析,忽略了多特征融合的优势,同时数据样本量有限,影响了模型的泛化能力,导致语音情感识别准确率不够理想。为了进一步提高语音... 语音情感识别是机器情感认知能力的关键,对于提高人机交互质量至关重要。然而,现有研究多聚焦于浅层特征的分析,忽略了多特征融合的优势,同时数据样本量有限,影响了模型的泛化能力,导致语音情感识别准确率不够理想。为了进一步提高语音情感识别的准确率,提出一种基于数据增强和多层次通道融合的语音情感识别方法。将原始语音加入高斯白噪声、音高转换和混合处理三种方法进行数据增强,提高模型的鲁棒性。提出一种基于wav2vec 2.0模型和CNN模型的多层次并行通道网络结构。其中,第一个通道采用wav2vec 2.0模型作为主干网络,学习语音数据的深层表征,再经过两层卷积的CNN模型进行计算;第二个通道提取语音情感浅层特征作为输入,采用五层卷积的CNN模型学习语音数据的浅层表征,更全面地分析语音数据的深层表征和浅层表征。将两个通道输出的表征进行融合,形成深浅结合的多层次语音情感特征体系。所提出的模型在RAVDESS和CASIA数据集上分别进行测试,准确率达到94.38%和98.75%,实验结果验证了所提方法的有效性。 展开更多
关键词 语音情感识别 多层次通道融合 wav2vec 2.0 卷积神经网络(CNN)
在线阅读 下载PDF
基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别 被引量:4
7
作者 曹荣贺 吴晓龙 +4 位作者 冯畅 郑方 徐明星 哈妮克孜·伊拉洪 艾斯卡尔·艾木都拉 《信号处理》 CSCD 北大核心 2023年第4期698-707,共10页
情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信... 情感在人际交互中扮演着重要的角色。在日常对话中,一些语句往往存在情感色彩较弱、情感类别复杂、模糊性高等现象,使对话语音情感识别成为一项具有挑战性的任务。针对该问题,现有很多工作通过对全局对话进行情感信息检索,将全局情感信息用于预测。然而,当对话中前后的话语情感变化较大时,不加选择的引入前文情感信息容易给当前预测带来干扰。本文提出了基于Wav2vec2.0与语境情感信息补偿的方法,旨在从前文中选择与当前话语最相关的情感信息作为补偿。首先通过语境信息补偿模块从历史对话中选择可能对当前话语情感影响最大的话语的韵律信息,利用长短时记忆网络将韵律信息构建为语境情感信息补偿表征。然后,利用预训练模型Wav2vec2.0提取当前话语的嵌入表征,将嵌入表征与语境表征融合用于情感识别。本方法在IEMOCAP数据集上的识别性能为69.0%(WA),显著超过了基线模型。 展开更多
关键词 情感识别 二元对话 情感补偿 wav2vec2.0
在线阅读 下载PDF
基于自然语言处理的企业科技成果管理平台研究
8
作者 韩光明 车坚女 +2 位作者 郭龙 韩玉林 王继鹏 《天然气与石油》 2025年第1期43-50,共8页
企业科技成果包含数据较为复杂,并涵盖较多敏感数据,现有文本分类结果不能满足实际的保密管理需求,可能存在数据泄露或非法访问的风险。为此,设计基于自然语言处理(Natural Language Processing,NLP)的企业科技成果管理平台,以解决关键... 企业科技成果包含数据较为复杂,并涵盖较多敏感数据,现有文本分类结果不能满足实际的保密管理需求,可能存在数据泄露或非法访问的风险。为此,设计基于自然语言处理(Natural Language Processing,NLP)的企业科技成果管理平台,以解决关键字检索不能对保密文本进行准确分类的经典问题。使用卷积神经网络(Convolutional Neural Networks,CNN)自动提取文本特征,并用支持向量机(Support Vector Machine,SVM)作为最终的分类器,构建CNN-SVM模型;采用多种不同维度的卷积核进行卷积运算,利用全连接层接收并处理来自注意力层的输出数据,采用SVM分类器对科技成果文本进行分类;通过附件管理模块实现对象存储服务(Swift Object Storage Service,Swift)部署;通过高级加密标准(Advanced Encryption Standard,AES)算法实施科技成果文本数据在传输和存储过程中的加密处理,实现企业科技成果管理平台设计。为了验证设计平台的有效性,将系统A、系统B进行对比实验,表明不同频率的数据窃取攻击下,被窃取科技成果数据不超过1 MB,检索一致性超过90%,对文档进行分类后语义涉密检查的召回率最高可达97%,说明设计平台的文档自动分类效果较好,能够对保护企业知识产权起一定作用。研究设计的企业科技成果管理平台,通过结合NLP技术和先进的加密手段,有效提升了科技成果文本的保密管理水平,能够在很大程度上防止数据泄露和非法访问,同时保证了文档分类的准确性和效率。 展开更多
关键词 NLP SVM CNN 词语向量化处理 SWIFT 企业科技成果管理 AES算法
在线阅读 下载PDF
基于Wav2vec2.0神经网络的轨道交通钢轨损伤压电阵列超声导波定位方法 被引量:2
9
作者 刘思昊 钱鲁斌 +1 位作者 梅曜华 邢宇辉 《城市轨道交通研究》 北大核心 2023年第6期101-105,110,共6页
鉴于普通超声波检测方法无法实现对轨道交通钢轨的长距离检测,基于超声导波的SHM(结构健康监测)技术难以从响应信号中提取损伤特征而影响损伤定位精度,提出了一种基于Wav2vec2.0神经网络的压电阵列超声导波定位方法对轨道交通钢轨损伤... 鉴于普通超声波检测方法无法实现对轨道交通钢轨的长距离检测,基于超声导波的SHM(结构健康监测)技术难以从响应信号中提取损伤特征而影响损伤定位精度,提出了一种基于Wav2vec2.0神经网络的压电阵列超声导波定位方法对轨道交通钢轨损伤进行定位。基于压电阵列超声导波数据的特点,对该方法进行了简要介绍。搭建了钢轨损伤的超声导波检测系统,并利用该系统进行数据集的采集。采用ABAQUS有限元软件建立钢轨损伤超声导波检测三维有限元模型,并利用该模型进行数据集的采集。利用小波信号处理方法对超声导波试验信号进行重构,以达到信号去噪的目的;在仿真信号中加入随机噪声,将叠加随机噪声后的超声导波仿真信号作为补充数据集;通过计算模型中钢轨损伤定位的准确率和误差对模型的性能进行评估。结果表明,当迭代轮次达到第120次时,训练样本的准确率达到100%。利用基于Wav2vec 2.0神经网络的压电阵列超声导波定位方法可实现轨道交通钢轨损伤的准确定位。 展开更多
关键词 轨道交通 钢轨损伤 压电阵列超声导波定位方法 wav2vec2.0神经网络
在线阅读 下载PDF
Self-Diffuser:语音驱动人脸表情的技术研究
10
作者 臧梦利 王少波 +1 位作者 智宇 陈昂 《计算机科学与应用》 2024年第8期236-249,共14页
先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕... 先前的语音驱动面部表情的动画研究从音频信号中产生了较为逼真和精确的嘴唇运动和面部表情。传统的方法主要集中在学习从语音到动画的确定性映射,最近的研究开始探讨语音驱动的3D人脸动画的多样性,即通过利用扩散模型的多样性能力来捕捉音频和面部运动之间复杂的多对多关系来完成任务。本文的Self-Diffuser方法使用预训练的大语言模型wav2vec 2.0对音频输入进行编码,通过引入基于扩散的技术,将其与Transformer相结合来完成生成任务。本研究不仅克服了传统回归模型在生成具有唇读可理解性的真实准确唇运动方面的局限性,还探讨了精确的嘴唇同步和创造与语音无关的面部表情之间的权衡。通过对比、分析当前最先进的方法,本文的Self-Diffuser方法,使得语音驱动的面部动画产生了更精确的唇运动;在与说话松散相关的上半部表情方面也产生了更贴近于真实说话表情的面部运动;同时本文模型引入的扩散机制使得生成3D人脸动画序列的多样性能力也大大提高。Previous research on speech-driven facial expression animation has achieved realistic and accurate lip movements and facial expressions from audio signals. Traditional methods primarily focused on learning deterministic mappings from speech to animation. Recent studies have started exploring the diversity of speech-driven 3D facial animation, aiming to capture the complex many-to-many relationships between audio and facial motion by leveraging the diversity capabilities of diffusion models. In this study, the Self-Diffuser method is proposed by utilizing the pre-trained large-scale language model wav2vec 2.0 to encode audio inputs. By introducing diffusion-based techniques and combining them with Transformers, the generation task is accomplished. This research not only overcomes the limitations of traditional regression models in generating lip movements that are both realistic and lip-reading comprehensible, but also explores the trade-off between precise lip synchronization and creating facial expressions independent of speech. Through comparisons and analysis with the current state-of-the-art methods, the Self-Diffuser method in this paper achieves more accurate lip movements in speech-driven facial animation. It also produces facial motions that closely resemble real speaking expressions in the upper face region correlated with speech looseness. Additionally, the introduced diffusion mechanism significantly enhances the diversity capabilities in generating 3D facial animation sequences. 展开更多
关键词 wav2vec 2.0 TRANSFORMER 扩散机制 语音驱动 面部动画
在线阅读 下载PDF
Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0
11
作者 Xiaoke Li Zufan Zhang 《Digital Communications and Networks》 2025年第5期1567-1577,共11页
Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the dev... Speech Emotion Recognition(SER)has received widespread attention as a crucial way for understanding human emotional states.However,the impact of irrelevant information on speech signals and data sparsity limit the development of SER system.To address these issues,this paper proposes a framework that incorporates the Attentive Mask Residual Network(AM-ResNet)and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively,together with a cross-attention module to interact and fuse these two features.The AM-ResNet branch mainly consists of maximum amplitude difference detection,mask residual block,and an attention mechanism.Among them,the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network,respectively,to reduce the impact of silent frames,and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech.In the Wav2vec 2.0 branch,this model is introduced as a feature extractor to obtain general speech features(Wav2vec 2.0 features)through pre-training with a large amount of unlabeled speech data,which can assist the SER task and cope with data sparsity problems.In the cross-attention module,AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features,which are used to predict the final emotion.Furthermore,multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations.Finally,experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches. 展开更多
关键词 Speech emotion recognition Residual network Mask Attention Wav2vec 2.0 Cross-feature fusion
在线阅读 下载PDF
Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
12
作者 Somin Park Mpabulungi Mark +1 位作者 Bogyung Park Hyunki Hong 《Computers, Materials & Continua》 SCIE EI 2023年第10期1009-1030,共22页
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua... Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech. 展开更多
关键词 Attention block IEMOCAP dataset speaker-specific representation speech emotion recognition wav2vec 2.0
在线阅读 下载PDF
An Optimized Chinese Filtering Model Using Value Scale Extended Text Vector
13
作者 Siyu Lu Ligao Cai +5 位作者 Zhixin Liu Shan Liu Bo Yang Lirong Yin Mingzhe Liu Wenfeng Zheng 《Computer Systems Science & Engineering》 SCIE EI 2023年第11期1881-1899,共19页
With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification... With the development of Internet technology,the explosive growth of Internet information presentation has led to difficulty in filtering effective information.Finding a model with high accuracy for text classification has become a critical problem to be solved by text filtering,especially for Chinese texts.This paper selected the manually calibrated Douban movie website comment data for research.First,a text filtering model based on the BP neural network has been built;Second,based on the Term Frequency-Inverse Document Frequency(TF-IDF)vector space model and the doc2vec method,the text word frequency vector and the text semantic vector were obtained respectively,and the text word frequency vector was linearly reduced by the Principal Component Analysis(PCA)method.Third,the text word frequency vector after dimensionality reduction and the text semantic vector were combined,add the text value degree,and the text synthesis vector was constructed.Experiments show that the model combined with text word frequency vector degree after dimensionality reduction,text semantic vector,and text value has reached the highest accuracy of 84.67%. 展开更多
关键词 Chinese text filtering text vector word frequency vectors text semantic vectors value degree BP neural network TF-IDF doc2vec PCA
在线阅读 下载PDF
Emvirus:An embedding-based neural framework for human-virus proteinprotein interactions prediction 被引量:1
14
作者 Pengfei Xie Jujuan Zhuang +1 位作者 Geng Tian Jialiang Yang 《Biosafety and Health》 CAS CSCD 2023年第3期152-158,共7页
Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconv... Human-virus protein-protein interactions(PPIs)play critical roles in viral infection.For example,the spike protein of severe acute respiratory syndrome coronavirus 2(SARS-CoV-2)binds primarily to human angiotensinconverting enzyme 2(ACE2)protein to infect human cells.Thus,identifying and blocking these PPIs contribute to controlling and preventing viruses.However,wet-lab experiment-based identification of human-virus PPIs is usually expensive,labor-intensive,and time-consuming,which presents the need for computational methods.Many machine-learning methods have been proposed recently and achieved good results in predicting humanvirus PPIs.However,most methods are based on protein sequence features and apply manually extracted features,such as statistical characteristics,phylogenetic profiles,and physicochemical properties.In this work,we present an embedding-based neural framework with convolutional neural network(CNN)and bi-directional long short-term memory unit(Bi-LSTM)architecture,named Emvirus,to predict human-virus PPIs(including human-SARS-CoV-2 PPIs).In addition,we conduct cross-viral experiments to explore the generalization ability of Emvirus.Compared to other feature extraction methods,Emvirus achieves better prediction accuracy. 展开更多
关键词 SARS-CoV-2 human-virus PPI Word embedding Doc2vec neural networks
原文传递
Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases 被引量:1
15
作者 Karim Dabbabi Abdelkarim Mars 《Journal of Systems Science and Systems Engineering》 SCIE EI CSCD 2024年第5期576-606,共31页
Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to... Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications. 展开更多
关键词 Wav2vec 2.0 Distil HuBERT HuBERT SER audio and audio-visual features
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部