To solve the problem that traditional Mel Frequency Cepstral Coefficient(MFCC)features cannot fully represent dynamic speech features,this paper introduces first⁃order and second⁃order difference on the basis of stati...To solve the problem that traditional Mel Frequency Cepstral Coefficient(MFCC)features cannot fully represent dynamic speech features,this paper introduces first⁃order and second⁃order difference on the basis of static MFCC features to extract dynamic MFCC features,and constructs a hybrid model(TWM,TIM⁃NET(Temporal⁃aware Bi⁃directional Multi⁃scale Network)WGAN⁃GP(Wasserstein Generative Adversarial Network with Gradient Penalty)multi⁃head attention)combining multi⁃head attention mechanism and improved WGAN⁃GP on the basis of TIM⁃NET network.Among them,the multi⁃head attention mechanism not only effectively prevents gradient vanishing,but also allows for the construction of deeper networks that can capture long⁃range dependencies and learn from information at different time steps,improving the accuracy of the model;WGAN⁃GP solves the problem of insufficient sample size by improving the quality of speech sample generation.The experiment results show that this method significantly improves the accuracy and robustness of speech emotion recognition on RAVDESS and EMO⁃DB datasets.展开更多
Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can pre...Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can preduce highly dis-criminiative “dynamic” reference vectors to represent the temporal and spectral vari-abilities of speech. Recognition experiments on 19 Chinese consonants show that the“dynamic” classifier outperforms the original “static” classifier significantly.展开更多
In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and p...In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and particle swarm optimization (PSO) is presented. ADP is used for dynamically changing the values of the PSO parameters. During the process of face recognition, the discrete cosine transformation (DCT) is first introduced to reduce negative effects. Then, Karhunen-Loeve (K-L) transformation can be used to compress images and decrease data dimensions. According to principal component analysis (PCA), the main parts of vectors are extracted for data representation. Finally, radial basis function (RBF) neural network is trained to recognize various faces. The training of RBF neural network is exploited by ADP-PSO. In terms of ORL Face Database, the experimental result gives a clear view of its accurate efficiency.展开更多
In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one se...In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one severe obstacle for its applications to some important areas, such as speech recognition and part-of-speech tagging. The paper proposed a novel SVM with discriminative dynamic time alignment ( DDTA - SVM) to solve this problem. When training DDTA - SVM classifier, according to the category information of the training sampies, different time alignment strategies were adopted to manipulate them in the kernel functions, which contributed to great improvement for training speed and generalization capability of the classifier. Since the alignment operator was embedded in kernel functions, the training algorithms of standard SVM were still compatible in DDTA- SVM. In order to increase the reliability of the classification, a new classification algorithm was suggested. The preliminary experimental results on Chinese confusable syllables speech classification task show that DDTA- SVM obtains faster convergence speed and better classification performance than dynamic time alignment kernel SVM ( DTAK - SVM). Moreover, DDTA - SVM also gives higher classification precision compared to the conventional HMM. This proves that the proposed method is effective, especially for confusable length - variable pattern classification tasks展开更多
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy...Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.展开更多
Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic sig...Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic signals remain a challenge for sensors.Here,highly resilient and durable ionogels are developed by introducing micro-scale incompatible phases in macroscopic homogeneous polymeric network.The compatible network disperses in conductive ionic liquid to form highly resilient and stretchable skeleton,while incompatible phase forms hydrogen bonds to dissipate energy thus strengthening the ionogels.The ionogels-derived strain sensors show highly sensitivity,fast response time(<10 ms),low detection limit(~50μm),and remarkable durability(>5000 cycles),allowing for precise monitoring of human motions.More importantly,a self-adaptive recognition program empowered by deep-learning algorithms is designed to compensate for sensors,creating a comprehensive system capable of dynamic gesture recognition.This system can comprehensively analyze both the temporal and spatial features of sensor data,enabling deeper understanding of the dynamic process underlying gestures.The system accurately classifies 10 hand gestures across five participants with impressive accuracy of 93.66%.Moreover,it maintains robust recognition performance without the need for further training even when different sensors or subjects are involved.This technological breakthrough paves the way for intuitive and seamless interaction between humans and machines,presenting significant opportunities in diverse applications,such as human-robot interaction,virtual reality control,and assistive devices for the disabled individuals.展开更多
语音情感识别(SER)旨在赋予计算机准确识别语音信号中的情感状态的能力,而如何高效地表征语音中的情感特征一直是SER的研究热点。目前,大多数研究都致力于利用深度学习方法直接从原始语音或语谱图中学习最优特征,这种学习模式可以提取...语音情感识别(SER)旨在赋予计算机准确识别语音信号中的情感状态的能力,而如何高效地表征语音中的情感特征一直是SER的研究热点。目前,大多数研究都致力于利用深度学习方法直接从原始语音或语谱图中学习最优特征,这种学习模式可以提取到更完整的特征信息,但忽略了对特定特征更深层细化信息的学习,同时不能保证特征的可解释性。为了解决上述问题,提出一种基于卷积神经网络的渐进式表征学习SER方法(CnnPRL),在语音声学特征的基础上利用卷积神经网络(CNN)渐进式地提取具有可解释性的精细化情感特征。首先,手工提取可解释的浅层特征并选择出最优的特征集;其次,提出级联CNN和动态融合结构,以细化浅层特征,并学习深层情感表征;最后,构建并行异构CNN提取不同尺度的互补特征,以利用融合模块实现多特征融合,捕获多粒度特征,并整合来自不同特征尺度的深层情感信息。实验结果表明,在保证时间复杂度的前提下,在数据集IEMOCAP(Interactive EMOtional dyadic motion CAPture database)、CASIA(Institute of Automation,Chinese Academy of Sciences)和EMODB(Berlin EMOtional DataBase)上,相较于SpeechFormer++、TLFMRF(Two-Layer Fuzzy Multiple Random Forest)和TIM-Net(Temporal-aware bI-direction Multi-scale Network)等对比方法,CnnPRL在指标加权平均召回率(WAR)上分别至少取得了0.86、2.92和1.46个百分点的提升,验证了CnnPRL的有效性;消融实验结果验证了CnnPRL的每个模块都有利于提升模型的整体性能。展开更多
提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方...提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方法增强模型识别少数类别的能力。模型采用卷积神经网络级联长短期记忆网络的经典网络架构。同时,使用柏林情感语音数据库(Berlin Emotional Speech Database,EMODB)、萨里视听情感表达数据集(Surrey Audio-Visual Expressed Emotion,SAVEE)、中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation,CASIA)数据集来评估德语、英语、中文3种语言。最后,通过实验得出模型在不同数据库中识别情绪的未加权准确度:EMODB数据集为89.72%,SAVEE数据集为65.62%,CASIA数据集为65.42%。展开更多
文摘To solve the problem that traditional Mel Frequency Cepstral Coefficient(MFCC)features cannot fully represent dynamic speech features,this paper introduces first⁃order and second⁃order difference on the basis of static MFCC features to extract dynamic MFCC features,and constructs a hybrid model(TWM,TIM⁃NET(Temporal⁃aware Bi⁃directional Multi⁃scale Network)WGAN⁃GP(Wasserstein Generative Adversarial Network with Gradient Penalty)multi⁃head attention)combining multi⁃head attention mechanism and improved WGAN⁃GP on the basis of TIM⁃NET network.Among them,the multi⁃head attention mechanism not only effectively prevents gradient vanishing,but also allows for the construction of deeper networks that can capture long⁃range dependencies and learn from information at different time steps,improving the accuracy of the model;WGAN⁃GP solves the problem of insufficient sample size by improving the quality of speech sample generation.The experiment results show that this method significantly improves the accuracy and robustness of speech emotion recognition on RAVDESS and EMO⁃DB datasets.
文摘Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can preduce highly dis-criminiative “dynamic” reference vectors to represent the temporal and spectral vari-abilities of speech. Recognition experiments on 19 Chinese consonants show that the“dynamic” classifier outperforms the original “static” classifier significantly.
基金This work was supported by Natural Science Foundation of Huazhong University of Science and Technology of PRC(No.2007Q006B).
文摘In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and particle swarm optimization (PSO) is presented. ADP is used for dynamically changing the values of the PSO parameters. During the process of face recognition, the discrete cosine transformation (DCT) is first introduced to reduce negative effects. Then, Karhunen-Loeve (K-L) transformation can be used to compress images and decrease data dimensions. According to principal component analysis (PCA), the main parts of vectors are extracted for data representation. Finally, radial basis function (RBF) neural network is trained to recognize various faces. The training of RBF neural network is exploited by ADP-PSO. In terms of ORL Face Database, the experimental result gives a clear view of its accurate efficiency.
基金Sponsored by the National Natural Science Foundation of China(Grant No. 60575030)the Scientific Research Foundation of Harbin Institute of Technol-ogy (Grant No. HIT.2002.70)the Heilongjiang Scientific Research Foundation for Scholars Returned from Abroad(Grant No.LC03C10)
文摘In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one severe obstacle for its applications to some important areas, such as speech recognition and part-of-speech tagging. The paper proposed a novel SVM with discriminative dynamic time alignment ( DDTA - SVM) to solve this problem. When training DDTA - SVM classifier, according to the category information of the training sampies, different time alignment strategies were adopted to manipulate them in the kernel functions, which contributed to great improvement for training speed and generalization capability of the classifier. Since the alignment operator was embedded in kernel functions, the training algorithms of standard SVM were still compatible in DDTA- SVM. In order to increase the reliability of the classification, a new classification algorithm was suggested. The preliminary experimental results on Chinese confusable syllables speech classification task show that DDTA- SVM obtains faster convergence speed and better classification performance than dynamic time alignment kernel SVM ( DTAK - SVM). Moreover, DDTA - SVM also gives higher classification precision compared to the conventional HMM. This proves that the proposed method is effective, especially for confusable length - variable pattern classification tasks
基金supported by the Research Plan Project of National University of Defense Technology under Grant No.JC13-06-01the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant(GL2)
文摘Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
基金supported by the National Key Research and Development Program of China(No.2021YFA1401103)the National Natural Science Foundation of China(Nos.61825403,61921005,and 82370520).
文摘Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic signals remain a challenge for sensors.Here,highly resilient and durable ionogels are developed by introducing micro-scale incompatible phases in macroscopic homogeneous polymeric network.The compatible network disperses in conductive ionic liquid to form highly resilient and stretchable skeleton,while incompatible phase forms hydrogen bonds to dissipate energy thus strengthening the ionogels.The ionogels-derived strain sensors show highly sensitivity,fast response time(<10 ms),low detection limit(~50μm),and remarkable durability(>5000 cycles),allowing for precise monitoring of human motions.More importantly,a self-adaptive recognition program empowered by deep-learning algorithms is designed to compensate for sensors,creating a comprehensive system capable of dynamic gesture recognition.This system can comprehensively analyze both the temporal and spatial features of sensor data,enabling deeper understanding of the dynamic process underlying gestures.The system accurately classifies 10 hand gestures across five participants with impressive accuracy of 93.66%.Moreover,it maintains robust recognition performance without the need for further training even when different sensors or subjects are involved.This technological breakthrough paves the way for intuitive and seamless interaction between humans and machines,presenting significant opportunities in diverse applications,such as human-robot interaction,virtual reality control,and assistive devices for the disabled individuals.
文摘语音情感识别(SER)旨在赋予计算机准确识别语音信号中的情感状态的能力,而如何高效地表征语音中的情感特征一直是SER的研究热点。目前,大多数研究都致力于利用深度学习方法直接从原始语音或语谱图中学习最优特征,这种学习模式可以提取到更完整的特征信息,但忽略了对特定特征更深层细化信息的学习,同时不能保证特征的可解释性。为了解决上述问题,提出一种基于卷积神经网络的渐进式表征学习SER方法(CnnPRL),在语音声学特征的基础上利用卷积神经网络(CNN)渐进式地提取具有可解释性的精细化情感特征。首先,手工提取可解释的浅层特征并选择出最优的特征集;其次,提出级联CNN和动态融合结构,以细化浅层特征,并学习深层情感表征;最后,构建并行异构CNN提取不同尺度的互补特征,以利用融合模块实现多特征融合,捕获多粒度特征,并整合来自不同特征尺度的深层情感信息。实验结果表明,在保证时间复杂度的前提下,在数据集IEMOCAP(Interactive EMOtional dyadic motion CAPture database)、CASIA(Institute of Automation,Chinese Academy of Sciences)和EMODB(Berlin EMOtional DataBase)上,相较于SpeechFormer++、TLFMRF(Two-Layer Fuzzy Multiple Random Forest)和TIM-Net(Temporal-aware bI-direction Multi-scale Network)等对比方法,CnnPRL在指标加权平均召回率(WAR)上分别至少取得了0.86、2.92和1.46个百分点的提升,验证了CnnPRL的有效性;消融实验结果验证了CnnPRL的每个模块都有利于提升模型的整体性能。
文摘提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方法增强模型识别少数类别的能力。模型采用卷积神经网络级联长短期记忆网络的经典网络架构。同时,使用柏林情感语音数据库(Berlin Emotional Speech Database,EMODB)、萨里视听情感表达数据集(Surrey Audio-Visual Expressed Emotion,SAVEE)、中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation,CASIA)数据集来评估德语、英语、中文3种语言。最后,通过实验得出模型在不同数据库中识别情绪的未加权准确度:EMODB数据集为89.72%,SAVEE数据集为65.62%,CASIA数据集为65.42%。