Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can pre...Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can preduce highly dis-criminiative “dynamic” reference vectors to represent the temporal and spectral vari-abilities of speech. Recognition experiments on 19 Chinese consonants show that the“dynamic” classifier outperforms the original “static” classifier significantly.展开更多
In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and p...In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and particle swarm optimization (PSO) is presented. ADP is used for dynamically changing the values of the PSO parameters. During the process of face recognition, the discrete cosine transformation (DCT) is first introduced to reduce negative effects. Then, Karhunen-Loeve (K-L) transformation can be used to compress images and decrease data dimensions. According to principal component analysis (PCA), the main parts of vectors are extracted for data representation. Finally, radial basis function (RBF) neural network is trained to recognize various faces. The training of RBF neural network is exploited by ADP-PSO. In terms of ORL Face Database, the experimental result gives a clear view of its accurate efficiency.展开更多
In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one se...In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one severe obstacle for its applications to some important areas, such as speech recognition and part-of-speech tagging. The paper proposed a novel SVM with discriminative dynamic time alignment ( DDTA - SVM) to solve this problem. When training DDTA - SVM classifier, according to the category information of the training sampies, different time alignment strategies were adopted to manipulate them in the kernel functions, which contributed to great improvement for training speed and generalization capability of the classifier. Since the alignment operator was embedded in kernel functions, the training algorithms of standard SVM were still compatible in DDTA- SVM. In order to increase the reliability of the classification, a new classification algorithm was suggested. The preliminary experimental results on Chinese confusable syllables speech classification task show that DDTA- SVM obtains faster convergence speed and better classification performance than dynamic time alignment kernel SVM ( DTAK - SVM). Moreover, DDTA - SVM also gives higher classification precision compared to the conventional HMM. This proves that the proposed method is effective, especially for confusable length - variable pattern classification tasks展开更多
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy...Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.展开更多
Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic sig...Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic signals remain a challenge for sensors.Here,highly resilient and durable ionogels are developed by introducing micro-scale incompatible phases in macroscopic homogeneous polymeric network.The compatible network disperses in conductive ionic liquid to form highly resilient and stretchable skeleton,while incompatible phase forms hydrogen bonds to dissipate energy thus strengthening the ionogels.The ionogels-derived strain sensors show highly sensitivity,fast response time(<10 ms),low detection limit(~50μm),and remarkable durability(>5000 cycles),allowing for precise monitoring of human motions.More importantly,a self-adaptive recognition program empowered by deep-learning algorithms is designed to compensate for sensors,creating a comprehensive system capable of dynamic gesture recognition.This system can comprehensively analyze both the temporal and spatial features of sensor data,enabling deeper understanding of the dynamic process underlying gestures.The system accurately classifies 10 hand gestures across five participants with impressive accuracy of 93.66%.Moreover,it maintains robust recognition performance without the need for further training even when different sensors or subjects are involved.This technological breakthrough paves the way for intuitive and seamless interaction between humans and machines,presenting significant opportunities in diverse applications,such as human-robot interaction,virtual reality control,and assistive devices for the disabled individuals.展开更多
提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方...提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方法增强模型识别少数类别的能力。模型采用卷积神经网络级联长短期记忆网络的经典网络架构。同时,使用柏林情感语音数据库(Berlin Emotional Speech Database,EMODB)、萨里视听情感表达数据集(Surrey Audio-Visual Expressed Emotion,SAVEE)、中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation,CASIA)数据集来评估德语、英语、中文3种语言。最后,通过实验得出模型在不同数据库中识别情绪的未加权准确度:EMODB数据集为89.72%,SAVEE数据集为65.62%,CASIA数据集为65.42%。展开更多
Recognizing emotions from speech is of great significance in enhancing human-machine interaction.Convolutional neural networks(CNN)continuously compress size and stack weighted values when capturing temporal dynamic f...Recognizing emotions from speech is of great significance in enhancing human-machine interaction.Convolutional neural networks(CNN)continuously compress size and stack weighted values when capturing temporal dynamic features,resulting in the loss of important dynamic features in different channels and depths during the extraction process,which reduces recognition accuracy.To address this issue,the U-Net architecture is employed in this study for speech emotion recognition,and an extended version of the U-Net structure is proposed.The specific method involves extracting the temporal dynamic features of the audio signal through rectangular convolution to generate the Mel-spectrogram dynamic feature map.Then,the U-Net architecture is utilized to establish connections between feature maps of varying scales,while channel selection attention is employed to assess dissimilarities among dynamic features across different channels.Experimental findings on the combined CER dataset reveal that the enhanced U-Net effectively filters essential temporal dynamic features,resulting in a 4.29 percentage point improvement in recognition accuracy compared to the baseline model.展开更多
文摘Dynamic time warping (DTW) and dynamic spectral wafliing (DSW)techniques are introduced into learning vector quantization (LVQ) algorithm to con-struct a “dynamic” Bayes classifier for speech recognition. It can preduce highly dis-criminiative “dynamic” reference vectors to represent the temporal and spectral vari-abilities of speech. Recognition experiments on 19 Chinese consonants show that the“dynamic” classifier outperforms the original “static” classifier significantly.
基金This work was supported by Natural Science Foundation of Huazhong University of Science and Technology of PRC(No.2007Q006B).
文摘In order to distinguish faces of various angles during face recognition, an algorithm of the combination of approximate dynamic programming (ADP) called action dependent heuristic dynamic programming (ADHDP) and particle swarm optimization (PSO) is presented. ADP is used for dynamically changing the values of the PSO parameters. During the process of face recognition, the discrete cosine transformation (DCT) is first introduced to reduce negative effects. Then, Karhunen-Loeve (K-L) transformation can be used to compress images and decrease data dimensions. According to principal component analysis (PCA), the main parts of vectors are extracted for data representation. Finally, radial basis function (RBF) neural network is trained to recognize various faces. The training of RBF neural network is exploited by ADP-PSO. In terms of ORL Face Database, the experimental result gives a clear view of its accurate efficiency.
基金Sponsored by the National Natural Science Foundation of China(Grant No. 60575030)the Scientific Research Foundation of Harbin Institute of Technol-ogy (Grant No. HIT.2002.70)the Heilongjiang Scientific Research Foundation for Scholars Returned from Abroad(Grant No.LC03C10)
文摘In the past several years, support vector machines (SVM) have achieved a huge success in many fields, especially in pattern recognition. But the standard SVM cannot deal with length-variable vectors, which is one severe obstacle for its applications to some important areas, such as speech recognition and part-of-speech tagging. The paper proposed a novel SVM with discriminative dynamic time alignment ( DDTA - SVM) to solve this problem. When training DDTA - SVM classifier, according to the category information of the training sampies, different time alignment strategies were adopted to manipulate them in the kernel functions, which contributed to great improvement for training speed and generalization capability of the classifier. Since the alignment operator was embedded in kernel functions, the training algorithms of standard SVM were still compatible in DDTA- SVM. In order to increase the reliability of the classification, a new classification algorithm was suggested. The preliminary experimental results on Chinese confusable syllables speech classification task show that DDTA- SVM obtains faster convergence speed and better classification performance than dynamic time alignment kernel SVM ( DTAK - SVM). Moreover, DDTA - SVM also gives higher classification precision compared to the conventional HMM. This proves that the proposed method is effective, especially for confusable length - variable pattern classification tasks
基金supported by the Research Plan Project of National University of Defense Technology under Grant No.JC13-06-01the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant(GL2)
文摘Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
基金supported by the National Key Research and Development Program of China(No.2021YFA1401103)the National Natural Science Foundation of China(Nos.61825403,61921005,and 82370520).
文摘Gesture recognition utilizing flexible strain sensors is a highly valuable technology widely applied in human-machine interfaces.However,achieving rapid detection of subtle motions and timely processing of dynamic signals remain a challenge for sensors.Here,highly resilient and durable ionogels are developed by introducing micro-scale incompatible phases in macroscopic homogeneous polymeric network.The compatible network disperses in conductive ionic liquid to form highly resilient and stretchable skeleton,while incompatible phase forms hydrogen bonds to dissipate energy thus strengthening the ionogels.The ionogels-derived strain sensors show highly sensitivity,fast response time(<10 ms),low detection limit(~50μm),and remarkable durability(>5000 cycles),allowing for precise monitoring of human motions.More importantly,a self-adaptive recognition program empowered by deep-learning algorithms is designed to compensate for sensors,creating a comprehensive system capable of dynamic gesture recognition.This system can comprehensively analyze both the temporal and spatial features of sensor data,enabling deeper understanding of the dynamic process underlying gestures.The system accurately classifies 10 hand gestures across five participants with impressive accuracy of 93.66%.Moreover,it maintains robust recognition performance without the need for further training even when different sensors or subjects are involved.This technological breakthrough paves the way for intuitive and seamless interaction between humans and machines,presenting significant opportunities in diverse applications,such as human-robot interaction,virtual reality control,and assistive devices for the disabled individuals.
文摘提出一种用于语音情感识别的深度神经网络模型,级联梅尔频率倒谱系数、色度特征、动态能量变化形成组合特征并输入神经网络,以识别情感。在数据预处理阶段,采用主成分分析方法提高模型的计算效率和泛化能力,同时采用BorderlineSMOTE方法增强模型识别少数类别的能力。模型采用卷积神经网络级联长短期记忆网络的经典网络架构。同时,使用柏林情感语音数据库(Berlin Emotional Speech Database,EMODB)、萨里视听情感表达数据集(Surrey Audio-Visual Expressed Emotion,SAVEE)、中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation,CASIA)数据集来评估德语、英语、中文3种语言。最后,通过实验得出模型在不同数据库中识别情绪的未加权准确度:EMODB数据集为89.72%,SAVEE数据集为65.62%,CASIA数据集为65.42%。
基金supported by the Shanghai Science and Technology Commission under Grant No.23010501500.
文摘Recognizing emotions from speech is of great significance in enhancing human-machine interaction.Convolutional neural networks(CNN)continuously compress size and stack weighted values when capturing temporal dynamic features,resulting in the loss of important dynamic features in different channels and depths during the extraction process,which reduces recognition accuracy.To address this issue,the U-Net architecture is employed in this study for speech emotion recognition,and an extended version of the U-Net structure is proposed.The specific method involves extracting the temporal dynamic features of the audio signal through rectangular convolution to generate the Mel-spectrogram dynamic feature map.Then,the U-Net architecture is utilized to establish connections between feature maps of varying scales,while channel selection attention is employed to assess dissimilarities among dynamic features across different channels.Experimental findings on the combined CER dataset reveal that the enhanced U-Net effectively filters essential temporal dynamic features,resulting in a 4.29 percentage point improvement in recognition accuracy compared to the baseline model.