Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while bla...Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.展开更多
With the development of globalization,the use of English is no longer restricted to native speaker(NS)but also widely spread to non-native speaker(NNS).The importance of English learning is also acknowledged by Expand...With the development of globalization,the use of English is no longer restricted to native speaker(NS)but also widely spread to non-native speaker(NNS).The importance of English learning is also acknowledged by Expanding and Outer Circle,and English as a foreign language(EFL)education plays a significant role in China’s education.Admitting the fact that non-native English teachers(NNESTs)take up a large proportion of English teachers,English language teaching(ELT)is still greatly influenced by native-speakerism.This research aims to investigate language ideologies reflected in Chinese foreign language education policy(FLEP)at higher education level,and Chinese English learners’attitudes towards native-speakerism and English teachers.A mixed method of policy analysis and survey is adopted in this research.After conducting analysing two FLEPs in higher education level,it is found that linguistic instrumentalism is the prominent language ideology,although native-speakerism and standard English ideology is implicitly demonstrated.Questionnaire is used to investigate 58 Chinese English learners’attitudes,revealing that most participants do not demonstrate bias towards either NESTs or NNESTs.Instead,the strengths and weaknesses of both NEST and NNEST are identified,though participants adhere to native-speakerism in terms of English variety.Overall,English learner’s attitudes are consistent with language ideologies in FLEPs.This research may provide implications for future studies on addressing native-speakerism in Chinese FLEPs,as well as relationship of students’attitudes and language policies.展开更多
A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech....A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.展开更多
This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers ca...This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers can be as effective as their native colleagues and they have equal chance to achieve professional success, even though native speaker teachers have great advantages over non-native teachers in some aspects. It is time for employers, as well as ELT professionals to shut their eyes to the glaring differences between native speaker teachers and non-native speaker teachers and optimize such unique resources.展开更多
The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this ...The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this paper argues that the obligation of the lan guage teacher is to help students to use L2 effectively not to simply imitate native speaker.A successful L2 user who comes from the group of L2 learners can be a model for students.Therefore,non-native teachers with a high degree of language proficiency and good teaching skills can be ideal and qualified language teachers.展开更多
This study examined the NNSs' ability of modifying their interlanguage utterances in modified comprehensible output to give response to other-initiation and self-initiation,which was studied in both NS-NNS and NNS...This study examined the NNSs' ability of modifying their interlanguage utterances in modified comprehensible output to give response to other-initiation and self-initiation,which was studied in both NS-NNS and NNS-NNS interactions.It was the qualitative study by using two different tasks which were picture-dictation task and opinion-exchange task to collect the data.There were 32 participants whose age ranged of 22 to 37.The author proposed two hypotheses based on his expectation that NNS-NNS interactions would provide more opportunities for NNS participants to give comprehensible output for other-initiated clarification requests and self-initiated clarification attempts than NS-NNS interactions.The author was good at using numbers to illustrate and describe the data in his writing.展开更多
An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate s...An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.展开更多
As much more non-native-speaker English teachers teach alongside native-speaker English teachers, either in China or any other non-English-speaking country, research on the differences between native-speaker English t...As much more non-native-speaker English teachers teach alongside native-speaker English teachers, either in China or any other non-English-speaking country, research on the differences between native-speaker English teacher and non-native-speaker English teacher is necessary. This paper offers an overview of such difference between the two groups of English teachers in terms of their strengths and weaknesses, teaching styles and approaches. The conclusion suggests that cooperation and communication be emphsised and that the two groups of teachers communicate more and exchange their ideas on how to teach the same group of students more effectively.展开更多
In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech...In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech signal conforming to the identity of the speaker, namely, input audio stream is partitioned into homogeneous segments. In this work, we present a novel speaker diarization system using the Tangent weighted Mel frequency cepstral coefficient(TMFCC) as the feature parameter and Lion algorithm for the clustering of the voice activity detected audio streams into particular speaker groups. Thus the two main tasks of the speaker indexing, i.e., speaker segmentation and speaker clustering, are improved. The TMFCC makes use of the low energy frame as well as the high energy frame with more effect, improving the performance of the proposed system. The experiments using the audio signal from the ELSDSR corpus datasets having three speakers, four speakers and five speakers are analyzed for the proposed system. The evaluation of the proposed speaker diarization system based on the tracking distance, tracking time as the evaluation metrics is done and the experimental results show that the speaker diarization system with the TMFCC parameterization and Lion based clustering is found to be superior over existing diarization systems with 95% tracking accuracy.展开更多
This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL...This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL education in China is still affected by this chauvinistic ideology.The analysis of data via a critical lens reveals that the vast majority of the participants conferred upon NS products(teacher,language,culture and teaching methodology)a status superior to that granted to the NNS counterparts and failed to see linguacultural and epistemological inequalities between the English speaking West and traditional NNS countries,inter alia,China.These findings suggest that the three participant groups as an entirety succumb to native speakerism,and by extension that ELT in China is still haunted to a great degree by this ideology.Given that this study treats each participant group separately,future studies are expected to explore inter-group interactions in ideology.展开更多
A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With onl...A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation展开更多
Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the mo...Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the most commonly used methods for feature extraction is Mel Frequency Cepstral Coefficients(MFCCs).Recent researches show that MFCCs are successful in processing the voice signal with high accuracies.MFCCs represents a sequence of voice signal-specific features.This experimental analysis is proposed to distinguish Turkish speakers by extracting the MFCCs from the speech recordings.Since the human perception of sound is not linear,after the filterbank step in theMFCC method,we converted the obtained log filterbanks into decibel(dB)features-based spectrograms without applying the Discrete Cosine Transform(DCT).A new dataset was created with converted spectrogram into a 2-D array.Several learning algorithms were implementedwith a 10-fold cross-validationmethod to detect the speaker.The highest accuracy of 90.2%was achieved using Multi-layer Perceptron(MLP)with tanh activation function.The most important output of this study is the inclusion of human voice as a new feature set.展开更多
The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), i...The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.展开更多
In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor anal...In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.展开更多
While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students tow...While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students towards them have been rarely conducted.This essay discusses the implications of cultural differences for the language classroom,and the different cultures of learning with regard to language teaching and learning in China and the Wes.The conclusion suggests that it is of great importance to have a good sense of cultural awareness and an open mind for cultural interactions,in order to benefit both language learners and native speaker teachers in the cross-cultural classroom.展开更多
Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with s...Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with some skills of public speaking- from several different aspects that should be noticed in public speaking.展开更多
This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this...This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.展开更多
This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant chara...This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.展开更多
This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor bas...This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor based on these parameters is formed, and a recognition criterion definded in order to identify individual speakers. Experimental results show the usefulness of fractal dimensions in characterizing speaker identity.展开更多
基金The Major Key Project of PCL,Grant/Award Number:PCL2022A03National Natural Science Foundation of China,Grant/Award Numbers:61976064,62372137Zhejiang Provincial Natural Science Foundation of China,Grant/Award Number:LZ22F020007。
文摘Adversarial attacks have been posing significant security concerns to intelligent systems,such as speaker recognition systems(SRSs).Most attacks assume the neural networks in the systems are known beforehand,while black-box attacks are proposed without such information to meet practical situations.Existing black-box attacks improve trans-ferability by integrating multiple models or training on multiple datasets,but these methods are costly.Motivated by the optimisation strategy with spatial information on the perturbed paths and samples,we propose a Dual Spatial Momentum Iterative Fast Gradient Sign Method(DS-MI-FGSM)to improve the transferability of black-box at-tacks against SRSs.Specifically,DS-MI-FGSM only needs a single data and one model as the input;by extending to the data and model neighbouring spaces,it generates adver-sarial examples against the integrating models.To reduce the risk of overfitting,DS-MI-FGSM also introduces gradient masking to improve transferability.The authors conduct extensive experiments regarding the speaker recognition task,and the results demonstrate the effectiveness of their method,which can achieve up to 92%attack success rate on the victim model in black-box scenarios with only one known model.
文摘With the development of globalization,the use of English is no longer restricted to native speaker(NS)but also widely spread to non-native speaker(NNS).The importance of English learning is also acknowledged by Expanding and Outer Circle,and English as a foreign language(EFL)education plays a significant role in China’s education.Admitting the fact that non-native English teachers(NNESTs)take up a large proportion of English teachers,English language teaching(ELT)is still greatly influenced by native-speakerism.This research aims to investigate language ideologies reflected in Chinese foreign language education policy(FLEP)at higher education level,and Chinese English learners’attitudes towards native-speakerism and English teachers.A mixed method of policy analysis and survey is adopted in this research.After conducting analysing two FLEPs in higher education level,it is found that linguistic instrumentalism is the prominent language ideology,although native-speakerism and standard English ideology is implicitly demonstrated.Questionnaire is used to investigate 58 Chinese English learners’attitudes,revealing that most participants do not demonstrate bias towards either NESTs or NNESTs.Instead,the strengths and weaknesses of both NEST and NNEST are identified,though participants adhere to native-speakerism in terms of English variety.Overall,English learner’s attitudes are consistent with language ideologies in FLEPs.This research may provide implications for future studies on addressing native-speakerism in Chinese FLEPs,as well as relationship of students’attitudes and language policies.
基金The National Natural Science Foundation of China (No.60872073, 60975017, 51075068)the Natural Science Foundation of Guangdong Province (No. 10252800001000001)the Natural Science Foundation of Jiangsu Province (No. BK2010546)
文摘A novel emotional speaker recognition system (ESRS) is proposed to compensate for emotion variability. First, the emotion recognition is adopted as a pre-processing part to classify the neutral and emotional speech. Then, the recognized emotion speech is adjusted by prosody modification. Different methods including Gaussian normalization, the Gaussian mixture model (GMM) and support vector regression (SVR) are adopted to define the mapping rules of F0s between emotional and neutral speech, and the average linear ratio is used for the duration modification. Finally, the modified emotional speech is employed for the speaker recognition. The experimental results show that the proposed ESRS can significantly improve the performance of emotional speaker recognition, and the identification rate (IR) is higher than that of the traditional recognition system. The emotional speech with F0 and duration modifications is closer to the neutral one.
文摘This paper attempts to argue that in the age of‘World Englishes', it is not necessary to differentiate native speaker teachers from non-native speaker teachers. It is concluded that non-native speaker teachers can be as effective as their native colleagues and they have equal chance to achieve professional success, even though native speaker teachers have great advantages over non-native teachers in some aspects. It is time for employers, as well as ELT professionals to shut their eyes to the glaring differences between native speaker teachers and non-native speaker teachers and optimize such unique resources.
文摘The target of much language teaching and learning is to make students approximate to native speakers.The only rightful speak ers of a language are its native speakers.Contrary to these contemporary views,however,this paper argues that the obligation of the lan guage teacher is to help students to use L2 effectively not to simply imitate native speaker.A successful L2 user who comes from the group of L2 learners can be a model for students.Therefore,non-native teachers with a high degree of language proficiency and good teaching skills can be ideal and qualified language teachers.
文摘This study examined the NNSs' ability of modifying their interlanguage utterances in modified comprehensible output to give response to other-initiation and self-initiation,which was studied in both NS-NNS and NNS-NNS interactions.It was the qualitative study by using two different tasks which were picture-dictation task and opinion-exchange task to collect the data.There were 32 participants whose age ranged of 22 to 37.The author proposed two hypotheses based on his expectation that NNS-NNS interactions would provide more opportunities for NNS participants to give comprehensible output for other-initiated clarification requests and self-initiated clarification attempts than NS-NNS interactions.The author was good at using numbers to illustrate and describe the data in his writing.
文摘An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.
文摘As much more non-native-speaker English teachers teach alongside native-speaker English teachers, either in China or any other non-English-speaking country, research on the differences between native-speaker English teacher and non-native-speaker English teacher is necessary. This paper offers an overview of such difference between the two groups of English teachers in terms of their strengths and weaknesses, teaching styles and approaches. The conclusion suggests that cooperation and communication be emphsised and that the two groups of teachers communicate more and exchange their ideas on how to teach the same group of students more effectively.
文摘In audio stream containing multiple speakers, speaker diarization aids in ascertaining "who speak when". This is an unsupervised task as there is no prior information about the speakers. It labels the speech signal conforming to the identity of the speaker, namely, input audio stream is partitioned into homogeneous segments. In this work, we present a novel speaker diarization system using the Tangent weighted Mel frequency cepstral coefficient(TMFCC) as the feature parameter and Lion algorithm for the clustering of the voice activity detected audio streams into particular speaker groups. Thus the two main tasks of the speaker indexing, i.e., speaker segmentation and speaker clustering, are improved. The TMFCC makes use of the low energy frame as well as the high energy frame with more effect, improving the performance of the proposed system. The experiments using the audio signal from the ELSDSR corpus datasets having three speakers, four speakers and five speakers are analyzed for the proposed system. The evaluation of the proposed speaker diarization system based on the tracking distance, tracking time as the evaluation metrics is done and the experimental results show that the speaker diarization system with the TMFCC parameterization and Lion based clustering is found to be superior over existing diarization systems with 95% tracking accuracy.
文摘This paper reports on part of the findings of a large-scale study exploring the viewpoints of Chinese ELT stakeholders(students,teachers and administrators)on native speakerism in order to find out whether current EFL education in China is still affected by this chauvinistic ideology.The analysis of data via a critical lens reveals that the vast majority of the participants conferred upon NS products(teacher,language,culture and teaching methodology)a status superior to that granted to the NNS counterparts and failed to see linguacultural and epistemological inequalities between the English speaking West and traditional NNS countries,inter alia,China.These findings suggest that the three participant groups as an entirety succumb to native speakerism,and by extension that ELT in China is still haunted to a great degree by this ideology.Given that this study treats each participant group separately,future studies are expected to explore inter-group interactions in ideology.
文摘A transformation matrix linear interpolation (TMLI) approach for speaker adaptation is proposed. TMLI uses the transformation matrixes produced by MLLR from selected training speakers and the testing speaker. With only 3 adaptation sentences, the performance shows a 12.12% word error rate reduction. As the number of adaptation sentences increases, the performance saturates quickly. To improve the behavior of TMLI for large amounts of adaptation data, the TMLI+MAP method which combines TMLI with MAP technique is proposed. Experimental results show TMLI+MAP achieved better recognition accuracy than MAP and MLLR+MAP for both small and large amounts of adaptation data. Key words speech recognition - speaker adaptation - MLLR - MAP - maximum likelihood model interpolation (MLMI) CLC number TN 912. 34 Foundation item: Supported by the Science and Technology Committee of Shanghai (01JC14033)Biography: XU Xiang-hua (1977-), female, Ph. D. candidate, research direction: large vocabulary continuous Mandarin speech recognition and speaker adaptation
基金This work was supported by the GRRC program of Gyeonggi province.[GRRC-Gachon2020(B04),Development of AI-based Healthcare Devices].
文摘Automatic speaker recognition(ASR)systems are the field of Human-machine interaction and scientists have been using feature extraction and feature matching methods to analyze and synthesize these signals.One of the most commonly used methods for feature extraction is Mel Frequency Cepstral Coefficients(MFCCs).Recent researches show that MFCCs are successful in processing the voice signal with high accuracies.MFCCs represents a sequence of voice signal-specific features.This experimental analysis is proposed to distinguish Turkish speakers by extracting the MFCCs from the speech recordings.Since the human perception of sound is not linear,after the filterbank step in theMFCC method,we converted the obtained log filterbanks into decibel(dB)features-based spectrograms without applying the Discrete Cosine Transform(DCT).A new dataset was created with converted spectrogram into a 2-D array.Several learning algorithms were implementedwith a 10-fold cross-validationmethod to detect the speaker.The highest accuracy of 90.2%was achieved using Multi-layer Perceptron(MLP)with tanh activation function.The most important output of this study is the inclusion of human voice as a new feature set.
文摘The aim of this paper is to show the accuracy and time results of a text independent automatic speaker recognition (ASR) system, based on Mel-Frequency Cepstrum Coefficients (MFCC) and Gaussian Mixture Models (GMM), in order to develop a security control access gate. 450 speakers were randomly extracted from the Voxforge.org audio database, their utterances have been improved using spectral subtraction, then MFCC were extracted and these coefficients were statistically analyzed by GMM in order to build each profile. For each speaker two different speech files were used: the first one to build the profile database, the second one to test the system performance. The accuracy achieved by the proposed approach is greater than 96% and the time spent for a single test run, implemented in Matlab language, is about 2 seconds on a common PC.
文摘In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.
文摘While the majority of nonnative speaker English teachers teach alongside NS teachers,research on the role of native speaker English teachers in China's teaching context and the attitudes of university students towards them have been rarely conducted.This essay discusses the implications of cultural differences for the language classroom,and the different cultures of learning with regard to language teaching and learning in China and the Wes.The conclusion suggests that it is of great importance to have a good sense of cultural awareness and an open mind for cultural interactions,in order to benefit both language learners and native speaker teachers in the cross-cultural classroom.
文摘Public speaking is a part of communication.Good public speaking can convey clear、 persuasive ideas or opinions and also can become an effective bridge between the audience and the speaker.This paper is dealing with some skills of public speaking- from several different aspects that should be noticed in public speaking.
文摘This paper presented a speaker adaptable very low bit rate speech coder based on HMM (Hidden Markov Model) which includes the dynamic features, i.e., delta and delta delta parameters of speech. The performance of this speech coder has been improved by using the dynamic features generated by an algorithm for speech parameter generation from HMM because the generated speech parameter vectors reflect not only the means of static and dynamic feature vectors but also the covariance of those. The encoder part is equivalent to an HMM based phoneme recognizer and transmits phoneme indexes, state durations, pitch information and speaker characteristics adaptation vectors to the decoder. The decoder receives those messages and concatenates phoneme HMM sequence according to the phoneme indexes. Then the decoder generates a sequence of mel cepstral coefficient vectors using HMM based speech parameter generation technique. Finally the decoder synthesizes speech by directly exciting the MLSA(Mel Log Spectrum Approximation) filter with the generated mel cepstral coefficient vectors, according to the pitch information.
基金Project supported by the National Natural Science Foundation of China (Grant No.60903186)the Shanghai Leading Academic Discipline Project (Grant No.J50104)
文摘This paper proposes a new phase feature derived from the formant instantaneous characteristics for speech recognition (SR) and speaker identification (SI) systems. Using Hilbert transform (HT), the formant characteristics can be represented by instantaneous frequency (IF) and instantaneous bandwidth, namely formant instantaneous characteristics (FIC). In order to explore the importance of FIC both in SR and SI, this paper proposes different features from FIC used for SR and SI systems. When combing these new features with conventional parameters, higher identification rate can be achieved than that of using Mel-frequency cepstral coefficients (MFCC) parameters only. The experiment results show that the new features are effective characteristic parameters and can be treated as the compensation of conventional parameters for SR and SI.
文摘This paper discusses application of fractal dimensions to speech processing. Generalized dimensions of arbitrary orders and associated fractal parameters are used in speaker identification. A characteristic vactor based on these parameters is formed, and a recognition criterion definded in order to identify individual speakers. Experimental results show the usefulness of fractal dimensions in characterizing speaker identity.