Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency info...Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.展开更多
In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classificati...In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.展开更多
Neural network models for audio tasks,such as automatic speech recognition(ASR)and acoustic scene classification(ASC),are susceptible to noise contamination for real-life applications.To improve audio quality,an enhan...Neural network models for audio tasks,such as automatic speech recognition(ASR)and acoustic scene classification(ASC),are susceptible to noise contamination for real-life applications.To improve audio quality,an enhancement module,which can be developed independently,is explicitly used at the front-end of the target audio applications.In this paper,we present an end-to-end learning solution to jointly optimise the models for audio enhancement(AE)and the subsequent applications.To guide the optimisation of the AE module towards a target application,and especially to overcome difficult samples,we make use of the sample-wise performance measure as an indication of sample importance.In experiments,we consider four representative applications to evaluate our training paradigm,i.e.,ASR,speech command recognition(SCR),speech emotion recognition(SER),and ASC.These applications are associated with speech and nonspeech tasks concerning semantic and non-semantic features,transient and global information,and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models,especially at low signal-to-noise ratios,for a wide range of computer audition tasks in everyday-life noisy environments.展开更多
Objective Speech recognition technology is widely used as a mature technical approach in many fields.In the study of depression recognition,speech signals are commonly used due to their convenience and ease of acquisi...Objective Speech recognition technology is widely used as a mature technical approach in many fields.In the study of depression recognition,speech signals are commonly used due to their convenience and ease of acquisition.Though speech recognition is popular in the research field of depression recognition,it has been little studied in somatisation disorder recognition.The reason for this is the lack of a publicly accessible database of relevant speech and benchmark studies.To this end,we introduced our somatisation disorder speech database and gave benchmark results.Methods By collecting speech samples of somatisation disorder patients,in cooperation with the Shenzhen University General Hospital,we introduced our somatisation disorder speech database,the Shenzhen Somatisation Speech Corpus(SSSC).Moreover,a benchmark for SSSC using classic acoustic features and a machine learning model was proposed in our work.Results To obtain a more scientific benchmark,we compared and analysed the performance of different acoustic features,i.e.,the full ComPare feature set,or only Mel frequency cepstral coefficients(MFCCs),fundamental frequency(F0),and frequency and bandwidth of the formants(F1-F3).By comparison,the best result of our benchmark was the 76.0%unweighted average recall achieved by a support vector machine with formants F1–F3.Conclusion The proposal of SSSC may bridge a research gap in somatisation disorder,providing researchers with a publicly accessible speech database.In addition,the results of the benchmark could show the scientific validity and feasibility of computer audition for speech recognition in somatization disorders.展开更多
Cardiovascular diseases are a prominent cause of mortality,emphasizing the need for early prevention and diagnosis.Utilizing artificial intelligence(AI)models,heart sound analysis emerges as a noninvasive and universa...Cardiovascular diseases are a prominent cause of mortality,emphasizing the need for early prevention and diagnosis.Utilizing artificial intelligence(AI)models,heart sound analysis emerges as a noninvasive and universally applicable approach for assessing cardiovascular health conditions.However,real-world medical data are dispersed across medical institutions,forming“data islands”due to data sharing limitations for security reasons.To this end,federated learning(FL)has been extensively employed in the medical field,which can effectively model across multiple institutions.Additionally,conventional supervised classification methods require fully labeled data classes,e.g.,binary classification requires labeling of positive and negative samples.Nevertheless,the process of labeling healthcare data is timeconsuming and labor-intensive,leading to the possibility of mislabeling negative samples.In this study,we validate an FL framework with a naive positive-unlabeled(PU)learning strategy.Semisupervised FL model can directly learn from a limited set of positive samples and an extensive pool of unlabeled samples.Our emphasis is on vertical-FL to enhance collaboration across institutions with different medical record feature spaces.Additionally,our contribution extends to feature importance analysis,where we explore 6 methods and provide practical recommendations for detecting abnormal heart sounds.The study demonstrated an impressive accuracy of 84%,comparable to outcomes in supervised learning,thereby advancing the application of FL in abnormal heart sound detection.展开更多
Leveraging the power of artificial intelligence to facilitate an automatic analysis and monitoring of heart sounds has increasingly attracted tremendous efforts in the past decade.Nevertheless,lacking on standard open...Leveraging the power of artificial intelligence to facilitate an automatic analysis and monitoring of heart sounds has increasingly attracted tremendous efforts in the past decade.Nevertheless,lacking on standard open-access database made it difficult to maintain a sustainable and comparable research before the first release of the PhysioNet CinC Challenge Dataset.However,inconsistent standards on data collection,annotation,and partition are still restraining a fair and efficient comparison between different works.To this line,we introduced and benchmarked a first version of the Heart Sounds Shenzhen(HSS)corpus.Motivated and inspired by the previous works based on HSS,we redefined the tasks and make a comprehensive investigation on shallow and deep models in this study.First,we segmented the heart sound recording into shorter recordings(10 s),which makes it more similar to the human auscultation case.Second,we redefined the classification tasks.Besides using the 3 class categories(normal,moderate,and mild/severe)adopted in HSS,we added a binary classification task in this study,i.e.,normal and abnormal.In this work,we provided detailed benchmarks based on both the classic machine learning and the state-of-the-art deep learning technologies,which are reproducible by using open-source toolkits.Last but not least,we analyzed the feature contributions of best performance achieved by the benchmark to make the results more convincing and interpretable.展开更多
The sound generated by body carries important information about our health status physically and psychologically.In the past decades,we have witnessed a plethora of successes achieved in the field of body sound analys...The sound generated by body carries important information about our health status physically and psychologically.In the past decades,we have witnessed a plethora of successes achieved in the field of body sound analysis.Nevertheless,the fundamentals of this young field are still not well established.In particular,publicly accessible databases are rarely developed,which dramatically restrains a sustainable research.To this end,we are launching and continuously calling for participation from the global scientific community to contribute to the Voice of the Body(VoB)archive.We aim to build an open access platform to collect the well-established body sound databases in a well standardized way.Moreover,we hope to organize a series of challenges to promote the development of audio-driven methods for healthcare via the proposed VoB.We believe that VoB can help break the walls between different subjects toward an era of Medicine 4.0 enriched by audio intelligence.展开更多
基金supported by the German National BMBF IKT2020-Grant(16SV7213)(EmotAsS)the European-Unions Horizon 2020 Research and Innovation Programme(688835)(DE-ENIGMA)the China Scholarship Council(CSC)
文摘Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.
基金supported by the European Community’s Seventh Framework Programme(No.338164)(ERC Starting Grant iHEARu)
文摘In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.
基金supported by the Affective Computing&HCI Innovation Research Lab between Huawei Technologies and the University of Augsburg,and the EU H2020 Project under Grant No.101135556(INDUX-R).
文摘Neural network models for audio tasks,such as automatic speech recognition(ASR)and acoustic scene classification(ASC),are susceptible to noise contamination for real-life applications.To improve audio quality,an enhancement module,which can be developed independently,is explicitly used at the front-end of the target audio applications.In this paper,we present an end-to-end learning solution to jointly optimise the models for audio enhancement(AE)and the subsequent applications.To guide the optimisation of the AE module towards a target application,and especially to overcome difficult samples,we make use of the sample-wise performance measure as an indication of sample importance.In experiments,we consider four representative applications to evaluate our training paradigm,i.e.,ASR,speech command recognition(SCR),speech emotion recognition(SER),and ASC.These applications are associated with speech and nonspeech tasks concerning semantic and non-semantic features,transient and global information,and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models,especially at low signal-to-noise ratios,for a wide range of computer audition tasks in everyday-life noisy environments.
基金supported by the Ministry of Science and Technology of the People's Republic of China with the STI2030-Major Projects(Grant No.2021ZD0201900)the National Natural Science Foundation of China(Grant Nos.62227807 and 62272044)+3 种基金the Teli Young Fellow Program from the Beijing Institute of Technology,the Shenzhen Municipal Scheme for Basic Research(Grant Nos.JCYJ20210324100208022andJCYJ20190808144005614),Chinathe JSPS KAKENHI(Grant No.20H00569)the JST Mirai Program(Grant No.21473074)the JST MOONSHOT Program(Grant No.JPMJMS229B).
文摘Objective Speech recognition technology is widely used as a mature technical approach in many fields.In the study of depression recognition,speech signals are commonly used due to their convenience and ease of acquisition.Though speech recognition is popular in the research field of depression recognition,it has been little studied in somatisation disorder recognition.The reason for this is the lack of a publicly accessible database of relevant speech and benchmark studies.To this end,we introduced our somatisation disorder speech database and gave benchmark results.Methods By collecting speech samples of somatisation disorder patients,in cooperation with the Shenzhen University General Hospital,we introduced our somatisation disorder speech database,the Shenzhen Somatisation Speech Corpus(SSSC).Moreover,a benchmark for SSSC using classic acoustic features and a machine learning model was proposed in our work.Results To obtain a more scientific benchmark,we compared and analysed the performance of different acoustic features,i.e.,the full ComPare feature set,or only Mel frequency cepstral coefficients(MFCCs),fundamental frequency(F0),and frequency and bandwidth of the formants(F1-F3).By comparison,the best result of our benchmark was the 76.0%unweighted average recall achieved by a support vector machine with formants F1–F3.Conclusion The proposal of SSSC may bridge a research gap in somatisation disorder,providing researchers with a publicly accessible speech database.In addition,the results of the benchmark could show the scientific validity and feasibility of computer audition for speech recognition in somatization disorders.
基金partially supported by the National Natural Science Foundation of China(grant number 62272044)the Ministry of Science and Technology of the People’s Republic of China with the STI2030-Major Projects(grant number 2021ZD0201900)+5 种基金the Teli Young Fellow Program from the Beijing Institute of Technology,Chinathe Grants-in-Aid for Scientific Research(grant number 20H00569)from the Ministry of Education,Culture,Sports,Science and Technology(MEXT),Japanthe JSPS KAKENHI(grant number 20H00569),Japanthe JST Mirai Program(grant number 21473074),Japanthe JST MOONSHOT Program(grant number JPMJMS229B),Japanthe BIT Research and Innovation Promoting Project(grant number 2023YCXZ014).
文摘Cardiovascular diseases are a prominent cause of mortality,emphasizing the need for early prevention and diagnosis.Utilizing artificial intelligence(AI)models,heart sound analysis emerges as a noninvasive and universally applicable approach for assessing cardiovascular health conditions.However,real-world medical data are dispersed across medical institutions,forming“data islands”due to data sharing limitations for security reasons.To this end,federated learning(FL)has been extensively employed in the medical field,which can effectively model across multiple institutions.Additionally,conventional supervised classification methods require fully labeled data classes,e.g.,binary classification requires labeling of positive and negative samples.Nevertheless,the process of labeling healthcare data is timeconsuming and labor-intensive,leading to the possibility of mislabeling negative samples.In this study,we validate an FL framework with a naive positive-unlabeled(PU)learning strategy.Semisupervised FL model can directly learn from a limited set of positive samples and an extensive pool of unlabeled samples.Our emphasis is on vertical-FL to enhance collaboration across institutions with different medical record feature spaces.Additionally,our contribution extends to feature importance analysis,where we explore 6 methods and provide practical recommendations for detecting abnormal heart sounds.The study demonstrated an impressive accuracy of 84%,comparable to outcomes in supervised learning,thereby advancing the application of FL in abnormal heart sound detection.
基金partially supported by the Ministry of Science and Technology of the People's Republic of China with the STI2030-Major Projects(2021ZD0201900)the National Natural Science Foundation of China(No.62227807 and 62272044)+3 种基金the Teli Young Fellow Program from the Beijing Institute of Technology,Chinathe Natural Science Foundation of Shenzhen University General Hospital(No.SUGH2018QD013),Chinathe Shenzhen Science and Technology Innovation Commission Project(No.JCYJ20190808120613189),Chinathe Grants-in-Aid for Scientific Research(No.20H00569)from the Ministry of Education,Culture,Sports,Science and Technology(MEXT),Japan.
文摘Leveraging the power of artificial intelligence to facilitate an automatic analysis and monitoring of heart sounds has increasingly attracted tremendous efforts in the past decade.Nevertheless,lacking on standard open-access database made it difficult to maintain a sustainable and comparable research before the first release of the PhysioNet CinC Challenge Dataset.However,inconsistent standards on data collection,annotation,and partition are still restraining a fair and efficient comparison between different works.To this line,we introduced and benchmarked a first version of the Heart Sounds Shenzhen(HSS)corpus.Motivated and inspired by the previous works based on HSS,we redefined the tasks and make a comprehensive investigation on shallow and deep models in this study.First,we segmented the heart sound recording into shorter recordings(10 s),which makes it more similar to the human auscultation case.Second,we redefined the classification tasks.Besides using the 3 class categories(normal,moderate,and mild/severe)adopted in HSS,we added a binary classification task in this study,i.e.,normal and abnormal.In this work,we provided detailed benchmarks based on both the classic machine learning and the state-of-the-art deep learning technologies,which are reproducible by using open-source toolkits.Last but not least,we analyzed the feature contributions of best performance achieved by the benchmark to make the results more convincing and interpretable.
基金supported in part by the Ministry of Science and Technology of the People’s Republic of China(2021ZD0201900)the National Natural Science Foundation of China(62272044 and 62227807)in part by the Teli Young Fellow Program from the Beijing Institute of Technology,China。
文摘The sound generated by body carries important information about our health status physically and psychologically.In the past decades,we have witnessed a plethora of successes achieved in the field of body sound analysis.Nevertheless,the fundamentals of this young field are still not well established.In particular,publicly accessible databases are rarely developed,which dramatically restrains a sustainable research.To this end,we are launching and continuously calling for participation from the global scientific community to contribute to the Voice of the Body(VoB)archive.We aim to build an open access platform to collect the well-established body sound databases in a well standardized way.Moreover,we hope to organize a series of challenges to promote the development of audio-driven methods for healthcare via the proposed VoB.We believe that VoB can help break the walls between different subjects toward an era of Medicine 4.0 enriched by audio intelligence.