Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In thi...Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.展开更多
Lip-reading technology,based on visual speech decoding and automatic speech recognition,offers a promising solution to overcoming communication barriers,particularly for individuals with temporary or permanent speech ...Lip-reading technology,based on visual speech decoding and automatic speech recognition,offers a promising solution to overcoming communication barriers,particularly for individuals with temporary or permanent speech impairments.However,most Visual Speech Recognition(VSR)research has primarily focused on the English language and general-purpose applications,limiting its practical applicability in medical and rehabilitative settings.This study introduces the first Deep Learning(DL)based lip-reading system for the Italian language designed to assist individuals with vocal cord pathologies in daily interactions,facilitating communication for patients recovering from vocal cord surgeries,whether temporarily or permanently impaired.To ensure relevance and effectiveness in real-world scenarios,a carefully curated vocabulary of twenty-five Italian words was selected,encompassing critical semantic fields such as Needs,Questions,Answers,Emergencies,Greetings,Requests,and Body Parts.These words were chosen to address both essential daily communication and urgent medical assistance requests.Our approach combines a spatiotemporal Convolutional Neural Network(CNN)with a bidirectional Long Short-Term Memory(BiLSTM)recurrent network,and a Connectionist Temporal Classification(CTC)loss function to recognize individual words,without requiring predefined words boundaries.The experimental results demonstrate the system’s robust performance in recognizing target words,reaching an average accuracy of 96.4%in individual word recognition,suggesting that the system is particularly well-suited for offering support in constrained clinical and caregiving environments,where quick and reliable communication is critical.In conclusion,the study highlights the importance of developing language-specific,application-driven VSR solutions,particularly for non-English languages with limited linguistic resources.By bridging the gap between deep learning-based lip-reading and real-world clinical needs,this research advances assistive communication technologies,paving the way for more inclusive and medically relevant applications of VSR in rehabilitation and healthcare.展开更多
Lip reading is typically regarded as visually interpreting the speaker’s lip movements during the speaking.This is a task of decoding the text from the speaker’s mouth movement.This paper proposes a lip-reading mode...Lip reading is typically regarded as visually interpreting the speaker’s lip movements during the speaking.This is a task of decoding the text from the speaker’s mouth movement.This paper proposes a lip-reading model that helps deaf people and persons with hearing problems to understand a speaker by capturing a video of the speaker and inputting it into the proposed model to obtain the corresponding subtitles.Using deep learning technologies makes it easier for users to extract a large number of different features,which can then be converted to probabilities of letters to obtain accurate results.Recently proposed methods for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition.However,in this paper,a deep convolutional neural network model called the hybrid lip-reading(HLR-Net)model is developed for lip reading from a video.The proposed model includes three stages,namely,preprocessing,encoder,and decoder stages,which produce the output subtitle.The inception,gradient,and bidirectional GRU layers are used to build the encoder,and the attention,fully-connected,activation function layers are used to build the decoder,which performs the connectionist temporal classification(CTC).In comparison with the three recent models,namely,the LipNet model,the lip-reading model with cascaded attention(LCANet),and attention-CTC(A-ACA)model,on the GRID corpus dataset,the proposed HLR-Net model can achieve significant improvements,achieving the CER of 4.9%,WER of 9.7%,and Bleu score of 92%in the case of unseen speakers,and the CER of 1.4%,WER of 3.3%,and Bleu score of 99%in the case of overlapped speakers.展开更多
The continuing advances in deep learning have paved the way for several challenging ideas.One such idea is visual lip-reading,which has recently drawn many research interests.Lip-reading,often referred to as visual sp...The continuing advances in deep learning have paved the way for several challenging ideas.One such idea is visual lip-reading,which has recently drawn many research interests.Lip-reading,often referred to as visual speech recognition,is the ability to understand and predict spoken speech based solely on lip movements without using sounds.Due to the lack of research studies on visual speech recognition for the Arabic language in general,and its absence in the Quranic research,this research aims to fill this gap.This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level(i.e.,single letters(single alphabets)and Quranic disjoined letters)and in the word level based on the content and context of the book Al-Qaida Al-Noorania.This research uses visual speech recognition to recognize spoken Arabic letters(Arabic alphabets),Quranic disjoined letters,and Quranic words,mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania.This study could further validate the correctness of pronunciation and,subsequently,assist people in correctly reciting Quran.Furthermore,a detailed description of the created dataset and its construction methodology is provided.This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading,achieving the accuracies of 83.3%,80.5%,and 77.5%on words,disjoined letters,and single letters,respectively,where an extended analysis of the results is provided.Finally,the experimental outcomes,different research aspects,and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.展开更多
文摘Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.
文摘Lip-reading technology,based on visual speech decoding and automatic speech recognition,offers a promising solution to overcoming communication barriers,particularly for individuals with temporary or permanent speech impairments.However,most Visual Speech Recognition(VSR)research has primarily focused on the English language and general-purpose applications,limiting its practical applicability in medical and rehabilitative settings.This study introduces the first Deep Learning(DL)based lip-reading system for the Italian language designed to assist individuals with vocal cord pathologies in daily interactions,facilitating communication for patients recovering from vocal cord surgeries,whether temporarily or permanently impaired.To ensure relevance and effectiveness in real-world scenarios,a carefully curated vocabulary of twenty-five Italian words was selected,encompassing critical semantic fields such as Needs,Questions,Answers,Emergencies,Greetings,Requests,and Body Parts.These words were chosen to address both essential daily communication and urgent medical assistance requests.Our approach combines a spatiotemporal Convolutional Neural Network(CNN)with a bidirectional Long Short-Term Memory(BiLSTM)recurrent network,and a Connectionist Temporal Classification(CTC)loss function to recognize individual words,without requiring predefined words boundaries.The experimental results demonstrate the system’s robust performance in recognizing target words,reaching an average accuracy of 96.4%in individual word recognition,suggesting that the system is particularly well-suited for offering support in constrained clinical and caregiving environments,where quick and reliable communication is critical.In conclusion,the study highlights the importance of developing language-specific,application-driven VSR solutions,particularly for non-English languages with limited linguistic resources.By bridging the gap between deep learning-based lip-reading and real-world clinical needs,this research advances assistive communication technologies,paving the way for more inclusive and medically relevant applications of VSR in rehabilitation and healthcare.
文摘Lip reading is typically regarded as visually interpreting the speaker’s lip movements during the speaking.This is a task of decoding the text from the speaker’s mouth movement.This paper proposes a lip-reading model that helps deaf people and persons with hearing problems to understand a speaker by capturing a video of the speaker and inputting it into the proposed model to obtain the corresponding subtitles.Using deep learning technologies makes it easier for users to extract a large number of different features,which can then be converted to probabilities of letters to obtain accurate results.Recently proposed methods for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition.However,in this paper,a deep convolutional neural network model called the hybrid lip-reading(HLR-Net)model is developed for lip reading from a video.The proposed model includes three stages,namely,preprocessing,encoder,and decoder stages,which produce the output subtitle.The inception,gradient,and bidirectional GRU layers are used to build the encoder,and the attention,fully-connected,activation function layers are used to build the decoder,which performs the connectionist temporal classification(CTC).In comparison with the three recent models,namely,the LipNet model,the lip-reading model with cascaded attention(LCANet),and attention-CTC(A-ACA)model,on the GRID corpus dataset,the proposed HLR-Net model can achieve significant improvements,achieving the CER of 4.9%,WER of 9.7%,and Bleu score of 92%in the case of unseen speakers,and the CER of 1.4%,WER of 3.3%,and Bleu score of 99%in the case of overlapped speakers.
基金This research was supported and funded by KAU Scientific Endowment,King Abdulaziz University,Jeddah,Saudi Arabia.
文摘The continuing advances in deep learning have paved the way for several challenging ideas.One such idea is visual lip-reading,which has recently drawn many research interests.Lip-reading,often referred to as visual speech recognition,is the ability to understand and predict spoken speech based solely on lip movements without using sounds.Due to the lack of research studies on visual speech recognition for the Arabic language in general,and its absence in the Quranic research,this research aims to fill this gap.This paper introduces a new publicly available Arabic lip-reading dataset containing 10490 videos captured from multiple viewpoints and comprising data samples at the letter level(i.e.,single letters(single alphabets)and Quranic disjoined letters)and in the word level based on the content and context of the book Al-Qaida Al-Noorania.This research uses visual speech recognition to recognize spoken Arabic letters(Arabic alphabets),Quranic disjoined letters,and Quranic words,mainly phonetic as they are recited in the Holy Quran according to Quranic study aid entitled Al-Qaida Al-Noorania.This study could further validate the correctness of pronunciation and,subsequently,assist people in correctly reciting Quran.Furthermore,a detailed description of the created dataset and its construction methodology is provided.This new dataset is used to train an effective pre-trained deep learning CNN model throughout transfer learning for lip-reading,achieving the accuracies of 83.3%,80.5%,and 77.5%on words,disjoined letters,and single letters,respectively,where an extended analysis of the results is provided.Finally,the experimental outcomes,different research aspects,and dataset collection consistency and challenges are discussed and concluded with several new promising trends for future work.