Much like humans focus solely on object movement to understand actions,directing a deep learning model’s attention to the core contexts within videos is crucial for improving video comprehension.In the recent study,V...Much like humans focus solely on object movement to understand actions,directing a deep learning model’s attention to the core contexts within videos is crucial for improving video comprehension.In the recent study,Video Masked Auto-Encoder(VideoMAE)employs a pre-training approach with a high ratio of tube masking and reconstruction,effectively mitigating spatial bias due to temporal redundancy in full video frames.This steers the model’s focus toward detailed temporal contexts.However,as the VideoMAE still relies on full video frames during the action recognition stage,it may exhibit a progressive shift in attention towards spatial contexts,deteriorating its ability to capture the main spatio-temporal contexts.To address this issue,we propose an attention-directing module named Transformer Encoder Attention Module(TEAM).This proposed module effectively directs the model’s attention to the core characteristics within each video,inherently mitigating spatial bias.The TEAM first figures out the core features among the overall extracted features from each video.After that,it discerns the specific parts of the video where those features are located,encouraging the model to focus more on these informative parts.Consequently,during the action recognition stage,the proposed TEAM effectively shifts the VideoMAE’s attention from spatial contexts towards the core spatio-temporal contexts.This attention-shift manner alleviates the spatial bias in the model and simultaneously enhances its ability to capture precise video contexts.We conduct extensive experiments to explore the optimal configuration that enables the TEAM to fulfill its intended design purpose and facilitates its seamless integration with the VideoMAE framework.The integrated model,i.e.,VideoMAE+TEAM,outperforms the existing VideoMAE by a significant margin on Something-Something-V2(71.3%vs.70.3%).Moreover,the qualitative comparisons demonstrate that the TEAM encourages the model to disregard insignificant features and focus more on the essential video features,capturing more detailed spatio-temporal contexts within the video.展开更多
Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns.The schlieren imaging...Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns.The schlieren imaging technique visualizes airflows with subtle density variations.It is hypothesized that speech flows captured by schlieren,when analyzed using a hybrid of convolutional neural network(CNN)and long short-term memory(LSTM)network,can recognize alphabet pronunciations,thus facilitating automatic speech recognition and speech disorder therapy.This study evaluates the feasibility of using a CNN-based video classification network to differentiate speech flows corresponding to the first four alphabets:/A/,/B/,/C/,and/D/.A schlieren optical system was developed,and the speech flows of alphabet pronunciations were recorded for two participants at an acquisition rate of 60 frames per second.A total of 640 video clips,each lasting 1 s,were utilized to train and test a hybrid CNN-LSTM network.Acoustic analyses of the recorded sounds were conducted to understand the phonetic differences among the four alphabets.The hybrid CNN-LSTM network was trained separately on four datasets of varying sizes(i.e.,20,30,40,50 videos per alphabet),all achieving over 95%accuracy in classifying videos of the same participant.However,the network’s performance declined when tested on speech flows from a different participant,with accuracy dropping to around 44%,indicating significant inter-participant variability in alphabet pronunciation.Retraining the network with videos from both participants improved accuracy to 93%on the second participant.Analysis of misclassified videos indicated that factors such as low video quality and disproportional head size affected accuracy.These results highlight the potential of CNN-assisted speech recognition and speech therapy using articulation flows,although challenges remain in expanding the alphabet set and participant cohort.展开更多
Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and t...Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.展开更多
The fluidity of coal-water slurry(CWS)is crucial for various industrial applications such as long-distance transportation,gasification,and combustion.However,there is currently a lack of rapid and accurate detection m...The fluidity of coal-water slurry(CWS)is crucial for various industrial applications such as long-distance transportation,gasification,and combustion.However,there is currently a lack of rapid and accurate detection methods for assessing CWS fluidity.This paper proposed a method for analyzing the fluidity using videos of CWS dripping processes.By integrating the temporal and spatial features of each frame in the video,a multi-cascade classifier for CWS fluidity is established.The classifier distinguishes between four levels(A,B,C,and D)based on the quality of fluidity.The preliminary classification of A and D is achieved through feature engineering and the XGBoost algorithm.Subsequently,convolutional neural networks(CNN)and long short-term memory(LSTM)are utilized to further differentiate between the B and C categories which are prone to confusion.Finally,through detailed comparative experiments,the paper demonstrates the step-by-step design process of the proposed method and the superiority of the final solution.The proposed method achieves an accuracy rate of over 90%in determining the fluidity of CWS,serving as a technical reference for future industrial applications.展开更多
Video classification typically requires large labeled datasets which are costly and time-consuming to obtain.This paper proposes a novel Active Learning(AL)framework to improve video classification performance while m...Video classification typically requires large labeled datasets which are costly and time-consuming to obtain.This paper proposes a novel Active Learning(AL)framework to improve video classification performance while minimizing the human annotation effort.Unlike passive learning methods that randomly select samples for labeling,our approach actively identifies the most informative unlabeled instances to be annotated.Specifically,we develop batch mode AL techniques that select useful videos based on uncertainty and diversity sampling.The algorithm then extracts a diverse set of representative keyframes from the queried videos.Human annotators only need to label these keyframes instead of watching the full videos.We implement this approach by leveraging recent advances in deep neural networks for visual feature extraction and sequence modeling.Our experiments on benchmark datasets demonstrate that our method achieves significant improvements in video classification accuracy with less training data.This enables more efficient video dataset construction and could make large-scale video annotation more feasible.Our AL framework minimizes the human effort needed to train accurate video classifiers.展开更多
Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-...Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-relation network(ARTNet) and spatiotemporal and motion network(STM). However, with blocks stacking up, the rear part of the network has poor interpretability. To avoid this problem, we propose a novel architecture called spatial temporal relation network(STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information. Specifically, our STRNet is constructed by three branches,which separates the features into 1) appearance pathway, to obtain spatial semantics, 2) motion pathway, to reinforce the spatiotemporal feature representation, and 3) relation pathway, to focus on capturing temporal relation details of successive frames and to explore long-term representation dependency. In addition, our STRNet does not just simply merge the multi-branch information, but we apply a flexible and effective strategy to fuse the complementary information from multiple pathways. We evaluate our network on four major action recognition benchmarks: Kinetics-400, UCF-101, HMDB-51, and Something-Something v1, demonstrating that the performance of our STRNet achieves the state-of-the-art result on the UCF-101 and HMDB-51 datasets, as well as a comparable accuracy with the state-of-the-art method on Something-Something v1 and Kinetics-400.展开更多
The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached ...The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached to a glove or markers,which hinder the natural human-machine communication.On the other hand,vision-based methods are less restrictive and allow for a more spontaneous communication without the need of an intermediary between human and machine.Therefore,vision gesture recognition has been a popular area of research for the past thirty years.Hand gesture recognition finds its application in many areas,particularly the automotive industry where advanced automotive human-machine interface(HMI)designers are using gesture recognition to improve driver and vehicle safety.However,technology advances go beyond active/passive safety and into convenience and comfort.In this context,one of America’s big three automakers has partnered with the Centre of Pattern Analysis and Machine Intelligence(CPAMI)at the University of Waterloo to investigate expanding their product segment through machine learning to provide an increased driver convenience and comfort with the particular application of hand gesture recognition for autonomous car parking.The present paper leverages the state-of-the-art deep learning and optimization techniques to develop a vision-based multiview dynamic hand gesture recognizer for a self-parking system.We propose a 3D-CNN gesture model architecture that we train on a publicly available hand gesture database.We apply transfer learning methods to fine-tune the pre-trained gesture model on custom-made data,which significantly improves the proposed system performance in a real world environment.We adapt the architecture of end-to-end solution to expand the state-of-the-art video classifier from a single image as input(fed by monocular camera)to a Multiview 360 feed,offered by a six cameras module.Finally,we optimize the proposed solution to work on a limited resource embedded platform(Nvidia Jetson TX2)that is used by automakers for vehicle-based features,without sacrificing the accuracy robustness and real time functionality of the system.展开更多
The Norway lobster,Nephrops norvegicus,is one of the main commercial crustacean fisheries in Europe.The abundance of Nephrops norvegicus stocks is assessed based on identifying and counting the burrows where they live...The Norway lobster,Nephrops norvegicus,is one of the main commercial crustacean fisheries in Europe.The abundance of Nephrops norvegicus stocks is assessed based on identifying and counting the burrows where they live from underwater videos collected by camera systems mounted on sledges.The Spanish Oceanographic Institute(IEO)andMarine Institute Ireland(MIIreland)conducts annual underwater television surveys(UWTV)to estimate the total abundance of Nephrops within the specified area,with a coefficient of variation(CV)or relative standard error of less than 20%.Currently,the identification and counting of the Nephrops burrows are carried out manually by the marine experts.This is quite a time-consuming job.As a solution,we propose an automated system based on deep neural networks that automatically detects and counts the Nephrops burrows in video footage with high precision.The proposed system introduces a deep-learning-based automated way to identify and classify the Nephrops burrows.This research work uses the current state-of-the-art Faster RCNN models Inceptionv2 and MobileNetv2 for object detection and classification.We conduct experiments on two data sets,namely,the Smalls Nephrops survey(FU 22)and Cadiz Nephrops survey(FU 30),collected by Marine Institute Ireland and Spanish Oceanographic Institute,respectively.From the results,we observe that the Inception model achieved a higher precision and recall rate than theMobileNetmodel.The best mean Average Precision(mAP)recorded by the Inception model is 81.61%compared to MobileNet,which achieves the best mAP of 75.12%.展开更多
While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of ...While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.展开更多
基金This work was supported by the National Research Foundation of Korea(NRF)Grant(Nos.2018R1A5A7059549,2020R1A2C1014037)by Institute of Information&Communications Technology Planning&Evaluation(IITP)Grant(No.2020-0-01373)funded by the Korea government(*MSIT).*Ministry of Science and Information&Communication Technology.
文摘Much like humans focus solely on object movement to understand actions,directing a deep learning model’s attention to the core contexts within videos is crucial for improving video comprehension.In the recent study,Video Masked Auto-Encoder(VideoMAE)employs a pre-training approach with a high ratio of tube masking and reconstruction,effectively mitigating spatial bias due to temporal redundancy in full video frames.This steers the model’s focus toward detailed temporal contexts.However,as the VideoMAE still relies on full video frames during the action recognition stage,it may exhibit a progressive shift in attention towards spatial contexts,deteriorating its ability to capture the main spatio-temporal contexts.To address this issue,we propose an attention-directing module named Transformer Encoder Attention Module(TEAM).This proposed module effectively directs the model’s attention to the core characteristics within each video,inherently mitigating spatial bias.The TEAM first figures out the core features among the overall extracted features from each video.After that,it discerns the specific parts of the video where those features are located,encouraging the model to focus more on these informative parts.Consequently,during the action recognition stage,the proposed TEAM effectively shifts the VideoMAE’s attention from spatial contexts towards the core spatio-temporal contexts.This attention-shift manner alleviates the spatial bias in the model and simultaneously enhances its ability to capture precise video contexts.We conduct extensive experiments to explore the optimal configuration that enables the TEAM to fulfill its intended design purpose and facilitates its seamless integration with the VideoMAE framework.The integrated model,i.e.,VideoMAE+TEAM,outperforms the existing VideoMAE by a significant margin on Something-Something-V2(71.3%vs.70.3%).Moreover,the qualitative comparisons demonstrate that the TEAM encourages the model to disregard insignificant features and focus more on the essential video features,capturing more detailed spatio-temporal contexts within the video.
文摘Speech is a highly coordinated process that requires precise control over vocal tract morphology/motion to produce intelligible sounds while simultaneously generating unique exhaled flow patterns.The schlieren imaging technique visualizes airflows with subtle density variations.It is hypothesized that speech flows captured by schlieren,when analyzed using a hybrid of convolutional neural network(CNN)and long short-term memory(LSTM)network,can recognize alphabet pronunciations,thus facilitating automatic speech recognition and speech disorder therapy.This study evaluates the feasibility of using a CNN-based video classification network to differentiate speech flows corresponding to the first four alphabets:/A/,/B/,/C/,and/D/.A schlieren optical system was developed,and the speech flows of alphabet pronunciations were recorded for two participants at an acquisition rate of 60 frames per second.A total of 640 video clips,each lasting 1 s,were utilized to train and test a hybrid CNN-LSTM network.Acoustic analyses of the recorded sounds were conducted to understand the phonetic differences among the four alphabets.The hybrid CNN-LSTM network was trained separately on four datasets of varying sizes(i.e.,20,30,40,50 videos per alphabet),all achieving over 95%accuracy in classifying videos of the same participant.However,the network’s performance declined when tested on speech flows from a different participant,with accuracy dropping to around 44%,indicating significant inter-participant variability in alphabet pronunciation.Retraining the network with videos from both participants improved accuracy to 93%on the second participant.Analysis of misclassified videos indicated that factors such as low video quality and disproportional head size affected accuracy.These results highlight the potential of CNN-assisted speech recognition and speech therapy using articulation flows,although challenges remain in expanding the alphabet set and participant cohort.
基金Fundamental Research Funds for the Central Universities,China(No.2232021A-10)National Natural Science Foundation of China(No.61903078)+1 种基金Shanghai Sailing Program,China(No.22YF1401300)Natural Science Foundation of Shanghai,China(No.20ZR1400400)。
文摘Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content.Most existing methods do not consider the multimodal nature of the video,and the modality fusion approach tends to be too simple,often neglecting modality alignment before fusion.This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos.The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities.To solve the modality alignment problem,contrastive learning is introduced between two unimodal encoder modules.Additionally,masked language modeling(MLM)and video text matching(VTM)auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks.Compared with other two mainstream baselines,DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.
基金supported by the Youth Fund of the National Natural Science Foundation of China(No.52304311)the National Natural Science Foundation of China(No.52274282)the Postdoctoral Fellowship Program of CPSF(No.GZC20233016)。
文摘The fluidity of coal-water slurry(CWS)is crucial for various industrial applications such as long-distance transportation,gasification,and combustion.However,there is currently a lack of rapid and accurate detection methods for assessing CWS fluidity.This paper proposed a method for analyzing the fluidity using videos of CWS dripping processes.By integrating the temporal and spatial features of each frame in the video,a multi-cascade classifier for CWS fluidity is established.The classifier distinguishes between four levels(A,B,C,and D)based on the quality of fluidity.The preliminary classification of A and D is achieved through feature engineering and the XGBoost algorithm.Subsequently,convolutional neural networks(CNN)and long short-term memory(LSTM)are utilized to further differentiate between the B and C categories which are prone to confusion.Finally,through detailed comparative experiments,the paper demonstrates the step-by-step design process of the proposed method and the superiority of the final solution.The proposed method achieves an accuracy rate of over 90%in determining the fluidity of CWS,serving as a technical reference for future industrial applications.
文摘Video classification typically requires large labeled datasets which are costly and time-consuming to obtain.This paper proposes a novel Active Learning(AL)framework to improve video classification performance while minimizing the human annotation effort.Unlike passive learning methods that randomly select samples for labeling,our approach actively identifies the most informative unlabeled instances to be annotated.Specifically,we develop batch mode AL techniques that select useful videos based on uncertainty and diversity sampling.The algorithm then extracts a diverse set of representative keyframes from the queried videos.Human annotators only need to label these keyframes instead of watching the full videos.We implement this approach by leveraging recent advances in deep neural networks for visual feature extraction and sequence modeling.Our experiments on benchmark datasets demonstrate that our method achieves significant improvements in video classification accuracy with less training data.This enables more efficient video dataset construction and could make large-scale video annotation more feasible.Our AL framework minimizes the human effort needed to train accurate video classifiers.
基金supported by National Natural Science Foundation of China(Nos.U1836218,62020106012,61672265 and 61902153)the 111 Project of Ministry of Education of China(No.B12018)+1 种基金the EPSRC Programme FACER2VM(No.EP/N007743/1)the EPSRC/MURI/Dstl Project under(No.EP/R013616/1.)。
文摘Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-relation network(ARTNet) and spatiotemporal and motion network(STM). However, with blocks stacking up, the rear part of the network has poor interpretability. To avoid this problem, we propose a novel architecture called spatial temporal relation network(STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information. Specifically, our STRNet is constructed by three branches,which separates the features into 1) appearance pathway, to obtain spatial semantics, 2) motion pathway, to reinforce the spatiotemporal feature representation, and 3) relation pathway, to focus on capturing temporal relation details of successive frames and to explore long-term representation dependency. In addition, our STRNet does not just simply merge the multi-branch information, but we apply a flexible and effective strategy to fuse the complementary information from multiple pathways. We evaluate our network on four major action recognition benchmarks: Kinetics-400, UCF-101, HMDB-51, and Something-Something v1, demonstrating that the performance of our STRNet achieves the state-of-the-art result on the UCF-101 and HMDB-51 datasets, as well as a comparable accuracy with the state-of-the-art method on Something-Something v1 and Kinetics-400.
文摘The use of hand gestures can be the most intuitive human-machine interaction medium.The early approaches for hand gesture recognition used device-based methods.These methods use mechanical or optical sensors attached to a glove or markers,which hinder the natural human-machine communication.On the other hand,vision-based methods are less restrictive and allow for a more spontaneous communication without the need of an intermediary between human and machine.Therefore,vision gesture recognition has been a popular area of research for the past thirty years.Hand gesture recognition finds its application in many areas,particularly the automotive industry where advanced automotive human-machine interface(HMI)designers are using gesture recognition to improve driver and vehicle safety.However,technology advances go beyond active/passive safety and into convenience and comfort.In this context,one of America’s big three automakers has partnered with the Centre of Pattern Analysis and Machine Intelligence(CPAMI)at the University of Waterloo to investigate expanding their product segment through machine learning to provide an increased driver convenience and comfort with the particular application of hand gesture recognition for autonomous car parking.The present paper leverages the state-of-the-art deep learning and optimization techniques to develop a vision-based multiview dynamic hand gesture recognizer for a self-parking system.We propose a 3D-CNN gesture model architecture that we train on a publicly available hand gesture database.We apply transfer learning methods to fine-tune the pre-trained gesture model on custom-made data,which significantly improves the proposed system performance in a real world environment.We adapt the architecture of end-to-end solution to expand the state-of-the-art video classifier from a single image as input(fed by monocular camera)to a Multiview 360 feed,offered by a six cameras module.Finally,we optimize the proposed solution to work on a limited resource embedded platform(Nvidia Jetson TX2)that is used by automakers for vehicle-based features,without sacrificing the accuracy robustness and real time functionality of the system.
基金Open Access Article Processing Charges has been funded by University of Malaga.
文摘The Norway lobster,Nephrops norvegicus,is one of the main commercial crustacean fisheries in Europe.The abundance of Nephrops norvegicus stocks is assessed based on identifying and counting the burrows where they live from underwater videos collected by camera systems mounted on sledges.The Spanish Oceanographic Institute(IEO)andMarine Institute Ireland(MIIreland)conducts annual underwater television surveys(UWTV)to estimate the total abundance of Nephrops within the specified area,with a coefficient of variation(CV)or relative standard error of less than 20%.Currently,the identification and counting of the Nephrops burrows are carried out manually by the marine experts.This is quite a time-consuming job.As a solution,we propose an automated system based on deep neural networks that automatically detects and counts the Nephrops burrows in video footage with high precision.The proposed system introduces a deep-learning-based automated way to identify and classify the Nephrops burrows.This research work uses the current state-of-the-art Faster RCNN models Inceptionv2 and MobileNetv2 for object detection and classification.We conduct experiments on two data sets,namely,the Smalls Nephrops survey(FU 22)and Cadiz Nephrops survey(FU 30),collected by Marine Institute Ireland and Spanish Oceanographic Institute,respectively.From the results,we observe that the Inception model achieved a higher precision and recall rate than theMobileNetmodel.The best mean Average Precision(mAP)recorded by the Inception model is 81.61%compared to MobileNet,which achieves the best mAP of 75.12%.
文摘While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.