In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by ...In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by fusing lip images and audio signals. The main method used is lip-audio matching detection technology based on the Siamese neural network, combined with MFCC (Mel Frequency Cepstrum Coefficient) feature extraction of band-pass filters, an improved dual-branch Siamese network structure, and a two-stream network structure design. Firstly, the video stream is preprocessed to extract lip images, and the audio stream is preprocessed to extract MFCC features. Then, these features are processed separately through the two branches of the Siamese network. Finally, the model is trained and optimized through fully connected layers and loss functions. The experimental results show that the testing accuracy of the model in this study on the LRW (Lip Reading in the Wild) dataset reaches 92.3%;the recall rate is 94.3%;the F1 score is 93.3%, significantly better than the results of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) models. In the validation of multi-resolution image streams, the highest accuracy of dual-resolution image streams reaches 94%. Band-pass filters can effectively improve the signal-to-noise ratio of deep forgery video detection when processing different types of audio signals. The real-time processing performance of the model is also excellent, and it achieves an average score of up to 5 in user research. These data demonstrate that the method proposed in this study can effectively fuse visual and audio information in deep forgery video detection, accurately identify inconsistencies between video and audio, and thus verify the effectiveness of lip-audio modality fusion technology in improving detection performance.展开更多
Video camouflaged object detection(VCOD)has become a fundamental task in computer vision that has attracted significant attention in recent years.Unlike image camouflaged object detection(ICOD),VCOD not only requires ...Video camouflaged object detection(VCOD)has become a fundamental task in computer vision that has attracted significant attention in recent years.Unlike image camouflaged object detection(ICOD),VCOD not only requires spatial cues but also needs motion cues.Thus,effectively utilizing spatiotemporal information is crucial for generating accurate segmentation results.Current VCOD methods,which typically focus on exploring motion representation,often ineffectively integrate spatial and motion features,leading to poor performance in diverse scenarios.To address these issues,we design a novel spatiotemporal network with an encoder-decoder structure.During the encoding stage,an adjacent space-time memory module(ASTM)is employed to extract high-level temporal features(i.e.,motion cues)from the current frame and its adjacent frames.In the decoding stage,a selective space-time aggregation module is introduced to efficiently integrate spatial and temporal features.Additionally,a multi-feature fusion module is developed to progressively refine the rough prediction by utilizing the information provided by multiple types of features.Furthermore,we incorporate multi-task learning into the proposed network to obtain more accurate predictions.Experimental results show that the proposed method outperforms existing cutting-edge baselines on VCOD benchmarks.展开更多
In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Alt...In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Although many existing deepfake detection methods exhibit excellent performance for known manipulations,their detection capabilities are not strong when faced with unknown manipulations.Therefore,in order to obtain better generalization ability,this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains,and proposes a Local region Frequency Guided Dynamic Inconsistency Network(LFGDIN).The network includes two parts:Global SpatioTemporal Network(GSTN)and Local Region Frequency Guided Module(LRFGM).The GSTN is responsible for capturing the dynamic information of the entire face,while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth.The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment,so as to improve the model's detection performance.Experiments on the three public datasets(FF++,DFDC,and Celeb-DF)show that compared with many recent advanced methods,the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.展开更多
What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reas...What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reasons have made video object detection(VID)a growing area of research in recent years.Video object detection can be used for various healthcare applications,such as detecting and tracking tumors in medical imaging,monitoring the movement of patients in hospitals and long-term care facilities,and analyzing videos of surgeries to improve technique and training.Additionally,it can be used in telemedicine to help diagnose and monitor patients remotely.Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection.Some of those methods aggregate features on the full-sequence level or from nearby frames.To create feature maps,existing VID techniques frequently use Convolutional Neural Networks(CNNs)as the backbone network.On the other hand,Vision Transformers have outperformed CNNs in various vision tasks,including object detection in still images and image classification.We propose in this research to use Swin-Transformer,a state-of-the-art Vision Transformer,as an alternative to CNN-based backbone networks for object detection in videos.The proposed architecture enhances the accuracy of existing VID methods.The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.We have demonstrated that our proposed method is efficient by achieving 84.3%mean average precision(mAP)on ImageNet VID using less memory in comparison to other leading VID techniques.The source code is available on the website https://github.com/amaharek/SwinVid.展开更多
The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method in...The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.展开更多
Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than consider...Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than considering only the significant context. Method This paper proposes a novel video anomaly detection method called COVAD that mainly focuses on the region of interest in the video instead of the entire video. Our proposed COVAD method is based on an autoencoded convolutional neural network and a coordinated attention mechanism,which can effectively capture meaningful objects in the video and dependencies among different objects. Relying on the existing memory-guided video frame prediction network, our algorithm can significantly predict the future motion and appearance of objects in a video more effectively. Result The proposed algorithm obtained better experimental results on multiple datasets and outperformed the baseline models considered in our analysis. Simultaneously, we provide an improved visual test that can provide pixel-level anomaly explanations.展开更多
Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do no...Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do not work well for low volume road, which is not well-marked and with noises such as vehicle tracks. A fusion-based method termed Dempster-Shafer-based road detection(DSRD) is proposed to address this issue. This method detects road boundary by combining multiple information sources using Dempster-Shafer theory(DST). In order to test the performance of the proposed method, two field experiments were conducted, one of which was on a highway partially covered by snow and another was on a dense traffic highway. The results show that DSRD is robust and accurate, whose detection rates are 100% and 99.8% compared with manual detection results. Then, DSRD is adopted to improve UAV video processing algorithm, and the vehicle detection and tracking rate are improved by 2.7% and 5.5%,respectively. Also, the computation time has decreased by 5% and 8.3% for two experiments, respectively.展开更多
A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't...A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't accurately detected because it involves the camera operation and objectmovement.In this paper,a method based on support vector machine (SVM) is proposed to detect thedissolve shot boundary in MPEG compressed sequence.The problem of detection between the dissolveshot boundary and other boundaries is considered as two-class classification in our method.Featuresfrom the compressed sequences are directly extracted without decoding them,and the optimal classboundary between two classes are learned from training data by using SVM.Experiments,whichcompare various classification methods,show that using proposed method encourages performance ofvideo shot boundary detection.展开更多
Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing com...Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.展开更多
In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space...In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space is used to extract pre-detected regions instead of traditional motion differential method, as it’s more suitable for fire detection in indoor environment. Secondly, according to the flicker characteristic of the flame, similarity and two main values of centroid motion are proposed. At the same time, a simple but effective method for tracking the same regions in consecutive frames is established. Thirdly,a multi-expert system consisting of color component dispersion,similarity and centroid motion is established to identify flames.The proposed method has been tested on a very large dataset of fire videos acquired both in real indoor environment tests and from the Internet. The experimental results show that the proposed approach achieved a balance between the false positive rate and the false negative rate, and demonstrated a better performance in terms of overall accuracy and F standard with respect to other similar fire detection methods in indoor environment.展开更多
A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow ...A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow removal, tracking, and object classification. The Gaussian mixture model was utilized to extract the moving object from an image sequence segmented by the mean-shift technique in the pre-processing module. Shadow removal was used to alleviate the negative impact of the shadow to the detected objects. A model-free method was adopted to identify pedestrians. The maximum and minimum integration methods were developed to integrate multiple cues into the mean-shift algorithm and the initial tracking iteration with the competent integrated probability distribution map for object tracking. A simple but effective algorithm was proposed to handle full occlusion cases. The system was tested using real traffic videos from different sites. The results of the test confirm that the system is reliable and has an overall accuracy of over 85%.展开更多
Agglomerate fog event poses more serious threat than normal foggy weather to expressway traffic safety,due to its localized nature and suddenly uneven formation.However,vision-based fog detection methods typically est...Agglomerate fog event poses more serious threat than normal foggy weather to expressway traffic safety,due to its localized nature and suddenly uneven formation.However,vision-based fog detection methods typically estimate visibility for individual images and ignore the difference in the characteristics of even and uneven fog,lacking use of temporal information to differentiate between normal foggy weather and agglomerate fog events.Meanwhile,detection of fog at night faces strong interference from car lights that is always overlooked.This study proposes a nighttime agglomerate fog event detection(AFED)method for videos,taking into account car light interference.Depth disparity feature is constructed based on the information entropy of depth estimation result.In order to build a metric for uneven characteristics in the field of view,we creatively introduce the Moran’s index to establish uneven feature,generating two-dimensional feature time series for each video.By extracting interpretable features from the two-dimensional feature time series after removing car light interference frames,a classification model based on extreme gradient boosting(XGBoost)is built to differentiate agglomerate fog,normal fog,and no fog videos.Experiments are carried out utilizing real monitoring data from roadside surveillance cameras to validate the effectiveness of features and model.Furthermore,a fog event detection dataset containing over 1500 videos is established,making up data scarcity for vision-based agglomerate fog event detection and providing support for future research.展开更多
While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of ...While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.展开更多
Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach ...Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach for video copy detection. This approach directly estimates the location of the copy segment with a probabilistic graphical model. The spatial and temporal consistency of the video copy is embedded in the local probability function. An effective local descriptor and a two-level descriptor pairing method are used to build a video copy detection system to evaluate the approach. Tests show that it outperforms the popular voting algorithm and the probabilistic fusion framework based on the Hidden Markov Model, improving F-score (F1) by 8%.展开更多
This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. A...This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. And the graph-based semi-supervised learning methods can be extended to multiple graphs to predict the semantic labels for unlabeled video data. However, traditional graphs represent only homogeneous pairwise linking relations, and therefore the high-order correlations inherent in videos, such as high-order visual similarities, are ignored. In this paper we represent heterogeneous features by multiple hypergraphs and then the high-order correlated samples can be associated with hyperedges. Furthermore, the multi-hypergraph ranking (MHR) algorithm is proposed by defining Markov random walk on each hypergraph and then forming the mixture Markov chains so as to perform transductive learning in multiple hypergraphs. In experiments on the TRECVID dataset, a triple-hypergraph consisting of visual, textual features and multiple labeled tags is constructed to predict concept labels for unlabeled video shots by the MHR framework. Experimental results show that our approach is effective.展开更多
Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from th...Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.展开更多
Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,a...Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.展开更多
Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that referenc...Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that reference database consists of original videos, these videos are difficult to be obtained in many practical cases. In this paper, a copy detection method based on sparse repre- sentation is proposed to make use of some imperfect prototypes of original videos maintained in the reference database. A query video is represented as a linear combination of all the videos in the database. Then we can determine that whether the query has sibling videos in the database based on distribution of coefficients and find them out based on reconstruction error. The experiments show that even with very limited dimensional feature, this method can achieve high performance.展开更多
Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection method...Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection methods still have two shortcomings:(1)even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes;(2)once a model is deployed,it cannot autonomously evolve along with the accumulated unlabeled scene data.To address these problems,and inspired by visual knowledge theory,we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups.We first extract a large number of object proposals from unlabeled data through a pre-trained detection model.Second,we build the visual knowledge dictionary of object concepts by clustering the proposals,in which each cluster center represents an object prototype.Third,we look into the relations between different clusters and the object information of different groups,and propose a graph-based group information propagation strategy to determine the category of an object concept,which can effectively distinguish positive and negative proposals.With these pseudo labels,we can easily fine-tune the pretrained model.The effectiveness of the proposed method is verified by performing different experiments,and the significant improvements are achieved.展开更多
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ...Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.展开更多
文摘In response to the problem of traditional methods ignoring audio modality tampering, this study aims to explore an effective deep forgery video detection technique that improves detection precision and reliability by fusing lip images and audio signals. The main method used is lip-audio matching detection technology based on the Siamese neural network, combined with MFCC (Mel Frequency Cepstrum Coefficient) feature extraction of band-pass filters, an improved dual-branch Siamese network structure, and a two-stream network structure design. Firstly, the video stream is preprocessed to extract lip images, and the audio stream is preprocessed to extract MFCC features. Then, these features are processed separately through the two branches of the Siamese network. Finally, the model is trained and optimized through fully connected layers and loss functions. The experimental results show that the testing accuracy of the model in this study on the LRW (Lip Reading in the Wild) dataset reaches 92.3%;the recall rate is 94.3%;the F1 score is 93.3%, significantly better than the results of CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) models. In the validation of multi-resolution image streams, the highest accuracy of dual-resolution image streams reaches 94%. Band-pass filters can effectively improve the signal-to-noise ratio of deep forgery video detection when processing different types of audio signals. The real-time processing performance of the model is also excellent, and it achieves an average score of up to 5 in user research. These data demonstrate that the method proposed in this study can effectively fuse visual and audio information in deep forgery video detection, accurately identify inconsistencies between video and audio, and thus verify the effectiveness of lip-audio modality fusion technology in improving detection performance.
文摘Video camouflaged object detection(VCOD)has become a fundamental task in computer vision that has attracted significant attention in recent years.Unlike image camouflaged object detection(ICOD),VCOD not only requires spatial cues but also needs motion cues.Thus,effectively utilizing spatiotemporal information is crucial for generating accurate segmentation results.Current VCOD methods,which typically focus on exploring motion representation,often ineffectively integrate spatial and motion features,leading to poor performance in diverse scenarios.To address these issues,we design a novel spatiotemporal network with an encoder-decoder structure.During the encoding stage,an adjacent space-time memory module(ASTM)is employed to extract high-level temporal features(i.e.,motion cues)from the current frame and its adjacent frames.In the decoding stage,a selective space-time aggregation module is introduced to efficiently integrate spatial and temporal features.Additionally,a multi-feature fusion module is developed to progressively refine the rough prediction by utilizing the information provided by multiple types of features.Furthermore,we incorporate multi-task learning into the proposed network to obtain more accurate predictions.Experimental results show that the proposed method outperforms existing cutting-edge baselines on VCOD benchmarks.
基金supported by the National Natural Science Foundation of China(Nos.62072251 and U22B2062)the Priority Academic Program Development of Jiangsu Higher Education Institutions fund.
文摘In recent years,with the rapid development of deepfake technology,a large number of deepfake videos have emerged on the Internet,which poses a huge threat to national politics,social stability,and personal privacy.Although many existing deepfake detection methods exhibit excellent performance for known manipulations,their detection capabilities are not strong when faced with unknown manipulations.Therefore,in order to obtain better generalization ability,this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains,and proposes a Local region Frequency Guided Dynamic Inconsistency Network(LFGDIN).The network includes two parts:Global SpatioTemporal Network(GSTN)and Local Region Frequency Guided Module(LRFGM).The GSTN is responsible for capturing the dynamic information of the entire face,while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth.The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment,so as to improve the model's detection performance.Experiments on the three public datasets(FF++,DFDC,and Celeb-DF)show that compared with many recent advanced methods,the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.
文摘What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement,out-of-focus camera shots,and changes in posture.These reasons have made video object detection(VID)a growing area of research in recent years.Video object detection can be used for various healthcare applications,such as detecting and tracking tumors in medical imaging,monitoring the movement of patients in hospitals and long-term care facilities,and analyzing videos of surgeries to improve technique and training.Additionally,it can be used in telemedicine to help diagnose and monitor patients remotely.Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection.Some of those methods aggregate features on the full-sequence level or from nearby frames.To create feature maps,existing VID techniques frequently use Convolutional Neural Networks(CNNs)as the backbone network.On the other hand,Vision Transformers have outperformed CNNs in various vision tasks,including object detection in still images and image classification.We propose in this research to use Swin-Transformer,a state-of-the-art Vision Transformer,as an alternative to CNN-based backbone networks for object detection in videos.The proposed architecture enhances the accuracy of existing VID methods.The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.We have demonstrated that our proposed method is efficient by achieving 84.3%mean average precision(mAP)on ImageNet VID using less memory in comparison to other leading VID techniques.The source code is available on the website https://github.com/amaharek/SwinVid.
基金Science and Technology Funds from the Liaoning Education Department(Serial Number:LJKZ0104).
文摘The motivation for this study is that the quality of deep fakes is constantly improving,which leads to the need to develop new methods for their detection.The proposed Customized Convolutional Neural Network method involves extracting structured data from video frames using facial landmark detection,which is then used as input to the CNN.The customized Convolutional Neural Network method is the date augmented-based CNN model to generate‘fake data’or‘fake images’.This study was carried out using Python and its libraries.We used 242 films from the dataset gathered by the Deep Fake Detection Challenge,of which 199 were made up and the remaining 53 were real.Ten seconds were allotted for each video.There were 318 videos used in all,199 of which were fake and 119 of which were real.Our proposedmethod achieved a testing accuracy of 91.47%,loss of 0.342,and AUC score of 0.92,outperforming two alternative approaches,CNN and MLP-CNN.Furthermore,our method succeeded in greater accuracy than contemporary models such as XceptionNet,Meso-4,EfficientNet-BO,MesoInception-4,VGG-16,and DST-Net.The novelty of this investigation is the development of a new Convolutional Neural Network(CNN)learning model that can accurately detect deep fake face photos.
文摘Background Video anomaly detection has always been a hot topic and has attracted increasing attention.Many of the existing methods for video anomaly detection depend on processing the entire video rather than considering only the significant context. Method This paper proposes a novel video anomaly detection method called COVAD that mainly focuses on the region of interest in the video instead of the entire video. Our proposed COVAD method is based on an autoencoded convolutional neural network and a coordinated attention mechanism,which can effectively capture meaningful objects in the video and dependencies among different objects. Relying on the existing memory-guided video frame prediction network, our algorithm can significantly predict the future motion and appearance of objects in a video more effectively. Result The proposed algorithm obtained better experimental results on multiple datasets and outperformed the baseline models considered in our analysis. Simultaneously, we provide an improved visual test that can provide pixel-level anomaly explanations.
基金Project(2009AA11Z220)supported by the National High Technology Research and Development Program of China
文摘Video processing is one challenge in collecting vehicle trajectories from unmanned aerial vehicle(UAV) and road boundary estimation is one way to improve the video processing algorithms. However, current methods do not work well for low volume road, which is not well-marked and with noises such as vehicle tracks. A fusion-based method termed Dempster-Shafer-based road detection(DSRD) is proposed to address this issue. This method detects road boundary by combining multiple information sources using Dempster-Shafer theory(DST). In order to test the performance of the proposed method, two field experiments were conducted, one of which was on a highway partially covered by snow and another was on a dense traffic highway. The results show that DSRD is robust and accurate, whose detection rates are 100% and 99.8% compared with manual detection results. Then, DSRD is adopted to improve UAV video processing algorithm, and the vehicle detection and tracking rate are improved by 2.7% and 5.5%,respectively. Also, the computation time has decreased by 5% and 8.3% for two experiments, respectively.
文摘A number of automated video shot boundary detection methods for indexing a videosequence to facilitate browsing and retrieval have been proposed in recent years.Among these methods,the dissolve shot boundary isn't accurately detected because it involves the camera operation and objectmovement.In this paper,a method based on support vector machine (SVM) is proposed to detect thedissolve shot boundary in MPEG compressed sequence.The problem of detection between the dissolveshot boundary and other boundaries is considered as two-class classification in our method.Featuresfrom the compressed sequences are directly extracted without decoding them,and the optimal classboundary between two classes are learned from training data by using SVM.Experiments,whichcompare various classification methods,show that using proposed method encourages performance ofvideo shot boundary detection.
基金funded by the Natural Science Foundation China(NSFC)under Grant No.62203192.
文摘Video salient object detection(VSOD)aims at locating the most attractive objects in a video by exploring the spatial and temporal features.VSOD poses a challenging task in computer vision,as it involves processing complex spatial data that is also influenced by temporal dynamics.Despite the progress made in existing VSOD models,they still struggle in scenes of great background diversity within and between frames.Additionally,they encounter difficulties related to accumulated noise and high time consumption during the extraction of temporal features over a long-term duration.We propose a multi-stream temporal enhanced network(MSTENet)to address these problems.It investigates saliency cues collaboration in the spatial domain with a multi-stream structure to deal with the great background diversity challenge.A straightforward,yet efficient approach for temporal feature extraction is developed to avoid the accumulative noises and reduce time consumption.The distinction between MSTENet and other VSOD methods stems from its incorporation of both foreground supervision and background supervision,facilitating enhanced extraction of collaborative saliency cues.Another notable differentiation is the innovative integration of spatial and temporal features,wherein the temporal module is integrated into the multi-stream structure,enabling comprehensive spatial-temporal interactions within an end-to-end framework.Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on five benchmark datasets while maintaining a real-time speed of 27 fps(Titan XP).Our code and models are available at https://github.com/RuJiaLe/MSTENet.
基金supported by National Natural Science Foundation of China(41471387,41631072)
文摘In this paper, a video fire detection method is proposed, which demonstrated good performance in indoor environment. Three main novel ideas have been introduced. Firstly, a flame color model in RGB and HIS color space is used to extract pre-detected regions instead of traditional motion differential method, as it’s more suitable for fire detection in indoor environment. Secondly, according to the flicker characteristic of the flame, similarity and two main values of centroid motion are proposed. At the same time, a simple but effective method for tracking the same regions in consecutive frames is established. Thirdly,a multi-expert system consisting of color component dispersion,similarity and centroid motion is established to identify flames.The proposed method has been tested on a very large dataset of fire videos acquired both in real indoor environment tests and from the Internet. The experimental results show that the proposed approach achieved a balance between the false positive rate and the false negative rate, and demonstrated a better performance in terms of overall accuracy and F standard with respect to other similar fire detection methods in indoor environment.
基金Project(50778015)supported by the National Natural Science Foundation of ChinaProject(2012CB725403)supported by the Major State Basic Research Development Program of China
文摘A real-time pedestrian detection and tracking system using a single video camera was developed to monitor pedestrians. This system contained six modules: video flow capture, pre-processing, movement detection, shadow removal, tracking, and object classification. The Gaussian mixture model was utilized to extract the moving object from an image sequence segmented by the mean-shift technique in the pre-processing module. Shadow removal was used to alleviate the negative impact of the shadow to the detected objects. A model-free method was adopted to identify pedestrians. The maximum and minimum integration methods were developed to integrate multiple cues into the mean-shift algorithm and the initial tracking iteration with the competent integrated probability distribution map for object tracking. A simple but effective algorithm was proposed to handle full occlusion cases. The system was tested using real traffic videos from different sites. The results of the test confirm that the system is reliable and has an overall accuracy of over 85%.
文摘Agglomerate fog event poses more serious threat than normal foggy weather to expressway traffic safety,due to its localized nature and suddenly uneven formation.However,vision-based fog detection methods typically estimate visibility for individual images and ignore the difference in the characteristics of even and uneven fog,lacking use of temporal information to differentiate between normal foggy weather and agglomerate fog events.Meanwhile,detection of fog at night faces strong interference from car lights that is always overlooked.This study proposes a nighttime agglomerate fog event detection(AFED)method for videos,taking into account car light interference.Depth disparity feature is constructed based on the information entropy of depth estimation result.In order to build a metric for uneven characteristics in the field of view,we creatively introduce the Moran’s index to establish uneven feature,generating two-dimensional feature time series for each video.By extracting interpretable features from the two-dimensional feature time series after removing car light interference frames,a classification model based on extreme gradient boosting(XGBoost)is built to differentiate agglomerate fog,normal fog,and no fog videos.Experiments are carried out utilizing real monitoring data from roadside surveillance cameras to validate the effectiveness of features and model.Furthermore,a fog event detection dataset containing over 1500 videos is established,making up data scarcity for vision-based agglomerate fog event detection and providing support for future research.
文摘While the internet has a lot of positive impact on society,there are negative components.Accessible to everyone through online platforms,pornography is,inducing psychological and health related issues among people of all ages.While a difficult task,detecting pornography can be the important step in determining the porn and adult content in a video.In this paper,an architecture is proposed which yielded high scores for both training and testing.This dataset was produced from 190 videos,yielding more than 19 h of videos.The main sources for the content were from YouTube,movies,torrent,and websites that hosts both pornographic and non-pornographic contents.The videos were from different ethnicities and skin color which ensures the models can detect any kind of video.A VGG16,Inception V3 and Resnet 50 models were initially trained to detect these pornographic images but failed to achieve a high testing accuracy with accuracies of 0.49,0.49 and 0.78 respectively.Finally,utilizing transfer learning,a convolutional neural network was designed and yielded an accuracy of 0.98.
基金Supported by the National Key Basic Research and Development (863) Program of China (No. 2007CB311003)
文摘Content-based video copy detection is an active research field due to the need for copyright pro- tection and business intellectual property protection. This paper gives a probabilistic spatiotemporal fusion approach for video copy detection. This approach directly estimates the location of the copy segment with a probabilistic graphical model. The spatial and temporal consistency of the video copy is embedded in the local probability function. An effective local descriptor and a two-level descriptor pairing method are used to build a video copy detection system to evaluate the approach. Tests show that it outperforms the popular voting algorithm and the probabilistic fusion framework based on the Hidden Markov Model, improving F-score (F1) by 8%.
基金supported by the National Natural Science Foundation of China(Nos.60603096 and 60673088)the National High-Tech Re-search and Development Program(863)of China(No.2006AA010107)the Program for Changjiang Scholars and Innovative Research Team in University of China(No.IRT0652)
文摘This paper tackles the problem of video concept detection using the multi-modality fusion method. Motivated by multi-view learning algorithms, multi-modality features of videos can be represented by multiple graphs. And the graph-based semi-supervised learning methods can be extended to multiple graphs to predict the semantic labels for unlabeled video data. However, traditional graphs represent only homogeneous pairwise linking relations, and therefore the high-order correlations inherent in videos, such as high-order visual similarities, are ignored. In this paper we represent heterogeneous features by multiple hypergraphs and then the high-order correlated samples can be associated with hyperedges. Furthermore, the multi-hypergraph ranking (MHR) algorithm is proposed by defining Markov random walk on each hypergraph and then forming the mixture Markov chains so as to perform transductive learning in multiple hypergraphs. In experiments on the TRECVID dataset, a triple-hypergraph consisting of visual, textual features and multiple labeled tags is constructed to predict concept labels for unlabeled video shots by the MHR framework. Experimental results show that our approach is effective.
基金supported in part by the CAMS Innovation Fund for Medical Sciences,China(No.2019-I2M5-016)National Natural Science Foundation of China(No.62172246)+1 种基金the Youth Innovation and Technology Support Plan of Colleges and Universities in Shandong Province,China(No.2021KJ062)National Science Foundation of USA(Nos.IIS-1715985 and IIS1812606).
文摘Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generalization ability.This situation could become worse on“long”videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and deteriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the given testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.
文摘Video understanding and content boundary detection are vital stages in video recommendation.However,previous content boundary detection methods require collecting information,including location,cast,action,and audio,and if any of these elements are missing,the results may be adversely affected.To address this issue and effectively detect transitions in video content,in this paper,we introduce a video classification and boundary detection method named JudPriNet.The focus of this paper is on objects in videos along with their labels,enabling automatic scene detection in video clips and establishing semantic connections among local objects in the images.As a significant contribution,JudPriNet presents a framework that maps labels to“Continuous Bag of Visual Words Model”to cluster labels and generates new standardized labels as video-type tags.This facilitates automatic classification of video clips.Furthermore,JudPriNet employs Monte Carlo sampling method to classify video clips,the features of video clips as elements within the framework.This proposed method seamlessly integrates video and textual components without compromising training and inference speed.Through experimentation,we have demonstrated that JudPriNet,with its semantic connections,is able to effectively classify videos alongside textual content.Our results indicate that,compared with several other detection approaches,JudPriNet excels in high-level content detection without disrupting the integrity of the video content,outperforming existing methods.
文摘Content-based video copy detection becomes an active research field due to requirement of copyright protection, business intelligence, video retrieval, etc. Although it is assumed in the existing methods that reference database consists of original videos, these videos are difficult to be obtained in many practical cases. In this paper, a copy detection method based on sparse repre- sentation is proposed to make use of some imperfect prototypes of original videos maintained in the reference database. A query video is represented as a linear combination of all the videos in the database. Then we can determine that whether the query has sibling videos in the database based on distribution of coefficients and find them out based on reconstruction error. The experiments show that even with very limited dimensional feature, this method can achieve high performance.
基金Project supported by the National Key R&D Program of China(No.2020AAA010400X)and the Hikvision Open Fund,China。
文摘Object detection is one of the hottest research directions in computer vision,has already made impressive progress in academia,and has many valuable applications in the industry.However,the mainstream detection methods still have two shortcomings:(1)even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes;(2)once a model is deployed,it cannot autonomously evolve along with the accumulated unlabeled scene data.To address these problems,and inspired by visual knowledge theory,we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups.We first extract a large number of object proposals from unlabeled data through a pre-trained detection model.Second,we build the visual knowledge dictionary of object concepts by clustering the proposals,in which each cluster center represents an object prototype.Third,we look into the relations between different clusters and the object information of different groups,and propose a graph-based group information propagation strategy to determine the category of an object concept,which can effectively distinguish positive and negative proposals.With these pseudo labels,we can easily fine-tune the pretrained model.The effectiveness of the proposed method is verified by performing different experiments,and the significant improvements are achieved.
基金This work was supported by the National Natural Science Foundation of China(62176169,61703077,and 62102207).
文摘Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet.