Image captioning,the task of generating descriptive sentences for images,has advanced significantly with the integration of semantic information.However,traditional models still rely on static visual features that do ...Image captioning,the task of generating descriptive sentences for images,has advanced significantly with the integration of semantic information.However,traditional models still rely on static visual features that do not evolve with the changing linguistic context,which can hinder the ability to form meaningful connections between the image and the generated captions.This limitation often leads to captions that are less accurate or descriptive.In this paper,we propose a novel approach to enhance image captioning by introducing dynamic interactions where visual features continuously adapt to the evolving linguistic context.Our model strengthens the alignment between visual and linguistic elements,resulting in more coherent and contextually appropriate captions.Specifically,we introduce two innovative modules:the Visual Weighting Module(VWM)and the Enhanced Features Attention Module(EFAM).The VWM adjusts visual features using partial attention,enabling dynamic reweighting of the visual inputs,while the EFAM further refines these features to improve their relevance to the generated caption.By continuously adjusting visual features in response to the linguistic context,our model bridges the gap between static visual features and dynamic language generation.We demonstrate the effectiveness of our approach through experiments on the MS-COCO dataset,where our method outperforms state-of-the-art techniques in terms of caption quality and contextual relevance.Our results show that dynamic visual-linguistic alignment significantly enhances image captioning performance.展开更多
Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms...Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image,improving the effectiveness of identifying relevant image regions at each step of caption generation.However,providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features.Consequently,this leads to enhanced captioning network performance.In light of this,we present an image captioning framework that efficiently exploits the extracted representations of the image.Our framework comprises three key components:the Visual Feature Detector module(VFD),the Visual Feature Visual Attention module(VFVA),and the language model.The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features,creating an updated visual features matrix.Subsequently,the VFVA directs its attention to the visual features matrix generated by the VFD,resulting in an updated context vector employed by the language model to generate an informative description.Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features,thereby contributing to enhancing the image captioning model’s performance.Using the MS-COCO dataset,our experiments show that the proposed framework competes well with state-of-the-art methods,effectively leveraging visual representations to improve performance.The implementation code can be found here:https://github.com/althobhani/VFDICM(accessed on 30 July 2024).展开更多
As a well-known urban landscape concept to describe urban space quality,urban street vitality is a subjective human perception of the urban environment but difficult to evaluate directly from the physical space.The st...As a well-known urban landscape concept to describe urban space quality,urban street vitality is a subjective human perception of the urban environment but difficult to evaluate directly from the physical space.The study utilized a modern machine learning computer vision algorithm in the urban build environment to simulate the process,which starts with the visual perception of the urban street landscape and ends with the human reaction to street vitality.By analyzing the optimized trained model,we tried to identify urban street vitality’s visual features and evaluate their importance.A region around the Mochou Lake in Nanjing,China,was set as our study area.Seven investigators surveyed the area,recorded their evaluation score on each site’s vitality level with a corresponding picture taken on site.A total of 370 pictures and recorded score pairs from 231 valid survey sites were used to train a convolutional neural network.After optimization,a deep neural network model with 43 layers,including 11 convolutional ones,was created.Heat maps were then used to identify the features which lead to high vitality score outputs.The spatial distributions of different types of feature entities were also analyzed to help identify the spatial effects.The study found that visual features,including human,construction site,shop front,and roadside/walking pavement,are vital ones that correspond to the vitality of the urban street.The consistency of these critical features with traditional urban vitality features indicates the model had learned useful knowledge from the training process.Applying the trained model in urban planning practices can help to improve the city environment for better attraction of residents’activities and communications.展开更多
Applying machine learning to lemon defect recognition can improve the efficiency of lemon quality detection. This paper proposes a deep learning-based classification method with visual feature extraction and transfer ...Applying machine learning to lemon defect recognition can improve the efficiency of lemon quality detection. This paper proposes a deep learning-based classification method with visual feature extraction and transfer learning to recognize defect lemons (</span><i><span style="font-family:Verdana;">i.e.</span></i><span style="font-family:Verdana;">, green and mold defects). First, the data enhancement and brightness compensation techniques are used for data prepossessing. The visual feature extraction is used to quantify the defects and determine the feature variables as the bandit basis for classification. Then we construct a convolutional neural network with an embedded Visual Geome</span><span style="font-family:Verdana;">try Group 16 based (VGG16-based) network using transfer learning. The proposed model is compared with many benchmark models such as</span><span style="font-family:Verdana;"> K-</span></span><span style="font-family:Verdana;">n</span><span style="font-family:Verdana;">earest</span><span style="font-family:""> </span><span style="font-family:Verdana;">Neighbor (KNN) and Support Vector Machine (SVM). Result</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;"> show that the proposed model achieves the highest accuracy (95.44%) in the testing data set. The research provides a new solution for lemon defect recognition.展开更多
The impact of location services on people’s lives has grown significantly in the era of widespread smart device usage.Due to global navigation satellite system(GNSS)signal rejection,weak signal strength in indoor env...The impact of location services on people’s lives has grown significantly in the era of widespread smart device usage.Due to global navigation satellite system(GNSS)signal rejection,weak signal strength in indoor environments and radio signal interference caused by multiwall environments,which collectively lead to significant positioning errors,vision-based positioning has emerged as a crucial method in indoor positioning research.This paper introduces a scale hierarchical matching model to tackle challenges associated with large visual databases and high scene similarity,both of which will compromise matching accuracy and lead to prolonged positioning delays.The proposed model establishes an image feature database using GIST features and speeded up robust feature(SURF)in the offline stage.In the online stage,a positioning navigating algorithm is constructed based on Dijkstra’s path planning.Additionally,a corresponding Android application has been developed to facilitate visual positioning and navigation in indoor environments.Experimental results obtained in real indoor environments demonstrate that the proposed method significantly enhances positioning accuracy compared with similar algorithms,while effectively reducing time overhead.This improvement caters to the requirements for indoor positioning and navigation,thereby meeting user needs.展开更多
The integration of remote sensing and artificial intelligence technologies into photovoltaic(PV)power generation has significantly enhanced the efficiency and precision of monitoring and evaluating PV station construc...The integration of remote sensing and artificial intelligence technologies into photovoltaic(PV)power generation has significantly enhanced the efficiency and precision of monitoring and evaluating PV station construction.However,most semantic segmentation models are primarily developed for natural scenes,often neglecting the distinctive visual attributes of PV panels.We introduce a visual feature constraint method designed to tailor the segmentation network to the unique aspects of PV panels,including their texture,color,and shape.The method incorporates a constraint module,comprised of three adversarial autoencoders,into a conventional segmentation model.This technique represents a versatile training framework that can be seamlessly integrated with state-of-theart models,providing clear insights into the learning process.Experimental results with UperNet,SegFormer,DeepLabV3+,TransUNet,CorrMatch,SCSM and UKAN as baseline models show a maximum IoU improvement of 2.16%.Notably,UperNet attains the superior segmentation outcomes,whereas DeepLabV3+exhibits the greatest benefit from the imposed constraints.Furthermore,our findings reveal that various models exhibit distinct sensitivities to different visual features,and employing multiple constraints typically yields better results than relying on single-feature constraints.Collectively,our proposed method showcases its potential to advance PV panel segmentation in remote sensing applications,presenting a scalable and effective solution.展开更多
Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emerg...Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emergency medicine poses a significant challenge.To address this,we propose a novel approach based on the inter-learning of visual features between global and local information.Specifically,our method enhances the perception capabilities of the visual feature extractor by strategically leveraging the strengths of convolutional neural network(CNN),which are adept at capturing local features,and visual transformers which perform well at perceiving global features.Furthermore,to mitigate the issue of overfitting caused by the limited availability of sign language data for emergency medical applications,we introduce an enhanced short temporal module for data augmentation through additional subsequences.Experimental results on three publicly available sign language datasets demonstrate the efficacy of the proposed approach.展开更多
Embodied visual exploration is critical for building intelligent visual agents. This paper presents the neural exploration with feature-based visual odometry and tracking-failure-reduction policy(Ne OR), a framework f...Embodied visual exploration is critical for building intelligent visual agents. This paper presents the neural exploration with feature-based visual odometry and tracking-failure-reduction policy(Ne OR), a framework for embodied visual exploration that possesses the efficient exploration capabilities of deep reinforcement learning(DRL)-based exploration policies and leverages feature-based visual odometry(VO) for more accurate mapping and positioning results. An improved local policy is also proposed to reduce tracking failures of feature-based VO in weakly textured scenes through a refined multi-discrete action space, keyframe fusion, and an auxiliary task. The experimental results demonstrate that Ne OR has better mapping and positioning accuracy compared to other entirely learning-based exploration frameworks and improves the robustness of feature-based VO by significantly reducing tracking failures in weakly textured scenes.展开更多
Multi-modal Named Entity Recognition(MNER),which is vision-language task,utilizes images as auxiliary to detect and classify named entities from input sentence.Recent studies find visual information is helpful for Nam...Multi-modal Named Entity Recognition(MNER),which is vision-language task,utilizes images as auxiliary to detect and classify named entities from input sentence.Recent studies find visual information is helpful for Named Entity Recognition(NER),while the difference between those two modalities is not carefully considered.Therefore,these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features,which give the same weight of different modalities usually predict wrong because of the noise of visual information.To reduce these bias,we propose a Masked Multi-modal Attention Fusion approach for MNER,named MMAF.Firstly,we utilize Image Caption to generate textual representation of image,which is combined with original sentence.Then,to get textual and visual features,we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities.We add Multi-modal Attention Mask to highlight the importance of certain words in sentences,enhancing the performance of entity detection.Finally,we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder.Experiments show our method outperforms state-of-the-art models by 0.23%and 0.84%on Twitter 2015 and 2017 MNER datasets respectively,demonstrating its effectiveness.展开更多
Deception detection plays a crucial role in criminal investigation.Videos contain a wealth of information regarding apparent and physiological changes in individuals,and thus can serve as an effective means of decepti...Deception detection plays a crucial role in criminal investigation.Videos contain a wealth of information regarding apparent and physiological changes in individuals,and thus can serve as an effective means of deception detection.In this paper,we investigate video-based deception detection considering both apparent visual features such as eye gaze,head pose and facial action unit(AU),and non-contact heart rate detected by remote photoplethysmography(rPPG)technique.Multiple wrapper-based feature selection methods combined with the K-nearest neighbor(KNN)and support vector machine(SVM)classifiers are employed to screen the most effective features for deception detection.We evaluate the performance of the proposed method on both a self-collected physiological-assisted visual deception detection(PV3D)dataset and a public bag-oflies(BOL)dataset.Experimental results demonstrate that the SVM classifier with symbiotic organisms search(SOS)feature selection yields the best overall performance,with an area under the curve(AUC)of 83.27%and accuracy(ACC)of 83.33%for PV3D,and an AUC of 71.18%and ACC of 70.33%for BOL.This demonstrates the stability and effectiveness of the proposed method in video-based deception detection tasks.展开更多
Aerodynamic surrogate modeling mostly relies only on integrated loads data obtained from simulation or experiment,while neglecting and wasting the valuable distributed physical information on the surface.To make full ...Aerodynamic surrogate modeling mostly relies only on integrated loads data obtained from simulation or experiment,while neglecting and wasting the valuable distributed physical information on the surface.To make full use of both integrated and distributed loads,a modeling paradigm,called the heterogeneous data-driven aerodynamic modeling,is presented.The essential concept is to incorporate the physical information of distributed loads as additional constraints within the end-to-end aerodynamic modeling.Towards heterogenous data,a novel and easily applicable physical feature embedding modeling framework is designed.This framework extracts lowdimensional physical features from pressure distribution and then effectively enhances the modeling of the integrated loads via feature embedding.The proposed framework can be coupled with multiple feature extraction methods,and the well-performed generalization capabilities over different airfoils are verified through a transonic case.Compared with traditional direct modeling,the proposed framework can reduce testing errors by almost 50%.Given the same prediction accuracy,it can save more than half of the training samples.Furthermore,the visualization analysis has revealed a significant correlation between the discovered low-dimensional physical features and the heterogeneous aerodynamic loads,which shows the interpretability and credibility of the superior performance offered by the proposed deep learning framework.展开更多
Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dyn...Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.展开更多
The quality of oranges is grounded on their appearance and diameter.Appearance refers to the skin’s smoothness and surface cleanliness;diameter refers to the transverse diameter size.They are visual attributes that v...The quality of oranges is grounded on their appearance and diameter.Appearance refers to the skin’s smoothness and surface cleanliness;diameter refers to the transverse diameter size.They are visual attributes that visual perception technologies can automatically identify.Nonetheless,the current orange quality assessment needs to address two issues:1)There are no image datasets for orange quality grading;2)It is challenging to effectively learn the fine-grained and distinct visual semantics of oranges from diverse angles.This study collected 12522 images from 2087 oranges for multi-grained grading tasks.In addition,it presented a visual learning graph convolution approach for multi-grained orange quality grading,including a backbone network and a graph convolutional network(GCN).The backbone network’s object detection,data augmentation,and feature extraction can remove extraneous visual information.GCN was utilized to learn the topological semantics of orange feature maps.Finally,evaluation results proved that the recognition accuracy of diameter size,appearance,and fine-grained orange quality were 99.50,97.27,and 97.99%,respectively,indicating that the proposed approach is superior to others.展开更多
The rapid growth of multimedia content necessitates powerful technologies to filter, classify, index and retrieve video documents more efficiently. However, the essential bottleneck of image and video analysis is the ...The rapid growth of multimedia content necessitates powerful technologies to filter, classify, index and retrieve video documents more efficiently. However, the essential bottleneck of image and video analysis is the problem of semantic gap that low level features extracted by computers always fail to coincide with high-level concepts interpreted by humans. In this paper, we present a generic scheme for the detection video semantic concepts based on multiple visual features machine learning. Various global and local low-level visual features are systelrtically investigated, and kernelbased learning method equips the concept detection system to explore the potential of these features. Then we combine the different features and sub-systen on both classifier-level and kernel-level fusion that contribute to a more robust system Our proposed system is tested on the TRECVID dataset. The resulted Mean Average Precision (MAP) score is rmch better than the benchmark perforrmnce, which proves that our concepts detection engine develops a generic model and perforrrs well on both object and scene type concepts.展开更多
Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseli...Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques.In this work,we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view.To fully explore the precious information from source LF captures,we propose a novel occlusion-aware source sampler(OSS)module which efficiently transfers the pixels of source views to the target view′s frustum in an occlusion-aware manner.An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF.The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles,but also proves to be able to effectively enhance the visual rendering quality.Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.展开更多
Objective image quality assessment(IQA)plays an important role in various visual communication systems,which can automatically and efficiently predict the perceived quality of images.The human eye is the ultimate eval...Objective image quality assessment(IQA)plays an important role in various visual communication systems,which can automatically and efficiently predict the perceived quality of images.The human eye is the ultimate evaluator for visual experience,thus the modeling of human visual system(HVS)is a core issue for objective IQA and visual experience optimization.The traditional model based on black box fitting has low interpretability and it is difficult to guide the experience optimization effectively,while the model based on physiological simulation is hard to integrate into practical visual communication services due to its high computational complexity.For bridging the gap between signal distortion and visual experience,in this paper,we propose a novel perceptual no-reference(NR)IQA algorithm based on structural computational modeling of HVS.According to the mechanism of the human brain,we divide the visual signal processing into a low-level visual layer,a middle-level visual layer and a high-level visual layer,which conduct pixel information processing,primitive information processing and global image information processing,respectively.The natural scene statistics(NSS)based features,deep features and free-energy based features are extracted from these three layers.The support vector regression(SVR)is employed to aggregate features to the final quality prediction.Extensive experimental comparisons on three widely used benchmark IQA databases(LIVE,CSIQ and TID2013)demonstrate that our proposed metric is highly competitive with or outperforms the state-of-the-art NR IQA measures.展开更多
With the rapid development of the Internet,the types of webpages are more abundant than in previous decades.However,it becomes severe that people are facing more and more significant network security risks and enormou...With the rapid development of the Internet,the types of webpages are more abundant than in previous decades.However,it becomes severe that people are facing more and more significant network security risks and enormous losses caused by phishing webpages,which imitate the interface of real webpages and deceive the victims.To better identify and distinguish phishing webpages,a visual feature extraction method and a visual similarity algorithm are proposed.First,the visual feature extraction method improves the Visionbased Page Segmentation(VIPS)algorithm to extract the visual block and calculate its signature by perceptual hash technology.Second,the visual similarity algorithm presents a one-to-one correspondence based on the visual blocks’coordinates and thresholds.Then the weights are assigned according to the tree structure,and the similarity of the visual blocks is calculated on the basis of the measurement of the visual features’Hamming distance.Further,the visual similarity of webpages is generated by integrating the similarity and weight of different visual blocks.Finally,multiple pairs of phishing webpages and legitimate webpages are evaluated to verify the feasibility of the algorithm.The experimental results achieve excellent performance and demonstrate that our method can achieve 94%accuracy.展开更多
In the paper a referral system to assist the medical experts in the screening/referral of diabetic retinopathy is suggested. The system has been developed by a sequential use of different existing mathematical techniq...In the paper a referral system to assist the medical experts in the screening/referral of diabetic retinopathy is suggested. The system has been developed by a sequential use of different existing mathematical techniques. These techniques involve speeded up robust features(SURF), K-means clustering and visual dictionaries(VD). Three databases are mixed to test the working of the system when the sources are dissimilar. When experiments were performed an area under the curve(AUC) of 0.9343 was attained. The results acquired from the system are promising.展开更多
Target tracking is one typical application of visual servoing technology. It is still a difficult task to track high speed target with current visual servo system. The improvement of visual servoing scheme is strongly...Target tracking is one typical application of visual servoing technology. It is still a difficult task to track high speed target with current visual servo system. The improvement of visual servoing scheme is strongly required. A position-based visual servo parallel system is presented for tracking target with high speed. A local Frenet frame is assigned to the sampling point of spatial trajectory. Position estimation is formed by the differential features of intrinsic geometry, and orientation estimation is formed by homogenous transformation. The time spent for searching and processing can be greatly reduced by shifting the window according to features location prediction. The simulation results have demonstrated the ability of the system to track spatial moving object.展开更多
With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised rad...With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised radio signal clustering methods have recently become an urgent need for this situation.Meanwhile,the high complexity of deep learning makes it difficult to understand the decision results of the clustering models,making it essential to conduct interpretable analysis.This paper proposed a combined loss function for unsupervised clustering based on autoencoder.The combined loss function includes reconstruction loss and deep clustering loss.Deep clustering loss is added based on reconstruction loss,which makes similar deep features converge more in feature space.In addition,a features visualization method for signal clustering was proposed to analyze the interpretability of autoencoder utilizing Saliency Map.Extensive experiments have been conducted on a modulated signal dataset,and the results indicate the superior performance of our proposed method over other clustering algorithms.In particular,for the simulated dataset containing six modulation modes,when the SNR is 20dB,the clustering accuracy of the proposed method is greater than 78%.The interpretability analysis of the clustering model was performed to visualize the significant features of different modulated signals and verified the high separability of the features extracted by clustering model.展开更多
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Image captioning,the task of generating descriptive sentences for images,has advanced significantly with the integration of semantic information.However,traditional models still rely on static visual features that do not evolve with the changing linguistic context,which can hinder the ability to form meaningful connections between the image and the generated captions.This limitation often leads to captions that are less accurate or descriptive.In this paper,we propose a novel approach to enhance image captioning by introducing dynamic interactions where visual features continuously adapt to the evolving linguistic context.Our model strengthens the alignment between visual and linguistic elements,resulting in more coherent and contextually appropriate captions.Specifically,we introduce two innovative modules:the Visual Weighting Module(VWM)and the Enhanced Features Attention Module(EFAM).The VWM adjusts visual features using partial attention,enabling dynamic reweighting of the visual inputs,while the EFAM further refines these features to improve their relevance to the generated caption.By continuously adjusting visual features in response to the linguistic context,our model bridges the gap between static visual features and dynamic language generation.We demonstrate the effectiveness of our approach through experiments on the MS-COCO dataset,where our method outperforms state-of-the-art techniques in terms of caption quality and contextual relevance.Our results show that dynamic visual-linguistic alignment significantly enhances image captioning performance.
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Image captioning has gained increasing attention in recent years.Visual characteristics found in input images play a crucial role in generating high-quality captions.Prior studies have used visual attention mechanisms to dynamically focus on localized regions of the input image,improving the effectiveness of identifying relevant image regions at each step of caption generation.However,providing image captioning models with the capability of selecting the most relevant visual features from the input image and attending to them can significantly improve the utilization of these features.Consequently,this leads to enhanced captioning network performance.In light of this,we present an image captioning framework that efficiently exploits the extracted representations of the image.Our framework comprises three key components:the Visual Feature Detector module(VFD),the Visual Feature Visual Attention module(VFVA),and the language model.The VFD module is responsible for detecting a subset of the most pertinent features from the local visual features,creating an updated visual features matrix.Subsequently,the VFVA directs its attention to the visual features matrix generated by the VFD,resulting in an updated context vector employed by the language model to generate an informative description.Integrating the VFD and VFVA modules introduces an additional layer of processing for the visual features,thereby contributing to enhancing the image captioning model’s performance.Using the MS-COCO dataset,our experiments show that the proposed framework competes well with state-of-the-art methods,effectively leveraging visual representations to improve performance.The implementation code can be found here:https://github.com/althobhani/VFDICM(accessed on 30 July 2024).
基金This work was supported by the China Scholarship Council[grant number 201706195004]the National Natural Science Foundation of China[grant numbers 41001093 and 51778278]the Social Science Foundation of Jiangsu Province,China[grant number 18GLB014].
文摘As a well-known urban landscape concept to describe urban space quality,urban street vitality is a subjective human perception of the urban environment but difficult to evaluate directly from the physical space.The study utilized a modern machine learning computer vision algorithm in the urban build environment to simulate the process,which starts with the visual perception of the urban street landscape and ends with the human reaction to street vitality.By analyzing the optimized trained model,we tried to identify urban street vitality’s visual features and evaluate their importance.A region around the Mochou Lake in Nanjing,China,was set as our study area.Seven investigators surveyed the area,recorded their evaluation score on each site’s vitality level with a corresponding picture taken on site.A total of 370 pictures and recorded score pairs from 231 valid survey sites were used to train a convolutional neural network.After optimization,a deep neural network model with 43 layers,including 11 convolutional ones,was created.Heat maps were then used to identify the features which lead to high vitality score outputs.The spatial distributions of different types of feature entities were also analyzed to help identify the spatial effects.The study found that visual features,including human,construction site,shop front,and roadside/walking pavement,are vital ones that correspond to the vitality of the urban street.The consistency of these critical features with traditional urban vitality features indicates the model had learned useful knowledge from the training process.Applying the trained model in urban planning practices can help to improve the city environment for better attraction of residents’activities and communications.
文摘Applying machine learning to lemon defect recognition can improve the efficiency of lemon quality detection. This paper proposes a deep learning-based classification method with visual feature extraction and transfer learning to recognize defect lemons (</span><i><span style="font-family:Verdana;">i.e.</span></i><span style="font-family:Verdana;">, green and mold defects). First, the data enhancement and brightness compensation techniques are used for data prepossessing. The visual feature extraction is used to quantify the defects and determine the feature variables as the bandit basis for classification. Then we construct a convolutional neural network with an embedded Visual Geome</span><span style="font-family:Verdana;">try Group 16 based (VGG16-based) network using transfer learning. The proposed model is compared with many benchmark models such as</span><span style="font-family:Verdana;"> K-</span></span><span style="font-family:Verdana;">n</span><span style="font-family:Verdana;">earest</span><span style="font-family:""> </span><span style="font-family:Verdana;">Neighbor (KNN) and Support Vector Machine (SVM). Result</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;"> show that the proposed model achieves the highest accuracy (95.44%) in the testing data set. The research provides a new solution for lemon defect recognition.
基金Supported by the National Natural Science Foundation of China(No.61971162,61771186)the Natural Science Foundation of Heilongjiang Province(No.PL2024F025)+1 种基金the Open Research Fund of National Mobile Communications Research Laboratory in Southeast University(No.2023D07)the Fundamental Scientific Research Funds of Heilongjiang Province(No.2022-KYYWF-1050).
文摘The impact of location services on people’s lives has grown significantly in the era of widespread smart device usage.Due to global navigation satellite system(GNSS)signal rejection,weak signal strength in indoor environments and radio signal interference caused by multiwall environments,which collectively lead to significant positioning errors,vision-based positioning has emerged as a crucial method in indoor positioning research.This paper introduces a scale hierarchical matching model to tackle challenges associated with large visual databases and high scene similarity,both of which will compromise matching accuracy and lead to prolonged positioning delays.The proposed model establishes an image feature database using GIST features and speeded up robust feature(SURF)in the offline stage.In the online stage,a positioning navigating algorithm is constructed based on Dijkstra’s path planning.Additionally,a corresponding Android application has been developed to facilitate visual positioning and navigation in indoor environments.Experimental results obtained in real indoor environments demonstrate that the proposed method significantly enhances positioning accuracy compared with similar algorithms,while effectively reducing time overhead.This improvement caters to the requirements for indoor positioning and navigation,thereby meeting user needs.
基金supported by the Key Research and Development Program of Ordos[grant number YF20232306].
文摘The integration of remote sensing and artificial intelligence technologies into photovoltaic(PV)power generation has significantly enhanced the efficiency and precision of monitoring and evaluating PV station construction.However,most semantic segmentation models are primarily developed for natural scenes,often neglecting the distinctive visual attributes of PV panels.We introduce a visual feature constraint method designed to tailor the segmentation network to the unique aspects of PV panels,including their texture,color,and shape.The method incorporates a constraint module,comprised of three adversarial autoencoders,into a conventional segmentation model.This technique represents a versatile training framework that can be seamlessly integrated with state-of-theart models,providing clear insights into the learning process.Experimental results with UperNet,SegFormer,DeepLabV3+,TransUNet,CorrMatch,SCSM and UKAN as baseline models show a maximum IoU improvement of 2.16%.Notably,UperNet attains the superior segmentation outcomes,whereas DeepLabV3+exhibits the greatest benefit from the imposed constraints.Furthermore,our findings reveal that various models exhibit distinct sensitivities to different visual features,and employing multiple constraints typically yields better results than relying on single-feature constraints.Collectively,our proposed method showcases its potential to advance PV panel segmentation in remote sensing applications,presenting a scalable and effective solution.
基金supported by the National Natural Science Foundation of China(No.62376197)the Tianjin Science and Technology Program(No.23JCYBJC00360)the Tianjin Health Research Project(No.TJWJ2025MS045).
文摘Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emergency medicine poses a significant challenge.To address this,we propose a novel approach based on the inter-learning of visual features between global and local information.Specifically,our method enhances the perception capabilities of the visual feature extractor by strategically leveraging the strengths of convolutional neural network(CNN),which are adept at capturing local features,and visual transformers which perform well at perceiving global features.Furthermore,to mitigate the issue of overfitting caused by the limited availability of sign language data for emergency medical applications,we introduce an enhanced short temporal module for data augmentation through additional subsequences.Experimental results on three publicly available sign language datasets demonstrate the efficacy of the proposed approach.
基金supported by the National Natural Science Foundation of China (No.62202137)the China Postdoctoral Science Foundation (No.2023M730599)the Zhejiang Provincial Natural Science Foundation of China (No.LMS25F020009)。
文摘Embodied visual exploration is critical for building intelligent visual agents. This paper presents the neural exploration with feature-based visual odometry and tracking-failure-reduction policy(Ne OR), a framework for embodied visual exploration that possesses the efficient exploration capabilities of deep reinforcement learning(DRL)-based exploration policies and leverages feature-based visual odometry(VO) for more accurate mapping and positioning results. An improved local policy is also proposed to reduce tracking failures of feature-based VO in weakly textured scenes through a refined multi-discrete action space, keyframe fusion, and an auxiliary task. The experimental results demonstrate that Ne OR has better mapping and positioning accuracy compared to other entirely learning-based exploration frameworks and improves the robustness of feature-based VO by significantly reducing tracking failures in weakly textured scenes.
基金supported by the Beijing Natural Science Foundation under Grant No.L247008
文摘Multi-modal Named Entity Recognition(MNER),which is vision-language task,utilizes images as auxiliary to detect and classify named entities from input sentence.Recent studies find visual information is helpful for Named Entity Recognition(NER),while the difference between those two modalities is not carefully considered.Therefore,these approaches utilizing different pre-trained models do not reduce the gap between textual and visual features,which give the same weight of different modalities usually predict wrong because of the noise of visual information.To reduce these bias,we propose a Masked Multi-modal Attention Fusion approach for MNER,named MMAF.Firstly,we utilize Image Caption to generate textual representation of image,which is combined with original sentence.Then,to get textual and visual features,we map the multi-modal inputs into a shared space and stack Multi-modal Attention Fusion layer that performs fully interaction between two modalities.We add Multi-modal Attention Mask to highlight the importance of certain words in sentences,enhancing the performance of entity detection.Finally,we achieve Multi-modal Attention based representation for each word and perform entity labeling via CRF decoder.Experiments show our method outperforms state-of-the-art models by 0.23%and 0.84%on Twitter 2015 and 2017 MNER datasets respectively,demonstrating its effectiveness.
基金National Natural Science Foundation of China(No.62271186)Anhui Key Project of Research and Development Plan(No.202104d07020005)。
文摘Deception detection plays a crucial role in criminal investigation.Videos contain a wealth of information regarding apparent and physiological changes in individuals,and thus can serve as an effective means of deception detection.In this paper,we investigate video-based deception detection considering both apparent visual features such as eye gaze,head pose and facial action unit(AU),and non-contact heart rate detected by remote photoplethysmography(rPPG)technique.Multiple wrapper-based feature selection methods combined with the K-nearest neighbor(KNN)and support vector machine(SVM)classifiers are employed to screen the most effective features for deception detection.We evaluate the performance of the proposed method on both a self-collected physiological-assisted visual deception detection(PV3D)dataset and a public bag-oflies(BOL)dataset.Experimental results demonstrate that the SVM classifier with symbiotic organisms search(SOS)feature selection yields the best overall performance,with an area under the curve(AUC)of 83.27%and accuracy(ACC)of 83.33%for PV3D,and an AUC of 71.18%and ACC of 70.33%for BOL.This demonstrates the stability and effectiveness of the proposed method in video-based deception detection tasks.
基金supported by the National Natural Science Foundation of China(Nos.92152301,12072282)。
文摘Aerodynamic surrogate modeling mostly relies only on integrated loads data obtained from simulation or experiment,while neglecting and wasting the valuable distributed physical information on the surface.To make full use of both integrated and distributed loads,a modeling paradigm,called the heterogeneous data-driven aerodynamic modeling,is presented.The essential concept is to incorporate the physical information of distributed loads as additional constraints within the end-to-end aerodynamic modeling.Towards heterogenous data,a novel and easily applicable physical feature embedding modeling framework is designed.This framework extracts lowdimensional physical features from pressure distribution and then effectively enhances the modeling of the integrated loads via feature embedding.The proposed framework can be coupled with multiple feature extraction methods,and the well-performed generalization capabilities over different airfoils are verified through a transonic case.Compared with traditional direct modeling,the proposed framework can reduce testing errors by almost 50%.Given the same prediction accuracy,it can save more than half of the training samples.Furthermore,the visualization analysis has revealed a significant correlation between the discovered low-dimensional physical features and the heterogeneous aerodynamic loads,which shows the interpretability and credibility of the superior performance offered by the proposed deep learning framework.
基金supported by the National Natural Science Foundation of China under Grant Nos.62076117 and 62166026the Jiangxi Provincial Key Laboratory of Virtual Reality under Grant No.2024SSY03151.
文摘Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.
基金supported by the National Natural Science Foundation of China(31901240,31971792)the Science and Technology Innovation Program of the Chinese Academy of Agricultural Sciences(CAAS-ASTIP-2016-AⅡ)the Central Public-interest Scientific Institution Basal Research Funds,China(Y2022QC17,CAAS-ZDRW202107).
文摘The quality of oranges is grounded on their appearance and diameter.Appearance refers to the skin’s smoothness and surface cleanliness;diameter refers to the transverse diameter size.They are visual attributes that visual perception technologies can automatically identify.Nonetheless,the current orange quality assessment needs to address two issues:1)There are no image datasets for orange quality grading;2)It is challenging to effectively learn the fine-grained and distinct visual semantics of oranges from diverse angles.This study collected 12522 images from 2087 oranges for multi-grained grading tasks.In addition,it presented a visual learning graph convolution approach for multi-grained orange quality grading,including a backbone network and a graph convolutional network(GCN).The backbone network’s object detection,data augmentation,and feature extraction can remove extraneous visual information.GCN was utilized to learn the topological semantics of orange feature maps.Finally,evaluation results proved that the recognition accuracy of diameter size,appearance,and fine-grained orange quality were 99.50,97.27,and 97.99%,respectively,indicating that the proposed approach is superior to others.
基金Acknowledgements This paper was supported by the coUabomtive Research Project SEV under Cant No. 01100474 between Beijing University of Posts and Telecorrrcnications and France Telecom R&D Beijing the National Natural Science Foundation of China under Cant No. 90920001 the Caduate Innovation Fund of SICE, BUPT, 2011.
文摘The rapid growth of multimedia content necessitates powerful technologies to filter, classify, index and retrieve video documents more efficiently. However, the essential bottleneck of image and video analysis is the problem of semantic gap that low level features extracted by computers always fail to coincide with high-level concepts interpreted by humans. In this paper, we present a generic scheme for the detection video semantic concepts based on multiple visual features machine learning. Various global and local low-level visual features are systelrtically investigated, and kernelbased learning method equips the concept detection system to explore the potential of these features. Then we combine the different features and sub-systen on both classifier-level and kernel-level fusion that contribute to a more robust system Our proposed system is tested on the TRECVID dataset. The resulted Mean Average Precision (MAP) score is rmch better than the benchmark perforrmnce, which proves that our concepts detection engine develops a generic model and perforrrs well on both object and scene type concepts.
基金the Theme-based Research Scheme,Research Grants Council of Hong Kong(No.T45-205/21-N).
文摘Novel view synthesis has attracted tremendous research attention recently for its applications in virtual reality and immersive telepresence.Rendering a locally immersive light field(LF)based on arbitrary large baseline RGB references is a challenging problem that lacks efficient solutions with existing novel view synthesis techniques.In this work,we aim at truthfully rendering local immersive novel views/LF images based on large baseline LF captures and a single RGB image in the target view.To fully explore the precious information from source LF captures,we propose a novel occlusion-aware source sampler(OSS)module which efficiently transfers the pixels of source views to the target view′s frustum in an occlusion-aware manner.An attention-based deep visual fusion module is proposed to fuse the revealed occluded background content with a preliminary LF into a final refined LF.The proposed source sampling and fusion mechanism not only helps to provide information for occluded regions from varying observation angles,but also proves to be able to effectively enhance the visual rendering quality.Experimental results show that our proposed method is able to render high-quality LF images/novel views with sparse RGB references and outperforms state-of-the-art LF rendering and novel view synthesis methods.
基金This work was supported by National Natural Science Foundation of China(Nos.61831015 and 61901260)Key Research and Development Program of China(No.2019YFB1405902).
文摘Objective image quality assessment(IQA)plays an important role in various visual communication systems,which can automatically and efficiently predict the perceived quality of images.The human eye is the ultimate evaluator for visual experience,thus the modeling of human visual system(HVS)is a core issue for objective IQA and visual experience optimization.The traditional model based on black box fitting has low interpretability and it is difficult to guide the experience optimization effectively,while the model based on physiological simulation is hard to integrate into practical visual communication services due to its high computational complexity.For bridging the gap between signal distortion and visual experience,in this paper,we propose a novel perceptual no-reference(NR)IQA algorithm based on structural computational modeling of HVS.According to the mechanism of the human brain,we divide the visual signal processing into a low-level visual layer,a middle-level visual layer and a high-level visual layer,which conduct pixel information processing,primitive information processing and global image information processing,respectively.The natural scene statistics(NSS)based features,deep features and free-energy based features are extracted from these three layers.The support vector regression(SVR)is employed to aggregate features to the final quality prediction.Extensive experimental comparisons on three widely used benchmark IQA databases(LIVE,CSIQ and TID2013)demonstrate that our proposed metric is highly competitive with or outperforms the state-of-the-art NR IQA measures.
基金This work is supported by the National Key R&D Program of China(2016QY05X1000)the National Natural Science Foundation of China(201561402137).
文摘With the rapid development of the Internet,the types of webpages are more abundant than in previous decades.However,it becomes severe that people are facing more and more significant network security risks and enormous losses caused by phishing webpages,which imitate the interface of real webpages and deceive the victims.To better identify and distinguish phishing webpages,a visual feature extraction method and a visual similarity algorithm are proposed.First,the visual feature extraction method improves the Visionbased Page Segmentation(VIPS)algorithm to extract the visual block and calculate its signature by perceptual hash technology.Second,the visual similarity algorithm presents a one-to-one correspondence based on the visual blocks’coordinates and thresholds.Then the weights are assigned according to the tree structure,and the similarity of the visual blocks is calculated on the basis of the measurement of the visual features’Hamming distance.Further,the visual similarity of webpages is generated by integrating the similarity and weight of different visual blocks.Finally,multiple pairs of phishing webpages and legitimate webpages are evaluated to verify the feasibility of the algorithm.The experimental results achieve excellent performance and demonstrate that our method can achieve 94%accuracy.
文摘In the paper a referral system to assist the medical experts in the screening/referral of diabetic retinopathy is suggested. The system has been developed by a sequential use of different existing mathematical techniques. These techniques involve speeded up robust features(SURF), K-means clustering and visual dictionaries(VD). Three databases are mixed to test the working of the system when the sources are dissimilar. When experiments were performed an area under the curve(AUC) of 0.9343 was attained. The results acquired from the system are promising.
基金This project is supported by National Electric Power Corporation Foundation of China(No.SPKJ010-27).
文摘Target tracking is one typical application of visual servoing technology. It is still a difficult task to track high speed target with current visual servo system. The improvement of visual servoing scheme is strongly required. A position-based visual servo parallel system is presented for tracking target with high speed. A local Frenet frame is assigned to the sampling point of spatial trajectory. Position estimation is formed by the differential features of intrinsic geometry, and orientation estimation is formed by homogenous transformation. The time spent for searching and processing can be greatly reduced by shifting the window according to features location prediction. The simulation results have demonstrated the ability of the system to track spatial moving object.
基金supported in part by the National Natural Science Foundation of China(No.62276206)the Key Research and Development Program of Shaanxi under Grant S2022-YF-YBGY-0921+2 种基金the State Key Program of National Natural Science of China(No.62231027)supported by the Science and Technology on Communication Information Security Control Laboratory。
文摘With the development of information technology,radio communication technology has made rapid progress.Many radio signals that have appeared in space are difficult to classify without manually labeling.Unsupervised radio signal clustering methods have recently become an urgent need for this situation.Meanwhile,the high complexity of deep learning makes it difficult to understand the decision results of the clustering models,making it essential to conduct interpretable analysis.This paper proposed a combined loss function for unsupervised clustering based on autoencoder.The combined loss function includes reconstruction loss and deep clustering loss.Deep clustering loss is added based on reconstruction loss,which makes similar deep features converge more in feature space.In addition,a features visualization method for signal clustering was proposed to analyze the interpretability of autoencoder utilizing Saliency Map.Extensive experiments have been conducted on a modulated signal dataset,and the results indicate the superior performance of our proposed method over other clustering algorithms.In particular,for the simulated dataset containing six modulation modes,when the SNR is 20dB,the clustering accuracy of the proposed method is greater than 78%.The interpretability analysis of the clustering model was performed to visualize the significant features of different modulated signals and verified the high separability of the features extracted by clustering model.