Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existin...Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existing FGIR works often follow two steps:discriminative sub-region localization and local feature representation.However,these works pay less attention on global context information.They neglect a fact that the subtle visual difference in challenging scenarios can be highlighted through exploiting the spatial relationship among different subregions from a global view point.Therefore,in this paper,we consider both global and local information for FGIR,and propose a collaborative teacher-student strategy to reinforce and unity the two types of information.Our framework is implemented mainly by convolutional neural network,referred to Teacher-Student Based Attention Convolutional Neural Network(T-S-ACNN).For fine-grained local information,we choose the classic Multi-Attention Network(MA-Net)as our baseline,and propose a type of boundary constraint to further reduce background noises in the local attention maps.In this way,the discriminative sub-regions tend to appear in the area occupied by fine-grained objects,leading to more accurate sub-region localization.For fine-grained global information,we design a graph convolution based Global Attention Network(GA-Net),which can combine extracted local attention maps from MA-Net with non-local techniques to explore spatial relationship among subregions.At last,we develop a collaborative teacher-student strategy to adaptively determine the attended roles and optimization modes,so as to enhance the cooperative reinforcement of MA-Net and GA-Net.Extensive experiments on CUB-200-2011,Stanford Cars and FGVC Aircraft datasets illustrate the promising performance of our framework.展开更多
Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emerg...Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emergency medicine poses a significant challenge.To address this,we propose a novel approach based on the inter-learning of visual features between global and local information.Specifically,our method enhances the perception capabilities of the visual feature extractor by strategically leveraging the strengths of convolutional neural network(CNN),which are adept at capturing local features,and visual transformers which perform well at perceiving global features.Furthermore,to mitigate the issue of overfitting caused by the limited availability of sign language data for emergency medical applications,we introduce an enhanced short temporal module for data augmentation through additional subsequences.Experimental results on three publicly available sign language datasets demonstrate the efficacy of the proposed approach.展开更多
In the Visual Place Recognition(VPR)task,existing research has leveraged large-scale pre-trained models to improve the performance of place recognition.However,when there are significant environmental differences betw...In the Visual Place Recognition(VPR)task,existing research has leveraged large-scale pre-trained models to improve the performance of place recognition.However,when there are significant environmental differences between query images and reference images,a large number of ineffective local features will interfere with the extraction of key landmark features,leading to the retrieval of visually similar but geographically different images.To address this perceptual aliasing problem caused by environmental condition changes,we propose a novel Visual Place Recognition method with Cross-Environment Robust Feature Enhancement(CerfeVPR).This method uses the GAN network to generate similar images of the original images under different environmental conditions,thereby enhancing the learning of robust features of the original images.This enables the global descriptor to effectively ignore appearance changes caused by environmental factors such as seasons and lighting,showing better place recognition accuracy than other methods.Meanwhile,we introduce a large kernel convolution adapter to fine tune the pre-trained model,obtaining a better image feature representation for subsequent robust feature learning.Then,we process the information of different local regions in the general features through a 3-layer pyramid scene parsing network and fuse it with a tag that retains global information to construct a multi-dimensional image feature representation.Based on this,we use the fused features of similar images to drive the robust feature learning of the original images and complete the feature matching between query images and retrieved images.Experiments on multiple commonly used datasets show that our method exhibits excellent performance.On average,CerfeVPR achieves the highest results,with all Recall@N values exceeding 90%.In particular,on the highly challenging Nordland dataset,the R@1 metric is improved by 4.6%,significantly outperforming other methods,which fully verifies the superiority of CerfeVPR in visual place recognition under complex environments.展开更多
Improving website security to prevent malicious online activities is crucial,and CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart)has emerged as a key strategy for distinguishing huma...Improving website security to prevent malicious online activities is crucial,and CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart)has emerged as a key strategy for distinguishing human users from automated bots.Text-based CAPTCHAs,designed to be easily decipherable by humans yet challenging for machines,are a common form of this verification.However,advancements in deep learning have facilitated the creation of models adept at recognizing these text-based CAPTCHAs with surprising efficiency.In our comprehensive investigation into CAPTCHA recognition,we have tailored the renowned UpDown image captioning model specifically for this purpose.Our approach innovatively combines an encoder to extract both global and local features,significantly boosting the model’s capability to identify complex details within CAPTCHA images.For the decoding phase,we have adopted a refined attention mechanism,integrating enhanced visual attention with dual layers of Long Short-Term Memory(LSTM)networks to elevate CAPTCHA recognition accuracy.Our rigorous testing across four varied datasets,including those from Weibo,BoC,Gregwar,and Captcha 0.3,demonstrates the versatility and effectiveness of our method.The results not only highlight the efficiency of our approach but also offer profound insights into its applicability across different CAPTCHA types,contributing to a deeper understanding of CAPTCHA recognition technology.展开更多
Enhancing website security is crucial to combat malicious activities,and CAPTCHA(Completely Automated Public Turing tests to tell Computers and Humans Apart)has become a key method to distinguish humans from bots.Whil...Enhancing website security is crucial to combat malicious activities,and CAPTCHA(Completely Automated Public Turing tests to tell Computers and Humans Apart)has become a key method to distinguish humans from bots.While text-based CAPTCHAs are designed to challenge machines while remaining human-readable,recent advances in deep learning have enabled models to recognize them with remarkable efficiency.In this regard,we propose a novel two-layer visual attention framework for CAPTCHA recognition that builds on traditional attention mechanisms by incorporating Guided Visual Attention(GVA),which sharpens focus on relevant visual features.We have specifically adapted the well-established image captioning task to address this need.Our approach utilizes the first-level attention module as guidance to the second-level attention component,incorporating two LSTM(Long Short-Term Memory)layers to enhance CAPTCHA recognition.Our extensive evaluation across four diverse datasets—Weibo,BoC(Bank of China),Gregwar,and Captcha 0.3—shows the adaptability and efficacy of our method.Our approach demonstrated impressive performance,achieving an accuracy of 96.70%for BoC and 95.92%for Webo.These results underscore the effectiveness of our method in accurately recognizing and processing CAPTCHA datasets,showcasing its robustness,reliability,and ability to handle varied challenges in CAPTCHA recognition.展开更多
Visual Place Recognition(VPR)technology aims to use visual information to judge the location of agents,which plays an irreplaceable role in tasks such as loop closure detection and relocation.It is well known that pre...Visual Place Recognition(VPR)technology aims to use visual information to judge the location of agents,which plays an irreplaceable role in tasks such as loop closure detection and relocation.It is well known that previous VPR algorithms emphasize the extraction and integration of general image features,while ignoring the mining of salient features that play a key role in the discrimination of VPR tasks.To this end,this paper proposes a Domain-invariant Information Extraction and Optimization Network(DIEONet)for VPR.The core of the algorithm is a newly designed Domain-invariant Information Mining Module(DIMM)and a Multi-sample Joint Triplet Loss(MJT Loss).Specifically,DIMM incorporates the interdependence between different spatial regions of the feature map in the cascaded convolutional unit group,which enhances the model’s attention to the domain-invariant static object class.MJT Loss introduces the“joint processing of multiple samples”mechanism into the original triplet loss,and adds a new distance constraint term for“positive and negative”samples,so that the model can avoid falling into local optimum during training.We demonstrate the effectiveness of our algorithm by conducting extensive experiments on several authoritative benchmarks.In particular,the proposed method achieves the best performance on the TokyoTM dataset with a Recall@1 metric of 92.89%.展开更多
Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi...Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.展开更多
Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal windo...Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.展开更多
The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make th...The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make the task difficult.Therefore,to enhance the accuracy of the fine-grained ship image recognition,we design a fine-grained ship image recognition network based on bilinear convolutional neural network(BCNN)with Inception and additive margin Softmax(AM-Softmax).This network improves the BCNN in two aspects.Firstly,by introducing Inception branches to the BCNN network,it is helpful to enhance the ability of extracting comprehensive features from ships.Secondly,by adding margin values to the decision boundary,the AM-Softmax function can better extend the inter-class differences and reduce the intra-class differences.In addition,as there are few publicly available datasets for fine-grained ship image recognition,we construct a Ship-43 dataset containing 47,300 ship images belonging to 43 categories.Experimental results on the constructed Ship-43 dataset demonstrate that our method can effectively improve the accuracy of ship image recognition,which is 4.08%higher than the BCNN model.Moreover,comparison results on the other three public fine-grained datasets(Cub,Cars,and Aircraft)further validate the effectiveness of the proposed method.展开更多
The methods of visual recognition,positioning and orienting with simple 3 D geometric workpieces are presented in this paper.The principle and operating process of multiple orientation run le...The methods of visual recognition,positioning and orienting with simple 3 D geometric workpieces are presented in this paper.The principle and operating process of multiple orientation run length coding based on general orientation run length coding and visual recognition method are described elaborately.The method of positioning and orientating based on the moment of inertia of the workpiece binary image is stated also.It has been applied in a research on flexible automatic coordinate measuring system formed by integrating computer aided design,computer vision and computer aided inspection planning,with a coordinate measuring machine.The results show that integrating computer vision with measurement system is a feasible and effective approach to improve their flexibility and automation.展开更多
A two-stage algorithm based on deep learning for the detection and recognition of can bottom spray codes and numbers is proposed to address the problems of small character areas and fast production line speeds in can ...A two-stage algorithm based on deep learning for the detection and recognition of can bottom spray codes and numbers is proposed to address the problems of small character areas and fast production line speeds in can bottom spray code number recognition.In the coding number detection stage,Differentiable Binarization Network is used as the backbone network,combined with the Attention and Dilation Convolutions Path Aggregation Network feature fusion structure to enhance the model detection effect.In terms of text recognition,using the Scene Visual Text Recognition coding number recognition network for end-to-end training can alleviate the problem of coding recognition errors caused by image color distortion due to variations in lighting and background noise.In addition,model pruning and quantization are used to reduce the number ofmodel parameters to meet deployment requirements in resource-constrained environments.A comparative experiment was conducted using the dataset of tank bottom spray code numbers collected on-site,and a transfer experiment was conducted using the dataset of packaging box production date.The experimental results show that the algorithm proposed in this study can effectively locate the coding of cans at different positions on the roller conveyor,and can accurately identify the coding numbers at high production line speeds.The Hmean value of the coding number detection is 97.32%,and the accuracy of the coding number recognition is 98.21%.This verifies that the algorithm proposed in this paper has high accuracy in coding number detection and recognition.展开更多
Flatness pattern recognition is the key of the flatness control. The accuracy of the present flatness pattern recognition is limited and the shape defects cannot be reflected intuitively. In order to improve it, a nov...Flatness pattern recognition is the key of the flatness control. The accuracy of the present flatness pattern recognition is limited and the shape defects cannot be reflected intuitively. In order to improve it, a novel method via T-S cloud inference network optimized by genetic algorithm(GA) is proposed. T-S cloud inference network is constructed with T-S fuzzy neural network and the cloud model. So, the rapid of fuzzy logic and the uncertainty of cloud model for processing data are both taken into account. What's more, GA possesses good parallel design structure and global optimization characteristics. Compared with the simulation recognition results of traditional BP Algorithm, GA is more accurate and effective. Moreover, virtual reality technology is introduced into the field of shape control by Lab VIEW, MATLAB mixed programming. And virtual flatness pattern recognition interface is designed.Therefore, the data of engineering analysis and the actual model are combined with each other, and the shape defects could be seen more lively and intuitively.展开更多
The influence of Tibetan characters on the visual recognition effects of Tibetan-Chinese bilingual guide signs based on drivers visual characteristics was studied.Four versions of Tibetan-Chinese bilingual guide signs...The influence of Tibetan characters on the visual recognition effects of Tibetan-Chinese bilingual guide signs based on drivers visual characteristics was studied.Four versions of Tibetan-Chinese bilingual guide signs with different heights and aspect ratios of Tibetan characters were designed,and corresponding road simulation models were established.10 Tibetan drivers and 10 Han drivers were selected to conduct driving simulation experiments using a driving simulator and eye tracker.The resultant data of the participant s pupil diameter and the visual recognition duration obtained from the eye tracker system were analyzed by analysis of variance.Combining results from the statistical analysis of driving simulator data and the questionnaire results on the visual recognition experience,it can be concluded that for Tibetan drivers,when the height of Tibetan characters was 2/3 of the height of Chinese characters,the visual recognition effect of the signs was better than that of 1/3 and 1/2 of the height of Chinese characters,indicating that increasing the height of Tibetan characters was conducive to improving the visual recognition effect of guide signs.The aspect ratio form of Tibetan had no significant effect on the level of difficulty encountered in drivers visual recognition,but it would affect the aesthetics of the bilingual guide signs.The recommended character height in Tibetan should be increased to improve the visual recognition process for Tibetan drivers.展开更多
Fine-grained aircraft target detection in remote sensing holds significant research valueand practical applications,particularly in military defense and precision strikes.Given the complex-ity of remote sensing images...Fine-grained aircraft target detection in remote sensing holds significant research valueand practical applications,particularly in military defense and precision strikes.Given the complex-ity of remote sensing images,where targets are often small and similar within categories,detectingthese fine-grained targets is challenging.To address this,we constructed a fine-grained dataset ofremotely sensed airplanes;for the problems of remote sensing fine-grained targets with obvious head-to-tail distributions and large variations in target sizes,we proposed the DWDet fine-grained tar-get detection and recognition algorithm.First,for the problem of unbalanced category distribution,we adopt an adaptive sampling strategy.In addition,we construct a deformable convolutional blockand improve the decoupling head structure to improve the detection effect of the model ondeformed targets.Then,we design a localization loss function,which is used to improve the model’slocalization ability for targets of different scales.The experimental results show that our algorithmimproves the overall accuracy of the model by 4.1%compared to the baseline model,and improvesthe detection accuracy of small targets by 12.2%.The ablation and comparison experiments alsoprove the effectiveness of our algorithm.展开更多
Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dyn...Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.展开更多
Research on intelligent and robotic excavator has become a focus both at home and abroad, and this type of excavator becomes more and more important in application. In this paper, we developed a control system which c...Research on intelligent and robotic excavator has become a focus both at home and abroad, and this type of excavator becomes more and more important in application. In this paper, we developed a control system which can make the intelligent robotic excavator perform excavating operation autonomously. It can recognize the excavating targets by itself, program the operation automatically based on the original parameter, and finish all the tasks. Experimental results indicate the validity in real-time performance and precision of the control system. The intelligent robotic excavator can remarkably ease the labor intensity and enhance the working efficiency.展开更多
A series of novel six-coordinated terpyridine zinc complexes,containing ammonium salts and thymine fragment at the two terminals,have been designed and synthesized,which can function as highly sensitive visualized sen...A series of novel six-coordinated terpyridine zinc complexes,containing ammonium salts and thymine fragment at the two terminals,have been designed and synthesized,which can function as highly sensitive visualized sensors for melamine detection via selective metallo-hydrogel formation.After fully characterization by various techniques,the complementary triple-hydrogen-bonding between the thymine fragment and melamine,as well as π-π stacking interactions may be responsible for the selective metallo-hydrogel formation.In light of the possible interference aroused by milk ingredients(proteins,peptides and amino acids) and legal/illegal additives(urine,sugars and vitamins),a series of control experiments are therefore involved.To our delight,this visual recognition is highly selective,no gelation was observed with the selected milk ingredients or additives.Remarkably,this new developed protocol enables convenient and highly selective visual recognition of melamine at a concentration as low as 10 ppm in raw milk without any tedious pretreatment.展开更多
AIM: To quantitatively evaluate the effect of a simulated smog environment on human visual function by psychophysical methods.METHODS: The smog environment was simulated in a 40×40×60 cm3 glass chamber fil...AIM: To quantitatively evaluate the effect of a simulated smog environment on human visual function by psychophysical methods.METHODS: The smog environment was simulated in a 40×40×60 cm3 glass chamber filled with a PM2.5 aerosol, and 14 subjects with normal visual function were examined by psychophysical methods with the foggy smog box placed in front of their eyes. The transmission of light through the smog box, an indication of the percentage concentration of smog, was determined with a luminance meter. Visual function under different smog concentrations was evaluated by the E-visual acuity, crowded E-visual acuity and contrast sensitivity.RESULTS: E-visual acuity, crowded E-visual acuity and contrast sensitivity were all impaired with a decrease in the transmission rate(TR) according to power functions, with invariable exponents of-1.41,-1.62 and-0.7, respectively, and R2 values of 0.99 for E and crowded E-visual acuity, 0.96 for contrast sensitivity. Crowded E-visual acuity decreased faster than E-visual acuity. There was a good correlation between the TR, extinction coefficient and visibility under heavy-smog conditions.CONCLUSION: Increases in smog concentration have a strong effect on visual function.展开更多
Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In thi...Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.展开更多
Deep learning has achieved excellent results in various tasks in the field of computer vision,especially in fine-grained visual categorization.It aims to distinguish the subordinate categories of the label-level categ...Deep learning has achieved excellent results in various tasks in the field of computer vision,especially in fine-grained visual categorization.It aims to distinguish the subordinate categories of the label-level categories.Due to high intra-class variances and high inter-class similarity,the fine-grained visual categorization is extremely challenging.This paper first briefly introduces and analyzes the related public datasets.After that,some of the latest methods are reviewed.Based on the feature types,the feature processing methods,and the overall structure used in the model,we divide them into three types of methods:methods based on general convolutional neural network(CNN)and strong supervision of parts,methods based on single feature processing,and meth-ods based on multiple feature processing.Most methods of the first type have a relatively simple structure,which is the result of the initial research.The methods of the other two types include models that have special structures and training processes,which are helpful to obtain discriminative features.We conduct a specific analysis on several methods with high accuracy on pub-lic datasets.In addition,we support that the focus of the future research is to solve the demand of existing methods for the large amount of the data and the computing power.In terms of tech-nology,the extraction of the subtle feature information with the burgeoning vision transformer(ViT)network is also an important research direction.展开更多
基金supported by the National Natural Science Foundation of China,China (Grants No.62171232)the Priority Academic Program Development of Jiangsu Higher Education Institutions,China。
文摘Fine-grained Image Recognition(FGIR)task is dedicated to distinguishing similar sub-categories that belong to the same super-category,such as bird species and car types.In order to highlight visual differences,existing FGIR works often follow two steps:discriminative sub-region localization and local feature representation.However,these works pay less attention on global context information.They neglect a fact that the subtle visual difference in challenging scenarios can be highlighted through exploiting the spatial relationship among different subregions from a global view point.Therefore,in this paper,we consider both global and local information for FGIR,and propose a collaborative teacher-student strategy to reinforce and unity the two types of information.Our framework is implemented mainly by convolutional neural network,referred to Teacher-Student Based Attention Convolutional Neural Network(T-S-ACNN).For fine-grained local information,we choose the classic Multi-Attention Network(MA-Net)as our baseline,and propose a type of boundary constraint to further reduce background noises in the local attention maps.In this way,the discriminative sub-regions tend to appear in the area occupied by fine-grained objects,leading to more accurate sub-region localization.For fine-grained global information,we design a graph convolution based Global Attention Network(GA-Net),which can combine extracted local attention maps from MA-Net with non-local techniques to explore spatial relationship among subregions.At last,we develop a collaborative teacher-student strategy to adaptively determine the attended roles and optimization modes,so as to enhance the cooperative reinforcement of MA-Net and GA-Net.Extensive experiments on CUB-200-2011,Stanford Cars and FGVC Aircraft datasets illustrate the promising performance of our framework.
基金supported by the National Natural Science Foundation of China(No.62376197)the Tianjin Science and Technology Program(No.23JCYBJC00360)the Tianjin Health Research Project(No.TJWJ2025MS045).
文摘Accessible communication based on sign language recognition(SLR)is the key to emergency medical assistance for the hearing-impaired community.Balancing the capture of both local and global information in SLR for emergency medicine poses a significant challenge.To address this,we propose a novel approach based on the inter-learning of visual features between global and local information.Specifically,our method enhances the perception capabilities of the visual feature extractor by strategically leveraging the strengths of convolutional neural network(CNN),which are adept at capturing local features,and visual transformers which perform well at perceiving global features.Furthermore,to mitigate the issue of overfitting caused by the limited availability of sign language data for emergency medical applications,we introduce an enhanced short temporal module for data augmentation through additional subsequences.Experimental results on three publicly available sign language datasets demonstrate the efficacy of the proposed approach.
基金supported by Postgraduate Scientific Research Innovation Project of Hunan Province CX20230915National Natural Science Foundation of China under Grant 62472440.
文摘In the Visual Place Recognition(VPR)task,existing research has leveraged large-scale pre-trained models to improve the performance of place recognition.However,when there are significant environmental differences between query images and reference images,a large number of ineffective local features will interfere with the extraction of key landmark features,leading to the retrieval of visually similar but geographically different images.To address this perceptual aliasing problem caused by environmental condition changes,we propose a novel Visual Place Recognition method with Cross-Environment Robust Feature Enhancement(CerfeVPR).This method uses the GAN network to generate similar images of the original images under different environmental conditions,thereby enhancing the learning of robust features of the original images.This enables the global descriptor to effectively ignore appearance changes caused by environmental factors such as seasons and lighting,showing better place recognition accuracy than other methods.Meanwhile,we introduce a large kernel convolution adapter to fine tune the pre-trained model,obtaining a better image feature representation for subsequent robust feature learning.Then,we process the information of different local regions in the general features through a 3-layer pyramid scene parsing network and fuse it with a tag that retains global information to construct a multi-dimensional image feature representation.Based on this,we use the fused features of similar images to drive the robust feature learning of the original images and complete the feature matching between query images and retrieved images.Experiments on multiple commonly used datasets show that our method exhibits excellent performance.On average,CerfeVPR achieves the highest results,with all Recall@N values exceeding 90%.In particular,on the highly challenging Nordland dataset,the R@1 metric is improved by 4.6%,significantly outperforming other methods,which fully verifies the superiority of CerfeVPR in visual place recognition under complex environments.
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Improving website security to prevent malicious online activities is crucial,and CAPTCHA(Completely Automated Public Turing test to tell Computers and Humans Apart)has emerged as a key strategy for distinguishing human users from automated bots.Text-based CAPTCHAs,designed to be easily decipherable by humans yet challenging for machines,are a common form of this verification.However,advancements in deep learning have facilitated the creation of models adept at recognizing these text-based CAPTCHAs with surprising efficiency.In our comprehensive investigation into CAPTCHA recognition,we have tailored the renowned UpDown image captioning model specifically for this purpose.Our approach innovatively combines an encoder to extract both global and local features,significantly boosting the model’s capability to identify complex details within CAPTCHA images.For the decoding phase,we have adopted a refined attention mechanism,integrating enhanced visual attention with dual layers of Long Short-Term Memory(LSTM)networks to elevate CAPTCHA recognition accuracy.Our rigorous testing across four varied datasets,including those from Weibo,BoC,Gregwar,and Captcha 0.3,demonstrates the versatility and effectiveness of our method.The results not only highlight the efficiency of our approach but also offer profound insights into its applicability across different CAPTCHA types,contributing to a deeper understanding of CAPTCHA recognition technology.
基金supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047)High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
文摘Enhancing website security is crucial to combat malicious activities,and CAPTCHA(Completely Automated Public Turing tests to tell Computers and Humans Apart)has become a key method to distinguish humans from bots.While text-based CAPTCHAs are designed to challenge machines while remaining human-readable,recent advances in deep learning have enabled models to recognize them with remarkable efficiency.In this regard,we propose a novel two-layer visual attention framework for CAPTCHA recognition that builds on traditional attention mechanisms by incorporating Guided Visual Attention(GVA),which sharpens focus on relevant visual features.We have specifically adapted the well-established image captioning task to address this need.Our approach utilizes the first-level attention module as guidance to the second-level attention component,incorporating two LSTM(Long Short-Term Memory)layers to enhance CAPTCHA recognition.Our extensive evaluation across four diverse datasets—Weibo,BoC(Bank of China),Gregwar,and Captcha 0.3—shows the adaptability and efficacy of our method.Our approach demonstrated impressive performance,achieving an accuracy of 96.70%for BoC and 95.92%for Webo.These results underscore the effectiveness of our method in accurately recognizing and processing CAPTCHA datasets,showcasing its robustness,reliability,and ability to handle varied challenges in CAPTCHA recognition.
基金supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under grant number 2022D01B186.
文摘Visual Place Recognition(VPR)technology aims to use visual information to judge the location of agents,which plays an irreplaceable role in tasks such as loop closure detection and relocation.It is well known that previous VPR algorithms emphasize the extraction and integration of general image features,while ignoring the mining of salient features that play a key role in the discrimination of VPR tasks.To this end,this paper proposes a Domain-invariant Information Extraction and Optimization Network(DIEONet)for VPR.The core of the algorithm is a newly designed Domain-invariant Information Mining Module(DIMM)and a Multi-sample Joint Triplet Loss(MJT Loss).Specifically,DIMM incorporates the interdependence between different spatial regions of the feature map in the cascaded convolutional unit group,which enhances the model’s attention to the domain-invariant static object class.MJT Loss introduces the“joint processing of multiple samples”mechanism into the original triplet loss,and adds a new distance constraint term for“positive and negative”samples,so that the model can avoid falling into local optimum during training.We demonstrate the effectiveness of our algorithm by conducting extensive experiments on several authoritative benchmarks.In particular,the proposed method achieves the best performance on the TokyoTM dataset with a Recall@1 metric of 92.89%.
文摘Fine-grained recognition of ships based on remote sensing images is crucial to safeguarding maritime rights and interests and maintaining national security.Currently,with the emergence of massive high-resolution multi-modality images,the use of multi-modality images for fine-grained recognition has become a promising technology.Fine-grained recognition of multi-modality images imposes higher requirements on the dataset samples.The key to the problem is how to extract and fuse the complementary features of multi-modality images to obtain more discriminative fusion features.The attention mechanism helps the model to pinpoint the key information in the image,resulting in a significant improvement in the model’s performance.In this paper,a dataset for fine-grained recognition of ships based on visible and near-infrared multi-modality remote sensing images has been proposed first,named Dataset for Multimodal Fine-grained Recognition of Ships(DMFGRS).It includes 1,635 pairs of visible and near-infrared remote sensing images divided into 20 categories,collated from digital orthophotos model provided by commercial remote sensing satellites.DMFGRS provides two types of annotation format files,as well as segmentation mask images corresponding to the ship targets.Then,a Multimodal Information Cross-Enhancement Network(MICE-Net)fusing features of visible and near-infrared remote sensing images,has been proposed.In the network,a dual-branch feature extraction and fusion module has been designed to obtain more expressive features.The Feature Cross Enhancement Module(FCEM)achieves the fusion enhancement of the two modal features by making the channel attention and spatial attention work cross-functionally on the feature map.A benchmark is established by evaluating state-of-the-art object recognition algorithms on DMFGRS.MICE-Net conducted experiments on DMFGRS,and the precision,recall,mAP0.5 and mAP0.5:0.95 reached 87%,77.1%,83.8%and 63.9%,respectively.Extensive experiments demonstrate that the proposed MICE-Net has more excellent performance on DMFGRS.Built on lightweight network YOLO,the model has excellent generalizability,and thus has good potential for application in real-life scenarios.
基金supported by the research team of Xi’an Traffic Engineering Institute and the Young and middle-aged fund project of Xi’an Traffic Engineering Institute (2022KY-02).
文摘Mining more discriminative temporal features to enrich temporal context representation is considered the key to fine-grained action recog-nition.Previous action recognition methods utilize a fixed spatiotemporal window to learn local video representation.However,these methods failed to capture complex motion patterns due to their limited receptive field.To solve the above problems,this paper proposes a lightweight Temporal Pyramid Excitation(TPE)module to capture the short,medium,and long-term temporal context.In this method,Temporal Pyramid(TP)module can effectively expand the temporal receptive field of the network by using the multi-temporal kernel decomposition without significantly increasing the computational cost.In addition,the Multi Excitation module can emphasize temporal importance to enhance the temporal feature representation learning.TPE can be integrated into ResNet50,and building a compact video learning framework-TPENet.Extensive validation experiments on several challenging benchmark(Something-Something V1,Something-Something V2,UCF-101,and HMDB51)datasets demonstrate that our method achieves a preferable balance between computation and accuracy.
基金This work is supported by the National Natural Science Foundation of China(61806013,61876010,62176009,and 61906005)General project of Science and Technology Planof Beijing Municipal Education Commission(KM202110005028)+2 种基金Beijing Municipal Education Commission Project(KZ201910005008)Project of Interdisciplinary Research Institute of Beijing University of Technology(2021020101)International Research Cooperation Seed Fund of Beijing University of Technology(2021A01).
文摘The fine-grained ship image recognition task aims to identify various classes of ships.However,small inter-class,large intra-class differences between ships,and lacking of training samples are the reasons that make the task difficult.Therefore,to enhance the accuracy of the fine-grained ship image recognition,we design a fine-grained ship image recognition network based on bilinear convolutional neural network(BCNN)with Inception and additive margin Softmax(AM-Softmax).This network improves the BCNN in two aspects.Firstly,by introducing Inception branches to the BCNN network,it is helpful to enhance the ability of extracting comprehensive features from ships.Secondly,by adding margin values to the decision boundary,the AM-Softmax function can better extend the inter-class differences and reduce the intra-class differences.In addition,as there are few publicly available datasets for fine-grained ship image recognition,we construct a Ship-43 dataset containing 47,300 ship images belonging to 43 categories.Experimental results on the constructed Ship-43 dataset demonstrate that our method can effectively improve the accuracy of ship image recognition,which is 4.08%higher than the BCNN model.Moreover,comparison results on the other three public fine-grained datasets(Cub,Cars,and Aircraft)further validate the effectiveness of the proposed method.
文摘The methods of visual recognition,positioning and orienting with simple 3 D geometric workpieces are presented in this paper.The principle and operating process of multiple orientation run length coding based on general orientation run length coding and visual recognition method are described elaborately.The method of positioning and orientating based on the moment of inertia of the workpiece binary image is stated also.It has been applied in a research on flexible automatic coordinate measuring system formed by integrating computer aided design,computer vision and computer aided inspection planning,with a coordinate measuring machine.The results show that integrating computer vision with measurement system is a feasible and effective approach to improve their flexibility and automation.
文摘A two-stage algorithm based on deep learning for the detection and recognition of can bottom spray codes and numbers is proposed to address the problems of small character areas and fast production line speeds in can bottom spray code number recognition.In the coding number detection stage,Differentiable Binarization Network is used as the backbone network,combined with the Attention and Dilation Convolutions Path Aggregation Network feature fusion structure to enhance the model detection effect.In terms of text recognition,using the Scene Visual Text Recognition coding number recognition network for end-to-end training can alleviate the problem of coding recognition errors caused by image color distortion due to variations in lighting and background noise.In addition,model pruning and quantization are used to reduce the number ofmodel parameters to meet deployment requirements in resource-constrained environments.A comparative experiment was conducted using the dataset of tank bottom spray code numbers collected on-site,and a transfer experiment was conducted using the dataset of packaging box production date.The experimental results show that the algorithm proposed in this study can effectively locate the coding of cans at different positions on the roller conveyor,and can accurately identify the coding numbers at high production line speeds.The Hmean value of the coding number detection is 97.32%,and the accuracy of the coding number recognition is 98.21%.This verifies that the algorithm proposed in this paper has high accuracy in coding number detection and recognition.
基金Project(LJRC013)supported by the University Innovation Team of Hebei Province Leading Talent Cultivation,China
文摘Flatness pattern recognition is the key of the flatness control. The accuracy of the present flatness pattern recognition is limited and the shape defects cannot be reflected intuitively. In order to improve it, a novel method via T-S cloud inference network optimized by genetic algorithm(GA) is proposed. T-S cloud inference network is constructed with T-S fuzzy neural network and the cloud model. So, the rapid of fuzzy logic and the uncertainty of cloud model for processing data are both taken into account. What's more, GA possesses good parallel design structure and global optimization characteristics. Compared with the simulation recognition results of traditional BP Algorithm, GA is more accurate and effective. Moreover, virtual reality technology is introduced into the field of shape control by Lab VIEW, MATLAB mixed programming. And virtual flatness pattern recognition interface is designed.Therefore, the data of engineering analysis and the actual model are combined with each other, and the shape defects could be seen more lively and intuitively.
基金The National Natural Science Foundation of China(No.51768063,51868068)Shanxi Provincial Innovation Center Project for Digital Road Design Technology(No.202104010911019)。
文摘The influence of Tibetan characters on the visual recognition effects of Tibetan-Chinese bilingual guide signs based on drivers visual characteristics was studied.Four versions of Tibetan-Chinese bilingual guide signs with different heights and aspect ratios of Tibetan characters were designed,and corresponding road simulation models were established.10 Tibetan drivers and 10 Han drivers were selected to conduct driving simulation experiments using a driving simulator and eye tracker.The resultant data of the participant s pupil diameter and the visual recognition duration obtained from the eye tracker system were analyzed by analysis of variance.Combining results from the statistical analysis of driving simulator data and the questionnaire results on the visual recognition experience,it can be concluded that for Tibetan drivers,when the height of Tibetan characters was 2/3 of the height of Chinese characters,the visual recognition effect of the signs was better than that of 1/3 and 1/2 of the height of Chinese characters,indicating that increasing the height of Tibetan characters was conducive to improving the visual recognition effect of guide signs.The aspect ratio form of Tibetan had no significant effect on the level of difficulty encountered in drivers visual recognition,but it would affect the aesthetics of the bilingual guide signs.The recommended character height in Tibetan should be increased to improve the visual recognition process for Tibetan drivers.
基金supported by National Natural Science Foundation of China(No.62471034)Hebei Natural Science Foundation(No.F2023105001).
文摘Fine-grained aircraft target detection in remote sensing holds significant research valueand practical applications,particularly in military defense and precision strikes.Given the complex-ity of remote sensing images,where targets are often small and similar within categories,detectingthese fine-grained targets is challenging.To address this,we constructed a fine-grained dataset ofremotely sensed airplanes;for the problems of remote sensing fine-grained targets with obvious head-to-tail distributions and large variations in target sizes,we proposed the DWDet fine-grained tar-get detection and recognition algorithm.First,for the problem of unbalanced category distribution,we adopt an adaptive sampling strategy.In addition,we construct a deformable convolutional blockand improve the decoupling head structure to improve the detection effect of the model ondeformed targets.Then,we design a localization loss function,which is used to improve the model’slocalization ability for targets of different scales.The experimental results show that our algorithmimproves the overall accuracy of the model by 4.1%compared to the baseline model,and improvesthe detection accuracy of small targets by 12.2%.The ablation and comparison experiments alsoprove the effectiveness of our algorithm.
基金supported by the National Natural Science Foundation of China under Grant Nos.62076117 and 62166026the Jiangxi Provincial Key Laboratory of Virtual Reality under Grant No.2024SSY03151.
文摘Dynamic sign language recognition holds significant importance, particularly with the application of deep learning to address its complexity. However, existing methods face several challenges. Firstly, recognizing dynamic sign language requires identifying keyframes that best represent the signs, and missing these keyframes reduces accuracy. Secondly, some methods do not focus enough on hand regions, which are small within the overall frame, leading to information loss. To address these challenges, we propose a novel Video Transformer Attention-based Network (VTAN) for dynamic sign language recognition. Our approach prioritizes informative frames and hand regions effectively. To tackle the first issue, we designed a keyframe extraction module enhanced by a convolutional autoencoder, which focuses on selecting information-rich frames and eliminating redundant ones from the video sequences. For the second issue, we developed a soft attention-based transformer module that emphasizes extracting features from hand regions, ensuring that the network pays more attention to hand information within sequences. This dual-focus approach improves effective dynamic sign language recognition by addressing the key challenges of identifying critical frames and emphasizing hand regions. Experimental results on two public benchmark datasets demonstrate the effectiveness of our network, outperforming most of the typical methods in sign language recognition tasks.
文摘Research on intelligent and robotic excavator has become a focus both at home and abroad, and this type of excavator becomes more and more important in application. In this paper, we developed a control system which can make the intelligent robotic excavator perform excavating operation autonomously. It can recognize the excavating targets by itself, program the operation automatically based on the original parameter, and finish all the tasks. Experimental results indicate the validity in real-time performance and precision of the control system. The intelligent robotic excavator can remarkably ease the labor intensity and enhance the working efficiency.
基金Financial support from the State General Administration of the People’s Republic of China for Quality Supervision and Inspection and Quarantine (No.2016QK122)Shanghai Institute of Quality Inspection and Technical Research+1 种基金the National Natural Science Foundation of China (Nos.21572036 and 21861132002)the Department of Chemistry,Fudan University
文摘A series of novel six-coordinated terpyridine zinc complexes,containing ammonium salts and thymine fragment at the two terminals,have been designed and synthesized,which can function as highly sensitive visualized sensors for melamine detection via selective metallo-hydrogel formation.After fully characterization by various techniques,the complementary triple-hydrogen-bonding between the thymine fragment and melamine,as well as π-π stacking interactions may be responsible for the selective metallo-hydrogel formation.In light of the possible interference aroused by milk ingredients(proteins,peptides and amino acids) and legal/illegal additives(urine,sugars and vitamins),a series of control experiments are therefore involved.To our delight,this visual recognition is highly selective,no gelation was observed with the selected milk ingredients or additives.Remarkably,this new developed protocol enables convenient and highly selective visual recognition of melamine at a concentration as low as 10 ppm in raw milk without any tedious pretreatment.
基金Supported by National Nature Science Foundation of China (No. 81570880)
文摘AIM: To quantitatively evaluate the effect of a simulated smog environment on human visual function by psychophysical methods.METHODS: The smog environment was simulated in a 40×40×60 cm3 glass chamber filled with a PM2.5 aerosol, and 14 subjects with normal visual function were examined by psychophysical methods with the foggy smog box placed in front of their eyes. The transmission of light through the smog box, an indication of the percentage concentration of smog, was determined with a luminance meter. Visual function under different smog concentrations was evaluated by the E-visual acuity, crowded E-visual acuity and contrast sensitivity.RESULTS: E-visual acuity, crowded E-visual acuity and contrast sensitivity were all impaired with a decrease in the transmission rate(TR) according to power functions, with invariable exponents of-1.41,-1.62 and-0.7, respectively, and R2 values of 0.99 for E and crowded E-visual acuity, 0.96 for contrast sensitivity. Crowded E-visual acuity decreased faster than E-visual acuity. There was a good correlation between the TR, extinction coefficient and visibility under heavy-smog conditions.CONCLUSION: Increases in smog concentration have a strong effect on visual function.
文摘Lip-reading technologies are rapidly progressing following the breakthrough of deep learning.It plays a vital role in its many applications,such as:human-machine communication practices or security applications.In this paper,we propose to develop an effective lip-reading recognition model for Arabic visual speech recognition by implementing deep learning algorithms.The Arabic visual datasets that have been collected contains 2400 records of Arabic digits and 960 records of Arabic phrases from 24 native speakers.The primary purpose is to provide a high-performance model in terms of enhancing the preprocessing phase.Firstly,we extract keyframes from our dataset.Secondly,we produce a Concatenated Frame Images(CFIs)that represent the utterance sequence in one single image.Finally,the VGG-19 is employed for visual features extraction in our proposed model.We have examined different keyframes:10,15,and 20 for comparing two types of approaches in the proposed model:(1)the VGG-19 base model and(2)VGG-19 base model with batch normalization.The results show that the second approach achieves greater accuracy:94%for digit recognition,97%for phrase recognition,and 93%for digits and phrases recognition in the test dataset.Therefore,our proposed model is superior to models based on CFIs input.
基金supported by the National Natural Science Foundation of China(61571453,61806218).
文摘Deep learning has achieved excellent results in various tasks in the field of computer vision,especially in fine-grained visual categorization.It aims to distinguish the subordinate categories of the label-level categories.Due to high intra-class variances and high inter-class similarity,the fine-grained visual categorization is extremely challenging.This paper first briefly introduces and analyzes the related public datasets.After that,some of the latest methods are reviewed.Based on the feature types,the feature processing methods,and the overall structure used in the model,we divide them into three types of methods:methods based on general convolutional neural network(CNN)and strong supervision of parts,methods based on single feature processing,and meth-ods based on multiple feature processing.Most methods of the first type have a relatively simple structure,which is the result of the initial research.The methods of the other two types include models that have special structures and training processes,which are helpful to obtain discriminative features.We conduct a specific analysis on several methods with high accuracy on pub-lic datasets.In addition,we support that the focus of the future research is to solve the demand of existing methods for the large amount of the data and the computing power.In terms of tech-nology,the extraction of the subtle feature information with the burgeoning vision transformer(ViT)network is also an important research direction.