期刊文献+
共找到10篇文章
< 1 >
每页显示 20 50 100
Efficient Reconstruction of Spatial Features for Remote Sensing Image-Text Retrieval
1
作者 ZHANG Weihang CHEN Jialiang +3 位作者 ZHANG Wenkai LI Xinming GAO Xin SUN Xian 《Transactions of Nanjing University of Aeronautics and Astronautics》 2025年第1期101-111,共11页
Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasi... Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%. 展开更多
关键词 remote sensing cross-modal image-text retrieval(RSCIR) spatial features channel features contrastive learning parameter effective transfer learning
在线阅读 下载PDF
Event-Driven Attention Network:A Cross-Modal Framework for Efficient Image-Text Retrieval in Mass Gathering Events
2
作者 Kamil Yasen Heyan Jin +4 位作者 Sijie Yang Li Zhan Xuyang Zhang Ke Qin Ye Li 《Computers, Materials & Continua》 2025年第5期3277-3301,共25页
Research on mass gathering events is critical for ensuring public security and maintaining social order.However,most of the existing works focus on crowd behavior analysis areas such as anomaly detection and crowd cou... Research on mass gathering events is critical for ensuring public security and maintaining social order.However,most of the existing works focus on crowd behavior analysis areas such as anomaly detection and crowd counting,and there is a relative lack of research on mass gathering behaviors.We believe real-time detection and monitoring of mass gathering behaviors are essential formigrating potential security risks and emergencies.Therefore,it is imperative to develop a method capable of accurately identifying and localizing mass gatherings before disasters occur,enabling prompt and effective responses.To address this problem,we propose an innovative Event-Driven Attention Network(EDAN),which achieves image-text matching in the scenario of mass gathering events with good results for the first time.Traditional image-text retrieval methods based on global alignment are difficult to capture the local details within complex scenes,limiting retrieval accuracy.While local alignment-based methods aremore effective at extracting detailed features,they frequently process raw textual features directly,which often contain ambiguities and redundant information that can diminish retrieval efficiency and degrade model performance.To overcome these challenges,EDAN introduces an Event-Driven AttentionModule that adaptively focuses attention on image regions or textual words relevant to the event type.By calculating the semantic distance between event labels and textual content,this module effectively significantly reduces computational complexity and enhances retrieval efficiency.To validate the effectiveness of EDAN,we construct a dedicated multimodal dataset tailored for the analysis of mass gathering events,providing a reliable foundation for subsequent studies.We conduct comparative experiments with other methods on our dataset,the experimental results demonstrate the effectiveness of EDAN.In the image-to-text retrieval task,EDAN achieved the best performance on the R@5 metric,while in the text-to-image retrieval task,it showed superior results on both R@10 and R@5 metrics.Additionally,EDAN excelled in the overall Rsummetric,achieving the best performance.Finally,ablation studies further verified the effectiveness of event-driven attention module. 展开更多
关键词 Mass gathering events image-text retrieval attention mechanism
在线阅读 下载PDF
Exploration of French-Chinese Translation Methods of Electrical Engineering Terminology Using Online Image-Text Retrieval Mode
3
作者 Tian Li 《Journal of Contemporary Educational Research》 2023年第6期47-52,共6页
With the incessant propulsion of the Open Door Policy,which is related to the consolidation of international collaborative partnerships,an increasing number of Chinese companies are moving toward cooperating countries... With the incessant propulsion of the Open Door Policy,which is related to the consolidation of international collaborative partnerships,an increasing number of Chinese companies are moving toward cooperating countries to participate in infrastructure construction,employing a win-win strategy in favor of the people and governments of both countries.Among the cooperation domains,our country’s electrical companies have achieved a series of remarkable results in the international Engineering,Procurement,and Construction(EPC)project market with their outstanding business capabilities and technical advantages.Nevertheless,some shortcomings cannot be overlooked,the most notable of which appears to be the impediment associated with engineering translation,which has always been an obsession among translators of Chinese companies.Taking the transmission line project in the Republic of Madagascar as an example,an analysis of French-Chinese translation methods of electrical engineering terminology in the field of the transmission line is carried out. 展开更多
关键词 Engineering translation Translation methods Electrical engineering terminology Interdisciplinary communication Online image-text retrieval mode
在线阅读 下载PDF
Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval
4
作者 Xue-Yang Qin Li-Shuang Li +3 位作者 Jing-Yao Tang Fei Hao Mei-Ling Ge Guang-Yao Pang 《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第4期811-826,共16页
Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing m... Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively. 展开更多
关键词 image-text retrieval cross-modal retrieval multi-task learning graph convolutional network
原文传递
Which is more faithful,seeing or saying? Multimodal sarcasm detection exploiting contrasting sentiment knowledge
5
作者 Yutao Chen Shumin Shi Heyan Huang 《CAAI Transactions on Intelligence Technology》 2025年第2期375-386,共12页
Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly common.However,detecting sarcasm in various forms of communication can be difficult due to confli... Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly common.However,detecting sarcasm in various forms of communication can be difficult due to conflicting sentiments.In this paper,we introduce a contrasting sentiment-based model for multimodal sarcasm detection(CS4MSD),which identifies inconsistent emotions by leveraging the CLIP knowledge module to produce sentiment features in both text and image.Then,five external sentiments are introduced to prompt the model learning sentimental preferences among modalities.Furthermore,we highlight the importance of verbal descriptions embedded in illustrations and incorporate additional knowledge-sharing modules to fuse such imagelike features.Experimental results demonstrate that our model achieves state-of-the-art performance on the public multimodal sarcasm dataset. 展开更多
关键词 CLIP image-text classification knowledge fusion multi-modal sarcasm detection
在线阅读 下载PDF
UniTrans:Unified Parameter-Efficient Transfer Learning and Multimodal Alignment for Large Multimodal Foundation Model
6
作者 Jiakang Sun Ke Chen +3 位作者 Xinyang He Xu Liu Ke Li Cheng Peng 《Computers, Materials & Continua》 2025年第4期219-238,共20页
With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap... With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks. 展开更多
关键词 Parameter-efficient transfer learning multimodal alignment image captioning image-text retrieval visual question answering
在线阅读 下载PDF
Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval 被引量:3
7
作者 Haoyu Lu Yuqi Huo +2 位作者 Mingyu Ding Nanyi Fei Zhiwu Lu 《Machine Intelligence Research》 EI CSCD 2023年第4期569-582,共14页
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often... Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient. 展开更多
关键词 image-text retrieval multimodal modeling contrastive learning weak correlation computer vision
原文传递
Multimodal Social Media Fake News Detection Based on Similarity Inference and Adversarial Networks 被引量:2
8
作者 Fangfang Shan Huifang Sun Mengyi Wang 《Computers, Materials & Continua》 SCIE EI 2024年第4期581-605,共25页
As social networks become increasingly complex, contemporary fake news often includes textual descriptionsof events accompanied by corresponding images or videos. Fake news in multiple modalities is more likely tocrea... As social networks become increasingly complex, contemporary fake news often includes textual descriptionsof events accompanied by corresponding images or videos. Fake news in multiple modalities is more likely tocreate a misleading perception among users. While early research primarily focused on text-based features forfake news detection mechanisms, there has been relatively limited exploration of learning shared representationsin multimodal (text and visual) contexts. To address these limitations, this paper introduces a multimodal modelfor detecting fake news, which relies on similarity reasoning and adversarial networks. The model employsBidirectional Encoder Representation from Transformers (BERT) and Text Convolutional Neural Network (Text-CNN) for extracting textual features while utilizing the pre-trained Visual Geometry Group 19-layer (VGG-19) toextract visual features. Subsequently, the model establishes similarity representations between the textual featuresextracted by Text-CNN and visual features through similarity learning and reasoning. Finally, these features arefused to enhance the accuracy of fake news detection, and adversarial networks have been employed to investigatethe relationship between fake news and events. This paper validates the proposed model using publicly availablemultimodal datasets from Weibo and Twitter. Experimental results demonstrate that our proposed approachachieves superior performance on Twitter, with an accuracy of 86%, surpassing traditional unimodalmodalmodelsand existing multimodal models. In contrast, the overall better performance of our model on the Weibo datasetsurpasses the benchmark models across multiple metrics. The application of similarity reasoning and adversarialnetworks in multimodal fake news detection significantly enhances detection effectiveness in this paper. However,current research is limited to the fusion of only text and image modalities. Future research directions should aimto further integrate features fromadditionalmodalities to comprehensively represent themultifaceted informationof fake news. 展开更多
关键词 Fake news detection attention mechanism image-text similarity multimodal feature fusion
在线阅读 下载PDF
Dixit Player with Open CLIP
9
作者 Ryan Wei 《Journal of Data Analysis and Information Processing》 2023年第4期536-547,共12页
A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper c... A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper calculates baseline accuracies for both the ability to match the correct image to a hint and the ability to match up with human preferences. A dataset created by previous work on Dixit is used for testing. CLIP is utilized through the comparison of a hint to multiple images, and previous hints, achieving a final accuracy of 0.5011 which surpasses previous results. 展开更多
关键词 Computer Vision AI CLIP Dixit Open AI Creative Gameplay Open CLIP Natural Language Processing Visual Models Game AI image-text Pairing
在线阅读 下载PDF
What Contributes to a Crowdfunding Campaign’s Success?Evidence and Analyses from GoFundMe Data
10
作者 Xupin Zhang Hanjia Lyu Jiebo Luo 《Journal of Social Computing》 2021年第2期183-192,共10页
Researchers have attempted to measure the success of crowdfunding campaigns using a variety of determinants,such as the descriptions of the crowdfunding campaigns,the amount of funding goals,and crowdfunding project c... Researchers have attempted to measure the success of crowdfunding campaigns using a variety of determinants,such as the descriptions of the crowdfunding campaigns,the amount of funding goals,and crowdfunding project characteristics.Although many successful determinants have been reported in the literature,it remains unclear whether the cover photo and the text in the title and description could be combined in a fusion classifier to better predict the crowdfunding campaign’s success.In this work,we focus on the performance of the crowdfunding campaigns on GoFundMe across a wide variety of funding categories.We analyze the attributes available at the launch of the campaign and identify attributes that are important for each category of the campaigns.Furthermore,we develop a fusion classifier based on the random forest that significantly improves the prediction result,thus suggesting effective ways to make a campaign successful. 展开更多
关键词 CROWDFUNDING image-text fusion GoFundMe
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部