Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasi...Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.展开更多
Research on mass gathering events is critical for ensuring public security and maintaining social order.However,most of the existing works focus on crowd behavior analysis areas such as anomaly detection and crowd cou...Research on mass gathering events is critical for ensuring public security and maintaining social order.However,most of the existing works focus on crowd behavior analysis areas such as anomaly detection and crowd counting,and there is a relative lack of research on mass gathering behaviors.We believe real-time detection and monitoring of mass gathering behaviors are essential formigrating potential security risks and emergencies.Therefore,it is imperative to develop a method capable of accurately identifying and localizing mass gatherings before disasters occur,enabling prompt and effective responses.To address this problem,we propose an innovative Event-Driven Attention Network(EDAN),which achieves image-text matching in the scenario of mass gathering events with good results for the first time.Traditional image-text retrieval methods based on global alignment are difficult to capture the local details within complex scenes,limiting retrieval accuracy.While local alignment-based methods aremore effective at extracting detailed features,they frequently process raw textual features directly,which often contain ambiguities and redundant information that can diminish retrieval efficiency and degrade model performance.To overcome these challenges,EDAN introduces an Event-Driven AttentionModule that adaptively focuses attention on image regions or textual words relevant to the event type.By calculating the semantic distance between event labels and textual content,this module effectively significantly reduces computational complexity and enhances retrieval efficiency.To validate the effectiveness of EDAN,we construct a dedicated multimodal dataset tailored for the analysis of mass gathering events,providing a reliable foundation for subsequent studies.We conduct comparative experiments with other methods on our dataset,the experimental results demonstrate the effectiveness of EDAN.In the image-to-text retrieval task,EDAN achieved the best performance on the R@5 metric,while in the text-to-image retrieval task,it showed superior results on both R@10 and R@5 metrics.Additionally,EDAN excelled in the overall Rsummetric,achieving the best performance.Finally,ablation studies further verified the effectiveness of event-driven attention module.展开更多
With the incessant propulsion of the Open Door Policy,which is related to the consolidation of international collaborative partnerships,an increasing number of Chinese companies are moving toward cooperating countries...With the incessant propulsion of the Open Door Policy,which is related to the consolidation of international collaborative partnerships,an increasing number of Chinese companies are moving toward cooperating countries to participate in infrastructure construction,employing a win-win strategy in favor of the people and governments of both countries.Among the cooperation domains,our country’s electrical companies have achieved a series of remarkable results in the international Engineering,Procurement,and Construction(EPC)project market with their outstanding business capabilities and technical advantages.Nevertheless,some shortcomings cannot be overlooked,the most notable of which appears to be the impediment associated with engineering translation,which has always been an obsession among translators of Chinese companies.Taking the transmission line project in the Republic of Madagascar as an example,an analysis of French-Chinese translation methods of electrical engineering terminology in the field of the transmission line is carried out.展开更多
Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing m...Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.展开更多
Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly common.However,detecting sarcasm in various forms of communication can be difficult due to confli...Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly common.However,detecting sarcasm in various forms of communication can be difficult due to conflicting sentiments.In this paper,we introduce a contrasting sentiment-based model for multimodal sarcasm detection(CS4MSD),which identifies inconsistent emotions by leveraging the CLIP knowledge module to produce sentiment features in both text and image.Then,five external sentiments are introduced to prompt the model learning sentimental preferences among modalities.Furthermore,we highlight the importance of verbal descriptions embedded in illustrations and incorporate additional knowledge-sharing modules to fuse such imagelike features.Experimental results demonstrate that our model achieves state-of-the-art performance on the public multimodal sarcasm dataset.展开更多
With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap...With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.展开更多
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often...Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.展开更多
As social networks become increasingly complex, contemporary fake news often includes textual descriptionsof events accompanied by corresponding images or videos. Fake news in multiple modalities is more likely tocrea...As social networks become increasingly complex, contemporary fake news often includes textual descriptionsof events accompanied by corresponding images or videos. Fake news in multiple modalities is more likely tocreate a misleading perception among users. While early research primarily focused on text-based features forfake news detection mechanisms, there has been relatively limited exploration of learning shared representationsin multimodal (text and visual) contexts. To address these limitations, this paper introduces a multimodal modelfor detecting fake news, which relies on similarity reasoning and adversarial networks. The model employsBidirectional Encoder Representation from Transformers (BERT) and Text Convolutional Neural Network (Text-CNN) for extracting textual features while utilizing the pre-trained Visual Geometry Group 19-layer (VGG-19) toextract visual features. Subsequently, the model establishes similarity representations between the textual featuresextracted by Text-CNN and visual features through similarity learning and reasoning. Finally, these features arefused to enhance the accuracy of fake news detection, and adversarial networks have been employed to investigatethe relationship between fake news and events. This paper validates the proposed model using publicly availablemultimodal datasets from Weibo and Twitter. Experimental results demonstrate that our proposed approachachieves superior performance on Twitter, with an accuracy of 86%, surpassing traditional unimodalmodalmodelsand existing multimodal models. In contrast, the overall better performance of our model on the Weibo datasetsurpasses the benchmark models across multiple metrics. The application of similarity reasoning and adversarialnetworks in multimodal fake news detection significantly enhances detection effectiveness in this paper. However,current research is limited to the fusion of only text and image modalities. Future research directions should aimto further integrate features fromadditionalmodalities to comprehensively represent themultifaceted informationof fake news.展开更多
A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper c...A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper calculates baseline accuracies for both the ability to match the correct image to a hint and the ability to match up with human preferences. A dataset created by previous work on Dixit is used for testing. CLIP is utilized through the comparison of a hint to multiple images, and previous hints, achieving a final accuracy of 0.5011 which surpasses previous results.展开更多
Researchers have attempted to measure the success of crowdfunding campaigns using a variety of determinants,such as the descriptions of the crowdfunding campaigns,the amount of funding goals,and crowdfunding project c...Researchers have attempted to measure the success of crowdfunding campaigns using a variety of determinants,such as the descriptions of the crowdfunding campaigns,the amount of funding goals,and crowdfunding project characteristics.Although many successful determinants have been reported in the literature,it remains unclear whether the cover photo and the text in the title and description could be combined in a fusion classifier to better predict the crowdfunding campaign’s success.In this work,we focus on the performance of the crowdfunding campaigns on GoFundMe across a wide variety of funding categories.We analyze the attributes available at the launch of the campaign and identify attributes that are important for each category of the campaigns.Furthermore,we develop a fusion classifier based on the random forest that significantly improves the prediction result,thus suggesting effective ways to make a campaign successful.展开更多
基金supported by the National Key R&D Program of China(No.2022ZD0118402)。
文摘Remote sensing cross-modal image-text retrieval(RSCIR)can flexibly and subjectively retrieve remote sensing images utilizing query text,which has received more researchers’attention recently.However,with the increasing volume of visual-language pre-training model parameters,direct transfer learning consumes a substantial amount of computational and storage resources.Moreover,recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features,ignoring the spatial features which are vital for modeling key entity relationships.To address these issues,we design an efficient transfer learning framework for RSCIR,which is based on spatial feature efficient reconstruction(SPER).A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships.The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension.We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets.Compared with traditional methods,our approach achieves an improvement of 3%-11% in sumR metric.Compared with methods finetuning all parameters,our proposed method only trains less than 1% of the parameters,while maintaining an overall performance of about 96%.
基金sponsored by Natural Science Foundation of Xinjiang Uygur Autonomous Region(2024D01A19).
文摘Research on mass gathering events is critical for ensuring public security and maintaining social order.However,most of the existing works focus on crowd behavior analysis areas such as anomaly detection and crowd counting,and there is a relative lack of research on mass gathering behaviors.We believe real-time detection and monitoring of mass gathering behaviors are essential formigrating potential security risks and emergencies.Therefore,it is imperative to develop a method capable of accurately identifying and localizing mass gatherings before disasters occur,enabling prompt and effective responses.To address this problem,we propose an innovative Event-Driven Attention Network(EDAN),which achieves image-text matching in the scenario of mass gathering events with good results for the first time.Traditional image-text retrieval methods based on global alignment are difficult to capture the local details within complex scenes,limiting retrieval accuracy.While local alignment-based methods aremore effective at extracting detailed features,they frequently process raw textual features directly,which often contain ambiguities and redundant information that can diminish retrieval efficiency and degrade model performance.To overcome these challenges,EDAN introduces an Event-Driven AttentionModule that adaptively focuses attention on image regions or textual words relevant to the event type.By calculating the semantic distance between event labels and textual content,this module effectively significantly reduces computational complexity and enhances retrieval efficiency.To validate the effectiveness of EDAN,we construct a dedicated multimodal dataset tailored for the analysis of mass gathering events,providing a reliable foundation for subsequent studies.We conduct comparative experiments with other methods on our dataset,the experimental results demonstrate the effectiveness of EDAN.In the image-to-text retrieval task,EDAN achieved the best performance on the R@5 metric,while in the text-to-image retrieval task,it showed superior results on both R@10 and R@5 metrics.Additionally,EDAN excelled in the overall Rsummetric,achieving the best performance.Finally,ablation studies further verified the effectiveness of event-driven attention module.
文摘With the incessant propulsion of the Open Door Policy,which is related to the consolidation of international collaborative partnerships,an increasing number of Chinese companies are moving toward cooperating countries to participate in infrastructure construction,employing a win-win strategy in favor of the people and governments of both countries.Among the cooperation domains,our country’s electrical companies have achieved a series of remarkable results in the international Engineering,Procurement,and Construction(EPC)project market with their outstanding business capabilities and technical advantages.Nevertheless,some shortcomings cannot be overlooked,the most notable of which appears to be the impediment associated with engineering translation,which has always been an obsession among translators of Chinese companies.Taking the transmission line project in the Republic of Madagascar as an example,an analysis of French-Chinese translation methods of electrical engineering terminology in the field of the transmission line is carried out.
基金supported by the National Natural Science Foundation of China under Grant No.62076048.
文摘Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.
基金National Natural Science Foundation of China,Grant/Award Numbers:61671064,61732005National Key Research and Development Program of China,Grant/Award Number:2018YFC0831700。
文摘Using sarcasm on social media platforms to express negative opinions towards a person or object has become increasingly common.However,detecting sarcasm in various forms of communication can be difficult due to conflicting sentiments.In this paper,we introduce a contrasting sentiment-based model for multimodal sarcasm detection(CS4MSD),which identifies inconsistent emotions by leveraging the CLIP knowledge module to produce sentiment features in both text and image.Then,five external sentiments are introduced to prompt the model learning sentimental preferences among modalities.Furthermore,we highlight the importance of verbal descriptions embedded in illustrations and incorporate additional knowledge-sharing modules to fuse such imagelike features.Experimental results demonstrate that our model achieves state-of-the-art performance on the public multimodal sarcasm dataset.
文摘With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.
文摘Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.
基金the National Natural Science Foundation of China(No.62302540)with author F.F.S.For more information,please visit their website at https://www.nsfc.gov.cn/.Additionally,it is also funded by the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness(No.HNTS2022020)+1 种基金where F.F.S is an author.Further details can be found at http://xt.hnkjt.gov.cn/data/pingtai/.The research is also supported by the Natural Science Foundation of Henan Province Youth Science Fund Project(No.232300420422)for more information,you can visit https://kjt.henan.gov.cn/2022/09-02/2599082.html.Lastly,it receives funding from the Natural Science Foundation of Zhongyuan University of Technology(No.K2023QN018),where F.F.S is an author.You can find more information at https://www.zut.edu.cn/.
文摘As social networks become increasingly complex, contemporary fake news often includes textual descriptionsof events accompanied by corresponding images or videos. Fake news in multiple modalities is more likely tocreate a misleading perception among users. While early research primarily focused on text-based features forfake news detection mechanisms, there has been relatively limited exploration of learning shared representationsin multimodal (text and visual) contexts. To address these limitations, this paper introduces a multimodal modelfor detecting fake news, which relies on similarity reasoning and adversarial networks. The model employsBidirectional Encoder Representation from Transformers (BERT) and Text Convolutional Neural Network (Text-CNN) for extracting textual features while utilizing the pre-trained Visual Geometry Group 19-layer (VGG-19) toextract visual features. Subsequently, the model establishes similarity representations between the textual featuresextracted by Text-CNN and visual features through similarity learning and reasoning. Finally, these features arefused to enhance the accuracy of fake news detection, and adversarial networks have been employed to investigatethe relationship between fake news and events. This paper validates the proposed model using publicly availablemultimodal datasets from Weibo and Twitter. Experimental results demonstrate that our proposed approachachieves superior performance on Twitter, with an accuracy of 86%, surpassing traditional unimodalmodalmodelsand existing multimodal models. In contrast, the overall better performance of our model on the Weibo datasetsurpasses the benchmark models across multiple metrics. The application of similarity reasoning and adversarialnetworks in multimodal fake news detection significantly enhances detection effectiveness in this paper. However,current research is limited to the fusion of only text and image modalities. Future research directions should aimto further integrate features fromadditionalmodalities to comprehensively represent themultifaceted informationof fake news.
文摘A computer vision approach through Open AI’s CLIP, a model capable of predicting text-image pairs, is used to create an AI agent for Dixit, a game which requires creative linking between images and text. This paper calculates baseline accuracies for both the ability to match the correct image to a hint and the ability to match up with human preferences. A dataset created by previous work on Dixit is used for testing. CLIP is utilized through the comparison of a hint to multiple images, and previous hints, achieving a final accuracy of 0.5011 which surpasses previous results.
文摘Researchers have attempted to measure the success of crowdfunding campaigns using a variety of determinants,such as the descriptions of the crowdfunding campaigns,the amount of funding goals,and crowdfunding project characteristics.Although many successful determinants have been reported in the literature,it remains unclear whether the cover photo and the text in the title and description could be combined in a fusion classifier to better predict the crowdfunding campaign’s success.In this work,we focus on the performance of the crowdfunding campaigns on GoFundMe across a wide variety of funding categories.We analyze the attributes available at the launch of the campaign and identify attributes that are important for each category of the campaigns.Furthermore,we develop a fusion classifier based on the random forest that significantly improves the prediction result,thus suggesting effective ways to make a campaign successful.