Background:Source-free unsupervised domain adaptation(SFUDA)methods aim to address the challenge of domain shift while preserving data privacy.Existing SFUDA approaches construct reliable and confident pseudo-labels f...Background:Source-free unsupervised domain adaptation(SFUDA)methods aim to address the challenge of domain shift while preserving data privacy.Existing SFUDA approaches construct reliable and confident pseudo-labels for target-domain data through denoising methods,thereby guiding the training of the target-domain model.The effectiveness of denoising approaches is influenced by the degree of domain gap between the source and target domains.A marked shift can cause the pseudo-labels to be unreliable,even after applying denoising.Methods:We propose a novel 2-stage framework for SFUDA called visual prompt source-free domain adaptation(VP-SFDA).We propose input-specific visual prompt in the first stage,prompting process,which bridges the target-domain data to source-domain distribution.Our method utilizes visual prompts and batch normalization constraint to enable the alignment model to learn domainspecific knowledge and align the target-domain data with the source-domain contribution.The second stage is the adaptation process,which aims at optimizing the segmentation model from the source domain to the target domain.This is accomplished through the denoising techniques,ultimately enhancing the performance.Results:Our study presents a comparative analysis of several SFUDA techniques in the VPSFDA framework across 4 tasks:abdominal magnetic resonance imaging(MRI)to computed tomography(CT),abdominal CT to MRI,cardiac MRI to CT,and cardiac CT to MRI.Notably,in the abdominal MRI to CT adaptation task,the VP-OS method achieved a remarkable improvement,increasing the average DICE score from 0.658 to 0.773(P<0.01)and reducing the average surface distance(ASD)from 3.489 to 2.961(P<0.01).Similarly,the VP-LD and VP-DPL methods also showed significant improvements over their base algorithms in both abdominal and cardiac MRI to CT tasks.Conclusions:This paper proposes VP-SFDA,a novel 2-stage framework for SFUDA in medical imaging,which achieves superior performance through input-specific visual prompts and batch normalization constraint for domain adaptation,coupled with denoising methods for enhanced results.Comparative experiments on 4 medical SFUDA tasks demonstrate that VO-SFDA surpasses existing methods,with ablation studies confirming the benefits of domain-specific patterns.展开更多
With the rapid development of intelligent video surveillance technology,pedestrian re-identification has become increasingly important inmulti-camera surveillance systems.This technology plays a critical role in enhan...With the rapid development of intelligent video surveillance technology,pedestrian re-identification has become increasingly important inmulti-camera surveillance systems.This technology plays a critical role in enhancing public safety.However,traditional methods typically process images and text separately,applying upstream models directly to downstream tasks.This approach significantly increases the complexity ofmodel training and computational costs.Furthermore,the common class imbalance in existing training datasets limitsmodel performance improvement.To address these challenges,we propose an innovative framework named Person Re-ID Network Based on Visual Prompt Technology andMulti-Instance Negative Pooling(VPM-Net).First,we incorporate the Contrastive Language-Image Pre-training(CLIP)pre-trained model to accurately map visual and textual features into a unified embedding space,effectively mitigating inconsistencies in data distribution and the training process.To enhancemodel adaptability and generalization,we introduce an efficient and task-specific Visual Prompt Tuning(VPT)technique,which improves the model’s relevance to specific tasks.Additionally,we design two key modules:the Knowledge-Aware Network(KAN)and theMulti-Instance Negative Pooling(MINP)module.The KAN module significantly enhances the model’s understanding of complex scenarios through deep contextual semantic modeling.MINP module handles samples,effectively improving the model’s ability to distinguish fine-grained features.The experimental outcomes across diverse datasets underscore the remarkable performance of VPM-Net.These results vividly demonstrate the unique advantages and robust reliability of VPM-Net in fine-grained retrieval tasks.展开更多
With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of th...With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image.This limitation restricts the interpretative capacity of the VQA models and their abil-ity to explore specific image regions.To address this issue,this study proposes a grounded VQA model for robotic surgery,capable of localizing a specific region during answer prediction.Drawing inspiration from prompt learning in language models,a dual-modality prompt model was developed to enhance precise multimodal information interactions.Specifically,two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model.A visual complementary prompter merges visual prompt knowl-edge with visual information features to guide accurate localization.The textual complementary prompter aligns vis-ual information with textual prompt knowledge and textual information,guiding textual information towards a more accurate inference of the answer.Additionally,a multiple iterative fusion strategy was adopted for comprehensive answer reasoning,to ensure high-quality generation of textual and grounded answers.The experimental results vali-date the effectiveness of the model,demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets.展开更多
Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring proc...Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring process,waterbody segmentation involves precisely delineating waterbody boundaries from imagery.Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis.In response to these challenges,this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images.However,the segmentation of waterbodies from ordinary images faces several obstacles,including variations in lighting,occlusions from objects like trees and buildings,and reflections on the water surface,all of which can mislead algorithms.Additionally,the diverse shapes and textures of waterbodies,alongside complex backgrounds,further complicate this task.While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets,their application to waterbody segmentation from ground-level images remains underexplored.Hence,this research proposed the Visual Aquatic Generalist(VAGen)as a countermeasure.Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning(ICL)and Visual Prompting(VP),VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL.As demonstrated by the experimental results,VAGen demonstrated a significant increase in the mean Intersection over Union(mIoU)metric,showing a 22.38%enhancement when compared to the baseline model that lacked the integration of learnable prompts.Moreover,VAGen surpassed the current stateof-the-art(SOTA)task-specific models designed for waterbody segmentation by 6.20%.The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead,and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles(UAVs)and mobile computing platforms.This study thereby makes a valuable contribution to the field of computer vision,offering practical solutions for engineering applications related to urban flood monitoring,agricultural water resource management,and environmental conservation efforts.展开更多
Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much att...Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much attention is paid to learning about basic classes,making it harder to understand novel classes;2.Most methods only rely on the context information provided by the prompt template,resulting in limited text features.In this study,we propose a new fine-tuning method for Visual-Language Models called Input-Enhanced Prompt Tuning(IEPT).The IEPT improves the generalization of VLMs for downstream tasks by introducing two components,i.e.,the Data Augmentation Framework(DAF)and the Category Generalization Optimizer(CGO).Specifically,the DAF employs Large Language Models to resolve issues of word ambiguity by obtaining more class label context,and uses simple image augmentation to address the issue of limited features by providing more image samples.The CGO prevents overfitting by adding new class names during training.Experiments show that the performance of IEPT in various evaluation suites is better or comparable to that of the existing method,covering basic to novel generalization,domain generalization,and cross-dataset evaluation.Compared to the state-of-the-art method PromptSRC,IEPT achieves an absolute improvement of 0.40%for base classes,1.56%for novel classes and 1.04%on the harmonic mean,averaged over 11 datasets.In addition,we present detailed ablation studies that validate the individual contributions of DAF and CGO to the overall performance of IEPT.Our code is available at https://github.com/ayuan 0626/IEPT.展开更多
Prompt learning has attracted broad attention in computer vision since the large pre-trained visionlanguagemodels (VLMs) exploded. Based on the close relationship between vision and language information builtby VLM, p...Prompt learning has attracted broad attention in computer vision since the large pre-trained visionlanguagemodels (VLMs) exploded. Based on the close relationship between vision and language information builtby VLM, prompt learning becomes a crucial technique in many important applications such as artificial intelligencegenerated content (AIGC). In this survey, we provide a progressive and comprehensive review of visual promptlearning as related to AIGC. We begin by introducing VLM, the foundation of visual prompt learning. Then, wereview the vision prompt learning methods and prompt-guided generative models, and discuss how to improve theefficiency of adapting AIGC models to specific downstream tasks. Finally, we provide some promising researchdirections concerning prompt learning.展开更多
基金supportted by the Natural Science Foundation of China(62394311,62394310)Beijing Natural Science Foundation(QY24034)National Biomedical Imaging Facility Grant and from the startup funds of Peking University Health Science Center.
文摘Background:Source-free unsupervised domain adaptation(SFUDA)methods aim to address the challenge of domain shift while preserving data privacy.Existing SFUDA approaches construct reliable and confident pseudo-labels for target-domain data through denoising methods,thereby guiding the training of the target-domain model.The effectiveness of denoising approaches is influenced by the degree of domain gap between the source and target domains.A marked shift can cause the pseudo-labels to be unreliable,even after applying denoising.Methods:We propose a novel 2-stage framework for SFUDA called visual prompt source-free domain adaptation(VP-SFDA).We propose input-specific visual prompt in the first stage,prompting process,which bridges the target-domain data to source-domain distribution.Our method utilizes visual prompts and batch normalization constraint to enable the alignment model to learn domainspecific knowledge and align the target-domain data with the source-domain contribution.The second stage is the adaptation process,which aims at optimizing the segmentation model from the source domain to the target domain.This is accomplished through the denoising techniques,ultimately enhancing the performance.Results:Our study presents a comparative analysis of several SFUDA techniques in the VPSFDA framework across 4 tasks:abdominal magnetic resonance imaging(MRI)to computed tomography(CT),abdominal CT to MRI,cardiac MRI to CT,and cardiac CT to MRI.Notably,in the abdominal MRI to CT adaptation task,the VP-OS method achieved a remarkable improvement,increasing the average DICE score from 0.658 to 0.773(P<0.01)and reducing the average surface distance(ASD)from 3.489 to 2.961(P<0.01).Similarly,the VP-LD and VP-DPL methods also showed significant improvements over their base algorithms in both abdominal and cardiac MRI to CT tasks.Conclusions:This paper proposes VP-SFDA,a novel 2-stage framework for SFUDA in medical imaging,which achieves superior performance through input-specific visual prompts and batch normalization constraint for domain adaptation,coupled with denoising methods for enhanced results.Comparative experiments on 4 medical SFUDA tasks demonstrate that VO-SFDA surpasses existing methods,with ablation studies confirming the benefits of domain-specific patterns.
基金funded by the Key Research and Development Program of Hubei Province,China(Grant No.2023BEB024)the Young and Middle-aged Scientific and Technological Innova-tion Team Plan in Higher Education Institutions inHubei Province,China(GrantNo.T2023007)the key projects ofHubei Provincial Department of Education(No.D20161403).
文摘With the rapid development of intelligent video surveillance technology,pedestrian re-identification has become increasingly important inmulti-camera surveillance systems.This technology plays a critical role in enhancing public safety.However,traditional methods typically process images and text separately,applying upstream models directly to downstream tasks.This approach significantly increases the complexity ofmodel training and computational costs.Furthermore,the common class imbalance in existing training datasets limitsmodel performance improvement.To address these challenges,we propose an innovative framework named Person Re-ID Network Based on Visual Prompt Technology andMulti-Instance Negative Pooling(VPM-Net).First,we incorporate the Contrastive Language-Image Pre-training(CLIP)pre-trained model to accurately map visual and textual features into a unified embedding space,effectively mitigating inconsistencies in data distribution and the training process.To enhancemodel adaptability and generalization,we introduce an efficient and task-specific Visual Prompt Tuning(VPT)technique,which improves the model’s relevance to specific tasks.Additionally,we design two key modules:the Knowledge-Aware Network(KAN)and theMulti-Instance Negative Pooling(MINP)module.The KAN module significantly enhances the model’s understanding of complex scenarios through deep contextual semantic modeling.MINP module handles samples,effectively improving the model’s ability to distinguish fine-grained features.The experimental outcomes across diverse datasets underscore the remarkable performance of VPM-Net.These results vividly demonstrate the unique advantages and robust reliability of VPM-Net in fine-grained retrieval tasks.
基金supported in part by the National Key Research and Development Program of China,No.2021ZD0112400National Natural Science Foundation of China,No.U1908214+5 种基金Program for Innovative Research Team at the University of Liaoning Province,No.LT2020015the Support Plan for Key Field Innovation Team of Dalian,No.2021RT06the Support Plan for Leading Innovation Team of Dalian University,No.XLJ202010Program for the Liaoning Province Doctoral Research Starting Fund,No.2022-BS-336Key Laboratory of Advanced Design and Intelligent Computing(Dalian University),and Ministry of Education,No.ADIC2022003Interdisciplinary Project of Dalian University,No.DLUXK-2023-QN-015.
文摘With recent advancements in robotic surgery,notable strides have been made in visual question answering(VQA).Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image.This limitation restricts the interpretative capacity of the VQA models and their abil-ity to explore specific image regions.To address this issue,this study proposes a grounded VQA model for robotic surgery,capable of localizing a specific region during answer prediction.Drawing inspiration from prompt learning in language models,a dual-modality prompt model was developed to enhance precise multimodal information interactions.Specifically,two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model.A visual complementary prompter merges visual prompt knowl-edge with visual information features to guide accurate localization.The textual complementary prompter aligns vis-ual information with textual prompt knowledge and textual information,guiding textual information towards a more accurate inference of the answer.Additionally,a multiple iterative fusion strategy was adopted for comprehensive answer reasoning,to ensure high-quality generation of textual and grounded answers.The experimental results vali-date the effectiveness of the model,demonstrating its superiority over existing methods on the EndoVis-18 and End-oVis-17 datasets.
文摘Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring process,waterbody segmentation involves precisely delineating waterbody boundaries from imagery.Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis.In response to these challenges,this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images.However,the segmentation of waterbodies from ordinary images faces several obstacles,including variations in lighting,occlusions from objects like trees and buildings,and reflections on the water surface,all of which can mislead algorithms.Additionally,the diverse shapes and textures of waterbodies,alongside complex backgrounds,further complicate this task.While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets,their application to waterbody segmentation from ground-level images remains underexplored.Hence,this research proposed the Visual Aquatic Generalist(VAGen)as a countermeasure.Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning(ICL)and Visual Prompting(VP),VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL.As demonstrated by the experimental results,VAGen demonstrated a significant increase in the mean Intersection over Union(mIoU)metric,showing a 22.38%enhancement when compared to the baseline model that lacked the integration of learnable prompts.Moreover,VAGen surpassed the current stateof-the-art(SOTA)task-specific models designed for waterbody segmentation by 6.20%.The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead,and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles(UAVs)and mobile computing platforms.This study thereby makes a valuable contribution to the field of computer vision,offering practical solutions for engineering applications related to urban flood monitoring,agricultural water resource management,and environmental conservation efforts.
基金supported by National Key R&D Program of China(No.2022YFE0196100)the Innovation Capacity Enhancement Program Science and Technology Platform Project of Hebei Province(22567623H)Hebei University High Level Innovative Talent Research Start-up Funding Project(No.521000981092).
文摘Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much attention is paid to learning about basic classes,making it harder to understand novel classes;2.Most methods only rely on the context information provided by the prompt template,resulting in limited text features.In this study,we propose a new fine-tuning method for Visual-Language Models called Input-Enhanced Prompt Tuning(IEPT).The IEPT improves the generalization of VLMs for downstream tasks by introducing two components,i.e.,the Data Augmentation Framework(DAF)and the Category Generalization Optimizer(CGO).Specifically,the DAF employs Large Language Models to resolve issues of word ambiguity by obtaining more class label context,and uses simple image augmentation to address the issue of limited features by providing more image samples.The CGO prevents overfitting by adding new class names during training.Experiments show that the performance of IEPT in various evaluation suites is better or comparable to that of the existing method,covering basic to novel generalization,domain generalization,and cross-dataset evaluation.Compared to the state-of-the-art method PromptSRC,IEPT achieves an absolute improvement of 0.40%for base classes,1.56%for novel classes and 1.04%on the harmonic mean,averaged over 11 datasets.In addition,we present detailed ablation studies that validate the individual contributions of DAF and CGO to the overall performance of IEPT.Our code is available at https://github.com/ayuan 0626/IEPT.
基金Project supported by the National Natural Science Foundation of China(Nos.62306075 and 62101136)the China Postdoctoral Science Foundation(No.2022TQ0069)+2 种基金the Natural Science Foundation of Shanghai,China(No.21ZR1403600)the Shanghai Municipal of Science and Technology Project,China(No.20JC1419500)the Shanghai Center for Brain Science and Brain-Inspired Technology,China。
文摘Prompt learning has attracted broad attention in computer vision since the large pre-trained visionlanguagemodels (VLMs) exploded. Based on the close relationship between vision and language information builtby VLM, prompt learning becomes a crucial technique in many important applications such as artificial intelligencegenerated content (AIGC). In this survey, we provide a progressive and comprehensive review of visual promptlearning as related to AIGC. We begin by introducing VLM, the foundation of visual prompt learning. Then, wereview the vision prompt learning methods and prompt-guided generative models, and discuss how to improve theefficiency of adapting AIGC models to specific downstream tasks. Finally, we provide some promising researchdirections concerning prompt learning.