Large language models(LLMs)have demonstrated remarkable generalization abilities across multiple tasks in natural language processing(NLP).For multi-step reasoning tasks,chain-of-thought(CoT)prompting facilitates step...Large language models(LLMs)have demonstrated remarkable generalization abilities across multiple tasks in natural language processing(NLP).For multi-step reasoning tasks,chain-of-thought(CoT)prompting facilitates step-by-step thinking,leading to improved performance.However,despite significant advancements in LLMs,current CoT prompting performs suboptimally on smaller-scale models that have fewer parameters.Additionally,the common paradigm of few-shot CoT prompting relies on a set of manual demonstrations,with performance contingent on the quality of these annotations and varying with task-specific requirements.To address these limitations,we propose a select-and-answer prompting method(SAP)to enhance language model performance on reasoning tasks without the need for manual demonstrations.This method comprises two primary steps:guiding the model to conduct preliminary analysis and generate several candidate answers based on the prompting;allowing the model to provide final answers derived from these candidate answers.The proposed prompting strategy is evaluated across two language models of varying sizes and six datasets.On ChatGLM-6B,SAP consistently outperforms few-shot CoT across all datasets.For GPT-3.5,SAP achieves comparable performance to few-shot CoT and outperforms zero-shot CoT in most cases.These experimental results indicate that SAP can significantly improve the accuracy of language models in reasoning tasks.展开更多
Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the con...Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the context of fewshot and zero-shot scenarios remains highly challenging due to the scarcity of training data.Large language models(LLMs),on the other hand,can generalize well to unseen tasks with few-shot demonstrations or even zero-shot instructions and have demonstrated impressive ability for a wide range of natural language understanding or generation tasks.Nevertheless,it is unclear,whether such effectiveness can be replicated in the task of IE,where the target tasks involve specialized schema and quite abstractive entity or relation concepts.In this paper,we first examine the validity of LLMs in executing IE tasks with an established prompting strategy and further propose multiple types of augmented prompting methods,including the structured fundamental prompt(SFP),the structured interactive reasoning prompt(SIRP),and the voting-enabled structured interactive reasoning prompt(VESIRP).The experimental results demonstrate that while directly promotes inferior performance,the proposed augmented prompt methods significantly improve the extraction accuracy,achieving comparable or even better performance(e.g.,zero-shot FewNERD,FewNERD-INTRA)than state-of-theart methods that require large-scale training samples.This study represents a systematic exploration of employing instruction-following LLM for the task of IE.It not only establishes a performance benchmark for this novel paradigm but,more importantly,validates a practical technical pathway through the proposed prompt enhancement method,offering a viable solution for efficient IE in low-resource settings.展开更多
With the further implementation of the knowledge innovation program (KIP), piloted by the Chinese Academy of Sciences (CAS), encouraging progress has been made in contributing to the development of the country's h...With the further implementation of the knowledge innovation program (KIP), piloted by the Chinese Academy of Sciences (CAS), encouraging progress has been made in contributing to the development of the country's high-tech industry, and forging S&T cooperation with local governments and industrial sectors. This was revealed at the Second CAS Conference on High-tech Industrialization April 25 - 29 in Shenzhen, Guangdong Province.展开更多
Semi-supervised sound event detection(SSED)tasks typically leverage a large amount of unlabeled and synthetic data to facilitate model generalization during training,reducing overfitting on a limited set of labeled da...Semi-supervised sound event detection(SSED)tasks typically leverage a large amount of unlabeled and synthetic data to facilitate model generalization during training,reducing overfitting on a limited set of labeled data.However,the generalization training process often encounters challenges from noisy interference introduced by pseudo-labels or domain knowledge gaps.To alleviate noisy interference in class distribution learning,we propose an efficient semi-supervised class distribution learning method through dynamic prompt tuning,named prompting class distribution optimization(PADO).Specifically,when modeling real labeled data,PADO dynamically incorporates independent learnable prompt tokens to explore prior knowledge about the true distribution.Then,the prior knowledge serves as prompt information,dynamically interacting with the posterior noisy-class distribution information.In this case,PADO achieves class distribution optimization while maintaining model generalization,leading to a significant improvement in the efficiency of class distribution learning.Compared with state-of-the-art methods on the SSED datasets from DCASE 2019,2020,and 2021 challenges,PADO achieves significant performance improvements.Furthermore,it is readily extendable to other benchmark models.展开更多
Automatic question tagging(AQT)represents a crucial task in community question answering(CQA)websites.Its pivotal role lies in substantially augmenting user experience through the optimization of question-answering ef...Automatic question tagging(AQT)represents a crucial task in community question answering(CQA)websites.Its pivotal role lies in substantially augmenting user experience through the optimization of question-answering efficiency.Existing question tagging models focus on the features of questions and tags,ignoring the external knowledge of the real world.Large language models can work as knowledge engines for incorporating real-world facts for different tasks.However,it is difficult for large language models to output tags in the database of CQA websites.To address this challenge,we propose a large language model enhanced question tagging method called LLMEQT to perform the question tagging task.In LLMEQT,a traditional question tagging method is first applied to pre-retrieve tags for questions.Then prompts are formulated for LLMs to comprehend the task and select more suitable tags from the candidate tags for questions.Results of our experiments on two real-world datasets demonstrate that LLMEQT significantly enhances the automatic question tagging performance for CQA,surpassing the performance of state-of-the-art methods.展开更多
The springing up of large language models(LLMs)has shifted the community from single-task-orientated natural language processing(NLP)research to a holistic end-to-end multi-task learning paradigm.Along this line of re...The springing up of large language models(LLMs)has shifted the community from single-task-orientated natural language processing(NLP)research to a holistic end-to-end multi-task learning paradigm.Along this line of research endeavors in the area,LLM-based prompting methods have attracted much attention,partially due to the technological advantages brought by prompt engineering(PE)as well as the underlying NLP principles disclosed by various prompting methods.Traditional supervised learning usually requires training a model based on labeled data and then making predictions.In contrast,PE methods directly use the powerful capabilities of existing LLMs(e.g.,GPT-3 and GPT-4)via composing appropriate prompts,especially under few-shot or zero-shot scenarios.Facing the abundance of studies related to the prompting and the ever-evolving nature of this field,this article aims to 1)illustrate a novel perspective to review existing PE methods within the well-established communication theory framework,2)facilitate a better/deeper understanding of developing trends of existing PE methods used in three typical tasks,and 3)shed light on promising research directions for future PE methods.展开更多
Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural ...Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.展开更多
Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring proc...Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring process,waterbody segmentation involves precisely delineating waterbody boundaries from imagery.Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis.In response to these challenges,this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images.However,the segmentation of waterbodies from ordinary images faces several obstacles,including variations in lighting,occlusions from objects like trees and buildings,and reflections on the water surface,all of which can mislead algorithms.Additionally,the diverse shapes and textures of waterbodies,alongside complex backgrounds,further complicate this task.While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets,their application to waterbody segmentation from ground-level images remains underexplored.Hence,this research proposed the Visual Aquatic Generalist(VAGen)as a countermeasure.Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning(ICL)and Visual Prompting(VP),VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL.As demonstrated by the experimental results,VAGen demonstrated a significant increase in the mean Intersection over Union(mIoU)metric,showing a 22.38%enhancement when compared to the baseline model that lacked the integration of learnable prompts.Moreover,VAGen surpassed the current stateof-the-art(SOTA)task-specific models designed for waterbody segmentation by 6.20%.The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead,and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles(UAVs)and mobile computing platforms.This study thereby makes a valuable contribution to the field of computer vision,offering practical solutions for engineering applications related to urban flood monitoring,agricultural water resource management,and environmental conservation efforts.展开更多
This paper explores the Vision Transformer(ViT)backbone for Unsupervised Domain Adaptive(UDA)person Re-Identification(Re-ID).While some recent studies have validated ViT for supervised Re-ID,no study has yet to use Vi...This paper explores the Vision Transformer(ViT)backbone for Unsupervised Domain Adaptive(UDA)person Re-Identification(Re-ID).While some recent studies have validated ViT for supervised Re-ID,no study has yet to use ViT for UDA Re-ID.We observe that the ViT structure provides a unique advantage for UDA Re-ID,i.e.,it has a prompt(the learnable class token)at its bottom layer,that can be used to efficiently condition the deep model for the underlying domain.To utilize this advantage,we propose a novel two-stage UDA pipeline named Prompting And Tuning(PAT)which consists of a prompt learning stage and a subsequent fine-tuning stage.In the first stage,PAT roughly adapts the model from source to target domain by learning the prompts for two domains,while in the second stage,PAT fine-tunes the entire backbone for further adaption to increase the accuracy.Although these two stages both adopt the pseudo labels for training,we show that they have different data preferences.With these two preferences,prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.展开更多
We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior ...We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.展开更多
Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language m...Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space.These video-language models are proficient at representing short-term actions,basic objects,and their combinations,but they are still far from understanding long-term procedures.In addition,the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures.Therefore,we propose a novel compositional prompt learning(CPL)framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems.Specifically,the proposed CPL consists of one visual prompt and three compositional textual prompts(including the action prompt,object prompt,and procedure prompt),which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding.Besides,the task reformulation enables our CPL to perform well in all zero-shot,few-shot,and fully-supervised settings.Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.展开更多
Pixel-level structure segmentations have attracted considerable attention,playing a crucial role in autonomous driving within the metaverse and enhancing comprehension in light field-based machine vision.However,curre...Pixel-level structure segmentations have attracted considerable attention,playing a crucial role in autonomous driving within the metaverse and enhancing comprehension in light field-based machine vision.However,current light field modeling methods fail to integrate appearance and geometric structural information into a coherent semantic space,thereby limiting the capability of light field transmission for visual knowledge.In this paper,we propose a general light field modeling method for pixel-level structure segmentation,comprising a generative light field prompting encoder(LF-GPE)and a prompt-based masked light field pretraining(LF-PMP)network.Our LF-GPE,serving as a light field backbone,can extract both appearance and geometric structural cues simultaneously.It aligns these features into a unified visual space,facilitating semantic interaction.Meanwhile,our LF-PMP,during the pretraining phase,integrates a mixed light field and a multi-view light field reconstruction.It prioritizes considering the geometric structural properties of the light field,enabling the light field backbone to accumulate a wealth of prior knowledge.We evaluate our pretrained LF-GPE on two downstream tasks:light field salient object detection and semantic segmentation.Experimental results demonstrate that LF-GPE can effectively learn high-quality light field features and achieve highly competitive performance in pixel-level segmentation tasks.展开更多
基金National Natural Science Foundation of China(No.62176052)。
文摘Large language models(LLMs)have demonstrated remarkable generalization abilities across multiple tasks in natural language processing(NLP).For multi-step reasoning tasks,chain-of-thought(CoT)prompting facilitates step-by-step thinking,leading to improved performance.However,despite significant advancements in LLMs,current CoT prompting performs suboptimally on smaller-scale models that have fewer parameters.Additionally,the common paradigm of few-shot CoT prompting relies on a set of manual demonstrations,with performance contingent on the quality of these annotations and varying with task-specific requirements.To address these limitations,we propose a select-and-answer prompting method(SAP)to enhance language model performance on reasoning tasks without the need for manual demonstrations.This method comprises two primary steps:guiding the model to conduct preliminary analysis and generate several candidate answers based on the prompting;allowing the model to provide final answers derived from these candidate answers.The proposed prompting strategy is evaluated across two language models of varying sizes and six datasets.On ChatGLM-6B,SAP consistently outperforms few-shot CoT across all datasets.For GPT-3.5,SAP achieves comparable performance to few-shot CoT and outperforms zero-shot CoT in most cases.These experimental results indicate that SAP can significantly improve the accuracy of language models in reasoning tasks.
基金supported by the National Natural Science Foundation of China(62222212).
文摘Information extraction(IE)aims to automatically identify and extract information about specific interests from raw texts.Despite the abundance of solutions based on fine-tuning pretrained language models,IE in the context of fewshot and zero-shot scenarios remains highly challenging due to the scarcity of training data.Large language models(LLMs),on the other hand,can generalize well to unseen tasks with few-shot demonstrations or even zero-shot instructions and have demonstrated impressive ability for a wide range of natural language understanding or generation tasks.Nevertheless,it is unclear,whether such effectiveness can be replicated in the task of IE,where the target tasks involve specialized schema and quite abstractive entity or relation concepts.In this paper,we first examine the validity of LLMs in executing IE tasks with an established prompting strategy and further propose multiple types of augmented prompting methods,including the structured fundamental prompt(SFP),the structured interactive reasoning prompt(SIRP),and the voting-enabled structured interactive reasoning prompt(VESIRP).The experimental results demonstrate that while directly promotes inferior performance,the proposed augmented prompt methods significantly improve the extraction accuracy,achieving comparable or even better performance(e.g.,zero-shot FewNERD,FewNERD-INTRA)than state-of-theart methods that require large-scale training samples.This study represents a systematic exploration of employing instruction-following LLM for the task of IE.It not only establishes a performance benchmark for this novel paradigm but,more importantly,validates a practical technical pathway through the proposed prompt enhancement method,offering a viable solution for efficient IE in low-resource settings.
文摘With the further implementation of the knowledge innovation program (KIP), piloted by the Chinese Academy of Sciences (CAS), encouraging progress has been made in contributing to the development of the country's high-tech industry, and forging S&T cooperation with local governments and industrial sectors. This was revealed at the Second CAS Conference on High-tech Industrialization April 25 - 29 in Shenzhen, Guangdong Province.
基金supported by the National Natural Science Foundation of China(Nos.62176106 and U1836220)the Special Scientific Research Project of School of Emergency Management of Jiangsu University(No.KY-A-01)+2 种基金the Project of Faculty of Agricultural Engineering of Jiangsu University(No.NGXB20240101)the Post-graduate Research&Practice Innovation Program of Jiangsu Province(Nos.KYCX22_3668 and KYCX21_3373)the Jiangsu Key Research and Development Plan(No.BE2020036)。
文摘Semi-supervised sound event detection(SSED)tasks typically leverage a large amount of unlabeled and synthetic data to facilitate model generalization during training,reducing overfitting on a limited set of labeled data.However,the generalization training process often encounters challenges from noisy interference introduced by pseudo-labels or domain knowledge gaps.To alleviate noisy interference in class distribution learning,we propose an efficient semi-supervised class distribution learning method through dynamic prompt tuning,named prompting class distribution optimization(PADO).Specifically,when modeling real labeled data,PADO dynamically incorporates independent learnable prompt tokens to explore prior knowledge about the true distribution.Then,the prior knowledge serves as prompt information,dynamically interacting with the posterior noisy-class distribution information.In this case,PADO achieves class distribution optimization while maintaining model generalization,leading to a significant improvement in the efficiency of class distribution learning.Compared with state-of-the-art methods on the SSED datasets from DCASE 2019,2020,and 2021 challenges,PADO achieves significant performance improvements.Furthermore,it is readily extendable to other benchmark models.
基金supported by the Beijing Natural Science Foundation,China(Nos.JQ23018 and L221004)the National Natural Science Foundation of China(Nos.62036012,62072456 and 62106262)sponsored by SMP-IDATA Open Youth Fund,China.
文摘Automatic question tagging(AQT)represents a crucial task in community question answering(CQA)websites.Its pivotal role lies in substantially augmenting user experience through the optimization of question-answering efficiency.Existing question tagging models focus on the features of questions and tags,ignoring the external knowledge of the real world.Large language models can work as knowledge engines for incorporating real-world facts for different tasks.However,it is difficult for large language models to output tags in the database of CQA websites.To address this challenge,we propose a large language model enhanced question tagging method called LLMEQT to perform the question tagging task.In LLMEQT,a traditional question tagging method is first applied to pre-retrieve tags for questions.Then prompts are formulated for LLMs to comprehend the task and select more suitable tags from the candidate tags for questions.Results of our experiments on two real-world datasets demonstrate that LLMEQT significantly enhances the automatic question tagging performance for CQA,surpassing the performance of state-of-the-art methods.
文摘The springing up of large language models(LLMs)has shifted the community from single-task-orientated natural language processing(NLP)research to a holistic end-to-end multi-task learning paradigm.Along this line of research endeavors in the area,LLM-based prompting methods have attracted much attention,partially due to the technological advantages brought by prompt engineering(PE)as well as the underlying NLP principles disclosed by various prompting methods.Traditional supervised learning usually requires training a model based on labeled data and then making predictions.In contrast,PE methods directly use the powerful capabilities of existing LLMs(e.g.,GPT-3 and GPT-4)via composing appropriate prompts,especially under few-shot or zero-shot scenarios.Facing the abundance of studies related to the prompting and the ever-evolving nature of this field,this article aims to 1)illustrate a novel perspective to review existing PE methods within the well-established communication theory framework,2)facilitate a better/deeper understanding of developing trends of existing PE methods used in three typical tasks,and 3)shed light on promising research directions for future PE methods.
基金supported by the National Key Research and Development Program of China(2023YFF0906502)the Postgraduate Research and Innovation Project of Hunan Province under Grant(CX20240473).
文摘Due to the digital transformation tendency among cultural institutions and the substantial influence of the social media platform,the demands of visual communication keep increasing for promoting traditional cultural artifacts online.As an effective medium,posters serve to attract public attention and facilitate broader engagement with cultural artifacts.However,existing poster generation methods mainly rely on fixed templates and manual design,which limits their scalability and adaptability to the diverse visual and semantic features of the artifacts.Therefore,we propose CAPGen,an automated aesthetic Cultural Artifacts Poster Generation framework built on a Multimodal Large Language Model(MLLM)with integrated iterative optimization.During our research,we collaborated with designers to define principles of graphic design for cultural artifact posters,to guide the MLLM in generating layout parameters.Later,we generated these parameters into posters.Finally,we refined the posters using an MLLM integrated with a multi-round iterative optimization mechanism.Qualitative results show that CAPGen consistently outperforms baseline methods in both visual quality and aesthetic performance.Furthermore,ablation studies indicate that the prompt,iterative optimization mechanism,and design principles significantly enhance the effectiveness of poster generation.
文摘Effective water management and flood prevention are critical challenges encountered by both urban and rural areas,necessitating precise and prompt monitoring of waterbodies.As a fundamental step in the monitoring process,waterbody segmentation involves precisely delineating waterbody boundaries from imagery.Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis.In response to these challenges,this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images.However,the segmentation of waterbodies from ordinary images faces several obstacles,including variations in lighting,occlusions from objects like trees and buildings,and reflections on the water surface,all of which can mislead algorithms.Additionally,the diverse shapes and textures of waterbodies,alongside complex backgrounds,further complicate this task.While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets,their application to waterbody segmentation from ground-level images remains underexplored.Hence,this research proposed the Visual Aquatic Generalist(VAGen)as a countermeasure.Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning(ICL)and Visual Prompting(VP),VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL.As demonstrated by the experimental results,VAGen demonstrated a significant increase in the mean Intersection over Union(mIoU)metric,showing a 22.38%enhancement when compared to the baseline model that lacked the integration of learnable prompts.Moreover,VAGen surpassed the current stateof-the-art(SOTA)task-specific models designed for waterbody segmentation by 6.20%.The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead,and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles(UAVs)and mobile computing platforms.This study thereby makes a valuable contribution to the field of computer vision,offering practical solutions for engineering applications related to urban flood monitoring,agricultural water resource management,and environmental conservation efforts.
基金This work was supported by the National Key Research and Development Program of China in the 13th Five-Year(No.2016YFB0801301)in the 14th Five-Year(Nos.2021YFFO602103,2021YFF0602102,and 20210Y1702).
文摘This paper explores the Vision Transformer(ViT)backbone for Unsupervised Domain Adaptive(UDA)person Re-Identification(Re-ID).While some recent studies have validated ViT for supervised Re-ID,no study has yet to use ViT for UDA Re-ID.We observe that the ViT structure provides a unique advantage for UDA Re-ID,i.e.,it has a prompt(the learnable class token)at its bottom layer,that can be used to efficiently condition the deep model for the underlying domain.To utilize this advantage,we propose a novel two-stage UDA pipeline named Prompting And Tuning(PAT)which consists of a prompt learning stage and a subsequent fine-tuning stage.In the first stage,PAT roughly adapts the model from source to target domain by learning the prompts for two domains,while in the second stage,PAT fine-tunes the entire backbone for further adaption to increase the accuracy.Although these two stages both adopt the pseudo labels for training,we show that they have different data preferences.With these two preferences,prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.
文摘We present a novel framework,CLIPSP,and a novel adaptive prompt method to leverage pre-trained knowledge from CLIP for scene parsing.Our approach addresses the limitations of DenseCLIP,which demonstrates the superior image segmentation provided by CLIP pre-trained models over ImageNet pre-trained models,but struggles with rough pixel-text score maps for complex scene parsing.We argue that,as they contain all textual information in a dataset,the pixel-text score maps,i.e.,dense prompts,are inevitably mixed with noise.To overcome this challenge,we propose a two-step method.Firstly,we extract visual and language features and perform multi-label classification to identify the most likely categories in the input images.Secondly,based on the top-k categories and confidence scores,our method generates scene tokens which can be treated as adaptive prompts for implicit modeling of scenes,and incorporates them into the visual features fed into the decoder for segmentation.Our method imposes a constraint on prompts and suppresses the probability of irrelevant categories appearing in the scene parsing results.Our method achieves competitive performance,limited by the available visual-language pre-trained models.Our CLIP-SP performs 1.14%better(in terms of mIoU)than DenseCLIP on ADE20K,using a ResNet-50 backbone.
文摘Instructional videos are very useful for completing complex daily tasks,which naturally contain abundant clip-narration pairs.Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space.These video-language models are proficient at representing short-term actions,basic objects,and their combinations,but they are still far from understanding long-term procedures.In addition,the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures.Therefore,we propose a novel compositional prompt learning(CPL)framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems.Specifically,the proposed CPL consists of one visual prompt and three compositional textual prompts(including the action prompt,object prompt,and procedure prompt),which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding.Besides,the task reformulation enables our CPL to perform well in all zero-shot,few-shot,and fully-supervised settings.Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
基金supported by the National Natural Science Foundation of China(NSFC)(62272342,62020106004,and 62306212)the 2022 Tianjin Research and Innovation Project under grant number 2022BKY160Tianjin University of Technology 2022 Post-graduate Research and Innovation Practice Project under grant number YJ2238.
文摘Pixel-level structure segmentations have attracted considerable attention,playing a crucial role in autonomous driving within the metaverse and enhancing comprehension in light field-based machine vision.However,current light field modeling methods fail to integrate appearance and geometric structural information into a coherent semantic space,thereby limiting the capability of light field transmission for visual knowledge.In this paper,we propose a general light field modeling method for pixel-level structure segmentation,comprising a generative light field prompting encoder(LF-GPE)and a prompt-based masked light field pretraining(LF-PMP)network.Our LF-GPE,serving as a light field backbone,can extract both appearance and geometric structural cues simultaneously.It aligns these features into a unified visual space,facilitating semantic interaction.Meanwhile,our LF-PMP,during the pretraining phase,integrates a mixed light field and a multi-view light field reconstruction.It prioritizes considering the geometric structural properties of the light field,enabling the light field backbone to accumulate a wealth of prior knowledge.We evaluate our pretrained LF-GPE on two downstream tasks:light field salient object detection and semantic segmentation.Experimental results demonstrate that LF-GPE can effectively learn high-quality light field features and achieve highly competitive performance in pixel-level segmentation tasks.