With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,ap...With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.展开更多
Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions...Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.展开更多
Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selec...Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selecting one of these proposals as the grounding result in the second stage.Nevertheless,the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate,thereby enormously limiting the overall network performance.To address the above issues,the authors propose an innovative method termed Separate Non-Maximum Suppression(Sep-NMS)for two-stage REC.Particularly,Sep-NMS models information from the two stages independently and collaboratively,ultimately achieving an overall improvement in comprehension and identification of the target objects.Specifically,the authors propose a Ref-Relatedness module for filtering referent proposals rigorously,decreasing the redundancy of referent proposals.A CLIP†Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects.It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage.Moreover,an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages,ensuring maximum uti-lisation of the available information.Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods.The datasets used are publicly available:RefCOCO,RefCOCO+:https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg:https://doi.org/10.1109/CVPR.2016.9.展开更多
A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the...A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the AMPPCA algorithm first estimates a statistical description for each operating mode by applying mixture probabilistic principal component analysis(MPPCA). As a comparison, the combined MPPCA is employed where monitoring results are softly integrated according to posterior probabilities of the test sample in each local model. For exploiting the cross-mode correlations, which may be useful but are inadvertently neglected due to separately held monitoring approaches, a global monitoring model is constructed by aligning all local models together. In this way, both within-mode and cross-mode correlations are preserved in this integrated space. Finally, the utility and feasibility of AMPPCA are demonstrated through a non-isothermal continuous stirred tank reactor and the TE benchmark process.展开更多
文摘With the advancements in parameter-efficient transfer learning techniques,it has become feasible to leverage large pre-trained language models for downstream tasks under low-cost and low-resource conditions.However,applying this technique to multimodal knowledge transfer introduces a significant challenge:ensuring alignment across modalities while minimizing the number of additional parameters required for downstream task adaptation.This paper introduces UniTrans,a framework aimed at facilitating efficient knowledge transfer across multiple modalities.UniTrans leverages Vector-based Cross-modal Random Matrix Adaptation to enable fine-tuning with minimal parameter overhead.To further enhance modality alignment,we introduce two key components:the Multimodal Consistency Alignment Module and the Query-Augmentation Side Network,specifically optimized for scenarios with extremely limited trainable parameters.Extensive evaluations on various cross-modal downstream tasks demonstrate that our approach surpasses state-of-the-art methods while using just 5%of their trainable parameters.Additionally,it achieves superior performance compared to fully fine-tuned models on certain benchmarks.
基金supported by the Zhejiang Provincial Natural Science Foundation of China(No.LQ23F030001)the National Natural Science Foundation of China(No.62406280)+5 种基金the Autism Research Special Fund of Zhejiang Foundation for Disabled Persons(No.2023008)the Liaoning Province Higher Education Innovative Talents Program Support Project(No.LR2019058)the Liaoning Province Joint Open Fund for Key Scientific and Technological Innovation Bases(No.2021-KF-12-05)the Central Guidance on Local Science and Technology Development Fund of Liaoning Province(No.2023JH6/100100066)the Key Laboratory for Biomedical Engineering of Ministry of Education,Zhejiang University,Chinain part by the Open Research Fund of the State Key Laboratory of Cognitive Neuroscience and Learning.
文摘Video action recognition(VAR)aims to analyze dynamic behaviors in videos and achieve semantic understanding.VAR faces challenges such as temporal dynamics,action-scene coupling,and the complexity of human interactions.Existing methods can be categorized into motion-level,event-level,and story-level ones based on spatiotemporal granularity.However,single-modal approaches struggle to capture complex behavioral semantics and human factors.Therefore,in recent years,vision-language models(VLMs)have been introduced into this field,providing new research perspectives for VAR.In this paper,we systematically review spatiotemporal hierarchical methods in VAR and explore how the introduction of large models has advanced the field.Additionally,we propose the concept of“Factor”to identify and integrate key information from both visual and textual modalities,enhancing multimodal alignment.We also summarize various multimodal alignment methods and provide in-depth analysis and insights into future research directions.
基金funded by the National Natural Science Foundation of China(No.62076032).
文摘Referring expression comprehension(REC)aims to locate a specific region in an image described by a natural language.Existing two-stage methods generate multiple candidate proposals in the first stage,followed by selecting one of these proposals as the grounding result in the second stage.Nevertheless,the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate,thereby enormously limiting the overall network performance.To address the above issues,the authors propose an innovative method termed Separate Non-Maximum Suppression(Sep-NMS)for two-stage REC.Particularly,Sep-NMS models information from the two stages independently and collaboratively,ultimately achieving an overall improvement in comprehension and identification of the target objects.Specifically,the authors propose a Ref-Relatedness module for filtering referent proposals rigorously,decreasing the redundancy of referent proposals.A CLIP†Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects.It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage.Moreover,an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages,ensuring maximum uti-lisation of the available information.Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods.The datasets used are publicly available:RefCOCO,RefCOCO+:https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg:https://doi.org/10.1109/CVPR.2016.9.
基金Supported by the National Natural Science Foundation of China(61374140)Shanghai Pujiang Program(12PJ1402200)
文摘A novel approach named aligned mixture probabilistic principal component analysis(AMPPCA) is proposed in this study for fault detection of multimode chemical processes. In order to exploit within-mode correlations,the AMPPCA algorithm first estimates a statistical description for each operating mode by applying mixture probabilistic principal component analysis(MPPCA). As a comparison, the combined MPPCA is employed where monitoring results are softly integrated according to posterior probabilities of the test sample in each local model. For exploiting the cross-mode correlations, which may be useful but are inadvertently neglected due to separately held monitoring approaches, a global monitoring model is constructed by aligning all local models together. In this way, both within-mode and cross-mode correlations are preserved in this integrated space. Finally, the utility and feasibility of AMPPCA are demonstrated through a non-isothermal continuous stirred tank reactor and the TE benchmark process.