Driver behavior is a critical factor in road safety,highlighting the need for advanced methods in Distracted riving lassification(DDC).In this study,we introduce DDC-Chat,a novel classification method based on a isual...Driver behavior is a critical factor in road safety,highlighting the need for advanced methods in Distracted riving lassification(DDC).In this study,we introduce DDC-Chat,a novel classification method based on a isual large anguageodel(VLM).DDC-Chat is an interactive multimodal system built upon LLAVA-Plus,fine-tuned specifically for addressing distracted driving detection.It utilizes logical reasoning chains to activate visual skills,including segmentation and pose detection,through end-to-end training.Furthermore,instruction tuning allows DDC-Chat to continuously incorporate new visual skills,enhancing its ability to classify distracted driving behavior.Our extensive experiments demonstrate that DDC-Chat achieves state-of-the-art performance on public DDC datasets,surpassing previous benchmarks.In evaluations on the 100-Driver dataset,the model exhibits superior results in both zero-shot and few-shot learning contexts,establishing it as a valuable tool for improving driving safety by accurately identifying driver distraction.Due to the computational intensity of inference,DDC-Chat is optimized for deployment on remote servers,with data streamed from in-vehicle monitoring systems for real-time analysis.展开更多
Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fai...Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text,focusing primarily on answer accuracy rather than the reasoning processes behind them.To address this gap and enhance the understanding of LVLMs’capabilities,we introduce KnowBench,a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs.KnowBench comprises 1081 image-question pairs,each with four options and four pieces of corresponding knowledge across 11 major categories.We evaluate mainstream LVLMs on KnowBench,including proprietary models like Gemini,Claude,and GPT,and open-source models like LLaVA,Qwen-VL,and InternVL.Our experiments reveal a notable discrepancy in the models’abilities to select correct answers and corresponding knowledge whether the models are opensource or proprietary.This indicates that there is still a significant gap in the current LVLMs’knowledge alignment between images and text.Furthermore,our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations.This indicates that scaling laws have a significant impact on multimodal knowledge alignment,and the iteration of the model by researchers also has a positive effect.We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models.We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.展开更多
Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much att...Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much attention is paid to learning about basic classes,making it harder to understand novel classes;2.Most methods only rely on the context information provided by the prompt template,resulting in limited text features.In this study,we propose a new fine-tuning method for Visual-Language Models called Input-Enhanced Prompt Tuning(IEPT).The IEPT improves the generalization of VLMs for downstream tasks by introducing two components,i.e.,the Data Augmentation Framework(DAF)and the Category Generalization Optimizer(CGO).Specifically,the DAF employs Large Language Models to resolve issues of word ambiguity by obtaining more class label context,and uses simple image augmentation to address the issue of limited features by providing more image samples.The CGO prevents overfitting by adding new class names during training.Experiments show that the performance of IEPT in various evaluation suites is better or comparable to that of the existing method,covering basic to novel generalization,domain generalization,and cross-dataset evaluation.Compared to the state-of-the-art method PromptSRC,IEPT achieves an absolute improvement of 0.40%for base classes,1.56%for novel classes and 1.04%on the harmonic mean,averaged over 11 datasets.In addition,we present detailed ablation studies that validate the individual contributions of DAF and CGO to the overall performance of IEPT.Our code is available at https://github.com/ayuan 0626/IEPT.展开更多
Dear Editor,This letter proposes an innovative open-vocabulary 3D scene understanding model based on visual-language model.By efficiently integrating 3D point cloud data,image data,and text data,our model effectively ...Dear Editor,This letter proposes an innovative open-vocabulary 3D scene understanding model based on visual-language model.By efficiently integrating 3D point cloud data,image data,and text data,our model effectively overcomes the segmentation problem[1],[2]of traditional models dealing with unknown categories[3].By deeply learning the deep semantic mapping between vision and language,the network significantly improves its ability to recognize unlabeled categories and exceeds current state-of-the-art methods in the task of scene understanding in open-vocabulary.展开更多
This article proposes an innovative adversarial attack method,AMA(Adaptive Multimodal Attack),which introduces an adaptive feedback mechanism by dynamically adjusting the perturbation strength.Specifically,AMA adjusts...This article proposes an innovative adversarial attack method,AMA(Adaptive Multimodal Attack),which introduces an adaptive feedback mechanism by dynamically adjusting the perturbation strength.Specifically,AMA adjusts perturbation amplitude based on task complexity and optimizes the perturbation direction based on the gradient direction in real time to enhance attack efficiency.Experimental results demonstrate that AMA elevates attack success rates from approximately 78.95%to 89.56%on visual question answering and from78.82%to 84.96%on visual reasoning tasks across representative vision-language benchmarks.These findings demonstrate AMA’s superior attack efficiency and reveal the vulnerability of current visual language models to carefully crafted adversarial examples,underscoring the need to enhance their robustness.展开更多
In human-robot collaborative tasks,human trust in robots can reduce resistance to them,thereby increasing the success rate of task execution.However,most existing studies have focused on improving the success rate of ...In human-robot collaborative tasks,human trust in robots can reduce resistance to them,thereby increasing the success rate of task execution.However,most existing studies have focused on improving the success rate of humanrobot collaboration(HRC)rather than on enhancing collaboration efficiency.To improve the overall collaboration efficiency while maintaining a high success rate,this study proposes an active interaction strategy generation for HRC based on trust.First,a trust-based optimal robot strategy generation method was proposed to generate the robot’s optimal strategy in a HRC.This method employs a tree to model the HRC process under different robot strategies and calculates the optimal strategy based on the modeling results for the robot to execute.Second,the robot’s performance was evaluated to calculate human’s trust in a robot.A robot performance evaluation method based on a visual language model was also proposed.The evaluation results were input into the trust model to compute human’s current trust.Finally,each time an object operation was completed,the robot’s performance evaluation and optimal strategy generation methods worked together to automatically generate the optimal strategy of the robot for the next step until the entire collaborative task was completed.The experimental results demonstrates that this method significantly improve collaborative efficiency,achieving a high success rate in HRC.展开更多
XCD is a design-by-contract based architecture description language that supports modular specifications in terms of components and connectors (i.e., interaction protocols). XCD is supported by a translator that produ...XCD is a design-by-contract based architecture description language that supports modular specifications in terms of components and connectors (i.e., interaction protocols). XCD is supported by a translator that produces formal models in SPIN’s ProMeLa formal verification language, which can then be formally analysed using SPIN’s model checker. XCD is extended with a visual notation set called VXCD. VXCD extends UML’s component diagram and adapts it to XCD’s structure, contractual behaviour, and interaction protocol specifications. Visual VXCD specifications can be translated into textual XCD specifications for formal analysis. To illustrate VXCD, the well-known gas station system is used. The gas system is specified contractually using VXCD’s visual notation set and then formally analysed using SPIN’s model checker for a number of properties including deadlock and race-condition.展开更多
基金supported by the National Natural Science Foundation of China(62173253,52272374)the Research and Practice Project of New Engineering in Ordinary Undergraduate Universities in Guangxi Zhuang Autonomous Region(XGK202310)+1 种基金educational reform projects(JGT202302,JGKQ202309)the 2024 Guangxi Collegiate Innovation and Entrepreneurship Training Project"Eye-Smart Driving-Fatigue Driving Monitoring and Warning System Based on Computer Vision"(Project No.S202410595158).
文摘Driver behavior is a critical factor in road safety,highlighting the need for advanced methods in Distracted riving lassification(DDC).In this study,we introduce DDC-Chat,a novel classification method based on a isual large anguageodel(VLM).DDC-Chat is an interactive multimodal system built upon LLAVA-Plus,fine-tuned specifically for addressing distracted driving detection.It utilizes logical reasoning chains to activate visual skills,including segmentation and pose detection,through end-to-end training.Furthermore,instruction tuning allows DDC-Chat to continuously incorporate new visual skills,enhancing its ability to classify distracted driving behavior.Our extensive experiments demonstrate that DDC-Chat achieves state-of-the-art performance on public DDC datasets,surpassing previous benchmarks.In evaluations on the 100-Driver dataset,the model exhibits superior results in both zero-shot and few-shot learning contexts,establishing it as a valuable tool for improving driving safety by accurately identifying driver distraction.Due to the computational intensity of inference,DDC-Chat is optimized for deployment on remote servers,with data streamed from in-vehicle monitoring systems for real-time analysis.
基金supported by the National Natural Science Foundation of China under Grant No.62176115.
文摘Large visual language models(LVLMs)have revolutionized the multimodal domain,demonstrating exceptional performance in tasks requiring fusing visual and textual information.However,the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text,focusing primarily on answer accuracy rather than the reasoning processes behind them.To address this gap and enhance the understanding of LVLMs’capabilities,we introduce KnowBench,a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs.KnowBench comprises 1081 image-question pairs,each with four options and four pieces of corresponding knowledge across 11 major categories.We evaluate mainstream LVLMs on KnowBench,including proprietary models like Gemini,Claude,and GPT,and open-source models like LLaVA,Qwen-VL,and InternVL.Our experiments reveal a notable discrepancy in the models’abilities to select correct answers and corresponding knowledge whether the models are opensource or proprietary.This indicates that there is still a significant gap in the current LVLMs’knowledge alignment between images and text.Furthermore,our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations.This indicates that scaling laws have a significant impact on multimodal knowledge alignment,and the iteration of the model by researchers also has a positive effect.We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models.We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.
基金supported by National Key R&D Program of China(No.2022YFE0196100)the Innovation Capacity Enhancement Program Science and Technology Platform Project of Hebei Province(22567623H)Hebei University High Level Innovative Talent Research Start-up Funding Project(No.521000981092).
文摘Prompt learning has become crucial for adapting Visual Language Models(VLM)to downstream tasks.Although existing prompt learning models have made significant strides,they still face two major challenges:1.Too much attention is paid to learning about basic classes,making it harder to understand novel classes;2.Most methods only rely on the context information provided by the prompt template,resulting in limited text features.In this study,we propose a new fine-tuning method for Visual-Language Models called Input-Enhanced Prompt Tuning(IEPT).The IEPT improves the generalization of VLMs for downstream tasks by introducing two components,i.e.,the Data Augmentation Framework(DAF)and the Category Generalization Optimizer(CGO).Specifically,the DAF employs Large Language Models to resolve issues of word ambiguity by obtaining more class label context,and uses simple image augmentation to address the issue of limited features by providing more image samples.The CGO prevents overfitting by adding new class names during training.Experiments show that the performance of IEPT in various evaluation suites is better or comparable to that of the existing method,covering basic to novel generalization,domain generalization,and cross-dataset evaluation.Compared to the state-of-the-art method PromptSRC,IEPT achieves an absolute improvement of 0.40%for base classes,1.56%for novel classes and 1.04%on the harmonic mean,averaged over 11 datasets.In addition,we present detailed ablation studies that validate the individual contributions of DAF and CGO to the overall performance of IEPT.Our code is available at https://github.com/ayuan 0626/IEPT.
基金supported by CAFUC(ZHMH 2022-005)Key Laboratory of Flight Techniques and Flight Safety(FZ2022ZZ06)Flight Technology and Flight Safety of Civil Aviation Administration of China(FZ2022KF10).
文摘Dear Editor,This letter proposes an innovative open-vocabulary 3D scene understanding model based on visual-language model.By efficiently integrating 3D point cloud data,image data,and text data,our model effectively overcomes the segmentation problem[1],[2]of traditional models dealing with unknown categories[3].By deeply learning the deep semantic mapping between vision and language,the network significantly improves its ability to recognize unlabeled categories and exceeds current state-of-the-art methods in the task of scene understanding in open-vocabulary.
基金funded by the Natural Science Foundation of Jiangsu Province(Program BK20240699)National Natural Science Foundation of China(Program 62402228).
文摘This article proposes an innovative adversarial attack method,AMA(Adaptive Multimodal Attack),which introduces an adaptive feedback mechanism by dynamically adjusting the perturbation strength.Specifically,AMA adjusts perturbation amplitude based on task complexity and optimizes the perturbation direction based on the gradient direction in real time to enhance attack efficiency.Experimental results demonstrate that AMA elevates attack success rates from approximately 78.95%to 89.56%on visual question answering and from78.82%to 84.96%on visual reasoning tasks across representative vision-language benchmarks.These findings demonstrate AMA’s superior attack efficiency and reveal the vulnerability of current visual language models to carefully crafted adversarial examples,underscoring the need to enhance their robustness.
基金supported in part by the National Key Research and Development Program of China,No.2021ZD0112400Support Plan for Key Field Innovation Team of Dalian,China,No.2021RT06+3 种基金Support Plan for 111 Project,No.D23006Dalian Major Projects of Basic Research,No.2023 JJ11 CG002China’s National Foreign Experts Project,No.D20240244Liaoning Province Education Department Basic Scientific Research Projects,No.LJ232411258019.
文摘In human-robot collaborative tasks,human trust in robots can reduce resistance to them,thereby increasing the success rate of task execution.However,most existing studies have focused on improving the success rate of humanrobot collaboration(HRC)rather than on enhancing collaboration efficiency.To improve the overall collaboration efficiency while maintaining a high success rate,this study proposes an active interaction strategy generation for HRC based on trust.First,a trust-based optimal robot strategy generation method was proposed to generate the robot’s optimal strategy in a HRC.This method employs a tree to model the HRC process under different robot strategies and calculates the optimal strategy based on the modeling results for the robot to execute.Second,the robot’s performance was evaluated to calculate human’s trust in a robot.A robot performance evaluation method based on a visual language model was also proposed.The evaluation results were input into the trust model to compute human’s current trust.Finally,each time an object operation was completed,the robot’s performance evaluation and optimal strategy generation methods worked together to automatically generate the optimal strategy of the robot for the next step until the entire collaborative task was completed.The experimental results demonstrates that this method significantly improve collaborative efficiency,achieving a high success rate in HRC.
文摘XCD is a design-by-contract based architecture description language that supports modular specifications in terms of components and connectors (i.e., interaction protocols). XCD is supported by a translator that produces formal models in SPIN’s ProMeLa formal verification language, which can then be formally analysed using SPIN’s model checker. XCD is extended with a visual notation set called VXCD. VXCD extends UML’s component diagram and adapts it to XCD’s structure, contractual behaviour, and interaction protocol specifications. Visual VXCD specifications can be translated into textual XCD specifications for formal analysis. To illustrate VXCD, the well-known gas station system is used. The gas system is specified contractually using VXCD’s visual notation set and then formally analysed using SPIN’s model checker for a number of properties including deadlock and race-condition.