Simultaneous localization and mapping(SLAM)is a pivotal challenge in mobile robotics.Traditional SLAM solutions primarily focus on achieving rapid and accurate localization and mapping while typically neglecting envir...Simultaneous localization and mapping(SLAM)is a pivotal challenge in mobile robotics.Traditional SLAM solutions primarily focus on achieving rapid and accurate localization and mapping while typically neglecting environmental object identification.This paper introduces an innovative SLAM system enhanced with YOLO-based open-vocabulary object detection.It leverages visual-language alignment to identify both known and novel objects using extensive image-text pairs.Our approach employs YOLOv8 as a teacher model,balancing speed and accuracy for object detection and bounding box prediction.These predictions are processed via CLIP encoders to generate high-dimensional vectors,teaching a student model robust image and text embeddings.Novel loss functions align augmented embeddings with supervisory signals,greatly enhancing detection accuracy and generalization.Additionally,the system integrates depth map-based scale extraction,3D mapping of target object positions,and efficient relative pose estimation for loop detection.The direct method used improves accuracy and robustness,especially in poorly textured environments.Extensive ablation studies show significant improvements in precision and recall metrics.Our advanced SLAM system not only ensures accurate localization and mapping but also enables mobile robots to recognize and interact with a wide variety of objects,making it ideal for practical applications in complex environments.展开更多
Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods o...Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.展开更多
Semantic segmentation is a core task in computer vision that allows AI models to interact and understand their surrounding environment. Similarly to how humans subconsciously segment scenes, this ability is crucial fo...Semantic segmentation is a core task in computer vision that allows AI models to interact and understand their surrounding environment. Similarly to how humans subconsciously segment scenes, this ability is crucial for scene understanding. However, a challenge many semantic learning models face is the lack of data. Existing video datasets are limited to short, low-resolution videos that are not representative of real-world examples. Thus, one of our key contributions is a customized semantic segmentation version of the Walking Tours Dataset that features hour-long, high-resolution, real-world data from tours of different cities. Additionally, we evaluate the performance of open-vocabulary, semantic model OpenSeeD on our own custom dataset and discuss future implications.展开更多
Zero-Shot object Detection(ZSD),one of the most challenging problems in the field of object detection,aims to accurately identify new categories that are not encountered during training.Recent advancements in deep lea...Zero-Shot object Detection(ZSD),one of the most challenging problems in the field of object detection,aims to accurately identify new categories that are not encountered during training.Recent advancements in deep learning and increased computational power have led to significant improvements in object detection systems,achieving high recognition accuracy on benchmark datasets.However,these systems remain limited in real-world applications due to the scarcity of labeled training samples,making it difficult to detect unseen classes.To address this,researchers have explored various approaches,yielding promising progress.This article provides a comprehensive review of the current state of ZSD,distinguishing four related methods—zero-shot,open-vocabulary,open-set,and open-world approaches—based on task objectives and data usage.We highlight representative methods,discuss the technical challenges within each framework,and summarize the commonly used evaluation metrics,benchmark datasets,and experimental results.Our review aims to offer readers a clear overview of the latest developments and performance trends in ZSD.展开更多
基金supported by Yunnan Science&Technology Project(Grant Nos.202302AD080008,202305AF150152)Guangdong Major Project of Basic and Applied Basic Research(Grant No.2023B0303000016)+5 种基金the National Natural Science Foundation of China(Grant No.U21A20487)Guangdong Technology Project(Grant No.2023TX07Z126)Shenzhen Technology Project(Grant Nos.JCYJ20220818101211025,GJHZ20240218112504008,JCYJ20220818101206014)the Shenzhen High-tech Zone Development Special Plan Innovation Platform Construction Projectthe proof of concept center for high precision and high resolution 4D imagingCAS Key Technology Talent Program。
文摘Simultaneous localization and mapping(SLAM)is a pivotal challenge in mobile robotics.Traditional SLAM solutions primarily focus on achieving rapid and accurate localization and mapping while typically neglecting environmental object identification.This paper introduces an innovative SLAM system enhanced with YOLO-based open-vocabulary object detection.It leverages visual-language alignment to identify both known and novel objects using extensive image-text pairs.Our approach employs YOLOv8 as a teacher model,balancing speed and accuracy for object detection and bounding box prediction.These predictions are processed via CLIP encoders to generate high-dimensional vectors,teaching a student model robust image and text embeddings.Novel loss functions align augmented embeddings with supervisory signals,greatly enhancing detection accuracy and generalization.Additionally,the system integrates depth map-based scale extraction,3D mapping of target object positions,and efficient relative pose estimation for loop detection.The direct method used improves accuracy and robustness,especially in poorly textured environments.Extensive ablation studies show significant improvements in precision and recall metrics.Our advanced SLAM system not only ensures accurate localization and mapping but also enables mobile robots to recognize and interact with a wide variety of objects,making it ideal for practical applications in complex environments.
基金supported by the National Natural Science Foundation of China under Grant No.62402490the Guangdong Basic and Applied Basic Research Foundation of China under Grant No.2025A1515010101.
文摘Vision-language models(VLMs)have shown strong open-vocabulary learning abilities in various video understanding tasks.However,when applied to open-vocabulary temporal action detection(OV-TAD),existing OV-TAD methods often face challenges in generalizing to unseen action categories due to their reliance on visual features,resulting in limited generalization.In this paper,we propose a novel framework,Concept-Guided Semantic Projection(CSP),to enhance the generalization ability of OV-TAD methods.By projecting video features into a unified action concept space,CSP enables the use of abstracted action concepts for action detection,rather than solely relying on visual details.To further improve feature consistency across action categories,we introduce a mutual contrastive loss(MCL),ensuring semantic coherence and better feature discrimination.Extensive experiments on the ActivityNet and THUMOS14 benchmarks demonstrate that our method outperforms state-of-the-art OV-TAD methods.Code and data are available at Concept-Guided-OV-TAD.
文摘Semantic segmentation is a core task in computer vision that allows AI models to interact and understand their surrounding environment. Similarly to how humans subconsciously segment scenes, this ability is crucial for scene understanding. However, a challenge many semantic learning models face is the lack of data. Existing video datasets are limited to short, low-resolution videos that are not representative of real-world examples. Thus, one of our key contributions is a customized semantic segmentation version of the Walking Tours Dataset that features hour-long, high-resolution, real-world data from tours of different cities. Additionally, we evaluate the performance of open-vocabulary, semantic model OpenSeeD on our own custom dataset and discuss future implications.
基金supported by the National Natural Science Foundation of China(Nos.62106150 and 62272315)the Open Fund of National Engineering Laboratory for Big Data System Computing Technology(No.SZU-BDSC-OF2024-22)+1 种基金the Open Research Fund of Anhui Province Key Laboratory of Machine Vision Inspection(No.KLMVI-2023-HIT-01)the Director Fund of Guangdong Laboratory of Artificial Intelligence and Digital Economy(Shenzhen)(No.24420001).
文摘Zero-Shot object Detection(ZSD),one of the most challenging problems in the field of object detection,aims to accurately identify new categories that are not encountered during training.Recent advancements in deep learning and increased computational power have led to significant improvements in object detection systems,achieving high recognition accuracy on benchmark datasets.However,these systems remain limited in real-world applications due to the scarcity of labeled training samples,making it difficult to detect unseen classes.To address this,researchers have explored various approaches,yielding promising progress.This article provides a comprehensive review of the current state of ZSD,distinguishing four related methods—zero-shot,open-vocabulary,open-set,and open-world approaches—based on task objectives and data usage.We highlight representative methods,discuss the technical challenges within each framework,and summarize the commonly used evaluation metrics,benchmark datasets,and experimental results.Our review aims to offer readers a clear overview of the latest developments and performance trends in ZSD.