As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advan...As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.展开更多
Task scheduling in cloud computing environments is a multi-objective optimization problem, which is NP hard. It is also a challenging problem to find an appropriate trade-off among resource utilization, energy consump...Task scheduling in cloud computing environments is a multi-objective optimization problem, which is NP hard. It is also a challenging problem to find an appropriate trade-off among resource utilization, energy consumption and Quality of Service(QoS) requirements under the changing environment and diverse tasks. Considering both processing time and transmission time, a PSO-based Adaptive Multi-objective Task Scheduling(AMTS) Strategy is proposed in this paper. First, the task scheduling problem is formulated. Then, a task scheduling policy is advanced to get the optimal resource utilization, task completion time, average cost and average energy consumption. In order to maintain the particle diversity, the adaptive acceleration coefficient is adopted. Experimental results show that the improved PSO algorithm can obtain quasi-optimal solutions for the cloud task scheduling problem.展开更多
An object oriented multi robotic graphic simulation environment is described in this paper. Object oriented programming is used to model the physical objects of the robotic workcell in the form of software objects ...An object oriented multi robotic graphic simulation environment is described in this paper. Object oriented programming is used to model the physical objects of the robotic workcell in the form of software objects or classes. The virtual objects are defined to provide the user with a user friendly interface including realistic graphic simulation and clarify the software architecture. The programming method of associating the task object with active object effectively increases the software reusability, maintainability and modifiability. Task level programming is also demonstrated through a multi robot welding task that allows the user to concentrate on the most important aspects of the tasks. The multi thread programming technique is used to simulate the interaction of multiple tasks. Finally, a virtual test is carried out in the graphic simulation environment to observe design and program errors and fix them before downloading the software to the real workcell.展开更多
Purpose: It is used for judging the advantages and disadvantages of information technology foundation course teaching in health vocational colleges. Method: In teaching, it takes the two classes of 2012 grade nursin...Purpose: It is used for judging the advantages and disadvantages of information technology foundation course teaching in health vocational colleges. Method: In teaching, it takes the two classes of 2012 grade nursing major as the experiment object. The comparison class adopts traditonal and speaking-practice combination teaching method and the experiment class adopts task-driving teaching method. When the semester finishes, it conducts testing andd questionnaire survey, collecting the relevant data, analyzing the changes of students in the aspects of performance, learning interest and attitude, autonomous learning consciousness and ability after experiment class adopting new teaching methods. Result: The exam performance of experiment class is obviously higher than the comparison class, and the experiment class has an obvious improvement in the aspects of learning interest, autonomous learning consciousness and ability, and the difference has statistical significance. Conclusion: The task driving teaching method is suitable for the status of information foundation teaching in health vocational colleges, which improves students' performance significantly and is good for students' learning interest and enthusiasm, obtaining good classroom effect. Also, it makes students' autonomous learning consciousness and ability improve greatly.展开更多
The basic theory of YOLO series object detection algorithms is discussed, the dangerous driving behavior dataset is collected and produced, and then the YOLOv7 network is introduced in detail, the deep separable convo...The basic theory of YOLO series object detection algorithms is discussed, the dangerous driving behavior dataset is collected and produced, and then the YOLOv7 network is introduced in detail, the deep separable convolution and CA attention mechanism are introduced, the YOLOv7 bounding box loss function and clustering algorithm are optimized, and the DB-YOLOv7 network structure is constructed. In the first stage of the experiment, the PASCAL VOC public dataset was utilized for pre-training. A comparative analysis was conducted to assess the recognition accuracy and inference time before and after the proposed improvements. The experimental results demonstrated an increase of 1.4% in the average recognition accuracy, alongside a reduction in the inference time by 4 ms. Subsequently, a model for the recognition of dangerous driving behaviors was trained using a specialized dangerous driving behavior dataset. A series of experiments were performed to evaluate the efficacy of the DB-YOLOv7 algorithm in this context. The findings indicate a significant enhancement in detection performance, with a 4% improvement in accuracy compared to the baseline network. Furthermore, the model’s inference time was reduced by 20%, from 25 ms to 20 ms. These results substantiate the effectiveness of the DB-YOLOv7 recognition algorithm for detecting dangerous driving behaviors, providing comprehensive validation of its practical applicability.展开更多
Resource allocation for an equipment development task is a complex process owing to the inherent characteristics,such as large amounts of input resources,numerous sub-tasks,complex network structures,and high degrees ...Resource allocation for an equipment development task is a complex process owing to the inherent characteristics,such as large amounts of input resources,numerous sub-tasks,complex network structures,and high degrees of uncertainty.This paper presents an investigation into the influence of resource allocation on the duration and cost of sub-tasks.Mathematical models are constructed for the relationships of the resource allocation quantity with the duration and cost of the sub-tasks.By considering the uncertainties,such as fluctuations in the sub-task duration and cost,rework iterations,and random overlaps,the tasks are simulated for various resource allocation schemes.The shortest duration and the minimum cost of the development task are first formulated as the objective function.Based on a multi-objective particle swarm optimization(MOPSO)algorithm,a multi-objective evolutionary algorithm is constructed to optimize the resource allocation scheme for the development task.Finally,an uninhabited aerial vehicle(UAV)is considered as an example of a development task to test the algorithm,and the optimization results of this method are compared with those based on non-dominated sorting genetic algorithm-II(NSGA-II),non-dominated sorting differential evolution(NSDE)and strength pareto evolutionary algorithm-II(SPEA-II).The proposed method is verified for its scientific approach and effectiveness.The case study shows that the optimization of the resource allocation can greatly aid in shortening the duration of the development task and reducing its cost effectively.展开更多
An object model based software architecture for service robot system is presented, which addresses both software engineering issues such as reuse, extensibility, and management of complexity as well as system enginee...An object model based software architecture for service robot system is presented, which addresses both software engineering issues such as reuse, extensibility, and management of complexity as well as system engineering issues like scalability, reactivity, and robustness. A novel approach to the service robot system architecture is discussed. Cognitive psychology is considered in designing the software system, i.e., a humans way of vision and planning is applied. The planner can incorporate the users request into its task selection mechanism and generate plans biased toward picking the most reliable task execution in a given situation, and the planner can alter task selection based on changes that occur in dynamic and uncertain environments.展开更多
In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection ...In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.展开更多
A concurrency control mechanism for collaborative work is akey element in a mixed reality environment. However, conventional lockingmechanisms restrict potential tasks or the support of non-owners, thusincreasing the ...A concurrency control mechanism for collaborative work is akey element in a mixed reality environment. However, conventional lockingmechanisms restrict potential tasks or the support of non-owners, thusincreasing the working time because of waiting to avoid conflicts. Herein, wepropose an adaptive concurrency control approach that can reduce conflictsand work time. We classify shared object manipulation in mixed reality intodetailed goals and tasks. Then, we model the relationships among goal,task, and ownership. As the collaborative work progresses, the proposedsystem adapts the different concurrency control mechanisms of shared objectmanipulation according to the modeling of goal–task–ownership. With theproposed concurrency control scheme, users can hold shared objects andmove and rotate together in a mixed reality environment similar to realindustrial sites. Additionally, this system provides MS Hololens and Myosensors to recognize inputs from a user and provides results in a mixed realityenvironment. The proposed method is applied to install an air conditioneras a case study. Experimental results and user studies show that, comparedwith the conventional approach, the proposed method reduced the number ofconflicts, waiting time, and total working time.展开更多
LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previou...LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.展开更多
One of the most basic and difficult areas of computer vision and image understanding applications is still object detection. Deep neural network models and enhanced object representation have led to significant progre...One of the most basic and difficult areas of computer vision and image understanding applications is still object detection. Deep neural network models and enhanced object representation have led to significant progress in object detection. This research investigates in greater detail how object detection has changed in the recent years in the deep learning age. We provide an overview of the literature on a range of cutting-edge object identification algorithms and the theoretical underpinnings of these techniques. Deep learning technologies are contributing to substantial innovations in the field of object detection. While Convolutional Neural Networks (CNN) have laid a solid foundation, new models such as You Only Look Once (YOLO) and Vision Transformers (ViTs) have expanded the possibilities even further by providing high accuracy and fast detection in a variety of settings. Even with these developments, integrating CNN, YOLO and ViTs, into a coherent framework still poses challenges with juggling computing demand, speed, and accuracy especially in dynamic contexts. Real-time processing in applications like surveillance and autonomous driving necessitates improvements that take use of each model type’s advantages. The goal of this work is to provide an object detection system that maximizes detection speed and accuracy while decreasing processing requirements by integrating YOLO, CNN, and ViTs. Improving real-time detection performance in changing weather and light exposure circumstances, as well as detecting small or partially obscured objects in crowded cities, are among the goals. We provide a hybrid architecture which leverages CNN for robust feature extraction, YOLO for rapid detection, and ViTs for remarkable global context capture via self-attention techniques. Using an innovative training regimen that prioritizes flexible learning rates and data augmentation procedures, the model is trained on an extensive dataset of urban settings. Compared to solo YOLO, CNN, or ViTs models, the suggested model exhibits an increase in detection accuracy. This improvement is especially noticeable in difficult situations such settings with high occlusion and low light. In addition, it attains a decrease in inference time in comparison to baseline models, allowing real-time object detection without performance loss. This work introduces a novel method of object identification that integrates CNN, YOLO and ViTs, in a synergistic way. The resultant framework extends the use of integrated deep learning models in practical applications while also setting a new standard for detection performance under a variety of conditions. Our research advances computer vision by providing a scalable and effective approach to object identification problems. Its possible uses include autonomous navigation, security, and other areas.展开更多
基金funded by the Yangtze River Delta Science and Technology Innovation Community Joint Research Project(2023CSJGG1600)the Natural Science Foundation of Anhui Province(2208085MF173)Wuhu“ChiZhu Light”Major Science and Technology Project(2023ZD01,2023ZD03).
文摘As the number and complexity of sensors in autonomous vehicles continue to rise,multimodal fusionbased object detection algorithms are increasingly being used to detect 3D environmental information,significantly advancing the development of perception technology in autonomous driving.To further promote the development of fusion algorithms and improve detection performance,this paper discusses the advantages and recent advancements of multimodal fusion-based object detection algorithms.Starting fromsingle-modal sensor detection,the paper provides a detailed overview of typical sensors used in autonomous driving and introduces object detection methods based on images and point clouds.For image-based detection methods,they are categorized into monocular detection and binocular detection based on different input types.For point cloud-based detection methods,they are classified into projection-based,voxel-based,point cluster-based,pillar-based,and graph structure-based approaches based on the technical pathways for processing point cloud features.Additionally,multimodal fusion algorithms are divided into Camera-LiDAR fusion,Camera-Radar fusion,Camera-LiDAR-Radar fusion,and other sensor fusion methods based on the types of sensors involved.Furthermore,the paper identifies five key future research directions in this field,aiming to provide insights for researchers engaged in multimodal fusion-based object detection algorithms and to encourage broader attention to the research and application of multimodal fusion-based object detection.
基金partially been sponsored by the National Science Foundation of China(No.61572355,61272093,610172063)Tianjin Research Program of Application Foundation and Advanced Technology under grant No.15JCYBJC15700
文摘Task scheduling in cloud computing environments is a multi-objective optimization problem, which is NP hard. It is also a challenging problem to find an appropriate trade-off among resource utilization, energy consumption and Quality of Service(QoS) requirements under the changing environment and diverse tasks. Considering both processing time and transmission time, a PSO-based Adaptive Multi-objective Task Scheduling(AMTS) Strategy is proposed in this paper. First, the task scheduling problem is formulated. Then, a task scheduling policy is advanced to get the optimal resource utilization, task completion time, average cost and average energy consumption. In order to maintain the particle diversity, the adaptive acceleration coefficient is adopted. Experimental results show that the improved PSO algorithm can obtain quasi-optimal solutions for the cloud task scheduling problem.
文摘An object oriented multi robotic graphic simulation environment is described in this paper. Object oriented programming is used to model the physical objects of the robotic workcell in the form of software objects or classes. The virtual objects are defined to provide the user with a user friendly interface including realistic graphic simulation and clarify the software architecture. The programming method of associating the task object with active object effectively increases the software reusability, maintainability and modifiability. Task level programming is also demonstrated through a multi robot welding task that allows the user to concentrate on the most important aspects of the tasks. The multi thread programming technique is used to simulate the interaction of multiple tasks. Finally, a virtual test is carried out in the graphic simulation environment to observe design and program errors and fix them before downloading the software to the real workcell.
文摘Purpose: It is used for judging the advantages and disadvantages of information technology foundation course teaching in health vocational colleges. Method: In teaching, it takes the two classes of 2012 grade nursing major as the experiment object. The comparison class adopts traditonal and speaking-practice combination teaching method and the experiment class adopts task-driving teaching method. When the semester finishes, it conducts testing andd questionnaire survey, collecting the relevant data, analyzing the changes of students in the aspects of performance, learning interest and attitude, autonomous learning consciousness and ability after experiment class adopting new teaching methods. Result: The exam performance of experiment class is obviously higher than the comparison class, and the experiment class has an obvious improvement in the aspects of learning interest, autonomous learning consciousness and ability, and the difference has statistical significance. Conclusion: The task driving teaching method is suitable for the status of information foundation teaching in health vocational colleges, which improves students' performance significantly and is good for students' learning interest and enthusiasm, obtaining good classroom effect. Also, it makes students' autonomous learning consciousness and ability improve greatly.
文摘The basic theory of YOLO series object detection algorithms is discussed, the dangerous driving behavior dataset is collected and produced, and then the YOLOv7 network is introduced in detail, the deep separable convolution and CA attention mechanism are introduced, the YOLOv7 bounding box loss function and clustering algorithm are optimized, and the DB-YOLOv7 network structure is constructed. In the first stage of the experiment, the PASCAL VOC public dataset was utilized for pre-training. A comparative analysis was conducted to assess the recognition accuracy and inference time before and after the proposed improvements. The experimental results demonstrated an increase of 1.4% in the average recognition accuracy, alongside a reduction in the inference time by 4 ms. Subsequently, a model for the recognition of dangerous driving behaviors was trained using a specialized dangerous driving behavior dataset. A series of experiments were performed to evaluate the efficacy of the DB-YOLOv7 algorithm in this context. The findings indicate a significant enhancement in detection performance, with a 4% improvement in accuracy compared to the baseline network. Furthermore, the model’s inference time was reduced by 20%, from 25 ms to 20 ms. These results substantiate the effectiveness of the DB-YOLOv7 recognition algorithm for detecting dangerous driving behaviors, providing comprehensive validation of its practical applicability.
基金supported by the National Natural Science Foundation of China(71690233)
文摘Resource allocation for an equipment development task is a complex process owing to the inherent characteristics,such as large amounts of input resources,numerous sub-tasks,complex network structures,and high degrees of uncertainty.This paper presents an investigation into the influence of resource allocation on the duration and cost of sub-tasks.Mathematical models are constructed for the relationships of the resource allocation quantity with the duration and cost of the sub-tasks.By considering the uncertainties,such as fluctuations in the sub-task duration and cost,rework iterations,and random overlaps,the tasks are simulated for various resource allocation schemes.The shortest duration and the minimum cost of the development task are first formulated as the objective function.Based on a multi-objective particle swarm optimization(MOPSO)algorithm,a multi-objective evolutionary algorithm is constructed to optimize the resource allocation scheme for the development task.Finally,an uninhabited aerial vehicle(UAV)is considered as an example of a development task to test the algorithm,and the optimization results of this method are compared with those based on non-dominated sorting genetic algorithm-II(NSGA-II),non-dominated sorting differential evolution(NSDE)and strength pareto evolutionary algorithm-II(SPEA-II).The proposed method is verified for its scientific approach and effectiveness.The case study shows that the optimization of the resource allocation can greatly aid in shortening the duration of the development task and reducing its cost effectively.
文摘An object model based software architecture for service robot system is presented, which addresses both software engineering issues such as reuse, extensibility, and management of complexity as well as system engineering issues like scalability, reactivity, and robustness. A novel approach to the service robot system architecture is discussed. Cognitive psychology is considered in designing the software system, i.e., a humans way of vision and planning is applied. The planner can incorporate the users request into its task selection mechanism and generate plans biased toward picking the most reliable task execution in a given situation, and the planner can alter task selection based on changes that occur in dynamic and uncertain environments.
基金The authors would like to thank the financial support of Natural Science Foundation of Anhui Province(No.2208085MF173)the key research and development projects of Anhui(202104a05020003)+2 种基金the anhui development and reform commission supports R&D and innovation project([2020]479)the national natural science foundation of China(51575001)Anhui university scientific research platform innovation team building project(2016-2018).
文摘In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.
基金supported by“Regional Innovation Strategy (RIS)”through the National Research Foundation of Korea (NRF)funded by the Ministry of Education (MOE) (2021RIS-004).
文摘A concurrency control mechanism for collaborative work is akey element in a mixed reality environment. However, conventional lockingmechanisms restrict potential tasks or the support of non-owners, thusincreasing the working time because of waiting to avoid conflicts. Herein, wepropose an adaptive concurrency control approach that can reduce conflictsand work time. We classify shared object manipulation in mixed reality intodetailed goals and tasks. Then, we model the relationships among goal,task, and ownership. As the collaborative work progresses, the proposedsystem adapts the different concurrency control mechanisms of shared objectmanipulation according to the modeling of goal–task–ownership. With theproposed concurrency control scheme, users can hold shared objects andmove and rotate together in a mixed reality environment similar to realindustrial sites. Additionally, this system provides MS Hololens and Myosensors to recognize inputs from a user and provides results in a mixed realityenvironment. The proposed method is applied to install an air conditioneras a case study. Experimental results and user studies show that, comparedwith the conventional approach, the proposed method reduced the number ofconflicts, waiting time, and total working time.
基金This work was supported,in part,by the National Nature Science Foundation of China under grant numbers 62272236in part,by the Natural Science Foundation of Jiangsu Province under grant numbers BK20201136,BK20191401in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund.
文摘LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.
文摘One of the most basic and difficult areas of computer vision and image understanding applications is still object detection. Deep neural network models and enhanced object representation have led to significant progress in object detection. This research investigates in greater detail how object detection has changed in the recent years in the deep learning age. We provide an overview of the literature on a range of cutting-edge object identification algorithms and the theoretical underpinnings of these techniques. Deep learning technologies are contributing to substantial innovations in the field of object detection. While Convolutional Neural Networks (CNN) have laid a solid foundation, new models such as You Only Look Once (YOLO) and Vision Transformers (ViTs) have expanded the possibilities even further by providing high accuracy and fast detection in a variety of settings. Even with these developments, integrating CNN, YOLO and ViTs, into a coherent framework still poses challenges with juggling computing demand, speed, and accuracy especially in dynamic contexts. Real-time processing in applications like surveillance and autonomous driving necessitates improvements that take use of each model type’s advantages. The goal of this work is to provide an object detection system that maximizes detection speed and accuracy while decreasing processing requirements by integrating YOLO, CNN, and ViTs. Improving real-time detection performance in changing weather and light exposure circumstances, as well as detecting small or partially obscured objects in crowded cities, are among the goals. We provide a hybrid architecture which leverages CNN for robust feature extraction, YOLO for rapid detection, and ViTs for remarkable global context capture via self-attention techniques. Using an innovative training regimen that prioritizes flexible learning rates and data augmentation procedures, the model is trained on an extensive dataset of urban settings. Compared to solo YOLO, CNN, or ViTs models, the suggested model exhibits an increase in detection accuracy. This improvement is especially noticeable in difficult situations such settings with high occlusion and low light. In addition, it attains a decrease in inference time in comparison to baseline models, allowing real-time object detection without performance loss. This work introduces a novel method of object identification that integrates CNN, YOLO and ViTs, in a synergistic way. The resultant framework extends the use of integrated deep learning models in practical applications while also setting a new standard for detection performance under a variety of conditions. Our research advances computer vision by providing a scalable and effective approach to object identification problems. Its possible uses include autonomous navigation, security, and other areas.