Deep learning has become integral to robotics,particularly in tasks such as robotic grasping,where objects often exhibit diverse shapes,textures,and physical properties.In robotic grasping tasks,due to the diverse cha...Deep learning has become integral to robotics,particularly in tasks such as robotic grasping,where objects often exhibit diverse shapes,textures,and physical properties.In robotic grasping tasks,due to the diverse characteristics of the targets,frequent adjustments to the network architecture and parameters are required to avoid a decrease in model accuracy,which presents a significant challenge for non-experts.Neural Architecture Search(NAS)provides a compelling method through the automated generation of network architectures,enabling the discovery of models that achieve high accuracy through efficient search algorithms.Compared to manually designed networks,NAS methods can significantly reduce design costs,time expenditure,and improve model performance.However,such methods often involve complex topological connections,and these redundant structures can severely reduce computational efficiency.To overcome this challenge,this work puts forward a robotic grasp detection framework founded on NAS.The method automatically designs a lightweight network with high accuracy and low topological complexity,effectively adapting to the target object to generate the optimal grasp pose,thereby significantly improving the success rate of robotic grasping.Additionally,we use Class Activation Mapping(CAM)as an interpretability tool,which captures sensitive information during the perception process through visualized results.The searched model achieved competitive,and in some cases superior,performance on the Cornell and Jacquard public datasets,achieving accuracies of 98.3%and 96.8%,respectively,while sustaining a detection speed of 89 frames per second with only 0.41 million parameters.To further validate its effectiveness beyond benchmark evaluations,we conducted real-world grasping experiments on a UR5 robotic arm,where the model demonstrated reliable performance across diverse objects and high grasp success rates,thereby confirming its practical applicability in robotic manipulation tasks.展开更多
Robots are key to expanding the scope of space applications.The end-to-end training for robot vision-based detection and precision operations is challenging owing to constraints such as extreme environments and high c...Robots are key to expanding the scope of space applications.The end-to-end training for robot vision-based detection and precision operations is challenging owing to constraints such as extreme environments and high computational overhead.This study proposes a lightweight integrated framework for grasp detection and imitation learning,named GD-IL;it comprises a grasp detection algorithm based on manipulability and Gaussian mixture model(manipulability-GMM),and a grasp trajectory generation algorithm based on a two-stage robot imitation learning algorithm(TS-RIL).In the manipulability-GMM algorithm,we apply GMM clustering and ellipse regression to the object point cloud,propose two judgment criteria to generate multiple candidate grasp bounding boxes for the robot,and use manipulability as a metric for selecting the optimal grasp bounding box.The stages of the TS-RIL algorithm are grasp trajectory learning and robot pose optimization.In the first stage,the robot grasp trajectory is characterized using a second-order dynamic movement primitive model and Gaussian mixture regression(GMM).By adjusting the function form of the forcing term,the robot closely approximates the target-grasping trajectory.In the second stage,a robot pose optimization model is built based on the derived pose error formula and manipulability metric.This model allows the robot to adjust its configuration in real time while grasping,thereby effectively avoiding singularities.Finally,an algorithm verification platform is developed based on a Robot Operating System and a series of comparative experiments are conducted in real-world scenarios.The experimental results demonstrate that GD-IL significantly improves the effectiveness and robustness of grasp detection and trajectory imitation learning,outperforming existing state-of-the-art methods in execution efficiency,manipulability,and success rate.展开更多
With the rapid development of robotics,grasp prediction has become fundamental to achieving intelligent physical interactions.To enhance grasp detection accuracy in unstructured environments,we propose a novel Cross-M...With the rapid development of robotics,grasp prediction has become fundamental to achieving intelligent physical interactions.To enhance grasp detection accuracy in unstructured environments,we propose a novel Cross-Multiscale Adaptive Collaborative and Fusion Grasp Detection Network(CMACF-Net).Addressing the limitations of conventional methods in capturing multi-scale spatial features,CMACF-Net introduces the Quantized Multi-scale Global Attention Module(QMGAM),which enables precise multi-scale spatial calibration and adaptive spatial-channel interaction,ultimately yielding a more robust and discriminative feature representation.To reduce the degradation of local features and the loss of high-frequency information,the Cross-scale Context Integration Module(CCI)is employed to facilitate the effective fusion and alignment of global context and local details.Furthermore,an Efficient Up-Convolution Block(EUCB)is integrated into a U-Net architecture to effectively restore spatial details lost during the downsampling process,while simultaneously preserving computational efficiency.Extensive evaluations demonstrate that CMACF-Net achieves state-of-the-art detection accuracies of 98.9% and 95.9% on the Cornell and Jacquard datasets,respectively.Additionally,real-time grasping experiments on the RM65-B robotic platform validate the framework’s robustness and generalization capability,underscoring its applicability to real-world robotic manipulation scenarios.展开更多
Robot grasp detection is a fundamental vision task for robots.Deep learning-based methods have shown excellent results in enhancing the grasp detection capabilities for model-free objects in unstructured scenes.Most p...Robot grasp detection is a fundamental vision task for robots.Deep learning-based methods have shown excellent results in enhancing the grasp detection capabilities for model-free objects in unstructured scenes.Most popular approaches explore deep network models and exploit RGB-D images combining colour and depth data to acquire enriched feature expressions.However,current work struggles to achieve a satisfactory balance between the accuracy and real-time performance;the variability of RGB and depth feature distributions receives inadequate attention.The treatment of predicted failure cases is also lacking.We propose an efficient fully convolutional network to predict the pixel-level antipodal grasp parameters in RGB-D images.A structure with hierarchical feature fusion is established using multiple lightweight feature extraction blocks.The feature fusion module with 3D global attention is used to select the complementary information in RGB and depth images suficiently.Additionally,a grasp configuration optimization method based on local grasp path is proposed to cope with the possible failures predicted by the model.Extensive experiments on two public grasping datasets,Cornell and Jacquard,demonstrate that the approach can improve the performance of grasping unknown objects.展开更多
To balance the inference speed and detection accuracy of a grasp detection algorithm,which are both important for robot grasping tasks,we propose an encoder–decoder structured pixel-level grasp detection neural netwo...To balance the inference speed and detection accuracy of a grasp detection algorithm,which are both important for robot grasping tasks,we propose an encoder–decoder structured pixel-level grasp detection neural network named the attention-based efficient robot grasp detection network(AE-GDN).Three spatial attention modules are introduced in the encoder stages to enhance the detailed information,and three channel attention modules are introduced in the decoder stages to extract more semantic information.Several lightweight and efficient DenseBlocks are used to connect the encoder and decoder paths to improve the feature modeling capability of AE-GDN.A high intersection over union(IoU)value between the predicted grasp rectangle and the ground truth does not necessarily mean a high-quality grasp configuration,but might cause a collision.This is because traditional IoU loss calculation methods treat the center part of the predicted rectangle as having the same importance as the area around the grippers.We design a new IoU loss calculation method based on an hourglass box matching mechanism,which will create good correspondence between high IoUs and high-quality grasp configurations.AEGDN achieves the accuracy of 98.9%and 96.6%on the Cornell and Jacquard datasets,respectively.The inference speed reaches 43.5 frames per second with only about 1.2×10^(6)parameters.The proposed AE-GDN has also been deployed on a practical robotic arm grasping system and performs grasping well.Codes are available at https://github.com/robvincen/robot_gradet.展开更多
Grasp detection is a visual recognition task where the robot makes use of its sensors to detect graspable objects in its environment.Despite the steady progress in robotic grasping,it is still difficult to achieve bot...Grasp detection is a visual recognition task where the robot makes use of its sensors to detect graspable objects in its environment.Despite the steady progress in robotic grasping,it is still difficult to achieve both real-time and high accuracy grasping detection.In this paper,we propose a real-time robotic grasp detection method,which can accurately predict potential grasp for parallel-plate robotic grippers using RGB images.Our work employs an end-to-end convolutional neural network which consists of a feature descriptor and a grasp detector.And for the first time,we add an attention mechanism to the grasp detection task,which enables the network to focus on grasp regions rather than background.Specifically,we present an angular label smoothing strategy in our grasp detection method to enhance the fault tolerance of the network.We quantitatively and qualitatively evaluate our grasp detection method from different aspects on the public Cornell dataset and Jacquard dataset.Extensive experiments demonstrate that our grasp detection method achieves superior performance to the state-of-the-art methods.In particular,our grasp detection method ranked first on both the Cornell dataset and the Jacquard dataset,giving rise to the accuracy of 98.9%and 95.6%,respectively at realtime calculation speed.展开更多
Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usuall...Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usually transmit the high-level feature in the encoder to the decoder,and low-level features are neglected.It is noted that low-level features contain abundant detail information,and how to fully exploit low-level features remains unsolved.Meanwhile,the channel information in high-level feature is also not well mined.Inevitably,the performance of grasp detection is degraded.To solve these problems,we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual.Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module,and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion.Such a hierarchical fusion guarantees the quality of grasp prediction.Furthermore,an inverted shuffle residual module is created,where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches.By such differentiation processing,more high-dimensional channel information is kept,which enhances the representation ability of the network.Besides,an information enhancement module is added before the encoder to reinforce input information.The proposed method attains 98.9%and 97.8%in image-wise and object-wise accuracy on the Cornell grasping dataset,respectively,and the experimental results verify the effectiveness of the method.展开更多
Robotic grasps play an important role in the service and industrial fields,and the robotic arm can grasp the object properly depends on the accuracy of the grasping detection result.In order to predict grasping detect...Robotic grasps play an important role in the service and industrial fields,and the robotic arm can grasp the object properly depends on the accuracy of the grasping detection result.In order to predict grasping detection positions for known or unknown objects by a modular robotic system,a convolutional neural network(CNN)with the residual block is proposed,which can be used to generate accurate grasping detection for input images of the scene.The proposed model architecture was trained on the standard Cornell grasp dataset and evaluated on the test dataset.Moreover,it was evaluated on different types of household objects and cluttered multi-objects.On the Cornell grasp dataset,the accuracy of the model on image-wise splitting detection and object-wise splitting detection achieved 95.5%and 93.6%,respectively.Further,the real detection time per image was 109 ms.The experimental results show that the model can quickly detect the grasping positions of a single object or multiple objects in image pixels in real time,and it keeps good stability and robustness.展开更多
Robotic grasping is an essential problem at both the household and industrial levels,and unstructured objects have always been difficult for grippers.Parallel-plate grippers and algorithms,focusing on partial informat...Robotic grasping is an essential problem at both the household and industrial levels,and unstructured objects have always been difficult for grippers.Parallel-plate grippers and algorithms,focusing on partial information of objects,are one of the widely used approaches.However,most works predict single-size grasp rectangles for fixed cameras and gripper sizes.In this paper,a multi-scale grasp detector is proposed to predict grasp rectangles with different sizes on RGB-D or RGB images in real-time for hand-eye cameras and various parallel-plate grippers.The detector extracts feature maps of multiple scales and conducts predictions on each scale independently.To guarantee independence between scales and efficiency,fully matching model and background classifier are applied in the network.Based on analysis of the Cornell Grasp Dataset,the fully matching model canmatch all labeled grasp rectangles.Furthermore,background classification,along with angle classification and box regression,functions as hard negative mining and background predictor.The detector is trained and tested on the augmented dataset,which includes images of 320×320 pixels and grasp rectangles ranging from 20 tomore than 320 pixels.It performs up to 98.87% accuracy on image-wise dataset and 97.83% on object-wise split dataset at a speed of more than 22 frames per second.In addition,the detector,which is trained on a single-object dataset,can predict grasps on multiple objects.展开更多
基金funded by Guangdong Basic and Applied Basic Research Foundation(2023B1515120064)National Natural Science Foundation of China(62273097).
文摘Deep learning has become integral to robotics,particularly in tasks such as robotic grasping,where objects often exhibit diverse shapes,textures,and physical properties.In robotic grasping tasks,due to the diverse characteristics of the targets,frequent adjustments to the network architecture and parameters are required to avoid a decrease in model accuracy,which presents a significant challenge for non-experts.Neural Architecture Search(NAS)provides a compelling method through the automated generation of network architectures,enabling the discovery of models that achieve high accuracy through efficient search algorithms.Compared to manually designed networks,NAS methods can significantly reduce design costs,time expenditure,and improve model performance.However,such methods often involve complex topological connections,and these redundant structures can severely reduce computational efficiency.To overcome this challenge,this work puts forward a robotic grasp detection framework founded on NAS.The method automatically designs a lightweight network with high accuracy and low topological complexity,effectively adapting to the target object to generate the optimal grasp pose,thereby significantly improving the success rate of robotic grasping.Additionally,we use Class Activation Mapping(CAM)as an interpretability tool,which captures sensitive information during the perception process through visualized results.The searched model achieved competitive,and in some cases superior,performance on the Cornell and Jacquard public datasets,achieving accuracies of 98.3%and 96.8%,respectively,while sustaining a detection speed of 89 frames per second with only 0.41 million parameters.To further validate its effectiveness beyond benchmark evaluations,we conducted real-world grasping experiments on a UR5 robotic arm,where the model demonstrated reliable performance across diverse objects and high grasp success rates,thereby confirming its practical applicability in robotic manipulation tasks.
基金Supported by National Natural Science Foundation of China(Grant No.52475280)Shaanxi Provincial Natural Science Basic Research Program(Grant No.2025SYSSYSZD-105).
文摘Robots are key to expanding the scope of space applications.The end-to-end training for robot vision-based detection and precision operations is challenging owing to constraints such as extreme environments and high computational overhead.This study proposes a lightweight integrated framework for grasp detection and imitation learning,named GD-IL;it comprises a grasp detection algorithm based on manipulability and Gaussian mixture model(manipulability-GMM),and a grasp trajectory generation algorithm based on a two-stage robot imitation learning algorithm(TS-RIL).In the manipulability-GMM algorithm,we apply GMM clustering and ellipse regression to the object point cloud,propose two judgment criteria to generate multiple candidate grasp bounding boxes for the robot,and use manipulability as a metric for selecting the optimal grasp bounding box.The stages of the TS-RIL algorithm are grasp trajectory learning and robot pose optimization.In the first stage,the robot grasp trajectory is characterized using a second-order dynamic movement primitive model and Gaussian mixture regression(GMM).By adjusting the function form of the forcing term,the robot closely approximates the target-grasping trajectory.In the second stage,a robot pose optimization model is built based on the derived pose error formula and manipulability metric.This model allows the robot to adjust its configuration in real time while grasping,thereby effectively avoiding singularities.Finally,an algorithm verification platform is developed based on a Robot Operating System and a series of comparative experiments are conducted in real-world scenarios.The experimental results demonstrate that GD-IL significantly improves the effectiveness and robustness of grasp detection and trajectory imitation learning,outperforming existing state-of-the-art methods in execution efficiency,manipulability,and success rate.
基金supported by the Jiangxi Provincial Natural Science Foundation(No.20232BAB202027)the National Natural Science Foundation of China(No.62367006)the Natural Science Foundation of Hubei Province of China(No.2022CFB536).
文摘With the rapid development of robotics,grasp prediction has become fundamental to achieving intelligent physical interactions.To enhance grasp detection accuracy in unstructured environments,we propose a novel Cross-Multiscale Adaptive Collaborative and Fusion Grasp Detection Network(CMACF-Net).Addressing the limitations of conventional methods in capturing multi-scale spatial features,CMACF-Net introduces the Quantized Multi-scale Global Attention Module(QMGAM),which enables precise multi-scale spatial calibration and adaptive spatial-channel interaction,ultimately yielding a more robust and discriminative feature representation.To reduce the degradation of local features and the loss of high-frequency information,the Cross-scale Context Integration Module(CCI)is employed to facilitate the effective fusion and alignment of global context and local details.Furthermore,an Efficient Up-Convolution Block(EUCB)is integrated into a U-Net architecture to effectively restore spatial details lost during the downsampling process,while simultaneously preserving computational efficiency.Extensive evaluations demonstrate that CMACF-Net achieves state-of-the-art detection accuracies of 98.9% and 95.9% on the Cornell and Jacquard datasets,respectively.Additionally,real-time grasping experiments on the RM65-B robotic platform validate the framework’s robustness and generalization capability,underscoring its applicability to real-world robotic manipulation scenarios.
基金the National Natural Science Foundation of China(No.62173230)the Program of Science and Technology Commission of Shanghai Municipality(No.22511101400)。
文摘Robot grasp detection is a fundamental vision task for robots.Deep learning-based methods have shown excellent results in enhancing the grasp detection capabilities for model-free objects in unstructured scenes.Most popular approaches explore deep network models and exploit RGB-D images combining colour and depth data to acquire enriched feature expressions.However,current work struggles to achieve a satisfactory balance between the accuracy and real-time performance;the variability of RGB and depth feature distributions receives inadequate attention.The treatment of predicted failure cases is also lacking.We propose an efficient fully convolutional network to predict the pixel-level antipodal grasp parameters in RGB-D images.A structure with hierarchical feature fusion is established using multiple lightweight feature extraction blocks.The feature fusion module with 3D global attention is used to select the complementary information in RGB and depth images suficiently.Additionally,a grasp configuration optimization method based on local grasp path is proposed to cope with the possible failures predicted by the model.Extensive experiments on two public grasping datasets,Cornell and Jacquard,demonstrate that the approach can improve the performance of grasping unknown objects.
基金supported by the National Natural Science Foundation of China(No.92048205)the China Scholarship Council(No.202008310014)。
文摘To balance the inference speed and detection accuracy of a grasp detection algorithm,which are both important for robot grasping tasks,we propose an encoder–decoder structured pixel-level grasp detection neural network named the attention-based efficient robot grasp detection network(AE-GDN).Three spatial attention modules are introduced in the encoder stages to enhance the detailed information,and three channel attention modules are introduced in the decoder stages to extract more semantic information.Several lightweight and efficient DenseBlocks are used to connect the encoder and decoder paths to improve the feature modeling capability of AE-GDN.A high intersection over union(IoU)value between the predicted grasp rectangle and the ground truth does not necessarily mean a high-quality grasp configuration,but might cause a collision.This is because traditional IoU loss calculation methods treat the center part of the predicted rectangle as having the same importance as the area around the grippers.We design a new IoU loss calculation method based on an hourglass box matching mechanism,which will create good correspondence between high IoUs and high-quality grasp configurations.AEGDN achieves the accuracy of 98.9%and 96.6%on the Cornell and Jacquard datasets,respectively.The inference speed reaches 43.5 frames per second with only about 1.2×10^(6)parameters.The proposed AE-GDN has also been deployed on a practical robotic arm grasping system and performs grasping well.Codes are available at https://github.com/robvincen/robot_gradet.
基金supported by the National Key Research and Development Program of China under Grant No.2018AAA010-3002the National Natural Science Foundation of China under Grant Nos.62172392,61702482 and 61972379.
文摘Grasp detection is a visual recognition task where the robot makes use of its sensors to detect graspable objects in its environment.Despite the steady progress in robotic grasping,it is still difficult to achieve both real-time and high accuracy grasping detection.In this paper,we propose a real-time robotic grasp detection method,which can accurately predict potential grasp for parallel-plate robotic grippers using RGB images.Our work employs an end-to-end convolutional neural network which consists of a feature descriptor and a grasp detector.And for the first time,we add an attention mechanism to the grasp detection task,which enables the network to focus on grasp regions rather than background.Specifically,we present an angular label smoothing strategy in our grasp detection method to enhance the fault tolerance of the network.We quantitatively and qualitatively evaluate our grasp detection method from different aspects on the public Cornell dataset and Jacquard dataset.Extensive experiments demonstrate that our grasp detection method achieves superior performance to the state-of-the-art methods.In particular,our grasp detection method ranked first on both the Cornell dataset and the Jacquard dataset,giving rise to the accuracy of 98.9%and 95.6%,respectively at realtime calculation speed.
基金This work was supported by the National Natural Science Foundation of China(Nos.62073322 and 61633020)the CIE-Tencent Robotics X Rhino-Bird Focused Research Program(No.2022-07)the Beijing Natural Science Foundation(No.2022MQ05).
文摘Grasp detection plays a critical role for robot manipulation.Mainstream pixel-wise grasp detection networks with encoder-decoder structure receive much attention due to good accuracy and efficiency.However,they usually transmit the high-level feature in the encoder to the decoder,and low-level features are neglected.It is noted that low-level features contain abundant detail information,and how to fully exploit low-level features remains unsolved.Meanwhile,the channel information in high-level feature is also not well mined.Inevitably,the performance of grasp detection is degraded.To solve these problems,we propose a grasp detection network with hierarchical multi-scale feature fusion and inverted shuffle residual.Both low-level and high-level features in the encoder are firstly fused by the designed skip connections with attention module,and the fused information is then propagated to corresponding layers of the decoder for in-depth feature fusion.Such a hierarchical fusion guarantees the quality of grasp prediction.Furthermore,an inverted shuffle residual module is created,where the high-level feature from encoder is split in channel and the resultant split features are processed in their respective branches.By such differentiation processing,more high-dimensional channel information is kept,which enhances the representation ability of the network.Besides,an information enhancement module is added before the encoder to reinforce input information.The proposed method attains 98.9%and 97.8%in image-wise and object-wise accuracy on the Cornell grasping dataset,respectively,and the experimental results verify the effectiveness of the method.
基金National Natural Science Foundation of China(No.52101346)Fundamental Research Funds for the Central Universities,China(No.2232019D3-61)Initial Research Fund for the Young Teachers of Donghua University,China。
文摘Robotic grasps play an important role in the service and industrial fields,and the robotic arm can grasp the object properly depends on the accuracy of the grasping detection result.In order to predict grasping detection positions for known or unknown objects by a modular robotic system,a convolutional neural network(CNN)with the residual block is proposed,which can be used to generate accurate grasping detection for input images of the scene.The proposed model architecture was trained on the standard Cornell grasp dataset and evaluated on the test dataset.Moreover,it was evaluated on different types of household objects and cluttered multi-objects.On the Cornell grasp dataset,the accuracy of the model on image-wise splitting detection and object-wise splitting detection achieved 95.5%and 93.6%,respectively.Further,the real detection time per image was 109 ms.The experimental results show that the model can quickly detect the grasping positions of a single object or multiple objects in image pixels in real time,and it keeps good stability and robustness.
基金fundings from Central Program of Basic Science of the National Natural Science Foundation of China(72088101)the National Postdoctoral Program for Innovative Talents(BX2021285).
文摘Robotic grasping is an essential problem at both the household and industrial levels,and unstructured objects have always been difficult for grippers.Parallel-plate grippers and algorithms,focusing on partial information of objects,are one of the widely used approaches.However,most works predict single-size grasp rectangles for fixed cameras and gripper sizes.In this paper,a multi-scale grasp detector is proposed to predict grasp rectangles with different sizes on RGB-D or RGB images in real-time for hand-eye cameras and various parallel-plate grippers.The detector extracts feature maps of multiple scales and conducts predictions on each scale independently.To guarantee independence between scales and efficiency,fully matching model and background classifier are applied in the network.Based on analysis of the Cornell Grasp Dataset,the fully matching model canmatch all labeled grasp rectangles.Furthermore,background classification,along with angle classification and box regression,functions as hard negative mining and background predictor.The detector is trained and tested on the augmented dataset,which includes images of 320×320 pixels and grasp rectangles ranging from 20 tomore than 320 pixels.It performs up to 98.87% accuracy on image-wise dataset and 97.83% on object-wise split dataset at a speed of more than 22 frames per second.In addition,the detector,which is trained on a single-object dataset,can predict grasps on multiple objects.