Grasping

Grasping is an important but challenging skill of a robot. To integrate this skill into our robots, we require a tool that can define the best grasp position for our robot gripper, to grasp an object of interest.

Grasp Generation Methods

We define three different strategies to define the best grasp position:

Logic based

The logic based strategy defines ‘rules’ that should lead to grasping an object successfully. This strategy can be useful when the object to grasp is constant. For example: to grasp bricks, one might introduce rules like ‘grasp the brick as much centered as possible’ and ‘place the fingers always on the long sides of the brick’. The great disadvantage of this method is that it doesn’t scale when the objects to grasp differ or are unknown. It is impossible to define rules for every possible object.

Generation based

The generation based strategy uses a trained AI model to sample grasp poses on a given object, followed by a scoring and filtering to select the best poses. Creation of the required models is complex and very computational expensive, but fortunately are some models shared, for example with research papers. Two popular models published on GitHub are GraspNet (AnyGrasp) and GraspGen (paper).

Vision Language Action based

The vision language action based strategy requires an AI model that integrates vision, language and actions to execute tasks. With these models, one could theoretically give a command like ‘pull the red handle’. A model like OpenVLA or DexGraspVLA should create an action list to bring the visioned state to the commanded state. This is a continuous process of performing actions and checking the resulting visual feedback, trying to get closer to the commanded state.

Integration of a Generation Based Model

Since the logic-based strategy is not scalable, while VLA is still very complex and controls the whole action strategy of the robot, we decided to focus on the generation based strategy. To use an existing generation based method, we need to integrate it in our robot system. Since there are many generation based models and development is still very active, a modular way to integrate a model is desired.

Architecture of Generation Based models

Inspection of the two popular models of GraspNet and GraspGen show many similarities:

        flowchart BT
    input["`depth_image
            rgb_image
            mask`"] 
    input --> get_and_process_data
    subgraph GraspNet
        get_and_process_data -- "point_cloud" --> get_grasps
        get_grasps -- "grasps" --> collision_detection
    end
    input --> depth_and_segmentation_to_point_clouds
    subgraph GraspGen
        depth_and_segmentation_to_point_clouds -- "point_cloud" --> GraspGenSampler.run_inference
        GraspGenSampler.run_inference -- "grasps" --> filter_colliding_grasps
    end

Both methods provide demo code. GraspNet only offers a demo where an RGBD image is used as input. GraspGen offers several demo scripts, where the input can be a point cloud, a mesh or an RGBD image. All demos of GraspGen and the demo of GraspNet first convert the input to a point cloud and use a mask to isolate the point cloud of the object of interest.

We can therefore conclude that the actual input of the models is an isolated point cloud of the object to grasp. Both models generate grasp poses using the isolated point cloud. Finally, both methods also offer a collision detection method, to check gripper collision with the original point cloud for all generated grasps.

Modular Integration

To integrate a model in a modular way, we need to define the input and the output. Since both GraspNet and GraspGen eventually use a point cloud of the object of interest to sample a list of grasp poses, these are respectively our defined input and output.

Instead of using the model functions to create this point cloud from another data format, we can create this functionality as a separate function in our system. This gives us the flexibility to create a more detailed point cloud, for example from multiple RGBD images. This also makes us (for this step) independent of the chosen model, which makes it easier to switch between models.

To use the grasp generation of a model, some wrapper software needs to be created. This wrapper software can be similar to the demo code of a model, with the difference that it will only use the grasp generation function, respecting our defined input and output.

The collision detection functionality should also be a separate function of our own system, for the same reason that we have more control over it and makes us independent of a chosen grasp generation model.

        flowchart BT
    create_point_cloud -- "point_cloud" --> generate_graps
    generate_graps -- "grasps" --> collision_filtering
    
    generate_graps["generate_grasps(model)"]
    generate_graps -- "point_cloud" --> get_grasps
    get_grasps -- "grasps" --> generate_graps 
    generate_graps -- "point_cloud" --> GraspGenSampler.run_inference
    GraspGenSampler.run_inference -- "grasps" --> generate_graps 
    
    subgraph GraspModelWrapper
        direction TB
        generate_graps
        subgraph GraspNet
            get_grasps
        end
        subgraph GraspGen
            GraspGenSampler.run_inference
        end
    end

Software Architecture

To achieve the described modular integration, several functionalities need to be developed:

create_point_cloud
generate_grasps_wrapper
collision_filtering

To share data between these functionalities, we can use the ROS architecture. We could consider a GraspManager (similar to our MoveitManager), that manages all the grasp related functionalities. The MoveitManager can use an instance of the GraspingManager, since movement towards the generated grasp poses will be generated using MoveIt. To achieve this, the GraspingManager should be written as a C++ class/library, since our MoveitManager is also written in C++.

The Generation based models are however written in Python. This means that the GraspModelWrapper should also be written in Python. We can define the GraspModelWrapper as an individual ROS node that can communicate with the other functionalities of the GraspingManager using ROS topics, services or actions.

        flowchart LR 
    subgraph C++
        subgraph MoveitManager
            subgraph GraspingManager
                create_point_cloud
                generate_grasps
                collision_filtering
            end
        end
    end

    subgraph Python
        subgraph GraspModelWrapper
            subgraph GraspGen
                GraspGenSampler.run_inference
            end
        end
    end

    generate_grasps <-- ROS Network --> GraspGenSampler.run_inference

    style C++ fill: #659AD2
    style Python fill: #FFD43B