Grasping
Grasping is an important but challenging skill of a robot. To integrate this skill into our robots, we require a tool that can define the best grasp position for our robot gripper, to grasp an object of interest.
Grasp Generation Methods
We define three different strategies to define the best grasp position:
Logic based
The logic based strategy defines ‘rules’ that should lead to grasping an object successfully. This strategy can be useful when the object to grasp is constant. For example: to grasp bricks, one might introduce rules like ‘grasp the brick as much centered as possible’ and ‘place the fingers always on the long sides of the brick’. The great disadvantage of this method is that it doesn’t scale when the objects to grasp differ or are unknown. It is impossible to define rules for every possible object.
Generation based
The generation based strategy uses a trained AI model to sample grasp poses on a given object, followed by a scoring and filtering to select the best poses. Creation of the required models is complex and very computational expensive, but fortunately are some models shared, for example with research papers. Two popular models published on GitHub are GraspNet (AnyGrasp) and GraspGen (paper).
Vision Language Action based
The vision language action based strategy requires an AI model that integrates vision, language and actions to execute tasks. With these models, one could theoretically give a command like ‘pull the red handle’. A model like OpenVLA or DexGraspVLA should create an action list to bring the visioned state to the commanded state. This is a continuous process of performing actions and checking the resulting visual feedback, trying to get closer to the commanded state.
Integration of a Generation Based Model
Since the logic-based strategy is not scalable, while VLA is still very complex and controls the whole action strategy of the robot, we decided to focus on the generation based strategy. To use an existing generation based method, we need to integrate it in our robot system. Since there are many generation based models and development is still very active, a modular way to integrate a model is desired.
Architecture of Generation Based models
Inspection of the two popular models of GraspNet and GraspGen show many similarities:
flowchart BT
input["`depth_image
rgb_image
mask`"]
input --> get_and_process_data
subgraph GraspNet
get_and_process_data -- "point_cloud" --> get_grasps
get_grasps -- "grasps" --> collision_detection
end
input --> depth_and_segmentation_to_point_clouds
subgraph GraspGen
depth_and_segmentation_to_point_clouds -- "point_cloud" --> GraspGenSampler.run_inference
GraspGenSampler.run_inference -- "grasps" --> filter_colliding_grasps
end
Both methods provide demo code. GraspNet only offers a demo where an RGBD image is used as input. GraspGen offers several demo scripts, where the input can be a point cloud, a mesh or an RGBD image. All demos of GraspGen and the demo of GraspNet first convert the input to a point cloud and use a mask to isolate the point cloud of the object of interest.
We can therefore conclude that the actual input of the models is an isolated point cloud of the object to grasp. Both models generate grasp poses using the isolated point cloud. Finally, both methods also offer a collision detection method, to check gripper collision with the original point cloud for all generated grasps.
Modular Integration
To integrate a model in a modular way, we need to define the input and the output. Since both GraspNet and GraspGen eventually use a point cloud of the object of interest to sample a list of grasp poses, these are respectively our defined input and output.
Instead of using the model functions to create this point cloud from another data format, we can create this functionality as a separate function in our system. This gives us the flexibility to create a more detailed point cloud, for example from multiple RGBD images. This also makes us (for this step) independent of the chosen model, which makes it easier to switch between models.
To use the grasp generation of a model, some wrapper software needs to be created. This wrapper software can be similar to the demo code of a model, with the difference that it will only use the grasp generation function, respecting our defined input and output.
The collision detection functionality should also be a separate function of our own system, for the same reason that we have more control over it and makes us independent of a chosen grasp generation model.
flowchart BT
create_point_cloud -- "point_cloud" --> generate_graps
generate_graps -- "grasps" --> collision_filtering
generate_graps["generate_grasps(model)"]
generate_graps -- "point_cloud" --> get_grasps
get_grasps -- "grasps" --> generate_graps
generate_graps -- "point_cloud" --> GraspGenSampler.run_inference
GraspGenSampler.run_inference -- "grasps" --> generate_graps
subgraph GraspModelWrapper
direction TB
generate_graps
subgraph GraspNet
get_grasps
end
subgraph GraspGen
GraspGenSampler.run_inference
end
end
Software Architecture
To achieve the described modular integration, several functionalities need to be developed:
create_point_cloud
generate_grasps_wrapper
collision_filtering
To share data between these functionalities, we can use the ROS architecture. We could consider a GraspManager (similar to our MoveitManager), that manages all the grasp related functionalities. The MoveitManager can use an instance of the GraspingManager, since movement towards the generated grasp poses will be generated using MoveIt. To achieve this, the GraspingManager should be written as a C++ class/library, since our MoveitManager is also written in C++.
The Generation based models are however written in Python. This means that the GraspModelWrapper should also be written in Python. We can define the GraspModelWrapper as an individual ROS node that can communicate with the other functionalities of the GraspingManager using ROS topics, services or actions.
flowchart LR
subgraph C++
subgraph MoveitManager
subgraph GraspingManager
create_point_cloud
generate_grasps
collision_filtering
end
end
end
subgraph Python
subgraph GraspModelWrapper
subgraph GraspGen
GraspGenSampler.run_inference
end
end
end
generate_grasps <-- ROS Network --> GraspGenSampler.run_inference
style C++ fill: #659AD2
style Python fill: #FFD43B