The Grasp Planning Problem
Given an RGB-D observation of a scene, predict a 6-DOF gripper pose (position + orientation) that produces a stable grasp on a target object. This deceptively simple problem statement hides significant complexity: the space of possible gripper poses is continuous and high-dimensional, most poses will fail, and the relationship between pre-grasp pose and grasp success depends on contact mechanics that are not fully visible from a single RGB-D frame.
The problem divides roughly into two regimes: top-down grasping (2.5D — ignore gripper rotation around the approach axis, grasp from above) which simplifies the problem enough for classical methods; and 6-DOF grasping (full pose prediction) which requires learned methods for reliable generalization.
Grasp Detection Approaches: The Taxonomy
The field has evolved through three generations of approaches, each still relevant for different use cases:
Generation 1 — Geometric/analytical methods: Compute grasp quality from the object's geometric model using metrics like the epsilon quality metric (largest wrench the grasp can resist) or force closure analysis. These methods require a known object mesh and accurate pose estimation. They are deterministic, interpretable, and fast — but cannot handle novel objects. Still used in industrial bin-picking where the object catalog is known and fixed.
Generation 2 — Learned scoring on candidate grasps: Sample candidate grasp poses (randomly or using geometric heuristics), then score each candidate with a learned neural network. GPD, GQ-CNN, and GraspNet-1Billion belong to this category. The key difference from Generation 1 is that the scoring function is learned from data rather than computed analytically, enabling generalization to novel objects.
Generation 3 — Direct 6-DOF pose regression: Predict grasp poses directly from the point cloud without explicit candidate sampling. Contact-GraspNet and AnyGrasp use this approach, predicting a set of grasp poses per point in the cloud. Faster and more accurate than sample-and-score methods because the network learns to focus on promising regions of the point cloud rather than evaluating uniformly sampled candidates.
Classical Methods
- GPD (Grasp Pose Detection): Point cloud-based 6-DOF grasp pose detection. Samples candidate grasp poses on the point cloud surface, evaluates each with a learned binary classifier (grasp quality). Runs at approximately 5 Hz on a modern workstation. Reliable on clean, accurate point clouds — struggles with occlusion and sensor noise. Open source (MIT license).
- GQ-CNN (Berkeley AUTOLAB): Grasp quality CNN operating on depth image patches. Fast (50ms inference) and accurate for top-down grasps from direct overhead view. Limitation: limited to top-down approach direction — cannot predict side grasps or precise 6-DOF poses. Best for bin-picking scenarios with overhead camera. Open source with pre-trained models available.
- PointNetGPD: Combines PointNet-based point cloud encoding with GPD's candidate sampling. Learns grasp features directly from raw point clouds rather than voxelized representations. Improves on GPD in cluttered scenes where point cloud quality is uneven. Open source, but less actively maintained than GraspNet.
Learned 6-DOF Methods: Detailed Comparison
| Method | Input | Novel Objects | Speed | Training Data | Availability |
|---|---|---|---|---|---|
| GraspNet-1Billion | Point cloud | Good (ShapeNet) | 10 Hz | 1.2B grasps, 97 objects | Open source |
| Contact-GraspNet | Point cloud | Good (handles occlusion) | 8 Hz | Acronym dataset, 8K objects | Open source |
| AnyGrasp | Point cloud | Best (open-set) | 15 Hz | GraspNet-1B + augmented | Commercial API + SDK |
| FoundationGrasp | RGB-D + language | Best (language-guided) | 3 Hz | Multi-modal pre-training | Research (code released) |
| GraspNeRF | Multi-view RGB | Good (NeRF reconstruction) | 0.5 Hz | NeRF + grasp labels | Research only |
GraspNet-1Billion and Contact-GraspNet: The Open-Source Workhorses
GraspNet-1Billion (Fang et al., CVPR 2020) introduced the largest grasp dataset and a benchmark that has become the standard evaluation for learned 6-DOF grasping. The dataset contains 1.2 billion grasp annotations across 97 everyday objects, generated using analytical grasp quality metrics on full 3D meshes. The model uses PointNet++ to encode the point cloud and predicts per-point grasp poses with a quality score. The key innovation is the approach-level cylinder grouping that constrains the grasp prediction to feasible approach directions, significantly reducing false positives.
Contact-GraspNet (Sundermeyer et al., ICRA 2021) tackles the harder case of grasping in heavy clutter and partial occlusion. Rather than predicting grasp poses for isolated object segments, Contact-GraspNet predicts grasps for each visible contact point in the scene, regardless of which object the point belongs to. This makes it robust to segmentation errors — a major practical advantage because object segmentation in cluttered scenes is itself unreliable. The model was trained on the ACRONYM dataset (8K objects, 17.7M grasps) and achieves 90%+ success on the GraspNet benchmark.
AnyGrasp: The Zero-Shot Standard
AnyGrasp, developed by the same group as GraspNet-1Billion, is the state-of-the-art method for grasping novel objects not seen during training. Trained on the GraspNet-1Billion dataset (600K+ training grasps across 97 object categories) plus additional augmentation, AnyGrasp generalizes to open-set objects with 90%+ grasp success rate on standard benchmarks.
What makes AnyGrasp the practical choice for most teams is its zero-shot performance. You do not need to fine-tune it on your specific objects — the pre-trained model handles household items, industrial parts, food items, and irregular shapes with no additional training. This "just works" property is rare in robotics perception and is what makes AnyGrasp the recommended starting point for new grasp planning deployments.
The commercial API (available from Graspnet.net) provides inference as a service with a Python client library — useful for teams that want to use AnyGrasp without maintaining the inference infrastructure. For teams that need on-device inference, the SDK supports Jetson AGX Orin deployment at 8-12 Hz.
Dexterous Grasp Planning
All methods discussed so far assume a parallel-jaw gripper — the simplest and most common end-effector. Dexterous hands (3-5 fingers, 10-20+ DOF) require fundamentally different grasp planning because the contact configuration space is vastly larger.
DexGraspNet (Wang et al., 2023) generates large-scale dexterous grasp datasets using optimization-based grasp synthesis. Starting from random initial hand configurations, it uses differentiable physics simulation to optimize towards force closure. The resulting dataset (1.3M grasps across 5K objects) enables training learned dexterous grasp planners that generalize to novel objects.
UniDexGrasp (Xu et al., CVPR 2023) takes a reinforcement learning approach: train a policy in simulation to grasp arbitrary objects with a dexterous hand, then transfer to real hardware. The policy observes the point cloud and hand proprioception, and outputs per-joint position commands. UniDexGrasp achieves 85%+ success on novel objects in simulation but sim-to-real transfer remains a challenge — real success rates are 50-70% due to contact modeling errors (see our sim-to-real guide).
Practical recommendation: If you are using a parallel-jaw gripper, start with AnyGrasp. If you are using a dexterous hand (Orca Hand, Allegro Hand, LEAP Hand), start with DexGraspNet for data generation and train a policy on the generated grasps. SVRC stocks Orca Hand and compatible dexterous hands and provides integration support for dexterous grasp planning pipelines.
Point Cloud vs. Depth Image Methods
A practical question: should your grasp planner operate on raw depth images or on 3D point clouds? The tradeoffs:
| Criterion | Point Cloud Methods | Depth Image Methods |
|---|---|---|
| Multi-view fusion | Natural (register and concatenate) | Requires volumetric fusion (TSDF) |
| Inference speed | 8-15 Hz (PointNet++ backbone) | 20-50 Hz (CNN backbone) |
| 6-DOF grasp support | Native (3D coordinates) | Requires projection/back-projection |
| Clutter handling | Better (3D reasoning) | Struggles with overlapping objects |
| Pre-processing | Downsampling, normal estimation | Minimal (resize, normalize) |
| Best method | AnyGrasp, Contact-GraspNet | GQ-CNN (top-down only) |
For 6-DOF grasping in cluttered environments, point cloud methods are strictly better. For top-down bin-picking with a fixed overhead camera, depth image methods (GQ-CNN) are faster and sufficient. Most teams should default to point cloud methods unless speed requirements demand depth image approaches.
Integration Pipeline
The standard integration with a ROS2 + MoveIt2 stack:
Key implementation details: always filter grasp candidates by workspace bounds (exclude grasps that would collide with the table or are outside the arm's reach), set a quality score threshold (0.7 is a good starting point), and use impedance control in the final 5cm of approach to handle position errors. The switch from position control to impedance control at the approach distance is critical for reliable grasping — pure position control will jam on contact if there is any pose error.
MoveIt2 Integration: Full Pipeline Architecture
Integrating a learned grasp planner into a production MoveIt2 stack involves more than calling the planner and executing. A robust pipeline requires collision-aware planning, grasp filtering, approach/retreat trajectory generation, and fallback logic. Here is the detailed architecture:
Grasp Pipeline Node (ROS2 Python)
Key architectural decisions in this pipeline:
- Workspace cropping before detection reduces false positives from background surfaces and cuts inference time by 30-40% by reducing point count.
- IK pre-check before planning avoids expensive motion planning calls for unreachable grasps. The IK check (0.1ms per call) is 100x cheaper than a full plan (10-50ms).
- Cartesian path for final approach ensures a straight-line approach to the grasp pose, critical for avoiding collisions during the last few centimeters. Free-space motion planning may produce curved paths that graze nearby objects.
- Force-controlled close prevents crushing fragile objects. Set force limits based on your gripper: 5-10N for food/electronics, 15-30N for rigid objects, 50N+ for industrial parts.
- Top-10 fallback loop handles the common case where the highest-scored grasp is unreachable or has a collision. In practice, the first reachable grasp is usually within the top 5.
Camera Calibration for Grasp Planning
Grasp planning accuracy depends critically on the camera-to-robot calibration (hand-eye calibration). A 5mm calibration error translates directly to a 5mm grasp pose error -- enough to cause failures on small objects. The calibration pipeline:
- Collect calibration data: Move the robot to 15-20 poses while observing an ArUco board or checkerboard mounted to the end-effector (eye-to-hand) or mounted in the workspace (eye-in-hand). Record the robot FK pose and the detected board pose for each configuration.
- Solve AX=XB: Use OpenCV's
calibrateHandEye()with the Tsai-Lenz or Park method. This gives you the camera-to-base transform (eye-to-hand) or camera-to-end-effector transform (eye-in-hand). - Validate: Command the robot to touch a known point in camera frame. Measure the error. Repeat for 10 points across the workspace. Target: <2mm mean error, <4mm max error.
- Re-calibrate periodically: After any camera mount adjustment, robot base movement, or significant temperature change (thermal expansion shifts camera position by up to 1mm per 10C).
SVRC provides pre-calibrated camera mounts for RealSense D435i and D405 that maintain <1.5mm calibration accuracy. For teams using custom mounts, our data services include hand-eye calibration as part of the hardware setup package.
Grasp Planner FPS by Hardware Platform
Inference speed varies dramatically across deployment hardware. These numbers were measured with batch size 1 on real point clouds (20K-40K points after workspace crop):
| Method | RTX 4090 | RTX 3060 | Jetson Orin | Jetson Xavier | CPU (i7-13700) |
|---|---|---|---|---|---|
| GQ-CNN (top-down) | 60 Hz | 45 Hz | 25 Hz | 12 Hz | 8 Hz |
| GraspNet-1Billion | 18 Hz | 10 Hz | 5 Hz | 2 Hz | 0.8 Hz |
| Contact-GraspNet | 15 Hz | 8 Hz | 4 Hz | 1.5 Hz | 0.5 Hz |
| AnyGrasp | 22 Hz | 15 Hz | 8 Hz | 3 Hz | 1.2 Hz |
| FoundationGrasp | 5 Hz | 3 Hz | 1.2 Hz | 0.4 Hz | N/A |
| GPD | 8 Hz | 5 Hz | 2.5 Hz | 1 Hz | 0.6 Hz |
Practical takeaway: On Jetson Orin (the standard edge compute for mobile robots), AnyGrasp at 8 Hz is the only 6-DOF method fast enough for reactive grasping. On a desktop RTX 3060, all methods run at acceptable speeds for pick-place tasks where you plan once and execute. For conveyor belt picking at 20+ picks/minute, you need either a desktop GPU or GQ-CNN on Jetson (top-down grasps only).
Grasp Success Rate by Object Category
Real-world grasp success depends heavily on object properties. These numbers aggregate results from multiple published evaluations and SVRC internal testing on OpenArm 101 with a Robotiq 2F-85 gripper:
| Object Category | AnyGrasp | Contact-GraspNet | GQ-CNN | Notes |
|---|---|---|---|---|
| Rigid boxes/cans | 95% | 93% | 90% | Easiest category |
| Irregular shapes (toys, tools) | 88% | 85% | 72% | GQ-CNN limited to top-down |
| Small objects (<3cm) | 78% | 72% | 65% | Depth noise dominates |
| Transparent (glass, clear plastic) | 35% | 30% | 25% | Depth sensor failure |
| Deformable (bags, cloth) | 55% | 50% | 40% | Shape changes on contact |
| Heavy clutter (>15 objects) | 75% | 80% | 58% | Contact-GraspNet excels here |
| Flat objects (books, plates) | 82% | 78% | 70% | Side grasps often needed |
Grasp Planning for Language-Conditioned Manipulation
The latest generation of grasp planners integrates language conditioning, allowing commands like "pick up the red mug" or "grasp the leftmost bottle." This bridges the gap between grasp planning (which grasp pose is stable?) and task planning (which object should I grasp?).
Architecture pattern: The typical language-conditioned grasp pipeline chains three components: (1) an open-vocabulary object detector (GroundingDINO, OWLv2) that localizes the target object from a text query, (2) a segmentation model (SAM, Segment Anything) that produces a mask for the target object, and (3) a grasp planner that operates on the masked point cloud. This modular approach lets you swap each component independently.
This pipeline adds 80-150ms of latency for detection and segmentation (on an RTX 3060) but dramatically improves task-level success because the grasp planner only considers the target object, eliminating grasps on distractors. For teams building VLA-based manipulation systems, language-conditioned grasping is the standard perception front-end.
Open Benchmarks
GraspNet Benchmark: The standard evaluation for learned 6-DOF grasping. Uses 190 scenes with 88 objects (split into seen/similar/novel categories). Metrics: AP (average precision at different quality thresholds) and success rate at different friction coefficients. All major methods report results on this benchmark, making cross-method comparison straightforward.
YCB Grasping Benchmark: Uses the YCB object set (77 common household objects) with a standardized camera setup and evaluation protocol. Less focused on 6-DOF methods than GraspNet, but useful for comparing grasp success across different gripper types.
OCRTOC (Open Cloud Robot Table Organization Challenge): An online benchmark for tabletop manipulation that includes grasp planning as one component. Useful for evaluating full pick-place pipelines rather than grasp planning in isolation.
Failure Modes and Mitigations
- Transparent Objects (glass, clear plastic): Depth sensors fail on transparent surfaces — structured light and ToF both require surface reflectance. Mitigation: use tactile search (move gripper to estimated contact point, probe with low force), add polarized lighting to increase surface visibility, or use RGB-based methods (FoundationGrasp) that do not require depth.
- Heavy Objects Near Payload Limit: Grasp planning does not account for payload limits. Grasp an object near the arm payload limit from the wrong angle and you may succeed at grasping but fail to lift due to torque limits. Mitigation: add payload estimation to your grasp selection filter (either from known object weights or from wrist F/T sensing after initial contact).
- Thin Objects <3mm: Standard parallel-jaw grippers cannot close to <3mm without hardware modification. Grasp planners trained on standard objects produce invalid grasps for credit cards, sheets of paper, or thin plates. Mitigation: specialized gripper geometry, vacuum-based grasping, or a hybrid gripper with both jaw and suction modes.
- Highly Cluttered Scenes (>20 objects): Point cloud quality degrades in dense clutter due to inter-object occlusion. Contact-GraspNet handles this better than GraspNet-1Billion because it does not require object segmentation, but even Contact-GraspNet shows 15-20% accuracy degradation above 15 objects. Mitigation: use a decluttering strategy — grasp and remove easy objects first, then re-observe the scene for remaining difficult objects.
- Deformable Objects (bags, cloth, soft food): All current methods assume rigid objects. Deformable objects change shape under contact, invalidating the grasp pose predicted from the pre-contact observation. Mitigation: use conservative grasp force, implement reactive re-grasping based on contact force feedback, or collect task-specific demonstration data that captures deformable manipulation strategies.
SVRC stocks RealSense D435i and D405 cameras optimized for manipulation, along with pre-built ROS2 grasp planning nodes for GraspNet and AnyGrasp. Our data services include grasp demonstration collection for teams training custom grasp policies. Visit the hardware catalog and platform documentation for integration guides. Pilot data collection starts at $2,500.