Paper status: completed

SemGrasp: Semantic Grasp Generation via Language Aligned Discretization

Published:04/05/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SemGrasp proposes generating natural human grasps from language instructions, transcending geometry-only methods. It employs a discrete representation to align grasp and semantic spaces by fine-tuning a Multimodal Large Language Model (MLLM) on the new CapGrasp dataset. Findings

Abstract

Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed SemGrasp, which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of SemGrasp, we have compiled a large-scale, grasp-text-aligned dataset named CapGrasp, featuring about 260k detailed captions and 50k diverse grasps. Experimental findings demonstrate that SemGrasp efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset are available publicly at: https://kailinli.github.io/SemGrasp.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
  • Authors:
    • Kailin Li (Shanghai Jiao Tong University, Shanghai AI Laboratory)
    • Jingbo Wang (Shanghai AI Laboratory)
    • Lixin Yang (Shanghai Jiao Tong University)
    • Cewu Lu (Shanghai Jiao Tong University)
    • Bo Dai (Shanghai AI Laboratory)
  • Journal/Conference: The paper is available on arXiv, a preprint server for academic articles. This means it has not yet undergone formal peer review for publication in a journal or conference.
  • Publication Year: 2024
  • Abstract: The paper addresses the challenge of generating natural human grasps, arguing that semantic information is as crucial as object geometry. It introduces SemGrasp, a novel method that generates static human grasp poses based on language instructions. The core innovation is a discrete grasp representation that aligns the grasp space with the semantic space of language. This is achieved by fine-tuning a Multimodal Large Language Model (MLLM) to integrate object geometry, grasp parameters, and language. To train this model, the authors created CapGrasp, a large-scale dataset with approximately 260,000 detailed captions for 50,000 diverse grasps. Experiments confirm that SemGrasp successfully generates natural and linguistically-aligned human grasps.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Generating realistic human grasps for applications like Augmented/Virtual Reality (AR/VR) and robotics requires understanding not just how to hold an object, but why. For instance, one grasps a mug by the handle to avoid hot liquid, or a bottle by the cap to open it. Existing methods primarily focus on object geometry and physical stability, largely ignoring the rich semantic context and human intent.
    • Gaps in Prior Work: Previous attempts to incorporate semantics used coarse affordance vectors or employed vision-language models to filter pre-generated grasps. There was no method to directly generate a grasp from a detailed, descriptive language instruction. A major bottleneck was the lack of large-scale datasets that tightly align rich text descriptions with 3D grasp data.
    • Innovation: SemGrasp introduces a new paradigm where grasp generation is framed as a language-to-action task. Its key innovation is to discretize the continuous space of hand poses into a sequence of semantic tokens, making it compatible with the token-based architecture of Large Language Models (LLMs).
  • Main Contributions / Findings (What):

    1. SemGrasp Method: A novel framework that generates human grasps directly from language instructions. It uses a Multimodal Large Language Model (MLLM) to reason about object geometry, language intent, and the corresponding hand pose.
    2. Discrete Grasp Representation: The paper proposes a hierarchical, discrete representation for grasps, breaking them down into three semantic tokens: , , and . This is learned using a Vector Quantized Variational Autoencoder (VQ-VAE), which effectively bridges the gap between continuous 3D poses and discrete language tokens.
    3. CapGrasp Dataset: The authors compiled and released a new, large-scale dataset specifically for this task. CapGrasp contains over 50,000 hand-object grasps annotated with ~260,000 detailed captions, including low-level contact details, high-level intents, and conversational exchanges, generated with the help of GPT-4.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Grasp Generation: The process of computing a configuration for a hand or robotic gripper to hold an object securely. For human hands, this involves determining the position, orientation, and joint angles of all fingers.
    • MANO Model: A parametric 3D model of the human hand. It can generate a variety of hand meshes by controlling a small set of parameters for pose (θ\theta) and shape (β\beta). It is a standard tool in hand-object interaction research.
    • Point Cloud: A set of 3D coordinates representing the surface of an object. It is a common way to represent 3D geometry for processing by deep learning models.
    • Vector Quantized Variational Autoencoder (VQ-VAE): A type of generative model that learns to compress data (like a grasp) into a discrete latent representation by mapping it to the closest vector in a learned "codebook." This process, called vector quantization, is crucial for converting continuous data into discrete tokens that an LLM can understand.
    • Multimodal Large Language Models (MLLMs): LLMs, like GPT-4 or Llama, that have been extended to process inputs from multiple modalities, such as text and images (e.g., LLaVA) or, in this case, text and 3D point clouds. They learn to find a common semantic space for these different data types.
  • Previous Works:

    • Geometry-Based Grasp Generation: Most prior work used object geometry to generate grasps, often employing models like Conditional Variational Autoencoders (cVAE) or Generative Adversarial Networks (GAN). While they could produce physically plausible grasps, they lacked semantic awareness. Examples include GrabNet and methods by Jiang et al.
    • Early Semantic Grasping: Some methods tried to incorporate semantics through simple "affordance vectors" (e.g., indicating which part of an object is graspable for a specific function) or by using a language model to filter a set of randomly sampled grasps. These approaches were indirect and could not handle detailed, free-form language instructions.
    • Hand-Object Interaction Datasets: Existing datasets like OakInk provided 3D hand and object poses but had limited semantic annotations. They might label an object part as "handle," but lacked detailed text descriptions like "grasp the handle firmly to lift the heavy mug."
  • Differentiation: SemGrasp fundamentally differs from prior work by treating grasp generation as a language translation problem. Instead of generating a continuous pose vector, it generates a sequence of discrete semantic tokens (, , ). This token-based representation is seamlessly integrated into an MLLM, allowing the model to leverage the powerful reasoning capabilities of LLMs to directly translate a user's textual intent into a corresponding 3D grasp. The creation of the CapGrasp dataset provides the necessary large-scale, aligned data to train such a model, which was a missing piece in the field.

4. Methodology (Core Technology & Implementation)

The SemGrasp methodology is composed of two main stages: first, learning a discrete representation for grasps, and second, training a language model to generate these discrete representations from text.

  • Principles: The core idea is inspired by how humans plan a grasp: first, determine the general orientation (e.g., approach a mug from the side), then decide on the grasp type or manner (e.g., use a full-hand grip on the handle), and finally, make fine adjustments for a stable contact. SemGrasp operationalizes this by creating a token for each step. Because language is inherently discrete (a sequence of words/tokens), discretizing the grasp action allows for a natural alignment between the two modalities within an MLLM.

4.1. Grasp Discretization

The goal is to convert a continuous grasp representation into a set of discrete tokens. A grasp GG is defined by the hand's global transformation relative to the object (TT), its local pose parameters (θ\theta), and its shape parameters (β\beta).

  • Steps & Procedures: A hierarchical VQ-VAE is used to learn three codebooks, one for each semantic component of the grasp:

    1. Orientation Token (): The VQ-VAE first encodes the hand's global transformation TT into a discrete token .

    2. Manner Token (): Conditioned on the object and the orientation token, a second level of the VQ-VAE encodes the hand's local pose (θ\theta) and shape (β\beta) into a manner token . This token captures the overall grasp type (e.g., pinch, power grasp).

    3. Refinement Token (): Finally, conditioned on the object and the first two tokens, a third level encodes the residual or delta parameters (ΔT,Δθ,Δβ\Delta T, \Delta\theta, \Delta\beta) needed for fine-tuning the pose to ensure physical plausibility. This is the refinement token .

      This process is illustrated in Figure 3, showing the cascaded encoders (E1,E2,E3\mathcal{E}_1, \mathcal{E}_2, \mathcal{E}_3) and decoders (D1,D2,D3\mathcal{D}_1, \mathcal{D}_2, \mathcal{D}_3). The object's point cloud features, extracted via PointBERT, provide conditional input at each stage.

      Figure 3: Grasp Discretization Process 该图像为方法示意图,展示了论文中基于语义对齐的抓取姿态生成流程。左侧输入物体点云O与手部点云H,分别通过Point-BERT编码得到特征f_O和f_H。右侧,多个编码器E和解码器D模块级联,依次对抓取姿态进行朝向(orientation)、方式(manner)和细化(refinement)三部分的预测。图中虚线箭头表示训练阶段启用的连接。整体架构强调通过多层离散表示和语义对齐实现符合语言指令的自然人类抓取生成。 Figure 3 illustrates the hierarchical VQ-VAE architecture. Object features (fOf_O) and hand features (fHf_H) are processed by a series of encoders (E\mathcal{E}) to produce discrete tokens for orientation, manner, and refinement. A corresponding series of decoders (D\mathcal{D}) reconstructs the grasp pose from these tokens.

  • Mathematical Formulas & Key Details: The VQ-VAE is trained to minimize a combination of three losses:

    1. Reconstruction Loss (Lrec\mathcal{L}_{\mathrm{rec}}): This ensures that the grasp reconstructed from the discrete tokens is close to the original ground-truth grasp. The loss is measured as the L2 distance between the vertices of the reconstructed hand mesh (H^\hat{H}) and the ground-truth hand mesh (HH). Lrec=HH^22=HM(G^)22 \mathcal { L } _ { \mathrm { r e c } } = \| \pmb { H } - \hat { \pmb { H } } \| _ { 2 } ^ { 2 } = \| \pmb { H } - \mathcal { M } ( \hat { \pmb { G } } ) \| _ { 2 } ^ { 2 }

      • HH: Ground-truth hand mesh vertices.
      • H^\hat{H}: Reconstructed hand mesh vertices.
      • M()\mathcal{M}(\cdot): The MANO model function that generates a hand mesh from grasp parameters G^\hat{G}.
    2. Embedding Loss (Lemb\mathcal{L}_{\mathrm{emb}}) and Commitment Loss (Lcom\mathcal{L}_{\mathrm{com}}): These losses are standard in VQ-VAE training. The embedding loss encourages the encoder's output to be close to the chosen codebook vector, while the commitment loss ensures the codebook vectors do not move too far from the encoder outputs. This stabilizes training. Lemb+Lcom=sg[NE(z)]bz22+NE(z)sg[bz]22 \mathcal { L } _ { \mathrm { e m b } } + \mathcal { L } _ { \mathrm { c o m } } = \| \mathrm { s g } [ \mathcal { N } _ { \mathcal { E } } ( z ) ] - \pmb { b } _ { z } \| _ { 2 } ^ { 2 } + \| \mathcal { N } _ { \mathcal { E } } ( z ) - \mathrm { s g } [ \pmb { b } _ { z } ] \| _ { 2 } ^ { 2 }

      • NE(z)\mathcal{N}_{\mathcal{E}}(z): The continuous output of the encoder network for input zz.
      • bzb_z: The nearest codebook vector to NE(z)\mathcal{N}_{\mathcal{E}}(z).
      • sg[]\mathrm{sg}[\cdot]: The stop-gradient operator, which blocks gradients from flowing through its argument during backpropagation.

4.2. Grasp Aware Language Model

Once the VQ-VAE is trained and frozen, its encoder serves as a "grasp tokenizer." The next step is to train an MLLM to generate these grasp tokens given an object and a language instruction.

  • Steps & Procedures: The architecture, inspired by LLaVA, aligns three modalities:

    • Object Modal: An object's point cloud is fed into a pre-trained PointBERT model to extract geometric features (fOf_O). These features are then passed through a linear projection layer (PO\mathcal{P}_O) to map them into the LLM's word embedding space. An important detail is the inclusion of an object size token , as PointBERT normalizes object size, which is critical information for grasping.

    • Language Modal: The text instruction is tokenized using the standard tokenizer of the base LLM (Vicuna-7B).

    • Grasp Modal: The target output is the sequence of discrete grasp tokens <o, m, r>, enclosed by special markers (start grasp) and (end grasp).

      The overall pipeline is shown in Figure 2.

      Figure 2: SemGrasp Pipeline Figure 2 shows the end-to-end pipeline. An object point cloud (OO) and a language instruction (LL) are processed and fed into the Grasp-Aware Language Model. The model autoregressively outputs the three discrete grasp tokens, which are then decoded by the frozen VQ-VAE decoder into a final 3D hand pose via the MANO model.

  • Training: The MLLM is trained using LoRA (Low-Rank Adaptation), an efficient fine-tuning technique. The training proceeds in two stages:

    1. Multimodal Alignment: The model is trained to predict the grasp tokens based on object features and a simple language description. In this stage, the object projection layer PO\mathcal{P}_O and new special token embeddings are trained.

    2. Instruction Tuning: The model is fine-tuned on more complex, conversational data from the CapGrasp dataset to improve its instruction-following and reasoning abilities. The projection layer is frozen during this stage for stability.

      The model is trained with a standard autoregressive, next-token prediction objective, minimizing the negative log-likelihood loss.

    LNLL=logp(X^X)=ilogp(x^ix^<i,x) \mathcal { L } _ { \mathrm { N L L } } = - \log p ( \hat { { \boldsymbol X } } | { \boldsymbol X } ) = - \sum _ { i } \log p ( \hat { { \boldsymbol x } } ^ { i } | \hat { { \boldsymbol x } } ^ { < i } , { \boldsymbol x } )

    • XX: The input sequence (object and language tokens).
    • X^\hat{X}: The target output sequence (language response and grasp tokens).
    • x^i\hat{x}^i: The ii-th token in the target sequence.

4.3. CapGrasp Dataset

A key contribution is the CapGrasp dataset, created by augmenting the existing OakInk dataset with rich text annotations using GPT-4 and GPT-4V. It includes three levels of annotation:

  • Low-level: Contact states (which fingers touch which object parts), calculated from 3D geometry.
  • High-level: Manipulation intent and grasp force, inferred from low-level contacts or images using GPT-4/GPT-4V. For example, inferring the intent "to unscrew" when fingers are on a bottle cap.
  • Conversational: Human-like dialogues about grasping, generated by GPT-4 to create instruction-tuning data.

5. Experimental Setup

  • Datasets: The CapGrasp dataset was used for all training and evaluation. The authors followed the official split of the underlying OakInk dataset: 80% for training, 10% for validation, and 10% for testing.

  • Evaluation Metrics:

    • Physical Plausibility Metrics:

      1. Mean Per-Vertex Position Error (MPVPE):
        • Conceptual Definition: Measures the reconstruction accuracy of the grasp. It calculates the average Euclidean distance between each vertex of the predicted hand mesh and the corresponding vertex of the ground-truth hand mesh. Lower is better. Units are in millimeters (mm).
        • Mathematical Formula: MPVPE=1Ni=1Nviv^i2\text{MPVPE} = \frac{1}{N} \sum_{i=1}^{N} \| v_i - \hat{v}_i \|_2
        • Symbol Explanation: NN is the number of vertices in the hand mesh (778 for MANO), viv_i is the position of the ii-th vertex in the ground-truth mesh, and v^i\hat{v}_i is the position of the ii-th vertex in the predicted mesh.
      2. Penetration Depth (PD):
        • Conceptual Definition: Measures the severity of interpenetration between the hand and the object. It is the maximum distance a hand vertex has penetrated inside the object mesh. Lower is better. Units are in centimeters (cm).
      3. Solid Intersection Volume (SIV):
        • Conceptual Definition: Quantifies the total volume of the hand mesh that is inside the object mesh. It provides a more holistic measure of penetration than PD. Lower is better. Units are in cubic centimeters (cm3cm^3).
      4. Simulation Displacement (SD):
        • Conceptual Definition: Evaluates the physical stability of a grasp. The grasp is placed in a physics simulator (PyBullet), and gravity is applied. SD measures how much the object's center moves. A smaller displacement indicates a more stable grasp. Both the mean and standard deviation are reported. Lower is better. Units are in centimeters (cm).
    • Semantic Consistency Metrics:

      1. GPT-4 Assisted Evaluation:
        • Conceptual Definition: An automated method to score the semantic alignment between the generated grasp and the input text. A rendered image of the grasp is shown to GPT-4V along with the text prompt, and the model is asked to score the consistency on a scale of 0-100. Higher is better.
      2. P-FID (Fréchet Inception Distance for Point Clouds):
        • Conceptual Definition: Measures the similarity between the distribution of generated hand poses and the distribution of ground-truth poses. It computes the Fréchet distance between features extracted from the point clouds of the predicted and real hand meshes. A lower score indicates the generated distribution is closer to the real one.
      3. Perceptual Score (PS):
        • Conceptual Definition: A human evaluation metric. Volunteers were asked to rate the generated grasps on a 5-point Likert scale based on naturalness and semantic consistency with the instruction. The final score is the average rating. Higher is better.
  • Baselines:

    • For Grasp Representation: The paper compares its VQ-VAE reconstruction performance against two state-of-the-art cVAE-based methods: GrabNet [68] and Jiang et al. [30].
    • For Language-Guided Generation: Since no direct prior work existed, the authors created a strong baseline by finetuning a BERT model to perform a classification task, where it predicts the three discrete grasp tokens (<o, m, r>) from the language and object features.

6. Results & Analysis

  • Core Results:

    • Discrete VQ-VAE Grasp Representation: Table 1 shows the reconstruction performance. This is a manual transcription of the data from Table 1 in the paper.

      Method MPVPE ↓ PD ↓ SIV ↓ SD mean. ↓ SD std. ↓
      CAPGrasp dataset - 0.11 0.62 0.94 1.62
      GrabNet [68] w/o refineNet 18.14 0.76 5.42 1.75 2.61
      GrabNet [68] 27.49 0.54 3.45 1.77 2.36
      GrabNet [68] w/ TTA 27.16 0.49 2.16 1.35 1.56
      Jiang et al. [30] w/o TTA 33.68 0.72 5.81 1.53 1.77
      Jiang et al. [30] w/ TTA 33.84 0.58 2.78 1.36 1.55
      Ours w/o finetune token 20.36 0.48 3.00 1.95 2.11
      Ours 14.97 0.46 2.72 2.14 2.37
      Ours w/ TTA 23.61 0.37 1.27 1.90 2.12
      • Analysis: The proposed discrete representation achieves the best MPVPE (14.97 mm), indicating high reconstruction fidelity. The addition of the refinement token () significantly improves performance over the version without it. With Test-Time Adaptation (TTA), a common optimization step, the method achieves state-of-the-art results in physical plausibility metrics (PD and SIV), demonstrating that the discrete representation does not sacrifice physical accuracy.
    • Controllable Generation & Language Guided Generation:

      Figure 5: Qualitative Results Figure 5a (left) shows that by fixing the and tokens, SemGrasp generates grasps with consistent orientation and manner across different mug shapes, while the cVAE-based GrabNet produces varied and uninterpretable results. Figure 5b (right) showcases qualitative results of language-guided generation, with the red box indicating a failure case.

      This is a manual transcription of the data from Table 2 in the paper.

      Method P-FID ↓ PD ↓ SIV ↓ SD mean. ↓ SD std. ↓ GPT-4 ↑ PS ↑
      CapGrasp (Ground Truth) - 0.11 0.62 0.94 1.62 82.3 4.7
      BERT [8] based 3.32 0.49 4.60 2.17 2.26 47.3 3.7
      SemGrasp 2.28 0.48 4.24 2.00 2.33 74.5 4.6
      • Analysis: SemGrasp significantly outperforms the BERT-based classification baseline on all metrics, especially the semantic ones (GPT-4 score of 74.5 vs. 47.3, and Perceptual Score of 4.6 vs. 3.7). This demonstrates the superiority of leveraging a pretrained MLLM's reasoning capabilities over a simple classification approach. The physical plausibility metrics are also slightly better, showing the MLLM can generate high-quality grasps.
  • Ablations / Parameter Sensitivity:

    • Representation Ablation (Table 3): This is a manual transcription of the data from Table 3 in the paper.

      Setting MPVPE ↓ PD ↓ SIV ↓ SD mean. ↓ SD std. ↓
      One token 29.95 0.66 5.14 1.90 N/A
      <o,m> 25.73 0.58 4.32 2.14 N/A
      <o,m,r x 2> 15.37 0.50 2.98 2.28 N/A
      <o,m,r x 3> 15.90 0.52 3.37 1.98 N/A
      Single VQ 28.02 0.68 5.31 1.81 N/A
      w/o semantic 21.94 0.60 4.59 1.84 N/A
      • Analysis: This study confirms the design choices. Using a single token or a non-hierarchical "Single VQ-VAE" leads to poor performance. The three-token <o, m, r> structure is optimal; adding more refinement steps (r x 2) does not help and can even degrade performance. Explicitly assigning semantic meaning to each token ("w/o semantic") is crucial for good performance.
    • VQ-VAE & MLLM Settings (Tables 4 & 5): These studies show that specific hyperparameters (codebook size, LoRA rank) and design choices (using Vicuna-7B, adding an object size token, and using 2-stage training) are all critical for achieving the best results. For example, removing the token significantly degrades performance, confirming that object scale is vital information that the MLLM needs.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully introduces SemGrasp, a novel and effective method for generating human grasps from natural language. Its main contributions—the semantically aligned discrete representation, the MLLM-based generation framework, and the CapGrasp dataset—collectively push the state-of-the-art from purely geometric grasp synthesis towards semantically-aware, intention-driven interaction. The results demonstrate that the generated grasps are not only physically plausible but also align well with linguistic instructions.

  • Limitations & Future Work:

    • The current method focuses on generating static, single-hand grasps. The authors acknowledge that real-world manipulation often involves dynamic motions and bimanual (two-handed) coordination.
    • The generation of full-body grasp motions is not end-to-end; it requires integration with separate Reinforcement Learning (RL) policies.
    • Future work aims to tackle two-hand manipulation and end-to-end semantic grasp motion synthesis, which would require more extensive and complex datasets.
  • Personal Insights & Critique:

    • Strengths: The core idea of discretizing a continuous action space to align it with language is powerful and highly innovative. It provides a blueprint for applying LLMs to other robotics and motor control tasks. The creation and release of the CapGrasp dataset is a very significant contribution that will likely enable much future research in this domain. The experimental validation is thorough, with strong baselines and comprehensive ablation studies.
    • Potential Weaknesses: The reliance on GPT-4 for dataset annotation, while pragmatic, might introduce the model's inherent biases into the dataset. The evaluation also relies on GPT-4V, which, while powerful, is not a perfectly objective or infallible judge of semantic consistency. The overall system is quite complex, integrating multiple large, pre-trained models (PointBERT, Vicuna) and a custom-trained VQ-VAE, which could pose challenges for reproducibility and deployment in resource-constrained environments.
    • Future Impact: SemGrasp represents a significant step toward creating more intuitive and human-like interactions in AR/VR and robotics. By allowing users to specify complex actions with natural language, it opens the door for more intelligent and collaborative embodied agents. The approach of "language-aligned discretization" could be highly influential and transferable to other domains like navigation, tool use, and complex manipulation planning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.