ActiveGrounder: 3D Visual Grounding
with Object-Hull-Guided Active Observation

Humanoids'25 Workshop "Bridging Humanoid Robotics and Foundation Models Workshop"
1Urban Robotics Lab*Equal contribution †Corresponding author

Abstract

We present ActiveGrounder, a framework that transforms 3D visual grounding from a passive recognition task into an active exploration paradigm. Unlike existing methods that rely on static maps or single-image perception, ActiveGrounder integrates maps with object-hull-guided navigation to actively acquire informative viewpoints. Through experiments, we demonstrate that ActiveGrounder achieves more accurate and reliable grounding compared with passive baselines, offering a step toward embodied agents capable of active perception and grounding in the real world.

ActiveGrounder

Motivation

  • Task: 3D visual grounding interprets language queries in 3D scenes and grounds them to specific objects or regions in space.
  • Problem: Existing 3D visual grounding remains passive due to its lack of integration with exploration.
    • Existing 3D visual grounding methods assume static pre-built maps → unsuitable for dynamic environments.
    • Existing image-based approaches operate without maps but are limited to 2D object representations and lack action generation → over-reliance on passive, single-image perception.
  • Goal: Enable embodied agents to actively perceive and ground visual queries in real-world 3D environments.
    • We propose ActiveGrounder, integrating active perception and 3D visual grounding to achieve robust grounding in dynamic environments.

Contribution

  • We introduce ActiveGrounder, shifting grounding from a passive recognition task to an active exploration paradigm that tightly integrates grounding and action.
  • We leverage a scene graph to maintain 3D object representations; the agent actively navigates around object hulls to observe them and selects optimal viewpoints for visual grounding.
  • We empirically validate our framework on simulation benchmarks and real-world experiments, demonstrating robustness to perception errors, and an effective visual grounding compared with existing passive methods.

Methodology

Fig1 - Framework
  • Scene Graph: Maintain object-level 3D scene representation.
    • Environment Level: Represents the given environment; the top-level context.
    • Keyframe Level: Stores the observed images and poses at specific time steps.
    • Object Level: Linked to each keyframe; contains the observed objects represented as 3D bounding boxes.
  • Object-Hull-Guided (OHG) Observation: When candidate objects are detected, actively adjust viewpoint for better observation.
    • Convex Hull & Traversable Point: For each object, compute its convex hull and select the nearest traversable points.
    • Path Point Generation & Navigation: Generate path from the selected path points, and let the robot navigate there for observation.
  • Viewpoint Selection: Evaluate which viewpoint pPp\in\mathcal{P} maximizes grounding accuracy.score(p)=λcovOpOcoveredOOcovered+λareaarea(oOpbbox(o))maxqP(area(oOqbbox(o))),\text{score}(p) = \lambda_\text{cov} \cdot \frac{ |\mathcal{O}_p \setminus \mathcal{O}^\text{covered}| }{ |\mathcal{O} \setminus \mathcal{O}^\text{covered}| } + \lambda_\text{area}\cdot \frac{\text{area}\left( \bigcup_{o\in\mathcal{O}_p}\text{bbox}(o) \right)}{ \max_{q\in\mathcal{P}}\left( \text{area}\left( \bigcup_{o\in\mathcal{O}_q} \text{bbox}(o) \right) \right)},
    • where
      • P\mathcal{P} : Viewpoint set
      • O\mathcal{O} : Entire object set
      • Op\mathcal{O}_p : Object set from viewpoint pp
      • Ocovered\mathcal{O}^\text{covered} : Covered object set
      • bbox(o)R2\text{bbox}(o)\in\mathbb{R}^2 : 2D bounding box of object oo within the image
      • area()\text{area}(\cdot) : Area of the given bounding boxes
      • λcov,λarea\lambda_\text{cov},\lambda_\text{area} : Weights for coverage and area
    • A greedy selection strategy iteratively picks viewpoints with the highest score, balancing coverage (new objects) and area (visibility) to ensure complementary observations.

Comparative Study

Success Rate(%) ↑Avg. Time (s) ^\daggerTimeout Cases(#/7)
Baseline14.327.01
ActiveGrounder (Ours)85.7125.90
^\dagger Avg. time excludes timeout cases.
  • Baseline: Immediately predicts once the target object is detected, leading to faster completion time but very poor accuracy.
  • ActiveGrounder (Ours): Actively explores the object's surroundings before answering, resulting in much higher success rate despite longer time.

Qualitative Evaluation

Simulation Environment

Fig2 - Qualitative evaluation

Future Works

  • Real-World Experiments: Validate framework performance in diverse physical environments.
  • Viewpoint Selection under Occlusion: Identify the most informative keyframes for LVLM queries, considering occlusions.
  • Beyond FoV Spatial Reasoning: Enhance LVLMs to capture global context beyond limited field of view.