ActiveGrounder: 3D Visual Grounding with Object-Hull-Guided Active Observation

ActiveGrounder: 3D Visual Grounding
with Object-Hull-Guided Active Observation

Humanoids'25 Workshop "Bridging Humanoid Robotics and Foundation Models Workshop"

Dasol Hong¹^*Juhye Park¹^*Taeyun Kim¹Jeewon Kim¹Jei Kong¹Wanhee Kim¹Alvin Jinsung Choi¹Wooju Lee¹Hyun Myung¹^†

¹Urban Robotics Lab*Equal contribution †Corresponding author

Abstract

We present ActiveGrounder, a framework that transforms 3D visual grounding from a passive recognition task into an active exploration paradigm. Unlike existing methods that rely on static maps or single-image perception, ActiveGrounder integrates maps with object-hull-guided navigation to actively acquire informative viewpoints. Through experiments, we demonstrate that ActiveGrounder achieves more accurate and reliable grounding compared with passive baselines, offering a step toward embodied agents capable of active perception and grounding in the real world.

ActiveGrounder

Motivation

Task: 3D visual grounding interprets language queries in 3D scenes and grounds them to speciﬁc objects or regions in space.
Problem: Existing 3D visual grounding remains passive due to its lack of integration with exploration.
- Existing 3D visual grounding methods assume static pre-built maps → unsuitable for dynamic environments.
- Existing image-based approaches operate without maps but are limited to 2D object representations and lack action generation → over-reliance on passive, single-image perception.
Goal: Enable embodied agents to actively perceive and ground visual queries in real-world 3D environments.
- We propose ActiveGrounder, integrating active perception and 3D visual grounding to achieve robust grounding in dynamic environments.

Contribution

We introduce ActiveGrounder, shifting grounding from a passive recognition task to an active exploration paradigm that tightly integrates grounding and action.
We leverage a scene graph to maintain 3D object representations; the agent actively navigates around object hulls to observe them and selects optimal viewpoints for visual grounding.
We empirically validate our framework on simulation benchmarks and real-world experiments, demonstrating robustness to perception errors, and an eﬀective visual grounding compared with existing passive methods.

Methodology

Fig1 - Framework

Scene Graph: Maintain object-level 3D scene representation.
- Environment Level: Represents the given environment; the top-level context.
- Keyframe Level: Stores the observed images and poses at speciﬁc time steps.
- Object Level: Linked to each keyframe; contains the observed objects represented as 3D bounding boxes.
Object-Hull-Guided (OHG) Observation: When candidate objects are detected, actively adjust viewpoint for better observation.
- Convex Hull & Traversable Point: For each object, compute its convex hull and select the nearest traversable points.
- Path Point Generation & Navigation: Generate path from the selected path points, and let the robot navigate there for observation.
Viewpoint Selection: Evaluate which viewpoint $p\in\mathcal{P}$ $p \in P$ maximizes grounding accuracy. $\text{score}(p) = \lambda_\text{cov} \cdot \frac{ |\mathcal{O}_p \setminus \mathcal{O}^\text{covered}| }{ |\mathcal{O} \setminus \mathcal{O}^\text{covered}| } + \lambda_\text{area}\cdot \frac{\text{area}\left( \bigcup_{o\in\mathcal{O}_p}\text{bbox}(o) \right)}{ \max_{q\in\mathcal{P}}\left( \text{area}\left( \bigcup_{o\in\mathcal{O}_q} \text{bbox}(o) \right) \right)},$
- where
  - $\mathcal{P}$ : Viewpoint set
  - $\mathcal{O}$ : Entire object set
  - $\mathcal{O}_p$ : Object set from viewpoint $p$
  - $\mathcal{O}^\text{covered}$ : Covered object set
  - $\text{bbox}(o)\in\mathbb{R}^2$ : 2D bounding box of object $o$ within the image
  - $\text{area}(\cdot)$ : Area of the given bounding boxes
  - $\lambda_\text{cov},\lambda_\text{area}$ : Weights for coverage and area
- A greedy selection strategy iteratively picks viewpoints with the highest score, balancing coverage (new objects) and area (visibility) to ensure complementary observations.

Comparative Study

	Success Rate(%) ↑	Avg. Time (s) $^\dagger$ ↓	Timeout Cases(#/7)
Baseline	14.3	27.0	1
ActiveGrounder (Ours)	85.7	125.9	0
$^\dagger$ Avg. time excludes timeout cases.

Baseline: Immediately predicts once the target object is detected, leading to faster completion time but very poor accuracy.
ActiveGrounder (Ours): Actively explores the object's surroundings before answering, resulting in much higher success rate despite longer time.

Qualitative Evaluation

Simulation Environment

Fig2 - Qualitative evaluation

Future Works

Real-World Experiments: Validate framework performance in diverse physical environments.
Viewpoint Selection under Occlusion: Identify the most informative keyframes for LVLM queries, considering occlusions.
Beyond FoV Spatial Reasoning: Enhance LVLMs to capture global context beyond limited ﬁeld of view.

Other Projects

ICML'25

CoCoA-Mix

CoCoA-Mix enhances specialization and generalization in prompt tuning using CoA-loss for refined decision boundaries and CoA-weights for confidence-based scaling.

AAAI'24(Oral)

OA-DG

OA-DG is an effective method for single-domain object detection generalization (S-DGOD). It consists of two components: OA-Mix for data augmentation and OA-Loss for reducing domain gaps.