Skip to main content

Capstone Project

Objective

Build an autonomous humanoid robot that receives a voice command, navigates obstacles, identifies a target object, and manipulates it. This project integrates every module from the textbook into a single working pipeline.

The Integration Pipeline

Voice Command → Plan → Navigate → Perceive → Manipulate

Stage 1: Voice Input

The user speaks a command (e.g., "Bring me the red cup from the table"). Whisper converts audio to text.

Stage 2: Task Planning

The LLM decomposes the command into an ordered sequence of actions:

plan = [
{"action": "navigate_to", "target": "table"},
{"action": "detect_object", "target": "red cup"},
{"action": "grasp", "target": "red cup"},
{"action": "navigate_to", "target": "user_location"},
{"action": "release", "target": "red cup"}
]

Stage 3: Navigation

Using Nav2 and the environment map built by vSLAM/Nvblox, the robot plans a collision-free path to the table.

Stage 4: Perception

NVIDIA Isaac ROS perception pipeline:

  1. RGB-D camera captures the scene
  2. Object detection model identifies the red cup
  3. Depth data provides the 3D coordinates of the cup
  4. Grasp planner computes the optimal grasp pose

Stage 5: Manipulation

Inverse Kinematics computes the joint angles needed to reach the cup. The arm controller executes the grasp, confirms contact via force sensors, and lifts the object.

Architecture Summary

ComponentModuleTechnology
InputVoice CommandWhisper (Speech-to-Text)
PlanningCognitive PipelineLLM (GPT-4/Gemini)
CommunicationRobot MiddlewareROS 2 (Nodes, Topics, Actions)
SimulationDigital TwinGazebo + Isaac Sim
PerceptionVisual UnderstandingIsaac ROS (YOLO/DINOv2)
NavigationPath PlanningNav2 Stack
ManipulationArm ControlInverse Kinematics

Module Connections

Each module in this textbook builds toward this capstone:

  • Part 1 (Introduction): Understanding the Physical AI landscape
  • Part 2 (Humanoid Robotics): Robot kinematics, URDF, and HRI fundamentals
  • Part 3 (ROS 2): The communication backbone connecting all subsystems
  • Part 4 (Digital Twin): Simulation environment for safe testing and training
  • Part 5 (VLA): The cognitive pipeline from language to action

The capstone is not a separate skill — it is the natural result of mastering Parts 1 through 5.

Evaluation Criteria

A successful capstone demonstrates:

  1. End-to-end execution: Voice command results in completed physical task
  2. Robustness: System handles at least one failure mode gracefully (e.g., blocked path, missed grasp)
  3. Modularity: Each subsystem can be tested independently via ROS 2 interfaces
  4. Reproducibility: The same command produces consistent results in simulation