Capstone Project
Objective
Build an autonomous humanoid robot that receives a voice command, navigates obstacles, identifies a target object, and manipulates it. This project integrates every module from the textbook into a single working pipeline.
The Integration Pipeline
Voice Command → Plan → Navigate → Perceive → Manipulate
Stage 1: Voice Input
The user speaks a command (e.g., "Bring me the red cup from the table"). Whisper converts audio to text.
Stage 2: Task Planning
The LLM decomposes the command into an ordered sequence of actions:
plan = [
{"action": "navigate_to", "target": "table"},
{"action": "detect_object", "target": "red cup"},
{"action": "grasp", "target": "red cup"},
{"action": "navigate_to", "target": "user_location"},
{"action": "release", "target": "red cup"}
]
Stage 3: Navigation
Using Nav2 and the environment map built by vSLAM/Nvblox, the robot plans a collision-free path to the table.
Stage 4: Perception
NVIDIA Isaac ROS perception pipeline:
- RGB-D camera captures the scene
- Object detection model identifies the red cup
- Depth data provides the 3D coordinates of the cup
- Grasp planner computes the optimal grasp pose
Stage 5: Manipulation
Inverse Kinematics computes the joint angles needed to reach the cup. The arm controller executes the grasp, confirms contact via force sensors, and lifts the object.
Architecture Summary
| Component | Module | Technology |
|---|---|---|
| Input | Voice Command | Whisper (Speech-to-Text) |
| Planning | Cognitive Pipeline | LLM (GPT-4/Gemini) |
| Communication | Robot Middleware | ROS 2 (Nodes, Topics, Actions) |
| Simulation | Digital Twin | Gazebo + Isaac Sim |
| Perception | Visual Understanding | Isaac ROS (YOLO/DINOv2) |
| Navigation | Path Planning | Nav2 Stack |
| Manipulation | Arm Control | Inverse Kinematics |
Module Connections
Each module in this textbook builds toward this capstone:
- Part 1 (Introduction): Understanding the Physical AI landscape
- Part 2 (Humanoid Robotics): Robot kinematics, URDF, and HRI fundamentals
- Part 3 (ROS 2): The communication backbone connecting all subsystems
- Part 4 (Digital Twin): Simulation environment for safe testing and training
- Part 5 (VLA): The cognitive pipeline from language to action
The capstone is not a separate skill — it is the natural result of mastering Parts 1 through 5.
Evaluation Criteria
A successful capstone demonstrates:
- End-to-end execution: Voice command results in completed physical task
- Robustness: System handles at least one failure mode gracefully (e.g., blocked path, missed grasp)
- Modularity: Each subsystem can be tested independently via ROS 2 interfaces
- Reproducibility: The same command produces consistent results in simulation