Voice-to-Action Pipeline

End-to-End Architecture

The Voice-to-Action pipeline integrates speech recognition, LLM-based planning, perception, and robot control into a single autonomous workflow.

Voice Command → Speech-to-Text → LLM Task Planner → Motion Primitives → ROS 2 Actions → Robot Execution

Speech-to-Text Integration

import whisper

model = whisper.load_model("base")

def transcribe_command(audio_path: str) -> str:
    """Convert spoken command to text using Whisper."""
    result = model.transcribe(audio_path)
    return result["text"]

Whisper handles multiple languages and noisy environments, making it suitable for real-world robot deployments where background noise is common.

LLM-Based Task Planning

The LLM receives the transcribed command along with the robot's current state and available capabilities:

PLANNER_PROMPT = """
You are a robot task planner. Given a natural language command,
break it down into a sequence of executable actions.

Available actions:
- navigate_to(location)
- pick_up(object)
- place_at(location)
- open(object)
- close(object)
- wait(seconds)

Current state: {robot_state}
Command: {command}

Return a JSON array of actions.
"""

Action Execution via ROS 2

Each planned action maps to a ROS 2 Action Server:

from rclpy.action import ActionClient
from nav2_msgs.action import NavigateToPose

class NavigationClient(Node):
    def __init__(self):
        super().__init__('navigation_client')
        self._action_client = ActionClient(
            self, NavigateToPose, 'navigate_to_pose'
        )

    def send_goal(self, x, y, theta):
        goal_msg = NavigateToPose.Goal()
        goal_msg.pose.pose.position.x = x
        goal_msg.pose.pose.position.y = y
        self._action_client.send_goal_async(goal_msg)

Perception Pipeline

The perception system runs continuously, updating the robot's world model:

Object Detection (YOLO/DINOv2) identifies objects in the scene
Depth Estimation from RGB-D cameras provides spatial coordinates
Scene Graph maintains a structured representation of detected objects and their relationships
Grasp Planning determines optimal grasp poses for target objects

Error Recovery

Real-world execution rarely follows the plan perfectly. The system must handle:

Failed grasps: Re-attempt with adjusted grasp pose
Blocked paths: Re-plan navigation around obstacles
Unrecognized objects: Ask for human clarification
Timeout: Abort current action and report failure

End-to-End Architecture​

Speech-to-Text Integration​

LLM-Based Task Planning​

Action Execution via ROS 2​

Perception Pipeline​

Error Recovery​