Integration Guide · Phase 7

Enable Voice Mode

Upgrade your agent from text-only to real-time voice interaction.

Overview

Voice mode allows users to talk to your agent naturally, with real-time audio streaming, interruptions, and low latency. HUMA handles the complex orchestration of Speech-to-Text (STT), LLM processing, and Text-to-Speech (TTS).

Listen (STT)

We use Deepgram for ultra-fast transcription. The user's audio is transcribed in real-time and sent to the agent as text events.

Speak (TTS)

We use ElevenLabs for high-quality voice generation. The agent's response is streamed back as audio chunks for instant playback.

How it Works

Unlike standard text chat (REST API), voice mode requires a persistent WebSocket connection to stream audio data.

Client App

Microphone / Speaker

WebSocket

HUMA Server

Orchestrator

Voice Providers

Daily / Deepgram / 11Labs

Voice Architecture

We use Daily.co as the real-time transport layer. Think of it as a "voice room" where the user and the agent meet.

Create Room

Your backend asks HUMA to create a voice room.

Get Token

HUMA returns a room URL and a secure token for the user.

Join Room

The client app uses the token to join the Daily room.

Agent Joins

The HUMA agent automatically joins the same room and starts listening.

Setup

1. Configure Voice Provider

In your agent metadata, specify the voice ID (ElevenLabs ID).

Agent Metadata

const metadata = {
  className: 'Finn',
  // ...
  voiceId: 'eleven_labs_voice_id_here', // e.g. 'JBFqnCBsd6RMkjVDRZzb'
  voiceProvider: 'elevenlabs', // Optional, defaults to elevenlabs
};

2. Create Room Endpoint

Your backend needs to call the HUMA API to creating a room.

Backend API Call

// POST https://api.humalike.ai/v1/voice/room
{
  "agentId": "finn-123",
  "userId": "user-456",
  "metadata": { "gameId": "game-789" }
}

3. Join Room (Frontend)

Use the daily-js library to join the room using the returned URL and token.

Frontend Code

import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createCallObject();

await callFrame.join({
  url: roomUrl,
  token: roomToken,
});

// Handle events
callFrame.on('active-speaker-change', (e) => {
  console.log('Active speaker:', e.activeSpeaker.peerId);
});

Daily.co Abstraction

You don't need to manage Deepgram or ElevenLabs API keys directly on the client. HUMA handles all that server-side. You just join the Daily room.

Voice Lifecycle

Understanding the voice lifecycle is crucial for UI feedback.

State	Description	UI Hint
Connecting	Joining the Daily room	Spinner / "Connecting..."
Connected	In room, waiting for agent	"Waiting for Finn..."
Agent Joined	Agent is in the room	Show Agent Avatar
Listening	User is speaking (VAD active)	Mic wave animation
Thinking	Processing response	Pulse animation
Speaking	Agent audio playing	Agent mouth movement

Common Pitfalls

Echo / Feedback Loops

Always use headphones during development. If the mic picks up the agent's voice, it can create a loop. In production, use echo cancellation (enabled by default in daily-js).

Latency perception

Even with fast models, there's ~500ms-1s latency. Use "thinking" UI states (animations, status text) to keep the user engaged while waiting.

Permissions

Browsers require explicit microphone permission. Ensure your app requests it before attempting to join the room.

Summary

Key Takeaways

• Voice mode enables real-time, low-latency conversation
• Uses Daily.co for transport (WebRTC)
• Agent handles STT (Deepgram) and TTS (ElevenLabs)
• Requires specific "voice" router strategies for best results

For Go Fish

In a voice-enabled Go Fish game:

• Users can say "Do you have any 7s?"
• Agents respond verbally "Go Fish!"
• Game logic still runs via tool calls in the background

Next Steps

Now that voice is enabled, let's look at the detailed Voice Lifecycle events to build a rich UI.

Next: Voice Lifecycle

Phase 6: Router Phase 8: Voice Lifecycle