Integration Guide · Phase 7

Enable Voice Mode

Upgrade your agent from text-only to real-time voice interaction.

Overview

Voice mode allows users to talk to your agent naturally, with real-time audio streaming, interruptions, and low latency. HUMA handles the complex orchestration of Speech-to-Text (STT), LLM processing, and Text-to-Speech (TTS).

Listen (STT)

We use Deepgram for ultra-fast transcription. The user's audio is transcribed in real-time and sent to the agent as text events.

Speak (TTS)

We use ElevenLabs for high-quality voice generation. The agent's response is streamed back as audio chunks for instant playback.

How it Works

Unlike standard text chat (REST API), voice mode requires a persistent WebSocket connection to stream audio data.

Client App
Microphone / Speaker
WebSocket
HUMA Server
Orchestrator
Voice Providers
Daily / Deepgram / 11Labs

Voice Architecture

We use Daily.co as the real-time transport layer. Think of it as a "voice room" where the user and the agent meet.

1

Create Room

Your backend asks HUMA to create a voice room.

2

Get Token

HUMA returns a room URL and a secure token for the user.

3

Join Room

The client app uses the token to join the Daily room.

4

Agent Joins

The HUMA agent automatically joins the same room and starts listening.

Setup

1. Configure Voice Provider

In your agent metadata, specify the voice ID (ElevenLabs ID).

Agent Metadata
const metadata = {
  className: 'Finn',
  // ...
  voiceId: 'eleven_labs_voice_id_here', // e.g. 'JBFqnCBsd6RMkjVDRZzb'
  voiceProvider: 'elevenlabs', // Optional, defaults to elevenlabs
};

2. Create Room Endpoint

Your backend needs to call the HUMA API to creating a room.

Backend API Call
// POST https://api.humalike.ai/v1/voice/room
{
  "agentId": "finn-123",
  "userId": "user-456",
  "metadata": { "gameId": "game-789" }
}

3. Join Room (Frontend)

Use the daily-js library to join the room using the returned URL and token.

Frontend Code
import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createCallObject();

await callFrame.join({
  url: roomUrl,
  token: roomToken,
});

// Handle events
callFrame.on('active-speaker-change', (e) => {
  console.log('Active speaker:', e.activeSpeaker.peerId);
});

Daily.co Abstraction

You don't need to manage Deepgram or ElevenLabs API keys directly on the client. HUMA handles all that server-side. You just join the Daily room.

Voice Lifecycle

Understanding the voice lifecycle is crucial for UI feedback.

StateDescriptionUI Hint
ConnectingJoining the Daily roomSpinner / "Connecting..."
ConnectedIn room, waiting for agent"Waiting for Finn..."
Agent JoinedAgent is in the roomShow Agent Avatar
ListeningUser is speaking (VAD active)Mic wave animation
ThinkingProcessing responsePulse animation
SpeakingAgent audio playingAgent mouth movement

Common Pitfalls

Echo / Feedback Loops

Always use headphones during development. If the mic picks up the agent's voice, it can create a loop. In production, use echo cancellation (enabled by default in daily-js).

Latency perception

Even with fast models, there's ~500ms-1s latency. Use "thinking" UI states (animations, status text) to keep the user engaged while waiting.

Permissions

Browsers require explicit microphone permission. Ensure your app requests it before attempting to join the room.

Summary

Key Takeaways

  • • Voice mode enables real-time, low-latency conversation
  • • Uses Daily.co for transport (WebRTC)
  • • Agent handles STT (Deepgram) and TTS (ElevenLabs)
  • • Requires specific "voice" router strategies for best results

For Go Fish

In a voice-enabled Go Fish game:

  • • Users can say "Do you have any 7s?"
  • • Agents respond verbally "Go Fish!"
  • • Game logic still runs via tool calls in the background

Next Steps

Now that voice is enabled, let's look at the detailed Voice Lifecycle events to build a rich UI.

Next: Voice Lifecycle