Enable Voice Mode
Upgrade your agent from text-only to real-time voice interaction.
Overview
Voice mode allows users to talk to your agent naturally, with real-time audio streaming, interruptions, and low latency. HUMA handles the complex orchestration of Speech-to-Text (STT), LLM processing, and Text-to-Speech (TTS).
Listen (STT)
We use Deepgram for ultra-fast transcription. The user's audio is transcribed in real-time and sent to the agent as text events.
Speak (TTS)
We use ElevenLabs for high-quality voice generation. The agent's response is streamed back as audio chunks for instant playback.
How it Works
Unlike standard text chat (REST API), voice mode requires a persistent WebSocket connection to stream audio data.
Voice Architecture
We use Daily.co as the real-time transport layer. Think of it as a "voice room" where the user and the agent meet.
Create Room
Your backend asks HUMA to create a voice room.
Get Token
HUMA returns a room URL and a secure token for the user.
Join Room
The client app uses the token to join the Daily room.
Agent Joins
The HUMA agent automatically joins the same room and starts listening.
Setup
1. Configure Voice Provider
In your agent metadata, specify the voice ID (ElevenLabs ID).
const metadata = {
className: 'Finn',
// ...
voiceId: 'eleven_labs_voice_id_here', // e.g. 'JBFqnCBsd6RMkjVDRZzb'
voiceProvider: 'elevenlabs', // Optional, defaults to elevenlabs
};2. Create Room Endpoint
Your backend needs to call the HUMA API to creating a room.
// POST https://api.humalike.ai/v1/voice/room
{
"agentId": "finn-123",
"userId": "user-456",
"metadata": { "gameId": "game-789" }
}3. Join Room (Frontend)
Use the daily-js library to join the room using the returned URL and token.
import DailyIframe from '@daily-co/daily-js';
const callFrame = DailyIframe.createCallObject();
await callFrame.join({
url: roomUrl,
token: roomToken,
});
// Handle events
callFrame.on('active-speaker-change', (e) => {
console.log('Active speaker:', e.activeSpeaker.peerId);
});Daily.co Abstraction
You don't need to manage Deepgram or ElevenLabs API keys directly on the client. HUMA handles all that server-side. You just join the Daily room.
Voice Lifecycle
Understanding the voice lifecycle is crucial for UI feedback.
| State | Description | UI Hint |
|---|---|---|
| Connecting | Joining the Daily room | Spinner / "Connecting..." |
| Connected | In room, waiting for agent | "Waiting for Finn..." |
| Agent Joined | Agent is in the room | Show Agent Avatar |
| Listening | User is speaking (VAD active) | Mic wave animation |
| Thinking | Processing response | Pulse animation |
| Speaking | Agent audio playing | Agent mouth movement |
Common Pitfalls
Echo / Feedback Loops
Always use headphones during development. If the mic picks up the agent's voice, it can create a loop. In production, use echo cancellation (enabled by default in daily-js).
Latency perception
Even with fast models, there's ~500ms-1s latency. Use "thinking" UI states (animations, status text) to keep the user engaged while waiting.
Permissions
Browsers require explicit microphone permission. Ensure your app requests it before attempting to join the room.
Summary
Key Takeaways
- • Voice mode enables real-time, low-latency conversation
- • Uses Daily.co for transport (WebRTC)
- • Agent handles STT (Deepgram) and TTS (ElevenLabs)
- • Requires specific "voice" router strategies for best results
For Go Fish
In a voice-enabled Go Fish game:
- • Users can say "Do you have any 7s?"
- • Agents respond verbally "Go Fish!"
- • Game logic still runs via tool calls in the background
Next Steps
Now that voice is enabled, let's look at the detailed Voice Lifecycle events to build a rich UI.
Next: Voice Lifecycle