Voice Implementation

HUMA-0.1 Only

Overview

HUMA enables natural multi-party voice conversations that are a great fit for use cases like online debates, NPCs in video games, interview practice, and collaborative brainstorming.

Multi-Party

Multiple participants can speak simultaneously in the same room

Real-Time

Low-latency speech recognition and text-to-speech

Event-Driven

All voice events flow through the same WebSocket connection

Important: Currently the ONLY supported way of integrating with HUMA voice is through Daily.co voice rooms. You must have a Daily.co account to use voice features.

Critical: Voice must be enabled in the agent metadata when creating the agent. Without voice.enabled: true in your metadata, the join-daily-room command will fail silently. See the "Enable Voice in Metadata" section below.

Implementation Plan

Set Up Daily.co Voice Rooms

Implement Daily.co voice room services in your app or game. You'll need to create rooms, manage participants, and handle audio streams.

Visit Daily.co Developers for API documentation and SDKs.

Standard HUMA Integration

HUMA in voice mode is an extension of standard HUMA. Start by following the general implementation guide.

Read the Integration Guide

Orchestrate Agent Join/Leave

Control when agents join and leave voice chats based on your application logic. This could be triggered by user actions, game events, or automated rules.

Enable Voice in Metadata

Required Step: Without this configuration, voice commands like join-daily-room will fail silently and the agent will not join the voice call.

When creating an agent that will use voice features, you must include the voice property in the agent metadata:

const agent = await fetch(`${API}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Agent',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly voice assistant.',
      instructions: 'Engage in natural conversation.',
      tools: [...],

      // ⚠️ REQUIRED FOR VOICE - Without this, join-daily-room will fail!
      voice: {
        enabled: true,                    // Must be true
        voiceId: 'EXAVITQu4vr4xnSDxMaL'   // ElevenLabs voice ID (optional)
      }
    }
  })
}).then(r => r.json());

voice.enabled

Required. Must be set to true to enable voice features. If missing or false, the agent cannot join voice rooms.

voice.voiceId

Optional. ElevenLabs voice ID for text-to-speech. If not provided, a default voice will be used.

Finding Voice IDs: Browse available voices at ElevenLabs Voice Library. The voice ID is the alphanumeric string in the voice URL.

Voice Lifecycle

Voice is a sub-phase within the Active state. The agent must be connected via WebSocket before joining a call.

Connected (no call)

In Voice Call

join / leave

Connected (no call)

Joining a Call

Send a join-daily-room event with:

• roomUrl - Daily.co room URL
• roomToken - Authentication token

Leaving a Call

Send a leave-daily-room event.

Always leave the room before disconnecting the agent to ensure clean cleanup.

Voice Events

Client → Server Events

Join Voice Room

socket.emit('message', {
  type: 'join-daily-room',
  content: {
    roomUrl: 'https://your-domain.daily.co/room-name',
    roomToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
  }
});

Leave Voice Room

socket.emit('message', {
  type: 'leave-daily-room'
});

Server → Client Events

All server events are received on the event channel:

Event Type	Description	Key Fields
room_joined	Agent successfully joined the call	roomUrl
room_left	Agent left the call	roomUrl
participant_joined	Someone joined the call	participantId, userName
participant_left	Someone left the call	participantId
transcript	Speech-to-text from a participant	participantId, text, isFinal
speak	Agent is speaking	text
vad_start	Voice activity started (someone is speaking)	participantId
vad_stop	Voice activity stopped	participantId
transcriber_fatal_error	Speech recognition failed permanently	error

Fatal Error: If you receive transcriber_fatal_error, the speech recognition has failed permanently. The agent will disconnect from the call. You should notify the user and optionally reconnect.

Example Integration

import { io } from 'socket.io-client';

const API = 'https://api.humalike.tech';
const API_KEY = 'ak_your_api_key';

// 1. Create agent with voice ENABLED in metadata
const agent = await fetch(`${API}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Assistant',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly and helpful voice assistant.',
      instructions: 'Engage in natural conversation. Use the speak tool to respond.',
      tools: [
        {
          name: 'speak',
          description: 'Say something out loud in the voice call',
          parameters: [
            { name: 'text', type: 'string', description: 'What to say', required: true }
          ]
        }
      ],

      // ⚠️ REQUIRED: Enable voice in metadata!
      voice: {
        enabled: true,                     // Must be true for voice to work
        voiceId: 'EXAVITQu4vr4xnSDxMaL'    // Optional: ElevenLabs voice ID
      }
    }
  })
}).then(r => r.json());

// 2. Connect WebSocket
const socket = io(API, {
  query: { agentId: agent.id, apiKey: API_KEY },
  transports: ['websocket']
});

// 3. Handle voice events
socket.on('event', (data) => {
  switch (data.type) {
    case 'room_joined':
      console.log('Agent joined voice room');
      break;

    case 'room_left':
      console.log('Agent left voice room');
      break;

    case 'participant_joined':
      console.log(`${data.userName} joined the call`);
      break;

    case 'transcript':
      console.log(`${data.participantId}: ${data.text}`);
      // Send as context update to trigger agent response
      socket.emit('message', {
        type: 'huma-0.1-event',
        content: {
          name: 'voice-transcript',
          context: {
            recentTranscripts: [...],
            participants: [...]
          },
          description: `User said: "${data.text}"`
        }
      });
      break;

    case 'speak':
      console.log(`Agent speaking: ${data.text}`);
      break;

    case 'transcriber_fatal_error':
      console.error('Voice recognition failed:', data.error);
      // Handle reconnection or notify user
      break;
  }
});

// 4. Join voice room (after WebSocket connected)
socket.on('connect', () => {
  // Get room credentials from Daily.co
  const dailyRoom = await createDailyRoom();

  socket.emit('message', {
    type: 'join-daily-room',
    content: {
      roomUrl: dailyRoom.url,
      roomToken: dailyRoom.token
    }
  });
});

// 5. Clean up on disconnect
function cleanup() {
  socket.emit('message', { type: 'leave-daily-room' });
  socket.disconnect();
}

Best Practices

Audio Quality

Ensure stable internet connection for low latency
Test with different devices and browsers
Handle audio permission prompts gracefully

User Experience

Show visual indicators when agent is speaking
Display participant list and status
Provide mute/unmute controls

Room Management

Always send leave-daily-room before disconnecting
Handle room expiration and cleanup
Consider room capacity limits

Error Handling

Handle transcriber errors by notifying users
Implement reconnection logic for dropped calls
Log events for debugging