Voice Implementation

HUMA-0.1 Only

Overview

HUMA enables natural multi-party voice conversations that are a great fit for use cases like online debates, NPCs in video games, interview practice, and collaborative brainstorming.

Multi-Party

Multiple participants can speak simultaneously in the same room

Real-Time

Low-latency speech recognition and text-to-speech

Event-Driven

All voice events flow through the same WebSocket connection

Important: Currently the ONLY supported way of integrating with HUMA voice is through Daily.co voice rooms. You must have a Daily.co account to use voice features.
Critical: Voice must be enabled in the agent metadata when creating the agent. Without voice.enabled: true in your metadata, the join-daily-room command will fail silently. See the "Enable Voice in Metadata" section below.

Implementation Plan

1

Set Up Daily.co Voice Rooms

Implement Daily.co voice room services in your app or game. You'll need to create rooms, manage participants, and handle audio streams.

Visit Daily.co Developers for API documentation and SDKs.
2

Standard HUMA Integration

HUMA in voice mode is an extension of standard HUMA. Start by following the general implementation guide.

Read the Integration Guide
3

Orchestrate Agent Join/Leave

Control when agents join and leave voice chats based on your application logic. This could be triggered by user actions, game events, or automated rules.

Enable Voice in Metadata

Required Step: Without this configuration, voice commands like join-daily-room will fail silently and the agent will not join the voice call.

When creating an agent that will use voice features, you must include the voice property in the agent metadata:

const agent = await fetch(`${API}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Agent',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly voice assistant.',
      instructions: 'Engage in natural conversation.',
      tools: [...],

      // ⚠️ REQUIRED FOR VOICE - Without this, join-daily-room will fail!
      voice: {
        enabled: true,                    // Must be true
        voiceId: 'EXAVITQu4vr4xnSDxMaL'   // ElevenLabs voice ID (optional)
      }
    }
  })
}).then(r => r.json());
voice.enabled

Required. Must be set to true to enable voice features. If missing or false, the agent cannot join voice rooms.

voice.voiceId

Optional. ElevenLabs voice ID for text-to-speech. If not provided, a default voice will be used.

Finding Voice IDs: Browse available voices at ElevenLabs Voice Library. The voice ID is the alphanumeric string in the voice URL.

Voice Lifecycle

Voice is a sub-phase within the Active state. The agent must be connected via WebSocket before joining a call.

Connected (no call)
In Voice Call
join / leave
Connected (no call)
Joining a Call

Send a join-daily-room event with:

  • roomUrl - Daily.co room URL
  • roomToken - Authentication token
Leaving a Call

Send a leave-daily-room event.

Always leave the room before disconnecting the agent to ensure clean cleanup.

Voice Events

Client → Server Events

Join Voice Room

socket.emit('message', {
  type: 'join-daily-room',
  content: {
    roomUrl: 'https://your-domain.daily.co/room-name',
    roomToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
  }
});

Leave Voice Room

socket.emit('message', {
  type: 'leave-daily-room'
});

Server → Client Events

All server events are received on the event channel:

Event TypeDescriptionKey Fields
room_joinedAgent successfully joined the callroomUrl
room_leftAgent left the callroomUrl
participant_joinedSomeone joined the callparticipantId, userName
participant_leftSomeone left the callparticipantId
transcriptSpeech-to-text from a participantparticipantId, text, isFinal
speakAgent is speakingtext
vad_startVoice activity started (someone is speaking)participantId
vad_stopVoice activity stoppedparticipantId
transcriber_fatal_errorSpeech recognition failed permanentlyerror
Fatal Error: If you receive transcriber_fatal_error, the speech recognition has failed permanently. The agent will disconnect from the call. You should notify the user and optionally reconnect.

Example Integration

import { io } from 'socket.io-client';

const API = 'https://api.humalike.tech';
const API_KEY = 'ak_your_api_key';

// 1. Create agent with voice ENABLED in metadata
const agent = await fetch(`${API}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Assistant',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly and helpful voice assistant.',
      instructions: 'Engage in natural conversation. Use the speak tool to respond.',
      tools: [
        {
          name: 'speak',
          description: 'Say something out loud in the voice call',
          parameters: [
            { name: 'text', type: 'string', description: 'What to say', required: true }
          ]
        }
      ],

      // ⚠️ REQUIRED: Enable voice in metadata!
      voice: {
        enabled: true,                     // Must be true for voice to work
        voiceId: 'EXAVITQu4vr4xnSDxMaL'    // Optional: ElevenLabs voice ID
      }
    }
  })
}).then(r => r.json());

// 2. Connect WebSocket
const socket = io(API, {
  query: { agentId: agent.id, apiKey: API_KEY },
  transports: ['websocket']
});

// 3. Handle voice events
socket.on('event', (data) => {
  switch (data.type) {
    case 'room_joined':
      console.log('Agent joined voice room');
      break;

    case 'room_left':
      console.log('Agent left voice room');
      break;

    case 'participant_joined':
      console.log(`${data.userName} joined the call`);
      break;

    case 'transcript':
      console.log(`${data.participantId}: ${data.text}`);
      // Send as context update to trigger agent response
      socket.emit('message', {
        type: 'huma-0.1-event',
        content: {
          name: 'voice-transcript',
          context: {
            recentTranscripts: [...],
            participants: [...]
          },
          description: `User said: "${data.text}"`
        }
      });
      break;

    case 'speak':
      console.log(`Agent speaking: ${data.text}`);
      break;

    case 'transcriber_fatal_error':
      console.error('Voice recognition failed:', data.error);
      // Handle reconnection or notify user
      break;
  }
});

// 4. Join voice room (after WebSocket connected)
socket.on('connect', () => {
  // Get room credentials from Daily.co
  const dailyRoom = await createDailyRoom();

  socket.emit('message', {
    type: 'join-daily-room',
    content: {
      roomUrl: dailyRoom.url,
      roomToken: dailyRoom.token
    }
  });
});

// 5. Clean up on disconnect
function cleanup() {
  socket.emit('message', { type: 'leave-daily-room' });
  socket.disconnect();
}

Best Practices

Audio Quality

  • Ensure stable internet connection for low latency
  • Test with different devices and browsers
  • Handle audio permission prompts gracefully

User Experience

  • Show visual indicators when agent is speaking
  • Display participant list and status
  • Provide mute/unmute controls

Room Management

  • Always send leave-daily-room before disconnecting
  • Handle room expiration and cleanup
  • Consider room capacity limits

Error Handling

  • Handle transcriber errors by notifying users
  • Implement reconnection logic for dropped calls
  • Log events for debugging