Integration Guide · Phase 8

Voice Lifecycle

Understanding the real-time states of a voice session: Listen, Think, Speak, and Interrupt.

Overview

A voice conversation isn't just audio back-and-forth. It's a state machine. Your UI needs to reflect these states to feel "alive" and responsive.

Listening
User speaking
Thinking
LLM processing
Speaking
Audio playing

Client States (Daily.co)

The daily-js library provides the fundamental connection states.

Connecting

Initial join request. Show a spinner.

Connected

Joined the room. Local mic is live.

Left / Error

Disconnected. Reset UI.

React Example
import { useDaily, useCallState } from '@daily-co/daily-react';

const MyComponent = () => {
  const callState = useCallState(); // 'joining', 'joined', 'left', 'error'
  
  if (callState === 'joining') return ;
  if (callState === 'joined') return ;
  
  return ;
};

Server Events (App Messages)

HUMA sends custom events via Daily's app-message channel to tell you what the AGENT is doing.

Transcript Events
transcript-user-finalSTT

The user finished a sentence. "What cards do you have?"

transcript-agent-textLLM

The agent generated text. "I have two 7s." (Before audio plays)

Audio Events
audio-startTTS

Agent started speaking audio. Trigger "Speaking" animation.

audio-endTTS

Agent finished speaking. Return to "Listening" state.

Event Listener
const [agentState, setAgentState] = useState('idle');

useAppMessage({
  onAppMessage: (e) => {
    switch (e.data.type) {
      case 'transcript-user-final':
        setAgentState('thinking'); // User stopped, agent is thinking
      break;
      case 'audio-start':
        setAgentState('speaking'); // Agent started talking
      break;
      case 'audio-end':
        setAgentState('idle'); // Done talking
      break;
  }
  },
});

Handling Interruption

Real conversations involve interruptions. If the user speaks while the agent is talking, the agent should stop immediately.

How it works automatically:

  1. 1

    VAD Trigger: Deepgram detects user speech while agent is outputting audio.

  2. 2

    Clear Queue: HUMA server clears the remaining audio buffer.

  3. 3

    Stop Event: Client receives audio-stop event.

Visual Feedback

When you receive audio-stop, immediately cut the "Speaking" animation. This makes the interruption feel snappy and responsive.

Example Flow

Here is a full lifecycle trace for a single turn in Go Fish.

User Speaks0ms

"Do you have any Kings?"

Event: transcript-user-final~500ms

STT finalized. State: Thinking

Event: transcript-agent-text~1200ms

"Nope, Go Fish!" (Text generated)

Event: audio-start~1500ms

Audio starts playing. State: Speaking

Event: audio-end~3000ms

Audio finished. State: Idle/Listening

Common Pitfalls

Missing "Thinking" State

Users will think the app is broken during the ~1s silence between speaking and response. Always show a visual indicator when transcript-user-final fires.

Ignoring Interruption

If the UI keeps showing "Speaking" after an interruption, it feels laggy. Listen for audio-stop or new transcript-user-final to reset state.

Summary

Key Events

  • transcript-user-finalUser done speaking
  • transcript-agent-textAgent text ready
  • audio-startAudio playing
  • audio-endAudio finished

UI States

  • Listening (Default)
  • Thinking (Wait)
  • Speaking (Active)

Congratulations!

Integration Guide Complete

You've mastered the full HUMA integration process, from designing personalities to handling real-time voice interrupts. You are now ready to build production-grade Multi-User Agents.