Voice AI: Building Applications That Talk Back

Voice interfaces are having a moment. Real-time speech APIs from OpenAI, ElevenLabs, and others have made building voice-enabled applications more accessible than ever.

Speech-to-Text: Transcription Options

// OpenAI Whisper for transcription
async function transcribe(audioBlob: Blob): Promise {
  const formData = new FormData()
  formData.append('file', audioBlob, 'audio.webm')
  formData.append('model', 'whisper-1')
  
  const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
    method: 'POST',
    headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
    body: formData,
  })
  
  const { text } = await response.json()
  return text
}

// Browser-native Web Speech API (free, real-time)
function useSpeechRecognition(onResult: (text: string) => void) {
  const recognition = new webkitSpeechRecognition()
  recognition.continuous = true
  recognition.interimResults = true
  
  recognition.onresult = (event) => {
    const transcript = Array.from(event.results)
      .map(result => result[0].transcript)
      .join('')
    onResult(transcript)
  }
  
  return {
    start: () => recognition.start(),
    stop: () => recognition.stop(),
  }
}

Text-to-Speech: Natural Voice Generation

// ElevenLabs for high-quality TTS
async function synthesizeSpeech(text: string, voiceId: string): Promise {
  const response = await fetch(
    `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`,
    {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'xi-api-key': process.env.ELEVENLABS_API_KEY,
      },
      body: JSON.stringify({
        text,
        model_id: 'eleven_turbo_v2',
        voice_settings: {
          stability: 0.5,
          similarity_boost: 0.75,
        },
      }),
    }
  )
  
  return response.arrayBuffer()
}

// Streaming TTS for low latency
async function* streamSpeech(text: string) {
  const response = await fetch('/api/tts/stream', {
    method: 'POST',
    body: JSON.stringify({ text }),
  })
  
  const reader = response.body.getReader()
  while (true) {
    const { done, value } = await reader.read()
    if (done) break
    yield value
  }
}

Real-Time Voice: Streaming Architectures

// Full voice conversation loop
class VoiceAssistant {
  private audioContext: AudioContext
  private recognition: SpeechRecognition
  
  async handleUserSpeech(transcript: string) {
    // 1. Send to LLM
    const response = await this.chat(transcript)
    
    // 2. Stream TTS response while it generates
    const audioStream = streamSpeech(response)
    
    // 3. Play audio chunks as they arrive
    for await (const chunk of audioStream) {
      await this.playAudioChunk(chunk)
    }
  }
  
  private async playAudioChunk(chunk: Uint8Array) {
    const audioBuffer = await this.audioContext.decodeAudioData(chunk.buffer)
    const source = this.audioContext.createBufferSource()
    source.buffer = audioBuffer
    source.connect(this.audioContext.destination)
    source.start()
  }
}

Latency Optimization

Voice interactions need low latency to feel natural (under 500ms is good, under 300ms is great):

// Strategies for low latency
const optimizations = {
  // 1. Start TTS before LLM finishes
  streamingResponse: true,
  
  // 2. Use edge functions for lowest latency
  runtime: 'edge',
  
  // 3. Preload audio context
  preloadAudio: true,
  
  // 4. Use faster models with acceptable quality
  models: {
    stt: 'whisper-1',     // Fast, accurate
    llm: 'gpt-4o-mini',   // Fast, capable
    tts: 'eleven_turbo',  // Fast, natural
  },
}

Key Takeaways

Choose the right STT for your needs. Browser APIs are free and real-time; Whisper is more accurate for non-real-time.

Stream everything. Start TTS before LLM finishes; play audio before it's fully generated.

Optimize for latency. Sub-500ms round trips feel conversational; longer feels robotic.