Voice interfaces are having a moment. Real-time speech APIs from OpenAI, ElevenLabs, and others have made building voice-enabled applications more accessible than ever.
Speech-to-Text: Transcription Options
// OpenAI Whisper for transcription
async function transcribe(audioBlob: Blob): Promise {
const formData = new FormData()
formData.append('file', audioBlob, 'audio.webm')
formData.append('model', 'whisper-1')
const response = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
body: formData,
})
const { text } = await response.json()
return text
}
// Browser-native Web Speech API (free, real-time)
function useSpeechRecognition(onResult: (text: string) => void) {
const recognition = new webkitSpeechRecognition()
recognition.continuous = true
recognition.interimResults = true
recognition.onresult = (event) => {
const transcript = Array.from(event.results)
.map(result => result[0].transcript)
.join('')
onResult(transcript)
}
return {
start: () => recognition.start(),
stop: () => recognition.stop(),
}
}
Text-to-Speech: Natural Voice Generation
// ElevenLabs for high-quality TTS
async function synthesizeSpeech(text: string, voiceId: string): Promise {
const response = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVENLABS_API_KEY,
},
body: JSON.stringify({
text,
model_id: 'eleven_turbo_v2',
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
}),
}
)
return response.arrayBuffer()
}
// Streaming TTS for low latency
async function* streamSpeech(text: string) {
const response = await fetch('/api/tts/stream', {
method: 'POST',
body: JSON.stringify({ text }),
})
const reader = response.body.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
yield value
}
}
Real-Time Voice: Streaming Architectures
// Full voice conversation loop
class VoiceAssistant {
private audioContext: AudioContext
private recognition: SpeechRecognition
async handleUserSpeech(transcript: string) {
// 1. Send to LLM
const response = await this.chat(transcript)
// 2. Stream TTS response while it generates
const audioStream = streamSpeech(response)
// 3. Play audio chunks as they arrive
for await (const chunk of audioStream) {
await this.playAudioChunk(chunk)
}
}
private async playAudioChunk(chunk: Uint8Array) {
const audioBuffer = await this.audioContext.decodeAudioData(chunk.buffer)
const source = this.audioContext.createBufferSource()
source.buffer = audioBuffer
source.connect(this.audioContext.destination)
source.start()
}
}
Latency Optimization
Voice interactions need low latency to feel natural (under 500ms is good, under 300ms is great):
// Strategies for low latency
const optimizations = {
// 1. Start TTS before LLM finishes
streamingResponse: true,
// 2. Use edge functions for lowest latency
runtime: 'edge',
// 3. Preload audio context
preloadAudio: true,
// 4. Use faster models with acceptable quality
models: {
stt: 'whisper-1', // Fast, accurate
llm: 'gpt-4o-mini', // Fast, capable
tts: 'eleven_turbo', // Fast, natural
},
}
Key Takeaways
Choose the right STT for your needs. Browser APIs are free and real-time; Whisper is more accurate for non-real-time.
Stream everything. Start TTS before LLM finishes; play audio before it's fully generated.
Optimize for latency. Sub-500ms round trips feel conversational; longer feels robotic.
