Multimodal AI: Building Apps That See, Hear, and Understand

Text-only AI already feels like a limitation. Modern AI models process images, audio, video, and code simultaneously, opening up application possibilities that weren't feasible even a year ago.

Understanding Multimodal Architecture

Modern multimodal models aren't just separate models stitched together—they share representations across modalities:

Input: [Image] + "What's happening in this photo?"
      ↓
Vision Encoder → Joint Representation ← Text Encoder
                         ↓
                  Language Model
                         ↓
Output: "A family is having a picnic in the park..."

Image Understanding: Beyond OCR

Modern vision capabilities go far beyond text extraction:

import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic()

async function analyzeImage(imageBase64: string, question: string) {
  const response = await client.messages.create({
    model: 'claude-4-sonnet-20260215',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: [
        {
          type: 'image',
          source: {
            type: 'base64',
            media_type: 'image/jpeg',
            data: imageBase64
          }
        },
        { type: 'text', text: question }
      ]
    }]
  })
  
  return response.content[0].text
}

// Usage examples
await analyzeImage(chartImage, 'Summarize the trends in this chart')
await analyzeImage(uiScreenshot, 'List all the accessibility issues')
await analyzeImage(receipt, 'Extract: merchant, total, date, items')

Audio Processing

Audio models handle both speech and environmental sounds:

// Transcription with context
async function transcribeWithContext(
  audioFile: Buffer,
  context: string
): Promise<TranscriptionResult> {
  // Use Whisper or similar for transcription
  const transcription = await whisper.transcribe(audioFile)
  
  // Enhance with LLM for domain-specific terms
  const enhanced = await llm.complete(`
    Context: ${context}
    
    Raw transcription: ${transcription.text}
    
    Correct any domain-specific terms and fix obvious errors.
  `)
  
  return {
    raw: transcription.text,
    enhanced: enhanced,
    confidence: transcription.confidence
  }
}

Video Analysis

Video analysis typically involves frame sampling and temporal reasoning:

async function analyzeVideo(
  videoPath: string,
  question: string
): Promise<string> {
  // Extract key frames
  const frames = await extractKeyFrames(videoPath, {
    maxFrames: 10,
    strategy: 'scene-change'  // or 'uniform'
  })
  
  // Analyze frames with temporal context
  const response = await client.messages.create({
    model: 'gpt-5',
    messages: [{
      role: 'user',
      content: [
        ...frames.map((frame, i) => ({
          type: 'image',
          source: { type: 'base64', data: frame.base64 },
          metadata: { timestamp: frame.timestamp }
        })),
        {
          type: 'text',
          text: `These are frames from a video. ${question}`
        }
      ]
    }]
  })
  
  return response.content[0].text
}

Practical Multimodal Patterns

Document Processing Pipeline

async function processDocument(file: Buffer, mimeType: string) {
  // Extract text + visual elements
  const extraction = await extractWithVision(file, mimeType)
  
  // Structured output
  const structured = await llm.complete({
    prompt: `Extract structured data from this document...`,
    schema: DocumentSchema,
    context: extraction
  })
  
  return structured
}

Accessibility Analysis

async function auditAccessibility(screenshotUrl: string) {
  return analyzeImage(screenshot, `
    Analyze this UI for accessibility issues:
    1. Color contrast problems
    2. Missing labels or alt text indicators
    3. Touch target sizes
    4. Text readability
    5. Focus indicators
    
    Return JSON with issues and severity.
  `)
}

Conclusion

Multimodal AI isn't just a feature—it's a new paradigm for building applications. The patterns in this guide provide a foundation, but the possibilities expand as you combine modalities in creative ways. Start with a clear use case, choose the right model for each modality, and build from there.