AI Safety for Application Developers: Practical Guardrails

Your AI feature will be abused. Users will try jailbreaks, inject prompts, and attempt to extract system prompts. Building defensive AI applications is essential.

Threat Modeling for AI Applications

Common attack vectors:

Prompt injection: Manipulating the model to ignore instructions
Data extraction: Tricking the model into revealing training data or system prompts
Jailbreaking: Bypassing content filters and safety measures
Resource exhaustion: Generating expensive or infinite responses

Prompt Injection: Detection and Prevention

// Input sanitization
function sanitizeUserInput(input: string): { safe: boolean; sanitized: string } {
  const injectionPatterns = [
    /ignore (all |previous |above )?instructions/i,
    /you are now/i,
    /pretend (you are|to be)/i,
    /system prompt/i,
    /reveal your/i,
    /\[\[.*\]\]/,  // Common injection brackets
    /<\|.*\|>/,    // Another common pattern
  ]
  
  for (const pattern of injectionPatterns) {
    if (pattern.test(input)) {
      return { safe: false, sanitized: input.replace(pattern, '[FILTERED]') }
    }
  }
  
  return { safe: true, sanitized: input }
}

// Structural defense: separate user input from instructions
function buildPrompt(systemInstructions: string, userInput: string) {
  return {
    messages: [
      { role: 'system', content: systemInstructions },
      { role: 'user', content: `\n${userInput}\n` },
    ]
  }
}

Output Filtering: Beyond Keyword Blocks

interface OutputFilter {
  check(output: string): { safe: boolean; issues: string[] }
}

class ContentFilter implements OutputFilter {
  check(output: string) {
    const issues: string[] = []
    
    // PII detection
    if (/\b\d{3}-\d{2}-\d{4}\b/.test(output)) {
      issues.push('Potential SSN detected')
    }
    if (/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.test(output)) {
      issues.push('Email address detected')
    }
    
    // System prompt leakage
    if (output.toLowerCase().includes('my instructions') ||
        output.toLowerCase().includes('i was told to')) {
      issues.push('Potential system prompt leakage')
    }
    
    // Use classifier for harmful content
    const harmfulness = await this.classifyHarmfulness(output)
    if (harmfulness > 0.7) {
      issues.push('Potentially harmful content')
    }
    
    return { safe: issues.length === 0, issues }
  }
}

Rate Limiting and Abuse Detection

class AbuseDetector {
  private userPatterns = new Map()
  
  async checkRequest(userId: string, request: AIRequest): Promise {
    const pattern = this.userPatterns.get(userId) || new UserPattern()
    
    // Check for rapid-fire requests (probing)
    if (pattern.requestsLastMinute > 20) {
      return { allowed: false, reason: 'Rate limit exceeded' }
    }
    
    // Check for repeated similar requests (injection attempts)
    const similarity = this.calculateSimilarity(request, pattern.recentRequests)
    if (similarity > 0.9 && pattern.recentRequests.length > 5) {
      return { allowed: false, reason: 'Suspicious repeated requests' }
    }
    
    // Check for escalating token usage
    if (pattern.averageTokens * 10 < request.estimatedTokens) {
      return { allowed: false, reason: 'Abnormal token request' }
    }
    
    pattern.addRequest(request)
    return { allowed: true }
  }
}

Incident Response

async function handleSafetyIncident(incident: SafetyIncident) {
  // 1. Log for analysis
  await logIncident({
    timestamp: Date.now(),
    type: incident.type,
    userId: incident.userId,
    input: incident.input,
    output: incident.attemptedOutput,
  })
  
  // 2. Return safe fallback
  const fallbackResponse = 'I\'m not able to help with that request.'
  
  // 3. Flag for review if severe
  if (incident.severity === 'high') {
    await notifySecurityTeam(incident)
  }
  
  // 4. Consider rate limiting user
  if (incident.repeated) {
    await applyRateLimit(incident.userId, { duration: '1h', maxRequests: 10 })
  }
  
  return fallbackResponse
}

Key Takeaways

Assume adversarial input. Every user input should be treated as potentially malicious.

Defense in depth. Input filtering, output filtering, rate limiting, and monitoring all work together.

Log everything. You can't improve defenses without understanding attacks.

Fail safely. When in doubt, refuse rather than risk harmful output.