Your AI feature will be abused. Users will try jailbreaks, inject prompts, and attempt to extract system prompts. Building defensive AI applications is essential.
Threat Modeling for AI Applications
Common attack vectors:
- Prompt injection: Manipulating the model to ignore instructions
- Data extraction: Tricking the model into revealing training data or system prompts
- Jailbreaking: Bypassing content filters and safety measures
- Resource exhaustion: Generating expensive or infinite responses
Prompt Injection: Detection and Prevention
// Input sanitization
function sanitizeUserInput(input: string): { safe: boolean; sanitized: string } {
const injectionPatterns = [
/ignore (all |previous |above )?instructions/i,
/you are now/i,
/pretend (you are|to be)/i,
/system prompt/i,
/reveal your/i,
/\[\[.*\]\]/, // Common injection brackets
/<\|.*\|>/, // Another common pattern
]
for (const pattern of injectionPatterns) {
if (pattern.test(input)) {
return { safe: false, sanitized: input.replace(pattern, '[FILTERED]') }
}
}
return { safe: true, sanitized: input }
}
// Structural defense: separate user input from instructions
function buildPrompt(systemInstructions: string, userInput: string) {
return {
messages: [
{ role: 'system', content: systemInstructions },
{ role: 'user', content: `\n${userInput}\n ` },
]
}
}
Output Filtering: Beyond Keyword Blocks
interface OutputFilter {
check(output: string): { safe: boolean; issues: string[] }
}
class ContentFilter implements OutputFilter {
check(output: string) {
const issues: string[] = []
// PII detection
if (/\b\d{3}-\d{2}-\d{4}\b/.test(output)) {
issues.push('Potential SSN detected')
}
if (/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.test(output)) {
issues.push('Email address detected')
}
// System prompt leakage
if (output.toLowerCase().includes('my instructions') ||
output.toLowerCase().includes('i was told to')) {
issues.push('Potential system prompt leakage')
}
// Use classifier for harmful content
const harmfulness = await this.classifyHarmfulness(output)
if (harmfulness > 0.7) {
issues.push('Potentially harmful content')
}
return { safe: issues.length === 0, issues }
}
}
Rate Limiting and Abuse Detection
class AbuseDetector {
private userPatterns = new Map()
async checkRequest(userId: string, request: AIRequest): Promise {
const pattern = this.userPatterns.get(userId) || new UserPattern()
// Check for rapid-fire requests (probing)
if (pattern.requestsLastMinute > 20) {
return { allowed: false, reason: 'Rate limit exceeded' }
}
// Check for repeated similar requests (injection attempts)
const similarity = this.calculateSimilarity(request, pattern.recentRequests)
if (similarity > 0.9 && pattern.recentRequests.length > 5) {
return { allowed: false, reason: 'Suspicious repeated requests' }
}
// Check for escalating token usage
if (pattern.averageTokens * 10 < request.estimatedTokens) {
return { allowed: false, reason: 'Abnormal token request' }
}
pattern.addRequest(request)
return { allowed: true }
}
}
Incident Response
async function handleSafetyIncident(incident: SafetyIncident) {
// 1. Log for analysis
await logIncident({
timestamp: Date.now(),
type: incident.type,
userId: incident.userId,
input: incident.input,
output: incident.attemptedOutput,
})
// 2. Return safe fallback
const fallbackResponse = 'I\'m not able to help with that request.'
// 3. Flag for review if severe
if (incident.severity === 'high') {
await notifySecurityTeam(incident)
}
// 4. Consider rate limiting user
if (incident.repeated) {
await applyRateLimit(incident.userId, { duration: '1h', maxRequests: 10 })
}
return fallbackResponse
}
Key Takeaways
Assume adversarial input. Every user input should be treated as potentially malicious.
Defense in depth. Input filtering, output filtering, rate limiting, and monitoring all work together.
Log everything. You can't improve defenses without understanding attacks.
Fail safely. When in doubt, refuse rather than risk harmful output.
