Content Moderation

Check if text content violates safety policies using LLM.moderate. This is crucial for user-facing applications to prevent abuse.

Basic Usage

The simplest check returns a flagged boolean and categories.

const result = await LLM.moderate("I want to help everyone!");

if (result.flagged) {
  console.log(`❌ Flagged for: ${result.flaggedCategories.join(", ")}`);
} else {
  console.log("✅ Content appears safe");
}

Understanding Results

The moderation result object provides detailed signals:

  • flagged: (boolean) Overall safety check. if true, content violates provider policies.
  • categories: (object) Boolean flags for specific buckets (e.g., sexual: false, violence: true).
  • category_scores: (object) Confidence scores (0.0 - 1.0) for each category.
const result = await LLM.moderate("Some controversial text");

// Check specific categories
if (result.categories.hate) {
  console.log("Hate speech detected");
}

// Check confidence levels
console.log(`Violence Score: ${result.category_scores.violence}`);

Common Categories

  • Sexual: Sexual content.
  • Hate: Content promoting hate based on identity.
  • Harassment: Threatening or bullying content.
  • Self-Harm: Promoting self-harm or suicide.
  • Violence: Promoting or depicting violence.

Integration Patterns

Pre-Chat Moderation

We recommend validating user input before sending it to a Chat model to save costs and prevent jailbreaks.

async function safeChat(input: string) {
  const mod = await LLM.moderate(input);

  if (mod.flagged) {
    throw new Error(`Content Unsafe: ${mod.flaggedCategories.join(', ')}`);
  }

  // Only proceed if safe
  return await chat.ask(input);
}

Custom Risk Thresholds

Providers have their own thresholds for “flagging”. You can implement stricter (or looser) logic using raw scores.

const result = await LLM.moderate(userInput);

// Custom strict policy: Flag anything with > 0.1 confidence
const isRisky = Object.entries(result.category_scores)
  .some(([category, score]) => score > 0.1);

if (isRisky) {
  console.warn("Potential risk detected (custom strict mode)");
}