Multi-modal Capabilities
Modern LLMs can understand more than just text. node-llm provides a unified way to pass images, audio, video, and documents to models that support them.
Smart File Handling
You can pass local paths or URLs directly to the ask method using the files option. node-llm automatically detects the file type and formats it correctly for the specific provider.
Supported File Types:
- Images:
.jpg,.jpeg,.png,.gif,.webp - Videos:
.mp4,.mpeg,.mov,.avi,.webm - Audio:
.wav,.mp3,.ogg,.flac - Documents:
.pdf,.csv,.json,.xml,.md,.txt - Code:
.js,.ts,.py,.rb,.go, etc.
Working with Images (Vision)
Vision-capable models (like gpt-4o, claude-3-5-sonnet, gemini-1.5-pro) can analyze images.
const chat = LLM.chat("gpt-4o");
// Analyze a local image
await chat.ask("What's in this image?", {
files: ["./screenshot.png"]
});
// Analyze an image from a URL
await chat.ask("Describe this logo", {
files: ["https://example.com/logo.png"]
});
// Compare multiple images
await chat.ask("Compare the design of these two apps", {
files: ["./v1-screenshot.png", "./v2-screenshot.png"]
});
Working with Audio
Audio-capable models (like gemini-1.5-flash) can listen to audio files and answer questions about them.
const chat = LLM.chat("gemini-1.5-flash");
// Summarize a meeting recording
await chat.ask("Summarize the key decisions in this meeting", {
files: ["./meeting.mp3"]
});
// Transcribe and analyze
await chat.ask("What was the tone of the speaker?", {
files: ["./voicemail.wav"]
});
Note: For pure transcription without chat, see Audio Transcription.
Working with Videos
Video analysis is currently supported primarily by Google Gemini and limited OpenAI models. node-llm handles the upload and reference process seamlessly.
const chat = LLM.chat("gemini-1.5-pro");
await chat.ask("What happens in this video?", {
files: ["./demo_video.mp4"]
});
Working with Documents (PDFs & Text)
You can provide full documents for analysis.
Text & Code Files
For text-based files, node-llm reads the content and passes it as text context to the model.
const chat = LLM.chat("claude-3-5-sonnet");
// Analyze code
await chat.ask("Explain potential bugs in this code", {
files: ["./app/auth.ts"]
});
PDFs
For PDFs, providers handled differently:
- Anthropic: Supports native PDF blocks (up to 10MB).
node-llmhandles the base64 encoding. - Gemini: Supports PDF via File API.
- OpenAI: Often requires text extraction first (unless using Assistants API, which
node-llmcore interacts with as Chat). Note: Ensure your provider supports the PDF modality directly or use a text extractor.
await chat.ask("Summarize this contract", {
files: ["./contract.pdf"]
});
Automatic Type Detection
You don’t need to specify the file type; node-llm infers it from the extension.
// Mix and match types
await chat.ask("Analyze these project resources", {
files: [
"diagram.png", // Image
"spec.pdf", // Document
"meeting.mp3", // Audio
"backend.ts" // Code
]
});