Audio Transcription
Convert audio files to text using models like OpenAI’s Whisper or Google’s Gemini. NodeLLM supports both raw transcription and multimodal chat analysis.
Basic Transcription
Use NodeLLM.transcribe() for direct speech-to-text conversion.
const text = await NodeLLM.transcribe("meeting.mp3", {
model: "whisper-1"
});
console.log(text.toString());
Advanced Options
Speed vs Accuracy
You can choose different models or parameters depending on your needs.
await NodeLLM.transcribe("audio.mp3", {
model: "whisper-1",
language: "en", // ISO-639-1 code hint to improve accuracy
prompt: "ZyntriQix, API" // Guide the model with domain-specific terms
});
Accessing Segments & Timestamps
The transcribe method returns a Transcription object that contains more than just text. You can access detailed timing information if supported by the provider (e.g., using response_format: 'verbose_json' with OpenAI).
const response = await NodeLLM.transcribe("interview.mp3", {
params: { response_format: "verbose_json" }
});
console.log(`Duration: ${response.duration}s`);
for (const segment of response.segments) {
console.log(`[${segment.start}s - ${segment.end}s]: ${segment.text}`);
}
Multimodal Chat vs. Transcription
There are two ways to work with audio:
- Transcription (
NodeLLM.transcribe): Best when you need the verbatim text.- Result: “Hello everyone today we are…”
- Multimodal Chat (
chat.ask): Best when you need to analyze or summarize the audio directly, without seeing the raw text first. Supported by models likegemini-1.5-proandgpt-4o.
// Multimodal Chat Example
const chat = NodeLLM.chat("gemini-1.5-pro");
await chat.ask("What is the main topic of this podcast?", {
files: ["podcast.mp3"]
});
Error Handling
Audio files can be large and prone to timeouts.
try {
await NodeLLM.transcribe("large-file.mp3");
} catch (error) {
console.error("Transcription failed:", error.message);
}