Building Speaker Diarization Without a Cloud API
Speaker diarization is figuring out "who spoke when" in a recording. It is the difference between a wall of text and a readable transcript where each speaker's words are clearly attributed. Most solutions require a cloud API -- AssemblyAI, Rev.ai, Google Speech-to-Text. You upload your audio, they process it on their servers, you get back timestamped speaker labels.
Falavra does it entirely locally. No upload. No cloud. No API key. Here is exactly how.
Why Diarization Matters
Transcription without diarization works fine for single-speaker audio. Podcast intros. Voice memos. Dictation. But the moment you have two or more people talking, a raw transcript becomes difficult to follow.
Consider a meeting recording:
Without diarization:
So I think we should ship by Friday. That seems aggressive, can we do Monday instead? Monday works, but we need the design assets by Wednesday then. I can have those ready by Tuesday.
With diarization:
Speaker 1: So I think we should ship by Friday. Speaker 2: That seems aggressive, can we do Monday instead? Speaker 1: Monday works, but we need the design assets by Wednesday then. Speaker 3: I can have those ready by Tuesday.
The second version is actually useful. You know who committed to what. You can search for what a specific person said. The transcript becomes a record, not just a wall of text.
For Falavra's target users -- people transcribing meetings, interviews, and podcasts -- diarization is the feature that turns "nice to have" into "I need this."
The Technical Approach
sherpa-onnx provides an OfflineSpeakerDiarization class that handles the entire pipeline. It is not a separate library or a Python dependency. It is part of the same sherpa-onnx package Falavra already uses for transcription.
The diarization system has three components:
1. Pyannote Segmentation Model (~90MB ONNX)
The segmentation model handles voice activity detection and speaker boundary detection. It takes audio as input and outputs timestamped segments indicating where speech occurs and where speaker changes happen.
This is a pyannote-based model converted to ONNX format. pyannote is the gold standard for speaker diarization in the research community, and the ONNX export means it runs through the same runtime Falavra already uses for transcription. No Python. No PyTorch. Just ONNX Runtime inference.
2. 3D Speaker Embedding Model (~40MB ONNX)
The embedding model creates a vector fingerprint for each detected speech segment. Think of it as generating a unique "voice ID" for each chunk of audio. Segments that sound like the same person will have similar vectors. Segments from different speakers will have distant vectors.
The model produces embeddings in a high-dimensional space (hence "3D Speaker" in the model name, though the actual embedding dimension is much higher). These embeddings are what allow the system to cluster segments by speaker identity without ever needing to know who the speakers are in advance.
3. Agglomerative Clustering
Once every speech segment has an embedding vector, agglomerative clustering groups them. The algorithm starts by treating each segment as its own cluster, then iteratively merges the two most similar clusters until a stopping criterion is met.
The key parameter: numClusters. Setting it to -1 tells the system to auto-detect the number of speakers. This is critical for a general-purpose tool like Falavra where users do not know (or should not need to know) how many people are in their recording.
const sd = sherpaOnnx.createOfflineSpeakerDiarization({
segmentation: {
pyannote: {
model: getModelPath('sherpa-onnx-pyannote-segmentation-3-0.onnx'),
},
},
embedding: {
model: getModelPath('3dspeaker_speech_eres2net_base_sv_zh-cn_3dspeaker_16k.onnx'),
},
clustering: {
numClusters: -1, // Auto-detect number of speakers
threshold: 0.5, // Similarity threshold for merging
},
minDurationOn: 0.3, // Minimum speech duration (seconds)
minDurationOff: 0.5, // Minimum silence between segments
});
About 130MB of additional models total. All running locally.
Pipeline Integration
Diarization is not a standalone feature. It is a stage in Falavra's processing pipeline. Here is how the full pipeline looks with diarization enabled:
Download (0-25%) -> Convert (25-40%) -> Diarize (40-55%) -> Transcribe (55-100%)
And without diarization:
Download (0-25%) -> Convert (25-40%) -> Transcribe (40-100%)
The diarization stage sits between audio conversion and transcription. This ordering matters. The audio needs to be in the correct format (16kHz mono WAV) before diarization can process it. And transcription needs the diarization results to assign speaker labels to each text segment.
The Merging Step
Diarization produces timestamped speaker segments:
// Diarization output
[
{ start: 0.0, end: 3.2, speaker: 0 },
{ start: 3.5, end: 8.1, speaker: 1 },
{ start: 8.4, end: 12.7, speaker: 0 },
{ start: 13.0, end: 18.3, speaker: 2 },
]
Transcription produces timestamped text segments:
// Transcription output
[
{ start: 0.1, end: 3.1, text: "So I think we should ship by Friday." },
{ start: 3.6, end: 7.9, text: "That seems aggressive, can we do Monday?" },
{ start: 8.5, end: 12.5, text: "Monday works, but we need design assets..." },
{ start: 13.1, end: 18.0, text: "I can have those ready by Tuesday." },
]
These need to be merged. The approach is midpoint matching: for each transcript segment, calculate its temporal midpoint and find which diarization segment contains that midpoint.
function assignSpeakers(
transcriptSegments: TranscriptSegment[],
diarizationSegments: DiarizationSegment[]
): TranscriptSegment[] {
return transcriptSegments.map(segment => {
const midpoint = (segment.start + segment.end) / 2;
const matchedSpeaker = diarizationSegments.find(
ds => ds.start <= midpoint && midpoint <= ds.end
);
return {
...segment,
speaker: matchedSpeaker
? `Speaker ${matchedSpeaker.speaker + 1}`
: undefined,
};
});
}
Midpoint matching is simple and works well when diarization and transcription segment boundaries roughly align. It handles slight timing mismatches gracefully -- a transcript segment that starts 0.2 seconds before the diarization boundary will still match correctly as long as its midpoint falls in the right diarization segment.
The Electron Gotcha
This is the kind of thing that does not show up in documentation and costs you a full day of debugging.
sherpa-onnx's readWave() function returns audio data as an external ArrayBuffer -- memory allocated outside of V8's heap by the native addon. Electron's V8 memory cage, a security feature that restricts what memory JavaScript can access, does not like this.
The symptom: a hard crash with no useful error message. The audio file reads fine in a Node.js script. It crashes in Electron.
The fix: bypass sherpa-onnx's WAV reader entirely and write a pure JavaScript WAV parser.
function readWavFile(filePath: string): {
samples: Float32Array;
sampleRate: number;
} {
const buffer = fs.readFileSync(filePath);
const view = new DataView(buffer.buffer, buffer.byteOffset, buffer.byteLength);
// Parse RIFF header
const chunkId = String.fromCharCode(...buffer.slice(0, 4));
if (chunkId !== 'RIFF') {
throw new Error('Not a valid WAV file');
}
// Find 'fmt ' chunk
let offset = 12;
let sampleRate = 0;
let bitsPerSample = 0;
let numChannels = 0;
while (offset < buffer.length) {
const subchunkId = String.fromCharCode(...buffer.slice(offset, offset + 4));
const subchunkSize = view.getUint32(offset + 4, true);
if (subchunkId === 'fmt ') {
numChannels = view.getUint16(offset + 10, true);
sampleRate = view.getUint32(offset + 12, true);
bitsPerSample = view.getUint16(offset + 22, true);
}
if (subchunkId === 'data') {
const dataStart = offset + 8;
const dataEnd = dataStart + subchunkSize;
const rawData = buffer.slice(dataStart, dataEnd);
// Convert to Float32Array in JS-managed memory
const samples = new Float32Array(rawData.length / (bitsPerSample / 8));
for (let i = 0; i < samples.length; i++) {
if (bitsPerSample === 16) {
samples[i] = view.getInt16(dataStart + i * 2, true) / 32768;
} else if (bitsPerSample === 32) {
samples[i] = view.getFloat32(dataStart + i * 4, true);
}
}
return { samples, sampleRate };
}
offset += 8 + subchunkSize;
}
throw new Error('WAV file missing data chunk');
}
The key line is new Float32Array(...) -- this allocates memory inside V8's heap, which Electron's memory cage allows. The native addon's external buffer is never exposed to JavaScript.
This is about 40 lines of code replacing a single function call. The trade-off is worth it: the app does not crash.
The Synchronous Blocking Problem
Here is something I am honest about: sd.process() is a synchronous, CPU-intensive call. For a 10-minute audio file, it blocks Node.js for 30-60 seconds. During that time, the event loop is frozen. No IPC messages. No UI updates.
For Falavra v1, this is acceptable for two reasons:
- The pipeline already runs one job at a time. There is no concurrent processing that would be starved by the blocked event loop.
- Users expect to wait. The pipeline has a progress bar. During diarization, the UI shows "Identifying speakers..." as an indeterminate state. Users understand this is a processing step.
The proper fix is moving diarization to a worker thread or a child process. That is on the roadmap. But shipping with a known limitation and documenting it is better than delaying the feature to engineer a perfect solution nobody is asking for yet.
No Progress Callback
sherpa-onnx's OfflineSpeakerDiarization does not emit progress events. The sd.process() call takes audio in and returns speaker segments out. There is no "40% complete" callback.
This means the UI cannot show a real progress bar during diarization. Instead, it shows an indeterminate loading state:
// During diarization stage
updateProgress({
stage: 'diarize',
percent: null, // Indeterminate
message: 'Identifying speakers...',
});
const segments = sd.process(audioSamples);
updateProgress({
stage: 'diarize',
percent: 100,
message: 'Speaker identification complete',
});
This is a minor UX compromise. The diarization step typically takes 15-45 seconds depending on audio length and hardware. Users see the "Identifying speakers..." message and the progress bar animating. It is not ideal, but it is honest -- I am not faking a progress percentage.
Expected Accuracy
Speaker diarization accuracy is measured by Diarization Error Rate (DER). Lower is better.
For Falavra's implementation:
- Clean 2-speaker podcasts: ~14-16% DER. Good enough that the transcript is clearly useful. Speaker changes are correctly identified the vast majority of the time.
- 3-4 speaker meetings with minimal overlap: ~18-22% DER. Still useful, occasional misattribution at speaker change boundaries.
- Noisy recordings with overlapping speech: DER degrades significantly. The pyannote segmentation model handles overlap poorly when speakers talk simultaneously for extended periods.
For comparison, cloud APIs like AssemblyAI report DER around 8-12% on clean audio. Falavra's local approach is not as accurate, but it is good enough for the use case of "I need to know roughly who said what" without uploading sensitive recordings to a third party.
The accuracy improves with audio quality. A good USB microphone recording of a two-person interview will produce nearly perfect diarization. A phone recording of a noisy conference room will produce mediocre results. I set expectations accordingly in the UI.
Speaker Labels in the UI
A small UX decision that matters: speaker labels are only shown when the speaker changes.
Speaker 1: So I think we should ship by Friday.
Speaker 2: That seems aggressive. Can we do Monday instead?
I talked to the team and they need more time.
Speaker 1: Monday works, but we need design assets by Wednesday.
Notice the second line from Speaker 2 does not repeat the label. If every line said "Speaker 2:", the transcript would be cluttered and harder to scan. The label only appears at speaker transitions.
This is a simple conditional in the rendering logic:
function shouldShowSpeakerLabel(
currentSegment: TranscriptSegment,
previousSegment: TranscriptSegment | null
): boolean {
if (!currentSegment.speaker) return false;
if (!previousSegment) return true;
return currentSegment.speaker !== previousSegment.speaker;
}
Export Integration
Speaker labels need to appear correctly in every export format Falavra supports.
SRT (SubRip):
1
00:00:00,100 --> 00:00:03,100
(Speaker 1) So I think we should ship by Friday.
2
00:00:03,600 --> 00:00:07,900
(Speaker 2) That seems aggressive, can we do Monday instead?
VTT (WebVTT):
WEBVTT
00:00.100 --> 00:03.100
<v Speaker 1>So I think we should ship by Friday.
00:03.600 --> 00:07.900
<v Speaker 2>That seems aggressive, can we do Monday instead?
Markdown:
**Speaker 1:** So I think we should ship by Friday.
**Speaker 2:** That seems aggressive, can we do Monday instead?
Each format has its own convention for speaker attribution. SRT uses parenthetical prefixes. VTT has the <v> tag specifically designed for voice annotations. Markdown uses bold labels.
function formatSpeakerForExport(
speaker: string | undefined,
format: 'srt' | 'vtt' | 'md'
): string {
if (!speaker) return '';
switch (format) {
case 'srt':
return `(${speaker}) `;
case 'vtt':
return `<v ${speaker}>`;
case 'md':
return `**${speaker}:** `;
}
}
Pro Feature Gating
Speaker diarization is a Pro feature. The free tier of Falavra handles basic transcription -- single speaker, no diarization. Pro unlocks diarization along with other features like larger model sizes and additional export formats.
The gating is straightforward:
async function processAudio(file: AudioFile, license: License) {
const stages: PipelineStage[] = ['download', 'convert'];
if (license.isPro && file.options.diarization) {
stages.push('diarize');
}
stages.push('transcribe');
return runPipeline(file, stages);
}
The diarization models (~130MB) are only downloaded when a Pro user first enables the feature. Free users never see the download and never have the models on their machine. This keeps the base install size small.
What I Would Do Differently
If I were starting this implementation over:
Worker threads from day one. The synchronous blocking problem is the biggest technical debt. Moving diarization to a worker thread would keep the UI responsive and allow progress updates through message passing. I skipped this for v1 velocity and will pay for it later.
Custom clustering threshold tuning. The default clustering threshold of 0.5 works reasonably well, but different recording conditions benefit from different thresholds. A future version could let users adjust sensitivity or auto-tune based on audio characteristics.
Pre-download models during onboarding. Currently, the diarization models download on first use, which means the first diarization attempt is slower than expected. Downloading during the initial Pro upgrade flow would set better expectations.
The Bottom Line
130MB of models. Zero cloud dependencies. No Python runtime. No new npm packages. Speaker diarization that runs entirely on the user's machine and produces results good enough for meetings, interviews, and podcasts.
It is not as accurate as the best cloud APIs. It blocks the event loop for 30-60 seconds. The progress reporting is indeterminate. These are real limitations.
But for users who cannot or will not send their audio to a third party, it is the only option that exists in a desktop app. And for an indie developer, the marginal cost of adding diarization to every transcription is exactly zero.
That is the kind of math I like.
Follow the Journey
Falavra is in active development. I write about the technical decisions, the product trade-offs, and the lessons learned. If you are building local AI tools or interested in the process:
- GitHub: github.com/helrabelo
- Twitter/X: twitter.com/helrabelo
- Helsky Labs: helsky-labs.com
Building products at Helsky Labs. Ship fast, learn from metrics, double down on winners.