Name: TokenCap
Author: Helsky Labs

I watch a lot of YouTube. Podcasts, conference talks, tutorials, interviews. The problem is that video is a terrible format for reference. You can't search a video. You can't skim it. You can't copy-paste the one paragraph you need.

Existing transcription services either cost $10-30/month, require uploading audio to their servers, or produce garbage output from auto-generated captions. I wanted something that takes a YouTube URL, runs Whisper locally, and gives me a searchable, exportable transcript. On my machine. No cloud.

That's Falavra.

What Falavra Does

Paste a YouTube URL. Click transcribe. Get a full transcript with timestamps, speaker identification, and export options. Everything runs on your hardware. Nothing leaves your machine.

The pipeline looks like this:

YouTube URL
    |
    v
yt-dlp (extract audio stream)
    |
    v
ffmpeg (convert to 16kHz mono WAV)
    |
    v
sherpa-onnx Whisper (transcribe locally)
    |
    v
SQLite (store transcript + metadata)
    |
    v
React UI (display, search, export)

Each step is a discrete process. If one fails, you get a clear error pointing to exactly what went wrong. No black boxes.

The Tech Stack

Falavra is an Electron app built with React and TypeScript. I know Electron gets criticism for memory usage, and in the case of DropVox -- a lightweight menu bar utility -- I chose native Swift specifically to avoid that overhead. But Falavra is a different kind of app.

Falavra is a full desktop application with a rich UI: transcript panels, waveform displays, search interfaces, settings screens, export dialogs. This is the kind of app Electron is good at. The memory overhead of Chromium is negligible compared to the Whisper model already sitting in RAM.

The transcription engine uses sherpa-onnx, a C++ inference runtime that supports Whisper models in ONNX format. It runs natively on the user's hardware without requiring Python, PyTorch, or any ML framework installation. The user downloads Falavra, picks a model size, and it works.

Why sherpa-onnx Instead of WhisperKit

DropVox uses WhisperKit, which is optimized for Apple Silicon. Falavra needs to run on macOS and Windows. sherpa-onnx provides cross-platform inference with good performance on both Intel and ARM processors. It also ships with built-in speaker diarization, which Falavra uses as a Pro feature.

The Pipeline in Detail

Step 1: Audio Extraction

When you paste a YouTube URL, Falavra calls yt-dlp to extract the audio stream. Not the video -- just the audio. This is faster, uses less bandwidth, and avoids dealing with video codecs.

// Simplified audio extraction
const extractAudio = async (url: string): Promise<string> => {
  const outputPath = path.join(tempDir, `${videoId}.wav`);

  await execAsync([
    ytdlpBinary,
    '--extract-audio',
    '--audio-format', 'wav',
    '--audio-quality', '0',
    '--output', outputPath,
    url
  ]);

  return outputPath;
};

yt-dlp is bundled with the app. No system dependencies to install.

Step 2: Audio Preprocessing

Whisper expects 16kHz mono WAV input. YouTube audio comes in various formats and sample rates. ffmpeg normalizes everything:

const preprocessAudio = async (inputPath: string): Promise<string> => {
  const outputPath = inputPath.replace('.wav', '_16k.wav');

  await execAsync([
    ffmpegBinary,
    '-i', inputPath,
    '-ar', '16000',      // 16kHz sample rate
    '-ac', '1',          // mono
    '-c:a', 'pcm_s16le', // 16-bit PCM
    outputPath
  ]);

  return outputPath;
};

ffmpeg is also bundled. The goal is zero external dependencies for the user.

Step 3: Transcription

sherpa-onnx runs the Whisper model on the preprocessed audio. The user chooses their model size based on their hardware and accuracy needs:

Model	Size	Speed (1hr audio)	Accuracy	Best For
Tiny	75MB	~3 min	Good	Quick drafts, familiar content
Base	142MB	~6 min	Better	General use
Small	466MB	~15 min	Great	Important content
Medium	1.5GB	~30 min	Excellent	Professional use
Large-v3	3GB	~50 min	Best	Critical accuracy
Turbo	800MB	~10 min	Great	Best balance

I default to Turbo. At 800MB, it fits comfortably in memory on any machine made in the last 5 years. The accuracy is close to Large-v3 at a fraction of the processing time. For most YouTube content -- spoken word, clear audio, common languages -- Turbo is indistinguishable from the larger models.

Step 4: Storage

Every transcript goes into a local SQLite database with full-text search via FTS5:

CREATE VIRTUAL TABLE transcripts_fts USING fts5(
  title,
  channel,
  transcript_text,
  content='transcripts',
  content_rowid='id'
);

This means you can search across all your transcripts instantly. "Find every video where someone mentioned 'transformer architecture'" returns results in milliseconds, even with hundreds of transcripts stored.

The database also stores metadata: video title, channel name, duration, thumbnail URL, transcription date, model used, and language detected.

Step 5: Display and Export

The React frontend renders the transcript with timestamps. Click any timestamp to jump to that position in the embedded video player. Search within a single transcript or across your entire library.

Export formats:

Markdown -- Headers, timestamps as links, clean formatting. Drops into Obsidian or Notion perfectly.
Plain text -- Just the words. Good for pasting into documents.
SRT -- Standard subtitle format. Import into video editors, media players, or subtitle tools.
VTT -- Web-native subtitle format. Use with HTML5 video players.

Each format is generated on demand from the stored transcript data. No re-processing needed.

Speaker Diarization

This is the Pro feature I'm most excited about.

Speaker diarization answers the question "who said what?" in a multi-speaker recording. For a podcast with two hosts and a guest, diarization separates the transcript into labeled speakers: Speaker 1, Speaker 2, Speaker 3. You can then rename them to actual names in the UI.

sherpa-onnx includes a built-in OfflineSpeakerDiarization pipeline that combines pyannote-based speaker segmentation with speaker embedding models. The pipeline runs entirely locally:

Segmentation -- Detect when speaker changes occur in the audio
Embedding -- Generate a voice fingerprint for each segment
Clustering -- Group segments by speaker similarity

The result is a transcript where each block is attributed to a specific speaker. For interview transcripts, panel discussions, and multi-host podcasts, this transforms the output from "wall of text" to "structured conversation."

Diarization adds processing time -- roughly 2x the base transcription time -- but the output quality for multi-speaker content justifies it. It's optional and only available in Pro because the additional model weights increase the download size.

Why Local-First Matters for This

YouTube transcription feels like it should be a cloud service. The video is already on the internet. Why download it just to process it locally?

Three reasons.

Privacy

The videos you transcribe reveal your interests, your research topics, your political views, your health concerns. A cloud transcription service builds a detailed profile of everything you watch. A local app knows nothing. The data stays in a SQLite file on your disk.

For journalists researching sensitive stories, academics studying controversial topics, or anyone who values intellectual privacy, this matters.

Cost

Cloud transcription APIs charge per minute of audio. At $0.006/minute (OpenAI's Whisper API rate), transcribing one hour of audio costs $0.36. That's cheap for a single video, but it adds up.

If you transcribe 5 hours of content per week -- not unusual for a researcher or content creator -- that's $93.60 per year in API costs alone. Plus the subscription fee for whatever service wraps the API.

Falavra is a one-time purchase. Your 500th transcription costs the same as your first: nothing. The electricity to run the model on your hardware.

Offline Capability

Once you've downloaded the model (a one-time download), Falavra works without internet. You can extract audio from a YouTube video while online, then transcribe it later on a plane, in a cabin, or anywhere with no connectivity.

This also means the app doesn't break when a third-party API changes pricing, goes down, or shuts off access. The entire pipeline runs on your machine with tools you control.

The Brand

Falavra uses a warm amber color system -- golds, ambers, and deep browns. The name is a playful Portuguese portmanteau. "Fala" means "speech" or "talk" in Portuguese. "Palavra" means "word." Falavra: turning speech into words. It's not a real Portuguese word, which is the point. It's memorable, it's pronounceable in English and Portuguese, and the .com was available.

The logo is a stylized "F" lettermark in amber on dark. Minimal, warm, recognizable at small sizes. It reflects the product's personality: serious tool, approachable design.

Use Cases

I built Falavra for myself, but the use cases extend beyond my personal workflow:

Researchers and academics. Transcribe conference talks, lectures, and interviews. Search across transcripts to find specific mentions. Export to markdown for literature reviews.

Podcasters and content creators. Transcribe your own episodes for show notes, blog posts, or social media clips. Speaker diarization means you don't have to manually label who said what.

Journalists. Transcribe source interviews and press conferences. Keep transcripts local for source protection. Full-text search across an entire beat's worth of recordings.

Language learners. Transcribe content in your target language. Read along while listening. Export transcripts as study material.

Anyone who prefers reading to watching. Some people process information better as text. Falavra converts any YouTube video into a searchable document.

Pricing

One-time purchase. No subscriptions.

Free tier: 3 transcriptions per day, Tiny model only, no diarization. Enough to evaluate whether the app fits your workflow.

Pro: Unlimited transcriptions, all model sizes, speaker diarization, priority support. Pay once.

The logic is the same as DropVox: the app uses your hardware. There are no servers processing your audio. No per-user infrastructure costs. Charging monthly for software that runs entirely on your machine doesn't make sense.

What's Next

Falavra is in active development. The core transcription pipeline is working. The UI is functional. I'm in the polish phase -- edge cases, error handling, the experience details that separate a tool from a product.

Near-term roadmap:

Batch processing -- Queue multiple URLs and transcribe them overnight
Playlist support -- Paste a YouTube playlist URL, transcribe all videos
Local file support -- Drag audio/video files directly, not just YouTube URLs
Summary generation -- Local LLM integration to summarize long transcripts
Translation -- Transcribe in one language, translate to another (all local)

The foundation is solid. Each of these features builds on the existing pipeline without requiring architectural changes.

Try It

Falavra will be available at falavra.com when it launches. If you want early access or want to follow development progress, the best places to find updates are below.

Follow the Journey

Building products at Helsky Labs. Ship fast, learn from metrics, double down on winners.