Name: TokenCap
Author: Helsky Labs

Two of my products -- DropVox and Falavra -- run AI entirely on the user's machine. No cloud. No API keys. No per-request costs. No server bills. This started as a privacy decision and turned into the single most important product and business decision I have made as an indie developer.

Here is why on-device AI is not just a technical choice but a product strategy.

Two Products, Two Runtimes, One Architecture

DropVox is a native macOS app built in Swift. Falavra is a cross-platform Electron app. Both transcribe audio using OpenAI's Whisper model architecture. Both do it locally. But they use completely different runtimes to get there.

DropVox uses WhisperKit. WhisperKit is optimized specifically for Apple Silicon. It compiles Whisper models to CoreML format and runs inference on the Neural Engine -- the dedicated ML accelerator built into every M-series chip. The integration is tight: Swift Package Manager pulls it in, the API is native Swift, and the performance is remarkable. A two-minute voice message transcribes in under 10 seconds on an M1 MacBook Air. That is faster than real-time by a wide margin.

Falavra uses sherpa-onnx. sherpa-onnx wraps models in the ONNX Runtime, which runs on macOS, Windows, and Linux. It does not have the Neural Engine optimization that WhisperKit enjoys, but it runs everywhere. For an Electron app that needs to ship cross-platform, that flexibility is non-negotiable.

Same Whisper model architecture underneath. Two very different paths to get audio in and text out.

Why Each Was the Right Choice

WhisperKit was right for DropVox because DropVox is macOS-only. When you are building for a single platform, you should use the tools that platform provides. Apple spent billions designing the Neural Engine. WhisperKit lets me use it. Leaving that performance on the table to use a cross-platform runtime would be a bad trade-off for a native app.

sherpa-onnx was right for Falavra because Falavra cannot afford to be platform-locked. The ONNX Runtime abstracts away hardware differences. On a Mac, it uses CoreML or Metal when available. On Windows, it can use DirectML or CUDA. On Linux, it falls back to CPU. One codebase, one model format, multiple hardware backends. For Electron, where the whole point is cross-platform distribution, this is the correct trade.

The lesson: there is no single "best" local AI runtime. There is the best runtime for your product's constraints.

The Product Advantages

Running AI locally is not just a technical flex. It creates real product differentiation that users notice and pay for.

Privacy Users Actually Care About

Every cloud transcription service asks you to trust them with your audio. DropVox and Falavra cannot send your audio anywhere because they contain no code to do so. This is not a privacy policy -- it is an architectural guarantee.

And it matters to real users. Journalists who handle sensitive sources. Lawyers processing privileged communications. Healthcare professionals dealing with patient recordings. Anyone who has read a data breach headline and thought "that could be my audio."

Privacy-by-architecture is a stronger selling point than any checkbox in a settings panel.

Zero Marginal Cost Per User

When a user transcribes a 10-minute recording with DropVox, it costs me nothing. No API call. No compute on my server. No bandwidth. The user's own hardware does the work.

Compare this to building on OpenAI's Whisper API at $0.006 per minute. A thousand daily users averaging five minutes each costs $900 per month. Ten thousand users costs $9,000. The cost scales linearly with usage, which means you need subscription pricing to survive.

With local AI, every new user is pure margin. I charge a one-time fee and deliver unlimited value.

Works Offline

No internet? No problem. Both DropVox and Falavra work on airplanes, in basements, in rural Brazil where my family lives and cell service is unreliable. The model is on the user's machine. It does not need permission from a server to function.

This is an underrated feature. People do not think about offline capability until they need it, and then it is the only thing that matters.

No API Deprecation Risk

Cloud APIs change. Pricing changes. Rate limits get introduced. Endpoints get deprecated. Entire services shut down. When your product depends on someone else's API, your product's future depends on their business decisions.

Local AI removes that dependency. The Whisper model I ship today will work in 10 years. No server needs to stay online. No API key needs to remain valid.

The Business Case for Indie Hackers

The economics of on-device AI fundamentally change what is viable as a solo developer.

No Infrastructure to Manage

I do not run servers. I do not manage Kubernetes. I do not have a cloud bill. I do not wake up at 3 AM because a transcription API is returning 503 errors. My infrastructure is a GitHub repository and a CI/CD pipeline. That is it.

For a solo developer who also has a full-time job at a digital agency, eliminating operational overhead is not a luxury. It is a prerequisite for the business existing at all.

One-Time Purchase Model Works

Subscriptions make sense when your costs scale with usage. When they do not, you can charge once and deliver forever. Users prefer this. I prefer this. The business model is simple: make something good, sell it for a fair price, move on to making it better.

DropVox Pro is $9.99. Falavra Pro is $14.99-29.99. No monthly anxiety about churn rates. No "are we retaining enough subscribers to cover API costs this month." The unit economics work from the first sale.

Revenue Scales, Costs Do Not

Cloud-dependent product:
  1,000 users  = ~$900/month API costs
  10,000 users = ~$9,000/month API costs
  Revenue must outpace costs at every scale

Local AI product:
  1,000 users  = $0 incremental costs
  10,000 users = $0 incremental costs
  Every sale after the first is almost pure profit

This is the math that makes indie software viable. Not every product can eliminate server costs entirely, but the ones that can have a structural advantage.

Performance Reality

Local AI is fast enough. But "fast enough" varies by hardware and model size.

WhisperKit on Apple Silicon

WhisperKit on an M1 MacBook Air transcribes faster than real-time for most model sizes. The tiny model (75MB) processes a one-minute file in about 6 seconds. The base model (150MB) takes about 12 seconds. Even the large-v3 model (3GB), which delivers near-cloud accuracy, runs at roughly real-time speed on M1 and faster on M2 and M3.

The Neural Engine makes this possible. It is a dedicated chip designed for exactly this kind of matrix math. Using it through WhisperKit means transcription is not competing with the CPU for resources -- your Mac stays responsive while transcription runs.

sherpa-onnx Cross-Platform

sherpa-onnx is slower than WhisperKit on macOS because it does not have the same level of Neural Engine integration. On a MacBook, expect roughly 1.5-2x real-time for the base model. On Windows machines with good GPUs, DirectML acceleration brings it closer to real-time. On CPU-only machines, it is slower but still practical for files under 30 minutes.

The trade-off is acceptable because Falavra is a desktop application for processing recorded audio, not a real-time transcription tool. Users drop a file and come back to it. A two-minute wait for a 10-minute recording is fine.

Model Sizes

Users choose their own trade-off between speed and accuracy:

Model	Disk Size	Speed (M1)	Accuracy
Tiny	~75MB	~10x real-time	Good for clean audio
Base	~150MB	~5x real-time	Solid general use
Small	~500MB	~2x real-time	Good for accented speech
Medium	~1.5GB	~1x real-time	Near-professional
Large-v3	~3GB	~0.8x real-time	Cloud-competitive

Both DropVox and Falavra let users pick the model that fits their hardware and patience. Defaulting to base and letting power users upgrade is the right UX choice.

Speaker Diarization: Local and Practical

Falavra goes beyond basic transcription with speaker diarization -- figuring out "who said what" in a recording. This also runs entirely locally.

sherpa-onnx provides OfflineSpeakerDiarization using a pyannote segmentation model (~90MB) for detecting speaker boundaries and a 3D Speaker embedding model (~40MB) for fingerprinting individual voices. The system uses agglomerative clustering to group segments by speaker and auto-detects the number of speakers.

About 130MB of additional models for a feature that cloud APIs charge $0.02-0.05 per minute for. Local. No recurring costs. No data leaving the machine.

I will write a separate deep-dive on the diarization implementation -- it deserves its own post.

The 2026 Hardware Landscape

The timing for on-device AI has never been better.

Apple's Neural Engine has been shipping in every Mac, iPad, and iPhone for years. It is mature, well-documented, and getting faster with each generation. M3 and M4 chips can run large models that would have choked earlier hardware.

On the Windows side, CES 2026 confirmed the "AI PC" era is real. Intel Core Ultra 300 series ships with a dedicated NPU. AMD Ryzen AI 400 delivers 60 TOPS. Qualcomm Snapdragon X2 pushes 80 TOPS for ARM Windows laptops. By the end of 2026, more than half of new PCs will have dedicated AI acceleration.

The hardware is ready. The models are efficient. The runtimes are mature. The gap between "local" and "cloud" AI quality shrinks with every model release. For tasks like speech recognition, that gap is already negligible.

When Local AI Does Not Make Sense

I believe strongly in local-first, but I am not ideological about it. There are cases where the cloud is the right answer.

Large language models that need 70B+ parameters. Running Llama 405B locally requires hardware that costs more than a car. If your product needs cutting-edge reasoning capability, you need cloud GPUs.

Real-time translation at scale. Live translation for a video conference with 50 participants is not a local AI problem. The compute requirements and the network coordination both demand centralized infrastructure.

Anything needing internet-connected knowledge. A chatbot that needs to answer questions about today's news cannot run purely locally. Retrieval-augmented generation needs a connection to a knowledge source.

Mobile apps with strict battery constraints. Running Whisper on an iPhone works, but doing it repeatedly drains the battery fast. For mobile, the trade-off between local processing and battery life is real.

For everything else -- especially personal data processing like transcription, summarization, and analysis -- local is the better default.

My Prediction

Within two years, "cloud-based" will be a liability for personal data processing apps, not a feature.

The trajectory is clear. Hardware NPUs are shipping in every new machine. Models keep getting more efficient at smaller sizes. Privacy regulations keep tightening. Users keep getting more suspicious of cloud processing.

An app that says "we process your voice recordings in the cloud" will sound increasingly like an app that says "we read your emails on our servers." Technically possible. Legally allowed. But not what users want when an alternative exists.

DropVox and Falavra are early bets on this trajectory. The next two years will determine whether I am right.

Follow the Journey

I am building all of this in public -- the products, the decisions, the numbers. If you are building with local AI or thinking about it, I would like to hear about it.

GitHub: github.com/helrabelo
Twitter/X: twitter.com/helrabelo
Helsky Labs: helsky-labs.com

Building products at Helsky Labs. Ship fast, learn from metrics, double down on winners.