Name: TokenCap
Author: Helsky Labs

The first version of DropVox worked. That was the problem.

It worked well enough that I used it every day for weeks. Well enough that I almost convinced myself it was shippable. And well enough that I wasted time polishing something that needed to be thrown away entirely.

This is the story of building DropVox twice -- once as a Python prototype, once as a native Swift app -- and why the rewrite was the best decision I made on this product.

The Python Version

DropVox started in January 2026 as a weekend project. The problem was simple: I receive a lot of WhatsApp voice messages. In Brazil, voice messages are the default communication medium. People send 3-minute monologues instead of typing two sentences. And I frequently find myself in situations -- meetings, public spaces, headphones-less commutes -- where I can't listen to them.

I wanted a menu bar app that would take an audio file, run it through Whisper, and give me the text. Locally. No upload. No cloud.

The first version used Python with three dependencies:

# The original DropVox stack
import rumps          # macOS menu bar framework
import whisper        # OpenAI's Whisper model
import pyperclip      # Clipboard access

That's it. A rumps app that showed a menu bar icon, accepted a file path, ran Whisper on it, and copied the transcript to the clipboard. I built it in a weekend, and it solved my problem.

For about two weeks, I was happy.

Where Python Broke Down

The problems weren't bugs. They were limits.

Distribution Was a Nightmare

Sharing a Python app with non-developers means bundling a Python runtime. PyInstaller and py2app exist, but they produce bloated, fragile bundles. My DropVox.app was 800MB because it included Python itself, NumPy, PyTorch, the Whisper model weights, and all their transitive dependencies.

On a fresh Mac, the app sometimes failed to launch because of missing system libraries. On Intel Macs, it crashed because the bundled PyTorch expected Apple Silicon. Every user's machine was a different minefield.

I spent more time debugging distribution than I spent building the app.

Performance Hit a Wall

Python's Global Interpreter Lock (GIL) meant that while Whisper was transcribing, the UI froze completely. The menu bar icon became unresponsive. macOS would sometimes show the spinning beach ball and offer to force-quit the app.

I could work around this with threading or multiprocessing, but Python threads don't truly parallelize CPU-bound work, and multiprocessing adds complexity with inter-process communication, shared memory, and serialization overhead.

A 2-minute voice message took around 30 seconds to transcribe on my M2 MacBook Pro using the base model. Not terrible, but not the experience I wanted to ship.

UI Was Fundamentally Limited

rumps is great for simple menu bar apps. But it can't do floating windows. It can't do drag-and-drop. It can't do custom UI beyond basic menu items and alerts.

I wanted a drop zone -- a floating window where you drag an audio file and it starts transcribing. I wanted a history panel. I wanted model selection. None of this was possible without abandoning rumps for PyObjC, which is essentially writing Objective-C in Python syntax. At that point, why not just write Swift?

Startup Time

The Python version took 4-6 seconds to launch. Most of that was importing PyTorch and loading Whisper into memory. For a menu bar app that should feel instant, this was unacceptable.

Users expect menu bar apps to be invisible until needed. A 5-second delay on every login undermines that expectation.

The Rewrite Decision

I had a working prototype. The instinct was to iterate: fix distribution, optimize performance, improve the UI piece by piece. That's usually the right instinct. Rewrites are expensive, risky, and notoriously underestimated.

But I ran through the checklist honestly:

Is the problem the architecture or the implementation? Architecture. Python's runtime model was the root cause of every major issue.
Would fixing the issues require replacing most of the code anyway? Yes. Fixing distribution meant a different bundling approach. Fixing performance meant a different ML runtime. Fixing UI meant a different framework.
Is there a clearly better platform for this specific app? Yes. Swift + SwiftUI + WhisperKit is purpose-built for exactly this kind of macOS application.
Did the prototype teach me what to build? Absolutely. Two weeks of daily use gave me a clear spec.

The prototype had done its job. It validated the idea, revealed the UX priorities, and proved that local Whisper transcription was good enough for real use. The code was disposable. The knowledge wasn't.

The Swift Version

On January 22, I opened Xcode and started from scratch.

The Stack

SwiftUI          → UI framework (native macOS widgets, floating windows, drag-drop)
WhisperKit       → CoreML-optimized Whisper (runs on Apple Neural Engine)
Swift Actors     → Thread-safe concurrency without GIL problems
Swift Data       → Local persistence for transcription history
Developer ID     → Code signing and notarization for distribution

Week 1: Core Transcription

The first week was about getting WhisperKit running and building the basic transcription pipeline. WhisperKit is not just a Swift wrapper around Whisper -- it's a complete reimplementation optimized for Apple's CoreML framework. It runs on the Neural Engine, the dedicated ML hardware in every Apple Silicon chip.

The performance difference was immediate. The same 2-minute voice message that took 30 seconds in Python transcribed in under 10 seconds with WhisperKit. On the large model, which Python couldn't run at all without running out of memory, WhisperKit handled it in about 25 seconds.

I also implemented 5 model sizes (Tiny through Large) and automatic language detection for 13 languages. WhisperKit handles both natively.

Week 2: The Interface

This is where Swift earned its keep. In SwiftUI, a floating drop zone is about 40 lines of code:

struct DropZoneWindow: View {
    @State private var isTargeted = false

    var body: some View {
        ZStack {
            RoundedRectangle(cornerRadius: 12)
                .fill(.ultraThinMaterial)
                .overlay(
                    RoundedRectangle(cornerRadius: 12)
                        .strokeBorder(
                            isTargeted ? Color.accentColor : Color.secondary,
                            style: StrokeStyle(lineWidth: 2, dash: [8])
                        )
                )

            VStack {
                Image(systemName: "waveform")
                Text("Drop audio file here")
            }
        }
        .onDrop(of: [.audio, .fileURL], isTargeted: $isTargeted) { providers in
            handleDrop(providers)
            return true
        }
    }
}

In Python with PyObjC, this same feature would have been 200+ lines of Objective-C bridge code with manual memory management and no SwiftUI previews.

I also built:

Keyboard shortcut (Cmd+D) to toggle the drop zone
Clipboard paste (Cmd+V) for copied audio files
Searchable transcription history with full-text search
Progress indicators during transcription
Model selection in preferences

Week 3: Distribution

Code signing and notarization on macOS is not trivial, but it's a solved problem. Apple Developer ID, codesign, notarytool, and a GitHub Actions workflow that builds, signs, notarizes, and produces a .dmg on every tagged release.

I also built the licensing system. Free tier: 3 transcriptions per day, 60-second max duration. Pro ($9.99 USD / R$49.90 BRL): unlimited everything. One-time purchase. LemonSqueezy handles payment, and the app validates license keys locally.

Final Days: Polish

The last stretch was edge cases. WhatsApp uses .opus files, which don't play nicely with every audio framework. Some users have .ogg files. Some have .webm. I added format conversion using AVFoundation so the app handles whatever people throw at it.

I also spent time on the menu bar icon states (idle, transcribing, complete), accessibility labels, and making sure the app behaved correctly when the user had multiple displays.

24 Days

From git init to a signed, notarized, commercially available macOS application: 24 days. No weekends off. Roughly 10-12 hour days.

That's fast, but it's not magic. The speed came from three things:

The prototype eliminated uncertainty. I knew exactly what the app needed to do because I'd been using a version of it for two weeks.
SwiftUI is productive for macOS. Native widgets, native behavior, no fighting the platform.
WhisperKit removed the ML complexity. I didn't have to think about model optimization, CoreML conversion, or hardware acceleration. WhisperKit abstracts all of it.

The Cross-Platform Question

People ask why I didn't use Electron or Tauri to build once and ship everywhere. The answer is performance and experience.

DropVox transcribes audio using ML models. Electron adds a Chromium runtime overhead on top of an already compute-intensive task. For a menu bar utility that should feel native and lightweight, wrapping a web browser is the wrong tradeoff.

The plan for Windows is a separate native app: C# with WinUI 3, using whisper.cpp or ONNX Runtime for transcription. Two codebases, zero compromise on either platform. The core logic (model loading, transcription pipeline, history management) is straightforward enough that reimplementation isn't expensive. The platform-specific parts (UI, system integration, distribution) are where native matters most.

This is a deliberate choice. Not every app needs cross-platform. Menu bar utilities that interact deeply with the OS are better built natively for each OS.

When to Rewrite vs. Iterate

The default answer should always be "iterate." Rewrites fail more often than they succeed because teams underestimate the accumulated knowledge embedded in existing code -- edge cases handled, bugs fixed, user feedback incorporated.

But DropVox was an exception because:

The codebase was 3 weeks old. There was minimal accumulated knowledge to lose.
The architecture was the bottleneck. No amount of iteration would fix Python's distribution story or GIL limitations.
The target platform was clear. This wasn't a speculative bet on a new technology. Swift + SwiftUI for macOS apps is the canonical choice.
The prototype had already captured the product knowledge. Every UX decision, every feature priority, every "users actually need this" insight survived the rewrite because it lived in my head, not in the code.

If DropVox had been a 6-month codebase with users, integrations, and accumulated edge case handling, I would have iterated. But it was a weekend prototype. The cost of rewriting was 24 days. The cost of not rewriting was shipping a fundamentally limited product.

The Lesson

Build the prototype in whatever gets you to validation fastest. For me, that was Python -- a language I'm comfortable in, with libraries that solve the immediate problem.

But don't mistake the prototype for the product. They serve different purposes. The prototype answers "should this exist?" The product answers "how should this exist?" Those are different questions requiring different tools.

DropVox the Python script validated that local voice transcription is useful. DropVox the Swift app proved it can be a product people pay for.

The prototype had to die for the product to live. And that's fine. The best $0 I ever spent was on code I threw away.

Follow the Journey

DropVox is available at dropvox.app. Free tier with no account required.

I share build progress, product decisions, and lessons learned in public:

Building products at Helsky Labs. Ship fast, learn from metrics, double down on winners.