Skip to content
<-- back
| 6 min | antonio maiolo
// product, ai, privacy, macos, architecture

On-Device Transcription Without the Cloud — The Architecture Behind Murmur

Local transcription used to mean slow, inaccurate, and half your CPU gone. That’s not true anymore — not on Apple Silicon.

I built Murmur to find out if “everything local” could actually ship as a product. Transcription, summarization, speaker diarization, 25+ languages, no cloud, no account. Here’s what the architecture looks like and where the time actually went.

The No-Cloud Constraint

The rule was simple: no network calls after the initial model download. No account creation. No telemetry. No “hybrid mode” where transcription is local but summaries hit an API. No fallback.

One cloud dependency and the privacy promise collapses. You can’t be “mostly offline” — either the audio never leaves the machine, or it does. There’s no middle ground users will trust.

That constraint forces you to solve everything locally: models that download once, inference that runs on consumer hardware without melting the battery, storage and encryption without a server to manage keys. Every shortcut I considered — “just use OpenAI for summaries,” “just store transcripts in iCloud” — broke the rule.

Why Not Whisper

Whisper is the obvious choice for speech-to-text. Huge community, well-documented, multiple Apple Silicon ports. It’s what everyone reaches for first.

I did too. Murmur v1 used WhisperKit. It worked. Then I tried FluidAudio with Parakeet, and the difference wasn’t incremental — it was a different league.

Whisper (via WhisperKit)Parakeet (via FluidAudio)
Speed on Apple SiliconGoodSignificantly faster
Neural Engine optimizationVia Core ML conversionNative
Language auto-detectionYesYes (25+)
Speaker diarizationSeparate pipeline neededBuilt in

That last row matters. Whisper gives you a wall of text. You know what was said but not who said it. To get speaker diarization, you need a second model, a separate pipeline, and alignment logic to stitch timestamps together. FluidAudio bundles offline diarization out of the box. One framework, one pipeline, speakers identified.

The migration from WhisperKit to FluidAudio was a one-way door. Once you see the speed difference on an M-chip, you don’t go back.

The Summarization Stack

Transcription is half the problem. A one-hour meeting produces a 10,000+ word transcript. Nobody reads that. What people actually want: “what did we decide?” and “what do I need to do?”

Murmur runs MLX with Qwen3-8B — a quantized 8B-parameter model running entirely on-device. Summaries are tailored by content type:

TypeWhat you get
MeetingAction items, decisions, open questions
LectureStructure, concepts, key takeaways
PodcastArguments, interesting quotes
Voice noteQuick distillation

Summaries are timestamped. When the summary says “decided to postpone the migration,” you can jump straight to that part of the recording and hear the actual reasoning.

All transcriptions are searchable. That conversation from three weeks ago where someone mentioned the API rate limit? Search across your entire history. Local full-text search, not a cloud index.

Same rule as everything else: the LLM runs on your machine. No API calls, no per-token billing, no question about who reads your meeting notes.

Audio Capture Is the Unglamorous Problem

Mic recording via AVAudioEngine: trivial. Every framework handles it.

System audio — capturing what’s playing through your speakers during a call — is where the engineering time went. macOS doesn’t just let you tap system audio. Before macOS 14, you needed a virtual audio device — a kernel extension or audio driver hack that routes output back as input. Fragile, requires user installation, breaks on OS updates.

ScreenCaptureKit (macOS 14.4+) made system audio capture a real API. But “real” is relative. Permission prompts interrupt the user mid-recording. Buffer formats differ between mic and system audio. Recording both simultaneously and keeping them in sync is the problem that looks simple on paper and cost the most debugging time in practice.

Then there’s file uploads. Users drop in audio files, video files, whatever they have. That’s a separate pipeline — AVAssetExportSession extracts the audio track, then it hits the same transcription path as a live recording. Two input pipelines, same output format.

This is why Murmur is a native Swift app, not Electron. You need ScreenCaptureKit, the Neural Engine, and low-level audio buffers. There’s no cross-platform abstraction for any of this.

Encryption Without a Server

No account means no server-side key management. Recordings and transcripts are encrypted on-device with AES-GCM via CryptoKit.

The trade-off is real: lose your machine, lose your data. No cloud backup. No recovery. No “forgot my password” flow — there’s no password, no account, no server that could help you.

For conversations you want private, that’s the point. The data exists in exactly one place.

The Requirements I Can’t Avoid

Every architecture decision above has a hardware consequence:

RequirementWhy
macOS 14.4+ScreenCaptureKit APIs for system audio capture
Apple SiliconNeural Engine required — Intel Macs can’t run this
16GB RAMModels need to fit in memory alongside the app
~4GB initial downloadFull models, not stubs — no cloud fallback

That excludes Windows, Linux, older Macs, and the 8GB MacBook Air. That’s a lot of people. The bet is that the people who need private local transcription — and care enough to pay for it — have a modern Mac.

€29 one-time purchase. 7-day free trial. No subscription. There’s no server to pay for, so there’s no reason to charge monthly.

When to Use a Cloud Service Instead

Use Murmur ifUse a cloud service if
Your conversations are sensitiveYou need live captions during the meeting
You want zero recurring costYour team needs shared, searchable archives
You’re on a Mac with Apple SiliconYou need cross-platform support
You don’t want an account or telemetryCompliance requires centralized storage

If you need real-time transcription while the meeting is happening, use a cloud service. If your team needs a shared searchable database of every meeting, use a cloud service. Murmur is for people who want their conversations to stay on their own machine.


Murmur is available for Mac. 7-day free trial, €29 after that.


x|telegram

<--
previous
Your AI Reads Customer Data — A Proxy Solved My GDPR Problem
next
Zero CI Code: Managing Supabase Credentials with dotenvx
-->