
Last Tuesday I needed subtitles for a 12-minute product demo. The video was in English, the audience was international, and the deadline was two hours away.
My first instinct was the usual cloud suspects. Rev quoted me $1.50 per minute with a 24-hour turnaround. Descript wanted a subscription. Happy Scribe's free tier maxed out at one minute. Even YouTube's auto-captions required uploading, waiting for processing, then manually downloading the .srt file from Studio; a workflow designed for YouTube, not for the rest of the internet.
Then I thought: OpenAI released Whisper as open source in 2022. It's been ported to ONNX, to Core ML, to WebAssembly. If someone can run Stable Diffusion in a browser tab, running a speech-to-text model shouldn't be harder.
So I built one. The Automatic Subtitle Generator on Kitmul runs Whisper entirely in your browser. No upload, no account, no subscription. Drop a video, get a .vtt or .srt file.
Here's how it works under the hood and why the privacy angle matters more than most people realize.
How Whisper works (the short version)
Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual audio. The encoder converts a log-mel spectrogram of the audio into a sequence of embeddings. The decoder autoregressively generates text tokens, predicting each word based on the audio context and all previous tokens.
The clever part is the training data. Instead of curating a clean dataset, OpenAI scraped the internet for audio paired with existing transcripts; YouTube videos with community captions, podcast show notes, audiobooks with their text counterparts. The sheer volume of noisy, real-world data is what gives Whisper its robustness. It handles accents, background music, and cross-talk far better than older models that were trained on read speech in quiet rooms.
The model comes in sizes from tiny (39M parameters) to large-v3 (1.5B parameters). The browser version uses a quantized variant that balances accuracy with the memory constraints of running inside a Web Worker.
The browser pipeline
When you drop a video into the subtitle generator, this is what happens:
- Audio extraction. The video's audio track is separated using FFmpeg compiled to WebAssembly. No server round-trip; the demuxing happens in your browser's memory.
- Resampling. Whisper expects 16kHz mono audio. If the source is 44.1kHz stereo (most videos), an
OfflineAudioContexthandles the conversion via the Web Audio API. - Chunked inference. The audio is split into 30-second chunks (Whisper's attention window) and fed through the ONNX model inside a Web Worker. This keeps the main thread responsive; you can scroll the page while transcription runs.
- Timestamp alignment. Whisper produces word-level timestamps. The tool merges these into subtitle segments of 1-3 lines, keeping each segment under 42 characters wide (the BBC subtitle guidelines standard for readability).
- Format export. You choose WebVTT or SRT. Both are plain text. Both work everywhere.
The first run downloads the model weights (~40-80MB depending on language). After that, your browser caches them. Subsequent runs start almost instantly.

Why client-side subtitles matter
The privacy argument is straightforward, but people underestimate how many scenarios it covers.
Legal depositions. Law firms transcribe testimony recordings. Uploading those to Rev or Otter.ai means a third party now has access to privileged communications. Every cloud transcription service's terms of service includes some version of "we may use your content to improve our models." Even if they don't today, you've already uploaded.
Pre-release content. Marketing teams subtitle product demos before launch. Internal training videos contain unreleased features. If your competitor's PR team is monitoring cloud transcription APIs (and some do; it's surprisingly cheap to scrape aggregated anonymized data), you've just leaked your roadmap.
Medical and therapeutic contexts. Therapists recording sessions for supervision. Doctors dictating notes. HIPAA doesn't care that the transcription service promises encryption; if PHI leaves your device, you need a BAA in place. Running locally sidesteps the entire compliance question.
Journalism. Source protection is sacred. Uploading an interview with a whistleblower to any cloud service, no matter how reputable, creates a copy outside your control.
The Kitmul approach to privacy applies the same principle across all its tools: if the computation can run locally, it should.
The accuracy question
Let's be honest. Whisper in a browser is not going to match a dedicated GPU running the full large-v3 model. But the gap is smaller than you'd expect.
For clear, single-speaker English audio (the most common use case for product demos, tutorials, and course videos), the browser version produces usable subtitles roughly 93-95% of the time. Most errors are homophones ("their" vs "there") or uncommon proper nouns.
Where it struggles:
- Heavy accents + background noise. A speaker with a strong regional accent in a noisy cafe will produce more errors than the same speaker in a quiet room.
- Multiple overlapping speakers. Whisper was trained primarily on single-speaker audio. Crosstalk confuses the decoder.
- Domain-specific jargon. Medical terminology, legal Latin, or niche technical vocabulary that didn't appear frequently enough in the training data.
- Very long silent gaps. Extended silence can cause the model to hallucinate repeated phrases; a known Whisper behavior documented in the OpenAI research paper.
For professional broadcast work, you'll still want human review. But for social media, internal videos, educational content, and quick drafts, browser-based Whisper is genuinely good enough.
85% of social media videos play on mute
This statistic from Digiday keeps getting cited because it keeps being true. On Instagram, TikTok, LinkedIn, and Twitter/X, the default playback is muted. If your video doesn't have burned-in captions, most viewers will scroll past it.
The subtitle generator doesn't just export .srt files. It can burn subtitles directly into the video using FFmpeg.wasm, with customizable font size, color, and background opacity. The output is a new MP4 with permanent, embedded captions; no player support needed, no separate file to upload.
For content creators who post across multiple platforms, this is the fastest workflow I've found: drop video, wait 2-3 minutes, customize caption style, download captioned video. One tool, zero context switches.

Subtitle formats: VTT vs SRT
If you're not sure which format to use, here's the practical difference:
| Feature | WebVTT (.vtt) | SubRip (.srt) |
|---|---|---|
| Styling | Supports CSS-like styling, positioning | Plain text only |
| Web players | Native HTML5 <track> support |
Requires parser |
| YouTube | Accepted | Accepted |
| Social media | Varies | Widely supported |
| Metadata | Headers, comments, notes | Sequence numbers only |
| Spec | W3C standard | De facto standard |
Rule of thumb: use VTT for web-based players and HTML5 video, SRT for everything else. Both are editable in any text editor, so converting between them is trivial.
The accessibility angle nobody talks about
Subtitles aren't just a growth hack. Under the Americans with Disabilities Act and the European Accessibility Act (which took full effect in June 2025), video content published by businesses must be accessible to people who are deaf or hard of hearing. This applies to websites, apps, and social media.
Most small businesses and solo creators don't subtitle their videos because the cost and friction are too high. A free, instant, no-upload tool removes that excuse entirely.
If accessibility is part of your workflow, the Accessibility Tree Visualizer on Kitmul can help you audit your web content's ARIA structure alongside your captioning efforts.
Complementary workflow tools
Subtitles are one step in a content pipeline. Here's how other Kitmul tools fit:
- Extract Audio from Video — Pull the audio track from any video before processing. Useful if you want to run Speech to Text separately for a full transcript.
- Video Trimmer — Cut your video to the relevant segment before generating subtitles. Processing a 2-minute clip is faster than a 30-minute recording.
- Audio Stem Splitter — Isolate vocals from background music before transcription. Cleaner audio input produces more accurate subtitles.
- Text Readability Scorer — Paste your subtitle text to check if the language is appropriate for your audience's reading level.
- Keyword Extractor — Pull keywords from your transcript for SEO metadata, video tags, and content optimization.
All of these run in your browser. No uploads. No accounts. They compose well because they all operate on the same principle: your data stays on your device.
Try it
The Automatic Subtitle Generator is free. Drop an MP4, WebM, MOV, or MKV file. Choose your language (or let the AI auto-detect from 90+ supported languages). Wait a couple of minutes. Download your subtitles or your captioned video.
No account. No upload. No watermark. No daily limit.
If you're building content at scale and want the transcript as raw text instead of timed subtitles, the Speech to Text tool handles that use case with the same Whisper model.