Local Voice-to-Text: Privacy in Your Browser

Talking to your computer is a normal behavior in 2026. Tarsk now lets you do it without broadcasting your voice to third-party transcription servers. The new local voice-to-text feature runs OpenAI's Whisper model entirely inside your browser using WebAssembly. This means you can dictate your code and prompts while your words stay strictly on your own hardware.

The Privacy Cost of Voice Input

Most speech recognition software operates by sending your recorded voice to a cloud datacenter. The datacenter processes your audio, generates the text, and sends it back. While this approach works well, it requires you to trust that the provider is not archiving your voice files or using them to train other models.

For developers working on proprietary code bases, this is a clear security concern. Sharing sensitive company logic via voice prompts should not require a security review. Tarsk resolves this by keeping the entire transcription loop local.

To start using the feature, make sure your microphone is connected and click the microphone icon in the chat input area. The application will guide you through the one-time model download.

How It Works: Whisper in WebAssembly

The voice-to-text feature uses a WebAssembly port of the Whisper model. When you click the microphone icon, the application processes your speech in the following steps:

Microphone Capture — the browser requests access to your microphone using native APIs and applies echo cancellation, noise suppression, and auto gain control to make your voice sound acceptable.
Audio Conversion — the captured audio stream is converted into a 16kHz mono Float32Array, which is the exact format required by the Whisper engine.
Local Inference — the WebAssembly engine processes the audio array and produces the text segments.

The audio is processed as soon as you stop speaking. Because the processing occurs in your browser memory, the raw audio data never touches the network.

Technical Optimizations

Running a 550B parameter model locally is impossible in a browser, but Whisper is small enough if you apply the right optimizations. Tarsk utilizes two main technical strategies to keep transcription fast:

WebAssembly SIMD — the local transcription engine requires WebAssembly SIMD (Single Instruction, Multiple Data) support. SIMD allows your processor to perform calculations on multiple data points simultaneously. If your browser does not support WebAssembly SIMD, Tarsk will disable voice input to prevent your computer from freezing.
Multi-threaded Execution — if your browser supports shared memory through SharedArrayBuffer, Tarsk will run the transcription across up to four CPU threads. If your browser lacks this support, it falls back to a single thread, which is slower but still functional.
IndexedDB Caching — the first time you activate voice-to-text, Tarsk downloads the Whisper voice model from the server. The model binary is approximately 75 megabytes. To prevent you from downloading this file every time you open the app, Tarsk saves the model in your browser's IndexedDB cache. Future sessions load the model instantly from local storage.

Handling the Occasional Glitch

Local WebAssembly engines can occasionally encounter memory limitations and abort. If the Whisper engine crashes during transcription, Tarsk is programmed to detect the failure and automatically restart the model. This reinitialization occurs in the background, allowing you to try dictating again without reloading the entire application.

Key Takeaways

Strictly Local because transcription runs entirely in your browser. Your voice data never leaves your computer.

Smart Caching because the 75MB model is cached in IndexedDB, meaning you only download it once.

Performance Conscious because it uses WebAssembly SIMD and multi-threading to ensure your CPU can handle the work in real time.

Try Voice Dictation in Tarsk

Download the Tarsk desktop client today and experience completely local, privacy-first voice-to-text dictation.

Download Tarsk Read the Docs