AI Tools
Tutorials
Tips & Tricks
Best Speech to Text Software for Linux (Free and Offline Options)
Speech to text on Linux is more fragmented than on Windows or Mac — but powerful options exist. This guide covers the best free and offline tools, from Vosk to OpenAI Whisper, with setup examples and honest advice on what actually works.
Julian Sterling
Julian Sterling
AI Content Strategist
July 3, 2026
9 min read
linux-speech-to-text-guide
In This Article
Why Linux Speech to Text Is More Challenging Than Windows or Mac
Best Free Offline Tools for Linux
Tool Comparison
Setting Up Vosk on Ubuntu (Step by Step)
X11 vs Wayland: What Breaks Real-Time Dictation
Commercial and Cloud Options
Browser-Based Alternatives That Work on Linux
Developer vs Desktop User: Clear Recommendations
Final Recommendation

Why Linux Speech to Text Is More Challenging Than Windows or Mac

Speech recognition on Linux is genuinely harder than on other platforms — not because the underlying technology is weaker, but because the ecosystem is fragmented and the infrastructure assumptions differ.
On Windows, Cortana and Windows Speech Recognition are deeply integrated into the OS, with a shared audio pipeline and system-wide text injection. On macOS, Apple's dictation engine hooks into every text field via accessibility APIs. Linux has neither of these unified layers.
The specific pain points:
Driver and audio subsystem complexity. PulseAudio, PipeWire, and ALSA all handle microphone input differently. Getting a clean, low-latency audio stream to a recognition engine — especially with noise suppression — requires manual configuration that most desktop users won't expect.
X11 vs Wayland split. Most dictation tools inject recognized text using X11's XTest extension (xdotool type). Under Wayland (now the default on GNOME and many distributions), XTest does not work. You need ydotool (which requires a uinput kernel module) or application-specific plugins. This is a real barrier for desktop dictation in 2026.
No GPU acceleration out of the box. The highest-accuracy models (Whisper large) benefit enormously from CUDA or ROCm. Setting up GPU inference on Linux requires driver configuration that is non-trivial, especially on AMD hardware.
Conclusion: For developers building transcription pipelines, Linux is fully capable. For desktop users who want Windows Cortana-style always-on dictation, expect friction.

Quick Tip: If you only need to transcribe a short recording, pasting the text into a TTS app afterward is a quick way to proofread by ear — AI Listen can read it back to you on any device.


Best Free Offline Tools for Linux

Vosk — Best for Real-Time, Low-Resource Machines

Vosk is an offline speech recognition toolkit built on Kaldi. It is designed for streaming — you feed audio chunks in and get partial transcripts back in real time. Models are small (40–200 MB) and run comfortably on a Raspberry Pi.
Install:
pip install vosk
Download a model from the Vosk model repository, then point the API at it. Vosk supports Python, Java, C#, Go, and a REST server mode, making it easy to embed in applications.
Best for: Real-time dictation apps, embedded devices, projects where latency matters more than accuracy.

OpenAI Whisper — Best Accuracy, Offline, GPU Optional

Whisper is OpenAI's general-purpose speech recognition model, released as open source. It is trained on 680,000 hours of multilingual audio and handles accents, background noise, and technical vocabulary better than any other free tool on Linux.
Install:
pip install openai-whisper
Transcribe a file:
whisper audio.mp3 --model medium
Models available: tiny, base, small, medium, large, large-v3. The medium model is a reasonable default for most use cases — good accuracy, runs in a few minutes on CPU.
For GPU acceleration (NVIDIA):
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Best for: Batch transcription, subtitle generation, podcast processing, high-accuracy requirements.

Mozilla DeepSpeech — Legacy, Still Used

Mozilla's DeepSpeech was an early pioneer in open-source speech recognition. Mozilla officially archived the project in 2022 in favor of Coqui STT (a community fork), but both remain in active use in enterprise workflows and existing integrations.
Install (Coqui STT fork):
pip install stt
DeepSpeech/Coqui is worth knowing because many existing Linux integrations and home automation setups still depend on it. If you are maintaining an existing project, it still works. For new projects, Vosk or Whisper are the better starting points.
Best for: Legacy projects, existing Home Assistant and Node-RED integrations, Python environments where Whisper's dependencies are too heavy.

Tool Comparison

Tool
Accuracy
Real-Time
Offline
GPU Needed
Ease of Setup
Vosk
Good
Yes
Yes
No
Easy
Whisper (medium)
Excellent
No*
Yes
Optional
Moderate
Whisper (large)
Best
No
Yes
Recommended
Moderate
DeepSpeech / Coqui
Fair
Yes
Yes
No
Moderate
Google Speech API
Excellent
Yes
No
No
Easy (API key)
Azure Speech
Excellent
Yes
No
No
Easy (API key)
*Whisper has a streaming variant (whisper-streaming on GitHub) but it is a community tool, not the official package.

Setting Up Vosk on Ubuntu (Step by Step)

  1. Install dependencies:
sudo apt update sudo apt install python3-pip ffmpeg portaudio19-dev pip3 install vosk sounddevice
  1. Download a model:
wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip unzip vosk-model-en-us-0.22.zip
  1. Run a basic real-time transcription script:
import vosk import sounddevice as sd import json model = vosk.Model("vosk-model-en-us-0.22") recognizer = vosk.KaldiRecognizer(model, 16000) with sd.RawInputStream(samplerate=16000, blocksize=8000, dtype='int16', channels=1) as stream: print("Listening... Press Ctrl+C to stop.") while True: data, _ = stream.read(8000) if recognizer.AcceptWaveform(bytes(data)): result = json.loads(recognizer.Result()) print(result.get("text", ""))
This gives you a basic working dictation loop in about 20 lines.

X11 vs Wayland: What Breaks Real-Time Dictation

If you have moved to a Wayland session (likely the default on Ubuntu 22.04+, Fedora, and most modern GNOME desktops), standard dictation tools will transcribe audio correctly but fail to type text into your active window.
The root cause:xdotool type relies on X11's XTest extension, which is not available in Wayland compositors. Tools like Nerd Dictation, Kaldi-based pipelines, and many dictation scripts use xdotool internally.
Workarounds:
  • ydotool: A uinput-based alternative that works on Wayland. Requires loading the uinput kernel module and running as root or with appropriate udev rules.
sudo apt install ydotool sudo modprobe uinput
  • XWayland: Run your dictation tool in an XWayland session. Most GTK and Qt apps support this, but it doesn't give you universal system-wide injection.
  • GNOME Shell extension: Some extensions expose a DBus interface for text input. Works reliably for GNOME-native apps.
  • Application plugins: VS Code, Emacs, and some IDEs have their own speech input plugins that bypass the X11/Wayland issue entirely.
Practical advice: If desktop-wide dictation is your goal and you are on Wayland, expect to spend time setting up ydotool. If you only need transcription (not live injection), X11 vs Wayland does not matter at all — just write the output to a file or clipboard.

Commercial and Cloud Options

If offline processing is not a requirement, cloud APIs are the easiest path on Linux:
  • Google Cloud Speech-to-Text: High accuracy, pay-per-use, excellent Python SDK.
  • Azure Cognitive Services Speech: Competitive accuracy, real-time streaming support.
  • AssemblyAI / Rev AI: Good developer experience, competitive pricing, useful speaker diarization features.
All of these work on Linux through standard HTTP/REST or their Python SDKs — there is no OS-specific limitation.

Browser-Based Alternatives That Work on Linux

If you just need occasional dictation without installing anything, the Web Speech API works in Chromium-based browsers on Linux (including Chrome and Edge). Go to any website using the API — Google Docs voice typing is the most accessible example — and dictation works through the browser, bypassing all the X11/Wayland injection problems entirely.
This is the most practical path for users who want a "just works" solution for occasional note-taking or form filling.

Developer vs Desktop User: Clear Recommendations

If you are a developer building a transcription pipeline, processing audio files, or adding voice input to an application:
  • Start with Whisper for batch/file work (best accuracy)
  • Use Vosk for real-time streaming or low-latency requirements
  • Use cloud APIs if you need speaker diarization, punctuation recovery, or multilingual handling at scale
If you are a desktop user wanting dictation to replace typing:
  • Try Chrome/Chromium + Google Docs voice typing first — zero setup, works on Wayland
  • If you want offline system-wide dictation: install Nerd Dictation (uses Vosk), then configure ydotool if you're on Wayland
  • Expect to spend 30–60 minutes on initial setup
If you also work with text-to-speech — converting written content back to audio for review, accessibility, or publishing — AI Listen covers the reverse direction and works across all platforms without any Linux-specific configuration.

Final Recommendation

For most users arriving at this page, the practical answer is:
  • Vosk if you need real-time, offline, on modest hardware
  • Whisper if accuracy matters and you are processing recorded audio
  • Browser dictation if you just need it to work now without setup
Linux speech to text in 2026 is capable but still requires more manual effort than Windows or macOS. The tools are there — the polish is not. For developers, that is fine. For desktop users, the honest advice is: start with the browser, graduate to Vosk when you need more control.

ai-listen-app
Ready to Transform Your Study Sessions?
Join 50,000+ students using AI Listen to study smarter. Free forever plan available.

Frequently Asked Questions
Is there a native speech-to-text tool built into Linux?
Most Linux distributions do not ship a built-in dictation tool. GNOME has experimented with a speech input feature, but it requires a network connection and is not widely available across distros. Third-party tools like Nerd Dictation or KDE's voice input are the closest to a native experience.
Which Linux speech to text tool is most accurate?
OpenAI Whisper consistently delivers the highest accuracy among free offline tools, especially its medium and large models. The tradeoff is speed and hardware: larger models are slow on CPU-only machines and benefit significantly from a GPU.
Can I use speech to text in real time on Linux?
Real-time dictation on Linux is possible but requires extra work. Vosk has a streaming API and tools like Nerd Dictation can pipe audio to it continuously. Wayland desktops make system-wide dictation harder because most injection tools rely on X11's XTest extension.
Does OpenAI Whisper work offline on Linux?
Yes. Whisper runs entirely locally once the model files are downloaded. No internet connection is needed at inference time. Models range from 39 MB (tiny) to 1.5 GB (large) and can be cached on disk.
What is the difference between Vosk and Whisper for Linux?
Vosk is optimized for real-time, low-latency transcription and works well on modest hardware. Whisper prioritizes accuracy and handles multiple languages and accents better, but is slower without a GPU. For live dictation, Vosk; for batch transcription, Whisper.
Will Linux speech to text work on Wayland?
Vosk, Whisper, and DeepSpeech can all transcribe audio on Wayland — the limitation is text injection, not transcription. Tools that type recognized text into the active window (like Nerd Dictation using xdotool) require X11. On Wayland, you need ydotool or application-level integration as a workaround.

AI Tools
Tutorials
Tips & Tricks
Share this article:
copy

Popular Articles

Continue exploring text to speech and productivity tips
AI Audio for Publishing and News: How Publishers Can Turn Written Content Into a Real Listening Product
TTS
AI Audio for Publishing and News: How Publishers Can Turn Written Content Into a Real Listening Product
AI audio is becoming a serious layer in publishing and news. This guide explains the real use cases, tradeoffs, and decision criteria behind adoption.
AI Story Generator: What It Is, How It Works, and Why It Matters
TTS
AI Story Generator: What It Is, How It Works, and Why It Matters
AI story generators turn prompts into structured drafts for fiction, marketing, and education. In this guide, we cover how AI story generators work, their core features, benefits, limitations, and how to choose the right AI Story Generator.
Android Speech to Text Not Working? 6 Fixes That Actually Work
Tutorials
Android Speech to Text Not Working? 6 Fixes That Actually Work
Android speech to text failures usually trace back to a small set of causes: permissions, cache buildup, language mismatches, or internet dependency. Here are six fixes that cover the most common cases, in order of effort.
What Is the Android Text to Speech Engine and How to Change It
Tutorials
What Is the Android Text to Speech Engine and How to Change It
The Android text to speech engine is the system-level layer that converts text to audio for all apps on your phone. Most users never change it — but knowing how to switch engines, download better voices, and configure it correctly can significantly improve TTS quality.
How to Use Text to Speech on Android: Built-In Features and Best Apps
Tutorials
How to Use Text to Speech on Android: Built-In Features and Best Apps
Android’s text to speech tools range from a built-in accessibility engine to dedicated apps designed for longer listening sessions. This guide covers how to use each, and which option fits your workflow.
Assistive Technology for Dyslexia: What Helps Most
AI Listen
Assistive Technology for Dyslexia: What Helps Most
Assistive technology for dyslexia is more than a list of apps. This guide explains which tools matter most, who they help, and how to choose support that improves reading and learning in practice.