
If you have landed here, you probably need one of three things: a local, zero-API-cost transcription pipeline; a cloud-backed solution with enterprise accuracy; or just a clear comparison so you can stop reading forum threads and start writing code. This guide covers all three.
We will walk through whisper.cpp, Google Cloud Speech-to-Text, and Microsoft Azure Cognitive Services — each with a real code example, honest setup cost, and a side-by-side comparison at the end. By the time you finish reading, you will know exactly which approach fits your constraints.
There are four realistic paths for speech-to-text in C++ today:
Approach | Runs Locally | Cost | Accuracy | Setup Time |
|---|---|---|---|---|
whisper.cpp | Yes | Free | High (large-v3) | Medium |
Google Cloud STT C++ SDK | No | Pay-per-minute | Very High | Medium |
Azure Cognitive Services SDK | No | Pay-per-minute | Very High | Medium |
Kaldi | Yes | Free | Research-grade | High |
Kaldi is worth acknowledging but not the focus here. It is a research toolkit with a steep learning curve, complex build system, and limited community support for production C++ integration in 2025. Whisper and the cloud APIs cover the vast majority of real-world use cases better.
Pick your path based on three constraints:
Privacy or offline requirement? → whisper.cpp. No audio leaves the device.
Maximum accuracy on diverse audio, and you can tolerate API costs? → Google Cloud or Azure.
Existing Microsoft infrastructure? → Azure SDK.
Building a consumer product where users supply their own API key? → Google Cloud has better developer documentation and broader language model support.
Embedded or edge deployment (Raspberry Pi, automotive, robotics)? → whisper.cpp on tiny or base model.
whisper.cpp is a C/C++ port of OpenAI's Whisper model. It uses ggml as its tensor backend, runs on CPU without any Python or PyTorch dependency, and supports CUDA, Metal, and OpenCL for GPU acceleration.
Best for: offline apps, privacy-sensitive deployments, embedded targets, cost-sensitive products, and developers who want full control of the inference pipeline.
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
# Download the base English model (~142 MB)
bash ./models/download-ggml-model.sh base.en
# Build with CMake
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)For Apple Silicon with Metal acceleration:
cmake .. -DWHISPER_METAL=ON -DCMAKE_BUILD_TYPE=ReleaseFor CUDA on Linux:
cmake .. -DWHISPER_CUDA=ON -DCMAKE_BUILD_TYPE=Releasewhisper.cpp expects 16 kHz mono 16-bit PCM audio. The simplest integration loads a WAV file and calls the C API:
#include "whisper.h"
#include
#include
bool load_wav_f32(const char* path, std::vector& samples) {
FILE* f = fopen(path, "rb");
if (!f) return false;
fseek(f, 44, SEEK_SET); // Skip 44-byte WAV header
int16_t sample;
while (fread(&sample, sizeof(int16_t), 1, f) == 1)
samples.push_back(sample / 32768.0f);
fclose(f);
return true;
}
int main(int argc, char** argv) {
if (argc < 3) {
fprintf(stderr, "Usage: %s \n", argv[0]);
return 1;
}
struct whisper_context_params cparams = whisper_context_default_params();
struct whisper_context* ctx = whisper_init_from_file_with_params(argv[1], cparams);
if (!ctx) { fprintf(stderr, "Failed to load model\n"); return 1; }
std::vector pcm;
if (!load_wav_f32(argv[2], pcm)) {
fprintf(stderr, "Failed to load audio\n");
whisper_free(ctx);
return 1;
}
struct whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
wparams.print_progress = false;
wparams.print_timestamps = true;
wparams.language = "en";
wparams.n_threads = 4;
if (whisper_full(ctx, wparams, pcm.data(), (int)pcm.size()) != 0) {
fprintf(stderr, "Transcription failed\n");
whisper_free(ctx);
return 1;
}
const int n_segments = whisper_full_n_segments(ctx);
for (int i = 0; i < n_segments; ++i)
printf("[%d] %s\n", i, whisper_full_get_segment_text(ctx, i));
whisper_free(ctx);
return 0;
} Link against the whisper library:
g++ -std=c++17 -O2 transcribe.cpp \
-I../whisper.cpp/include \
-L../whisper.cpp/build/src \
-lwhisper -lm -o transcribeGoogle Cloud Speech-to-Text supports over 125 languages, speaker diarization, automatic punctuation, and domain-specific models.
Best for: production apps requiring high accuracy across accents and languages. Pricing: ~$0.006 per 15 seconds for standard models.
vcpkg install google-cloud-cpp[speech]Set credentials first:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"#include "google/cloud/speech/v1/speech_client.h"
#include
#include
#include
namespace speech = ::google::cloud::speech_v1;
int main(int argc, char* argv[]) {
if (argc < 2) { std::cerr << "Usage: " << argv[0] << " \n"; return 1; }
std::ifstream file(argv[1], std::ios::binary);
std::vector audio_data(
(std::istreambuf_iterator(file)),
std::istreambuf_iterator()
);
auto client = speech::SpeechClient(speech::MakeSpeechConnection());
google::cloud::speech::v1::RecognizeRequest request;
auto& config = *request.mutable_config();
config.set_encoding(google::cloud::speech::v1::RecognitionConfig::LINEAR16);
config.set_sample_rate_hertz(16000);
config.set_language_code("en-US");
config.set_enable_automatic_punctuation(true);
config.set_model("latest_long");
request.mutable_audio()->set_content(
std::string(audio_data.begin(), audio_data.end())
);
auto response = client.Recognize(request);
if (!response) {
std::cerr << "Error: " << response.status().message() << "\n";
return 1;
}
for (const auto& result : response->results())
for (const auto& alt : result.alternatives())
std::cout << "Transcript: " << alt.transcript()
<< " (confidence: " << alt.confidence() << ")\n";
return 0;
} Limitation: The synchronous Recognize call requires audio under 60 seconds. For longer audio, use LongRunningRecognize or the streaming gRPC API.
Azure Speech SDK supports real-time recognition, batch transcription, speaker identification, and custom acoustic models.
Best for: teams on Azure infrastructure, custom voice models, Windows-first deployments. Pricing: ~$1.00 per audio hour standard.
# Download prebuilt SDK for Linux (x64)
wget https://aka.ms/csspeech/linuxpackage -O SpeechSDK-Linux.tar.gz
tar -xzf SpeechSDK-Linux.tar.gzOr via vcpkg:
vcpkg install microsoft-cognitiveservices-speech-sdk#include
#include
using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;
int main() {
const std::string key = std::getenv("AZURE_SPEECH_KEY");
const std::string region = std::getenv("AZURE_SPEECH_REGION");
auto config = SpeechConfig::FromSubscription(key, region);
config->SetSpeechRecognitionLanguage("en-US");
config->EnableDictation();
auto audio_config = AudioConfig::FromWavFileInput("audio.wav");
auto recognizer = SpeechRecognizer::FromConfig(config, audio_config);
auto result = recognizer->RecognizeOnceAsync().get();
if (result->Reason == ResultReason::RecognizedSpeech)
std::cout << "Recognized: " << result->Text << "\n";
else if (result->Reason == ResultReason::Canceled) {
auto cancel = CancellationDetails::FromResult(result);
std::cerr << "Canceled: " << cancel->ErrorDetails << "\n";
return 1;
}
return 0;
} Compile:
g++ -std=c++17 azure_stt.cpp \
-I/path/to/SpeechSDK/include/cxx_api \
-L/path/to/SpeechSDK/lib/x64 \
-lMicrosoft.CognitiveServices.Speech.core \
-o azure_sttThis example ties whisper.cpp to live microphone input using PortAudio for cross-platform audio capture. It captures 3-second chunks and feeds them to whisper for near-real-time transcription.
# Ubuntu/Debian
sudo apt-get install portaudio19-dev
# macOS
brew install portaudio#include "whisper.h"
#include
#include
#include
#include
#include
#include
#include
static const int SAMPLE_RATE = 16000;
static const int CHUNK_SECONDS = 3;
static const int CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SECONDS;
struct AudioCapture {
std::queue> chunks;
std::vector buffer;
std::mutex mtx;
std::atomic running{true};
};
static int pa_callback(const void* input, void*, unsigned long frame_count,
const PaStreamCallbackTimeInfo*, PaStreamCallbackFlags, void* user_data)
{
auto* cap = static_cast(user_data);
const auto* samples = static_cast(input);
std::lock_guard lock(cap->mtx);
cap->buffer.insert(cap->buffer.end(), samples, samples + frame_count);
if ((int)cap->buffer.size() >= CHUNK_SAMPLES) {
cap->chunks.push(std::vector(
cap->buffer.begin(), cap->buffer.begin() + CHUNK_SAMPLES));
cap->buffer.erase(cap->buffer.begin(), cap->buffer.begin() + CHUNK_SAMPLES);
}
return cap->running ? paContinue : paComplete;
}
int main(int argc, char** argv) {
if (argc < 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return 1; }
struct whisper_context_params cparams = whisper_context_default_params();
struct whisper_context* ctx = whisper_init_from_file_with_params(argv[1], cparams);
if (!ctx) { fprintf(stderr, "Failed to load model\n"); return 1; }
Pa_Initialize();
AudioCapture cap;
PaStreamParameters ip{};
ip.device = Pa_GetDefaultInputDevice();
ip.channelCount = 1;
ip.sampleFormat = paFloat32;
ip.suggestedLatency = Pa_GetDeviceInfo(ip.device)->defaultLowInputLatency;
PaStream* stream;
Pa_OpenStream(&stream, &ip, nullptr, SAMPLE_RATE, 512, paClipOff, pa_callback, &cap);
Pa_StartStream(stream);
printf("Listening... Ctrl+C to stop.\n");
struct whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
wparams.print_progress = false;
wparams.language = "en";
wparams.n_threads = 4;
wparams.no_context = true;
while (cap.running) {
std::vector chunk;
{ std::lock_guard lock(cap.mtx);
if (!cap.chunks.empty()) { chunk = std::move(cap.chunks.front()); cap.chunks.pop(); } }
if (!chunk.empty()) {
if (whisper_full(ctx, wparams, chunk.data(), (int)chunk.size()) == 0) {
const int n = whisper_full_n_segments(ctx);
for (int i = 0; i < n; ++i) {
printf("%s", whisper_full_get_segment_text(ctx, i));
fflush(stdout);
}
}
} else std::this_thread::sleep_for(std::chrono::milliseconds(50));
}
Pa_StopStream(stream); Pa_CloseStream(stream); Pa_Terminate();
whisper_free(ctx);
return 0;
} Build with:
cmake_minimum_required(VERSION 3.16)
project(realtime_stt)
find_package(PkgConfig REQUIRED)
pkg_check_modules(PORTAUDIO REQUIRED portaudio-2.0)
add_executable(realtime_stt main.cpp)
target_include_directories(realtime_stt PRIVATE ${PORTAUDIO_INCLUDE_DIRS} /path/to/whisper.cpp/include)
target_link_libraries(realtime_stt PRIVATE ${PORTAUDIO_LIBRARIES} whisper pthread)Latency per chunk is approximately CHUNK_SECONDS plus inference time. On an M2 MacBook Pro with base.en, total latency is roughly 3.3–3.6 seconds. Reducing chunk size to 1–2 seconds cuts latency but can hurt accuracy on short utterances.
Testing on a 2023 MacBook Pro (M2 Pro, 16 GB RAM) and AWS EC2 c5.4xlarge (16 vCPU), using 30 minutes of mixed English audio:
Criterion | whisper.cpp (base.en) | whisper.cpp (large-v3) | Google Cloud STT | Azure STT |
|---|---|---|---|---|
WER — clean speech | ~6% | ~3% | ~3% | ~3% |
WER — noisy/phone | ~14% | ~8% | ~5% | ~5% |
Latency (3s chunk, CPU) | ~350ms | ~2.1s | 800ms–1.5s | 700ms–1.3s |
Latency (3s chunk, M2 Metal) | ~110ms | ~380ms | same | same |
Cost per hour | $0 | $0 | ~$1.44 | ~$1.00 |
Offline capable | Yes | Yes | No | No |
Setup complexity | Medium | Medium | Medium | Low–Medium |
Multi-language | Yes (99 languages) | Yes (99 languages) | Yes (125+ languages) | Yes (100+ languages) |
whisper.cpp base.en — mobile, edge, Raspberry Pi, cost-sensitive real-time transcription (latency budget 300–500ms)
whisper.cpp large-v3 — batch transcription, subtitle generation, high-accuracy offline pipelines on workstation hardware
Google Cloud STT — production consumer apps, multi-language support, phone-quality audio, teams on GCP
Azure STT — enterprise deployments, custom keyword spotting, Windows-first apps, Azure-integrated infrastructure
For most C++ speech recognition projects, the choice comes down to a single question: can you accept API costs and a network dependency, or do you need everything to run locally?
If local: whisper.cpp is the answer. The base.en model is small enough for edge deployment and accurate enough for production English transcription. The large-v3 model matches cloud API quality on clean audio.
If cloud: Google Cloud gives you the strongest multilingual accuracy and the best-documented API. Azure wins on enterprise integration and custom model support.
Not everyone building with speech-to-text needs to write C++ code. If you are documenting your project for a mixed audience of developers and end users, it is worth pointing non-technical users toward tools that handle the complexity for them. AI Listen is a consumer app that covers the common use case — converting audio to readable text — without any setup.




