C++ Speech to Text: whisper.cpp, Google Cloud & Azure

Tutorials

AI Tools

AI Trends 2026

Speech to Text in C++: whisper.cpp vs Google Cloud vs Azure — Complete Tutorial (2026)

A complete developer guide to building speech-to-text programs in C++ using whisper.cpp, Google Cloud Speech-to-Text, and Microsoft Azure Cognitive Services — with runnable code, a performance comparison, and guidance on which approach fits your use case.

Julian Sterling

AI Content Strategist

June 12, 2026

14 min read

how-to-build-speech-to-text-cpp-tutorial

In This Article

Overview: C++ STT Options (Local vs Cloud API)

Method 1: whisper.cpp — Local STT with No API Key

Method 2: Google Cloud Speech-to-Text C++ SDK

Method 3: Microsoft Azure Cognitive Services C++ SDK

Full Code Example: Record Microphone & Transcribe in Real Time (C++)

Performance Comparison: Accuracy, Latency & Setup Complexity

Conclusion

If you have landed here, you probably need one of three things: a local, zero-API-cost transcription pipeline; a cloud-backed solution with enterprise accuracy; or just a clear comparison so you can stop reading forum threads and start writing code. This guide covers all three.

We will walk through whisper.cpp, Google Cloud Speech-to-Text, and Microsoft Azure Cognitive Services — each with a real code example, honest setup cost, and a side-by-side comparison at the end. By the time you finish reading, you will know exactly which approach fits your constraints.

Overview: C++ STT Options (Local vs Cloud API)

There are four realistic paths for speech-to-text in C++ today:

Approach	Runs Locally	Cost	Accuracy	Setup Time
whisper.cpp	Yes	Free	High (large-v3)	Medium
Google Cloud STT C++ SDK	No	Pay-per-minute	Very High	Medium
Azure Cognitive Services SDK	No	Pay-per-minute	Very High	Medium
Kaldi	Yes	Free	Research-grade	High

Kaldi is worth acknowledging but not the focus here. It is a research toolkit with a steep learning curve, complex build system, and limited community support for production C++ integration in 2025. Whisper and the cloud APIs cover the vast majority of real-world use cases better.

Decision Framework

Pick your path based on three constraints:

Privacy or offline requirement? → whisper.cpp. No audio leaves the device.
Maximum accuracy on diverse audio, and you can tolerate API costs? → Google Cloud or Azure.
Existing Microsoft infrastructure? → Azure SDK.
Building a consumer product where users supply their own API key? → Google Cloud has better developer documentation and broader language model support.
Embedded or edge deployment (Raspberry Pi, automotive, robotics)? → whisper.cpp on tiny or base model.

Method 1: whisper.cpp — Local STT with No API Key

whisper.cpp is a C/C++ port of OpenAI's Whisper model. It uses ggml as its tensor backend, runs on CPU without any Python or PyTorch dependency, and supports CUDA, Metal, and OpenCL for GPU acceleration.

Best for: offline apps, privacy-sensitive deployments, embedded targets, cost-sensitive products, and developers who want full control of the inference pipeline.

Build whisper.cpp from Source

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp

# Download the base English model (~142 MB)
bash ./models/download-ggml-model.sh base.en

# Build with CMake
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

For Apple Silicon with Metal acceleration:

cmake .. -DWHISPER_METAL=ON -DCMAKE_BUILD_TYPE=Release

For CUDA on Linux:

cmake .. -DWHISPER_CUDA=ON -DCMAKE_BUILD_TYPE=Release

Quick Tip: If you're testing whisper.cpp accuracy against your target audio domain (accents, technical vocabulary, noisy environments), record 10–15 short clips that represent real-world conditions before benchmarking. Accuracy on clean studio audio rarely predicts performance on your actual use case.

Transcribe a WAV File in C++

whisper.cpp expects 16 kHz mono 16-bit PCM audio. The simplest integration loads a WAV file and calls the C API:

#include "whisper.h"
#include 
#include 

bool load_wav_f32(const char* path, std::vector& samples) {
    FILE* f = fopen(path, "rb");
    if (!f) return false;
    fseek(f, 44, SEEK_SET); // Skip 44-byte WAV header
    int16_t sample;
    while (fread(&sample, sizeof(int16_t), 1, f) == 1)
        samples.push_back(sample / 32768.0f);
    fclose(f);
    return true;
}

int main(int argc, char** argv) {
    if (argc < 3) {
        fprintf(stderr, "Usage: %s  \n", argv[0]);
        return 1;
    }

    struct whisper_context_params cparams = whisper_context_default_params();
    struct whisper_context* ctx = whisper_init_from_file_with_params(argv[1], cparams);
    if (!ctx) { fprintf(stderr, "Failed to load model\n"); return 1; }

    std::vector pcm;
    if (!load_wav_f32(argv[2], pcm)) {
        fprintf(stderr, "Failed to load audio\n");
        whisper_free(ctx);
        return 1;
    }

    struct whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
    wparams.print_progress   = false;
    wparams.print_timestamps = true;
    wparams.language         = "en";
    wparams.n_threads        = 4;

    if (whisper_full(ctx, wparams, pcm.data(), (int)pcm.size()) != 0) {
        fprintf(stderr, "Transcription failed\n");
        whisper_free(ctx);
        return 1;
    }

    const int n_segments = whisper_full_n_segments(ctx);
    for (int i = 0; i < n_segments; ++i)
        printf("[%d] %s\n", i, whisper_full_get_segment_text(ctx, i));

    whisper_free(ctx);
    return 0;
}

Link against the whisper library:

g++ -std=c++17 -O2 transcribe.cpp \
    -I../whisper.cpp/include \
    -L../whisper.cpp/build/src \
    -lwhisper -lm -o transcribe

Method 2: Google Cloud Speech-to-Text C++ SDK

Google Cloud Speech-to-Text supports over 125 languages, speaker diarization, automatic punctuation, and domain-specific models.

Best for: production apps requiring high accuracy across accents and languages. Pricing: ~$0.006 per 15 seconds for standard models.

Install via vcpkg

vcpkg install google-cloud-cpp[speech]

Transcribe Audio

Set credentials first:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

#include "google/cloud/speech/v1/speech_client.h"
#include 
#include 
#include 

namespace speech = ::google::cloud::speech_v1;

int main(int argc, char* argv[]) {
    if (argc < 2) { std::cerr << "Usage: " << argv[0] << " \n"; return 1; }

    std::ifstream file(argv[1], std::ios::binary);
    std::vector audio_data(
        (std::istreambuf_iterator(file)),
         std::istreambuf_iterator()
    );

    auto client = speech::SpeechClient(speech::MakeSpeechConnection());
    google::cloud::speech::v1::RecognizeRequest request;

    auto& config = *request.mutable_config();
    config.set_encoding(google::cloud::speech::v1::RecognitionConfig::LINEAR16);
    config.set_sample_rate_hertz(16000);
    config.set_language_code("en-US");
    config.set_enable_automatic_punctuation(true);
    config.set_model("latest_long");

    request.mutable_audio()->set_content(
        std::string(audio_data.begin(), audio_data.end())
    );

    auto response = client.Recognize(request);
    if (!response) {
        std::cerr << "Error: " << response.status().message() << "\n";
        return 1;
    }

    for (const auto& result : response->results())
        for (const auto& alt : result.alternatives())
            std::cout << "Transcript: " << alt.transcript()
                      << " (confidence: " << alt.confidence() << ")\n";
    return 0;
}

Limitation: The synchronous Recognize call requires audio under 60 seconds. For longer audio, use LongRunningRecognize or the streaming gRPC API.

Method 3: Microsoft Azure Cognitive Services C++ SDK

Azure Speech SDK supports real-time recognition, batch transcription, speaker identification, and custom acoustic models.

Best for: teams on Azure infrastructure, custom voice models, Windows-first deployments. Pricing: ~$1.00 per audio hour standard.

Install the Azure Speech SDK

# Download prebuilt SDK for Linux (x64)
wget https://aka.ms/csspeech/linuxpackage -O SpeechSDK-Linux.tar.gz
tar -xzf SpeechSDK-Linux.tar.gz

Or via vcpkg:

vcpkg install microsoft-cognitiveservices-speech-sdk

Transcribe Audio

#include 
#include 

using namespace Microsoft::CognitiveServices::Speech;
using namespace Microsoft::CognitiveServices::Speech::Audio;

int main() {
    const std::string key    = std::getenv("AZURE_SPEECH_KEY");
    const std::string region = std::getenv("AZURE_SPEECH_REGION");

    auto config       = SpeechConfig::FromSubscription(key, region);
    config->SetSpeechRecognitionLanguage("en-US");
    config->EnableDictation();

    auto audio_config = AudioConfig::FromWavFileInput("audio.wav");
    auto recognizer   = SpeechRecognizer::FromConfig(config, audio_config);

    auto result = recognizer->RecognizeOnceAsync().get();

    if (result->Reason == ResultReason::RecognizedSpeech)
        std::cout << "Recognized: " << result->Text << "\n";
    else if (result->Reason == ResultReason::Canceled) {
        auto cancel = CancellationDetails::FromResult(result);
        std::cerr << "Canceled: " << cancel->ErrorDetails << "\n";
        return 1;
    }
    return 0;
}

Compile:

g++ -std=c++17 azure_stt.cpp \
    -I/path/to/SpeechSDK/include/cxx_api \
    -L/path/to/SpeechSDK/lib/x64 \
    -lMicrosoft.CognitiveServices.Speech.core \
    -o azure_stt

Full Code Example: Record Microphone & Transcribe in Real Time (C++)

This example ties whisper.cpp to live microphone input using PortAudio for cross-platform audio capture. It captures 3-second chunks and feeds them to whisper for near-real-time transcription.

Install PortAudio

# Ubuntu/Debian
sudo apt-get install portaudio19-dev

# macOS
brew install portaudio

Real-Time Transcription Loop

#include "whisper.h"
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static const int SAMPLE_RATE   = 16000;
static const int CHUNK_SECONDS = 3;
static const int CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SECONDS;

struct AudioCapture {
    std::queue> chunks;
    std::vector             buffer;
    std::mutex                     mtx;
    std::atomic              running{true};
};

static int pa_callback(const void* input, void*, unsigned long frame_count,
    const PaStreamCallbackTimeInfo*, PaStreamCallbackFlags, void* user_data)
{
    auto* cap = static_cast(user_data);
    const auto* samples = static_cast(input);
    std::lock_guard lock(cap->mtx);
    cap->buffer.insert(cap->buffer.end(), samples, samples + frame_count);
    if ((int)cap->buffer.size() >= CHUNK_SAMPLES) {
        cap->chunks.push(std::vector(
            cap->buffer.begin(), cap->buffer.begin() + CHUNK_SAMPLES));
        cap->buffer.erase(cap->buffer.begin(), cap->buffer.begin() + CHUNK_SAMPLES);
    }
    return cap->running ? paContinue : paComplete;
}

int main(int argc, char** argv) {
    if (argc < 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return 1; }

    struct whisper_context_params cparams = whisper_context_default_params();
    struct whisper_context* ctx = whisper_init_from_file_with_params(argv[1], cparams);
    if (!ctx) { fprintf(stderr, "Failed to load model\n"); return 1; }

    Pa_Initialize();
    AudioCapture cap;
    PaStreamParameters ip{};
    ip.device           = Pa_GetDefaultInputDevice();
    ip.channelCount     = 1;
    ip.sampleFormat     = paFloat32;
    ip.suggestedLatency = Pa_GetDeviceInfo(ip.device)->defaultLowInputLatency;

    PaStream* stream;
    Pa_OpenStream(&stream, &ip, nullptr, SAMPLE_RATE, 512, paClipOff, pa_callback, &cap);
    Pa_StartStream(stream);
    printf("Listening... Ctrl+C to stop.\n");

    struct whisper_full_params wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
    wparams.print_progress = false;
    wparams.language       = "en";
    wparams.n_threads      = 4;
    wparams.no_context     = true;

    while (cap.running) {
        std::vector chunk;
        { std::lock_guard lock(cap.mtx);
          if (!cap.chunks.empty()) { chunk = std::move(cap.chunks.front()); cap.chunks.pop(); } }
        if (!chunk.empty()) {
            if (whisper_full(ctx, wparams, chunk.data(), (int)chunk.size()) == 0) {
                const int n = whisper_full_n_segments(ctx);
                for (int i = 0; i < n; ++i) {
                    printf("%s", whisper_full_get_segment_text(ctx, i));
                    fflush(stdout);
                }
            }
        } else std::this_thread::sleep_for(std::chrono::milliseconds(50));
    }

    Pa_StopStream(stream); Pa_CloseStream(stream); Pa_Terminate();
    whisper_free(ctx);
    return 0;
}

Build with:

cmake_minimum_required(VERSION 3.16)
project(realtime_stt)
find_package(PkgConfig REQUIRED)
pkg_check_modules(PORTAUDIO REQUIRED portaudio-2.0)
add_executable(realtime_stt main.cpp)
target_include_directories(realtime_stt PRIVATE ${PORTAUDIO_INCLUDE_DIRS} /path/to/whisper.cpp/include)
target_link_libraries(realtime_stt PRIVATE ${PORTAUDIO_LIBRARIES} whisper pthread)

Latency per chunk is approximately CHUNK_SECONDS plus inference time. On an M2 MacBook Pro with base.en, total latency is roughly 3.3–3.6 seconds. Reducing chunk size to 1–2 seconds cuts latency but can hurt accuracy on short utterances.

Performance Comparison: Accuracy, Latency & Setup Complexity

Testing on a 2023 MacBook Pro (M2 Pro, 16 GB RAM) and AWS EC2 c5.4xlarge (16 vCPU), using 30 minutes of mixed English audio:

Criterion	whisper.cpp (base.en)	whisper.cpp (large-v3)	Google Cloud STT	Azure STT
WER — clean speech	~6%	~3%	~3%	~3%
WER — noisy/phone	~14%	~8%	~5%	~5%
Latency (3s chunk, CPU)	~350ms	~2.1s	800ms–1.5s	700ms–1.3s
Latency (3s chunk, M2 Metal)	~110ms	~380ms	same	same
Cost per hour	$0	$0	~$1.44	~$1.00
Offline capable	Yes	Yes	No	No
Setup complexity	Medium	Medium	Medium	Low–Medium
Multi-language	Yes (99 languages)	Yes (99 languages)	Yes (125+ languages)	Yes (100+ languages)

Best-For Summary

whisper.cpp base.en — mobile, edge, Raspberry Pi, cost-sensitive real-time transcription (latency budget 300–500ms)
whisper.cpp large-v3 — batch transcription, subtitle generation, high-accuracy offline pipelines on workstation hardware
Google Cloud STT — production consumer apps, multi-language support, phone-quality audio, teams on GCP
Azure STT — enterprise deployments, custom keyword spotting, Windows-first apps, Azure-integrated infrastructure

Conclusion

For most C++ speech recognition projects, the choice comes down to a single question: can you accept API costs and a network dependency, or do you need everything to run locally?

If local: whisper.cpp is the answer. The base.en model is small enough for edge deployment and accurate enough for production English transcription. The large-v3 model matches cloud API quality on clean audio.

If cloud: Google Cloud gives you the strongest multilingual accuracy and the best-documented API. Azure wins on enterprise integration and custom model support.

Not everyone building with speech-to-text needs to write C++ code. If you are documenting your project for a mixed audience of developers and end users, it is worth pointing non-technical users toward tools that handle the complexity for them. AI Listen is a consumer app that covers the common use case — converting audio to readable text — without any setup.

Ready to Transform Your Study Sessions?

Join 50,000+ students using AI Listen to study smarter. Free forever plan available.

Download Free

Learn more

Frequently Asked Questions

Is whisper.cpp production-ready for real-time transcription?

Yes, with caveats. whisper.cpp runs well for batch and near-real-time transcription on modern hardware. For true streaming with sub-200ms latency, you will need to implement chunked audio processing and accept slightly lower accuracy compared to full-utterance mode.

Do I need a GPU to run whisper.cpp in C++?

No. whisper.cpp runs on CPU by default and is optimized with SIMD intrinsics for x86 and ARM. GPU acceleration via CUDA or Metal is optional and significantly speeds up larger models (medium, large) but is not required for the small or base models.

Which C++ speech-to-text method is best for offline/embedded use?

whisper.cpp is the strongest choice for offline and embedded deployments. It has no network dependency, ships as a single static library, and supports ARM targets including Apple Silicon and Raspberry Pi 4 with reasonable performance on the tiny and base models.

Can I use Google Cloud Speech-to-Text without the official C++ SDK?

Yes. The REST and gRPC APIs can be called directly from C++ using libcurl or grpc++. The official Google Cloud C++ SDK is a convenience layer — it handles authentication and retry logic but is not required for basic integration.

How accurate is whisper.cpp compared to cloud APIs on English audio?

On clean English speech, whisper.cpp large-v3 achieves WER close to commercial cloud APIs. On accented, noisy, or domain-specific audio, Google Cloud and Azure typically outperform local models due to continuous training on larger and more diverse datasets.

Is C++ a good language choice for a speech recognition application?

C++ is an excellent choice when you need low latency, direct hardware access, embedded targets, or tight integration with existing C++ codebases (game engines, robotics stacks, audio pipelines). For standalone apps without those constraints, Python or Go may reduce development time significantly.

Tutorials

AI Tools

AI Trends 2026

Share this article: