2.3 KiB
Real-time Speech Recognition with Vosk and Zig
This project implements a minimal real-time speech-to-text application using Vosk and Zig.
Audio Device Configuration
The application uses ALSA's default device, which is configured in alsa.conf
. To use a different audio device:
- Find your audio devices:
aplay -l
orarecord -l
- Edit
alsa.conf
and update thepcm.!default
section:pcm.!default { type hw card 3 # Change to your card number device 0 # Change to your device number }
- Rebuild and run the application
Prerequisites
- Zig 0.15.1 (configured via mise)
- Nix development environment configured for ALSA, and audio libraries
- patchelf (for fixing RPATH in release builds):
nix-env -iA nixpkgs.patchelf
Vosk Model Download
The application uses the Vosk small English model for speech recognition:
- Source: https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
- Size: ~50MB
- Language: English only
- Accuracy: Good for simple sentences and commands
Installation Steps
- Enter nix development environment:
nix develop
- Build application:
zig build
- Run:
zig build run
Release Builds and Portability
When building in release mode (-Doptimize=ReleaseSafe
), Zig embeds the full path to libvosk.so in the ELF NEEDED entries, making the binary non-portable. The build system automatically fixes this by running fix_needed.sh
which uses patchelf
to replace the full path with just the library name.
Automatic fix: Just run zig build -Doptimize=ReleaseSafe
- the NEEDED entries are fixed automatically.
Manual fix: If needed, you can run ./fix_needed.sh [binary_path] [library_name]
manually.
The script uses patchelf
(via nix-shell if not installed) to replace entries like:
- Before:
NEEDED: [/home/user/.cache/zig/.../libvosk.so]
- After:
NEEDED: [libvosk.so]
This makes the binary portable while using the existing RPATH ($ORIGIN/../lib
) to find the library at runtime.
Usage
The application will:
- Initialize audio capture from default microphone
- Load the Vosk speech recognition model
- Process audio in real-time
- Output recognized text to terminal
- Exit on Ctrl+C
Dependencies
- Vosk C API library
- ALSA for audio capture
Notes
Vosk tends to recognize "light" as lake or like