stt/README.md

2.3 KiB

Real-time Speech Recognition with Vosk and Zig

This project implements a minimal real-time speech-to-text application using Vosk and Zig.

Audio Device Configuration

The application uses ALSA's default device, which is configured in alsa.conf. To use a different audio device:

  1. Find your audio devices: aplay -l or arecord -l
  2. Edit alsa.conf and update the pcm.!default section:
    pcm.!default {
        type hw
        card 3      # Change to your card number
        device 0    # Change to your device number
    }
    
  3. Rebuild and run the application

Prerequisites

  • Zig 0.15.1 (configured via mise)
  • Nix development environment configured for ALSA, and audio libraries
  • patchelf (for fixing RPATH in release builds): nix-env -iA nixpkgs.patchelf

Vosk Model Download

The application uses the Vosk small English model for speech recognition:

Installation Steps

  1. Enter nix development environment: nix develop
  2. Build application: zig build
  3. Run: zig build run

Release Builds and Portability

When building in release mode (-Doptimize=ReleaseSafe), Zig embeds the full path to libvosk.so in the ELF NEEDED entries, making the binary non-portable. The build system automatically fixes this by running fix_needed.sh which uses patchelf to replace the full path with just the library name.

Automatic fix: Just run zig build -Doptimize=ReleaseSafe - the NEEDED entries are fixed automatically.

Manual fix: If needed, you can run ./fix_needed.sh [binary_path] [library_name] manually.

The script uses patchelf (via nix-shell if not installed) to replace entries like:

  • Before: NEEDED: [/home/user/.cache/zig/.../libvosk.so]
  • After: NEEDED: [libvosk.so]

This makes the binary portable while using the existing RPATH ($ORIGIN/../lib) to find the library at runtime.

Usage

The application will:

  • Initialize audio capture from default microphone
  • Load the Vosk speech recognition model
  • Process audio in real-time
  • Output recognized text to terminal
  • Exit on Ctrl+C

Dependencies

  • Vosk C API library
  • ALSA for audio capture

Notes

Vosk tends to recognize "light" as lake or like