• How to properly use py-webrtcvad?

    From marc nicole@3:633/280.2 to All on Thu Jan 23 08:54:12 2025
    Hi,

    I am getting audio from my mic using PyAudio as follows:

    self.stream = audio.open(format=self.FORMAT,
    channels=self.CHANNELS,
    rate=self.RATE,
    input=True,
    frames_per_buffer=self.FRAMES_PER_BUFFER,
    input_device_index=1)


    then reading data as follows:

    for i in range(0, int(self.RATE / self.FRAMES_PER_BUFFER *
    self.RECORD_SECONDS)):
    data = self.stream.read(4800)


    on the other hand I am using py-webrtcvad as follows:

    self.vad = webrtcvad.Vad()


    and want to use *is_speech*() using audio data from PyAudio.
    But getting the error:

    return _webrtcvad.process(self._vad, sample_rate, buf, length)
    Error: Error while processing frame


    no matter how I changed the input data format (wav: using
    speech_recognition's *get_wav_data*(), using numpy...)

    Any suggestions (using Python 2.x)?
    Thanks.

    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: ---:- FTN<->UseNet Gate -:--- (3:633/280.2@fidonet)
  • From Stefan Ram@3:633/280.2 to All on Sun Jan 26 21:37:40 2025
    marc nicole <mk1853387@gmail.com> wrote or quoted:
    return _webrtcvad.process(self._vad, sample_rate, buf, length)
    Error: Error while processing frame

    (I was not able to check the following tips myself!
    So, please read them as a mere wild guess!)

    That error you're running into - it's possibly because the
    audio format webrtcvad wants isn't jiving with what you're
    feeding it. Let me break it down for you:

    WebRTC VAD is picky about its audio, like a foodie at a farmers
    market:

    - It wants 16-bit mono PCM, nothing fancy

    - Sample rates got to be 8000, 16000, 32000, or 48000 Hz

    - Frame durations should be 10, 20, or 30 ms, like clockwork

    Tweak your PyAudio setup like you're fine-tuning a classic car:

    Python

    self.FORMAT = pyaudio.paInt16
    self.CHANNELS = 1
    self.RATE = 16000
    self.FRAMES_PER_BUFFER = 480 # 30 ms at 16000 Hz, smooth as a SoCal highway

    Give your audio reading loop a makeover:

    Python

    for i in range(0, int(self.RATE / self.FRAMES_PER_BUFFER * self.RECORD_SECONDS)):
    data = self.stream.read(self.FRAMES_PER_BUFFER)
    is_speech = self.vad.is_speech(data, self.RATE)

    Make sure your audio data is on point:

    Python

    import numpy as np

    # Turn that audio data into a numpy array, like magic
    audio_array = np.frombuffer(data, dtype=np.int16)

    # If it's not mono, make it mono - no stereo allowed at this party
    if self.CHANNELS > 1:
    audio_array = audio_array[::self.CHANNELS]

    # Back to bytes it goes
    audio_bytes = audio_array.tobytes()

    is_speech = self.vad.is_speech(audio_bytes, self.RATE)

    Crank up that VAD aggressiveness:

    Python

    self.vad = webrtcvad.Vad(3) # 3 is as aggressive as LA traffic

    (Just remember to adjust your sample rate and frame duration
    to fit your needs.)



    --- MBSE BBS v1.0.8.4 (Linux-x86_64)
    * Origin: Stefan Ram (3:633/280.2@fidonet)