bagustris@/home: Acoustic Feature Extraction with Transformers

Tuesday, August 23, 2022

Acoustic Feature Extraction with Transformers

The example in Transformers' documentation here shows how to use the wav2vec 2.0 model for automatic speech recognition. However, there are two crucial issues in that example. First, we usually use our data (set) instead of their (available) dataset. Second, we need to extract acoustic features (the last hidden states instead of logits). The following is my example of adapting Transformers to extract acoustic embedding given any audio file (WAVE) using several models. It includes the pooling average from frame-based processing to utterance-based processing for given any audio file. You don't need to perform the pooling average if you want to process your audio file in frame-based processing (remove the `.mean(axis=0)` in the variable `last_hidden_states`).

Basic syntax: wav2vec2 base model

This is the example from the documentation. I replaced the use of the dataset with the defined path of the audio file ('00001.wav').

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torchaudio
import torch
# load model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

# audio file is decoded on the fly
array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav")
input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt")

# apply the model to the input array from wav
with torch.no_grad():
    outputs = model(**input)

# extract last hidden state, compute average, convert to numpy
last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy()

# print shape
print(f"Hidden state shape: {last_hidden_states.shape}")
# Hidden state shape: (768,)

The syntax for the wav2vec2 large and robust model

In this second example, I replace the base model with the large and robust model without finetuning. This example is adapted from here. Note that I replaced 'Wav2Vec2ForCTC' with 'wav2vec2Model'. The former is used when we want to obtain the logits (for speech-to-text transcription) instead of obtaining the hidden states.

from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
import torchaudio

# load model
processor = Wav2Vec2Processor.from_pretrained(
    "facebook/wav2vec2-large-robust-ft-swbd-300h")
model = Wav2Vec2Model.from_pretrained(
    "facebook/wav2vec2-large-robust-ft-swbd-300h")

# audio file is decoded on the fly
array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav")
input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt")

with torch.no_grad():
    outputs = model(**input)

last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy()
# printh shape
print(f"Hidden state shape: {last_hidden_states.shape}")

You can replace "facebook/wav2vec2-large-robust-ft-swbd-300h" with "facebook/wav2vec2-large-robust-ft-libri-960h" for the larger fine-tuned model.

For other models, you may need to change `Wav2Vec2Processor` with `Wav2Vec2FeatureExtractor` for processor variable. In my case, this is needed for the following models:

facebook/wav2vec2-large-robust
facebook/wav2vec2-large-xlsr-53

The syntax for the custom model (wav2vec-R-emo-vad)

The last one is the example of the custom model. The model is wav2vec 2.0 fine-tuned on the MSP-Podcast dataset for speech emotion recognition. This last example differs from the previous one since the configuration is given by the authors of the model (read the code thoroughly to inspect the details). I replaced the dummy audio file with the real audio file. It is assumed to process in batch (with batch_size=2) by replicating the same audio file.

import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2Model,
    Wav2Vec2PreTrainedModel,
)
import torchaudio


class RegressionHead(nn.Module):
    r"""Classification head."""

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):

        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x


class EmotionModel(Wav2Vec2PreTrainedModel):
    r"""Speech emotion classifier."""

    def __init__(self, config):

        super().__init__(config)

        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = RegressionHead(config)
        self.init_weights()

    def forward(
            self,
            input_values,
    ):

        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits = self.classifier(hidden_states)

        return hidden_states, logits


def process_func(
    wavs,
    sampling_rate: int
    # embeddings: bool = False,
):
    r"""Predict emotions or extract embeddings from raw audio signal."""

    # run through processor to normalize signal
    # always returns a batch, so we just get the first entry
    # then we put it on the device
    # wavs = pad_sequence(wavs, batch_first=True)
    # load model from hub
    device = 'cpu'
    model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    model = EmotionModel.from_pretrained(model_name)

    y = processor([wav.cpu().numpy() for wav in wavs],          
                   sampling_rate=sampling_rate,
                   return_tensors="pt",
                   padding="longest"
        )
    y = y['input_values']
    y = y.to(device)


    y = model(y)

    return {
        'hidden_states': y[0],
        'logits': y[1],
    }


## test to an audiofile
sampling_rate = 16000
signal = [torchaudio.load('train_001.wav')[0].squeeze().to('cpu') for _ in range(2)]

# extract hidden states
with torch.no_grad():
    hs = process_func(signal, sampling_rate)['hidden_states']
print(f"Hidden states shape={hs.shape}")

Please note for all models, the audio file must be sampled with 16000 Hz, otherwise, you must resample it before extracting acoustic embedding using the methods above. It may not throw an error even if the sampling rate is not 16000 Hz but the results, hence, is not valid since all models were generated based on 16 kHz of sampling rate speech datasets.

You may also want to extract acoustic features using the opensmile toolkit. The tutorial for Windows users using WSL is available here: http://bagustris.blogspot.com/2021/08/extracting-emobase-feature-using-python.html.

Happy reading. Don't wait for more time to apply these methods to your own audio file.