The example in Transformers' documentation here shows how to use the wav2vec 2.0 model for automatic speech recognition. However, there are two crucial issues in that example. First, we usually use our data (set) instead of their (available) dataset. Second, we need to extract acoustic features (the last hidden states instead of logits). The following is my example of adapting Transformers to extract acoustic embedding given any audio file (WAVE) using several models. It includes the pooling average from frame-based processing to utterance-based processing for given any audio file. You don't need to perform the pooling average if you want to process your audio file in frame-based processing (remove the `.mean(axis=0)` in the variable `last_hidden_states`).
Basic syntax: wav2vec2 base model
This is the example from the documentation. I replaced the use of the dataset with the defined path of the audio file ('00001.wav').
from transformers import Wav2Vec2Processor, Wav2Vec2Model import torchaudio import torch # load model processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h") # audio file is decoded on the fly array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav") input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt") # apply the model to the input array from wav with torch.no_grad(): outputs = model(**input) # extract last hidden state, compute average, convert to numpy last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy() # print shape print(f"Hidden state shape: {last_hidden_states.shape}") # Hidden state shape: (768,)
The syntax for the wav2vec2 large and robust model
In this second example, I replace the base model with the large and robust model without finetuning. This example is adapted from here. Note that I replaced 'Wav2Vec2ForCTC' with 'wav2vec2Model'. The former is used when we want to obtain the logits (for speech-to-text transcription) instead of obtaining the hidden states.
from transformers import Wav2Vec2Processor, Wav2Vec2Model import torch import torchaudio # load model processor = Wav2Vec2Processor.from_pretrained( "facebook/wav2vec2-large-robust-ft-swbd-300h") model = Wav2Vec2Model.from_pretrained( "facebook/wav2vec2-large-robust-ft-swbd-300h") # audio file is decoded on the fly array, fs = torchaudio.load("/data/A-VB/audio/wav/00001.wav") input = processor(array.squeeze(), sampling_rate=fs, return_tensors="pt") with torch.no_grad(): outputs = model(**input) last_hidden_states = outputs.last_hidden_state.squeeze().mean(axis=0).numpy() # printh shape print(f"Hidden state shape: {last_hidden_states.shape}")You can replace "facebook/wav2vec2-large-robust-ft-swbd-300h" with "facebook/wav2vec2-large-robust-ft-libri-960h" for the larger fine-tuned model.
For other models, you may need to change `Wav2Vec2Processor` with `Wav2Vec2FeatureExtractor` for processor variable. In my case, this is needed for the following models:
- facebook/wav2vec2-large-robust
- facebook/wav2vec2-large-xlsr-53
The syntax for the custom model (wav2vec-R-emo-vad)
The last one is the example of the custom model. The model is wav2vec 2.0 fine-tuned on the MSP-Podcast dataset for speech emotion recognition. This last example differs from the previous one since the configuration is given by the authors of the model (read the code thoroughly to inspect the details). I replaced the dummy audio file with the real audio file. It is assumed to process in batch (with batch_size=2) by replicating the same audio file.
import torch import torch.nn as nn from transformers import Wav2Vec2Processor from transformers.models.wav2vec2.modeling_wav2vec2 import ( Wav2Vec2Model, Wav2Vec2PreTrainedModel, ) import torchaudio class RegressionHead(nn.Module): r"""Classification head.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.dropout = nn.Dropout(config.final_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x class EmotionModel(Wav2Vec2PreTrainedModel): r"""Speech emotion classifier.""" def __init__(self, config): super().__init__(config) self.config = config self.wav2vec2 = Wav2Vec2Model(config) self.classifier = RegressionHead(config) self.init_weights() def forward( self, input_values, ): outputs = self.wav2vec2(input_values) hidden_states = outputs[0] hidden_states = torch.mean(hidden_states, dim=1) logits = self.classifier(hidden_states) return hidden_states, logits def process_func( wavs, sampling_rate: int # embeddings: bool = False, ): r"""Predict emotions or extract embeddings from raw audio signal.""" # run through processor to normalize signal # always returns a batch, so we just get the first entry # then we put it on the device # wavs = pad_sequence(wavs, batch_first=True) # load model from hub device = 'cpu' model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim' processor = Wav2Vec2Processor.from_pretrained(model_name) model = EmotionModel.from_pretrained(model_name) y = processor([wav.cpu().numpy() for wav in wavs], sampling_rate=sampling_rate, return_tensors="pt", padding="longest" ) y = y['input_values'] y = y.to(device) y = model(y) return { 'hidden_states': y[0], 'logits': y[1], } ## test to an audiofile sampling_rate = 16000 signal = [torchaudio.load('train_001.wav')[0].squeeze().to('cpu') for _ in range(2)] # extract hidden states with torch.no_grad(): hs = process_func(signal, sampling_rate)['hidden_states'] print(f"Hidden states shape={hs.shape}")
Please note for all models, the audio file must be sampled with 16000 Hz, otherwise, you must resample it before extracting acoustic embedding using the methods above. It may not throw an error even if the sampling rate is not 16000 Hz but the results, hence, is not valid since all models were generated based on 16 kHz of sampling rate speech datasets.
You may also want to extract acoustic features using the opensmile toolkit. The tutorial for Windows users using WSL is available here: http://bagustris.blogspot.com/2021/08/extracting-emobase-feature-using-python.html.
Happy reading. Don't wait for more time to apply these methods to your own audio file.