Thursday, January 27, 2022

Ekstraksi Fitur Akustik dengan Torchaudio

Kode di bawah ini mengekstrak tiga fitur akustik -- spectrogram, melspectrogram, dan mfcc -- dari sebuah file audio "filename" (wav, mp3, ogg, flacc, dll). Ketiga fitur akustik tersebut merupakan fitur-fitur akustik terpenting dalam pemrosesan sinyal wicara. Keterangan singkat ada di dalam badan kode. Hasil plot ada di bawah kode.

#import torch
import torchaudio
from matplotlib import pyplot as plt
import librosa

# show torchaudio version
# torch.__version__

def plot_spectrogram(spec, title=None, ylabel="freq_bin", aspect="auto",
    fig, axs = plt.subplots(1, 1)
    axs.set_title(title or "Spectrogram (db)")
    im = axs.imshow(librosa.power_to_db(spec), origin="lower", aspect=aspect)
    if xmax:
        axs.set_xlim((0, xmax))
    fig.colorbar(im, ax=axs)

filename = "/home/bagus/train_001.wav"  # change with your file
waveform, sample_rate = torchaudio.load(filename)

# Konfigurasi untuk spectrogam, melspectrogram, dan MFCC
n_fft = 1024
win_length = None  # jika None maka sama dengan n_fft
hop_length = 512   # y-axis in spec plot
n_mels = 64  # y-axis in melspec plot
fmin = 50
fmax = 8000
n_mfcc = 40  # must be smaller than n_mels, will be y-axis in plot

# definisi kelas untuk ekstraksi spektrogram
spectrogram = torchaudio.transforms.Spectrogram(

# Show plot of spectrogram
spec = spectrogram(waveform)
print(spec.shape)  # torch.Size([1, 513, 426])
plot_spectrogram(spec[0], title=f"Spectrogram - {str(filename)}")

## kelas untuk ekstraksi melspectrogram
melspectogtram = torchaudio.transforms.MelSpectrogram(

# Calculate melspec
melspec = melspectogtram(waveform)
melspec.shape # torch.Size([1, 513, 426])
plot_spectrogram(melspec[0], title=f"Melspectrogam - {str(filename)}")

## kelas untuk ekstraksi MFCC
mfcc_transform = torchaudio.transforms.MFCC(
      'n_fft': n_fft,
      'n_mels': n_mels,
      'hop_length': hop_length,
      'mel_scale': 'htk',

# plot mfcc
mfcc = mfcc_transform(waveform)
print(mfcc.shape) # torch.Size([1, 40, 426])
plot_spectrogram(mfcc[0], title=f"MFCC - {str(filename)}")


Wednesday, January 19, 2022

Choosing Journals and Conferences for Publication: Google Top 20 (and h5-index > 30)

If you want to publish your academic paper in a conference or journal, you may be confused about to which conference or journal you should submit your papers to. This short article may help you. To be categorized as a "reputable journal", my institution required two indicators below.

  1. It appears in Google Top 20 (all categories, categories, and sub-categories)
  2. It has Google H5-index > 30
For the first reason, it makes sense. The top twenty are the top 20 journals and conferences (mixed) which have the highest h5-index. I don't know the reason for the second reason why my institution chooses 30 as the limit of h5-index for "more incentive". It still makes sense since the higher h5-index means the higher impact.

From those two indicators, I choose the first as the main criteria for selecting publication. Here are five top 20 journals and conferences from all categories, categories, and two sub-categories in my field.

Google Top-20 (all categories)

For choosing categories, click "VIEW ALL" > Metrics > VIEW ALL.

Google Top 20 Category Engineering and Computer Sciences

Google Top 20 Category Life Sciences and Earth Sciences

Google Top 20 Sub-category: Signal Processing

Google Top 20 Sub-category: Acoustic and Audio

This guide for selecting criteria is not mandatory in my constitution. But they will give more bonus to the researchers if their publications are ranked by one or both criteria above (more bonus for both, maybe).

Tuesday, January 04, 2022

New Paper: Effect of Different Splitting Criteria on Speech Emotion Recognition


Traditional speech emotion recognition (SER) evaluations have been performed merely on a speaker-independent (SI) condition; some of them even did not evaluate their result on this condition (speaker-dependent, SD). This paper highlights the importance of splitting training and test data for SER by script, known as sentence-open or text-independent (TI) criteria. The results show that employing sentence-open criteria degraded the performance of SER. This finding implies the difficulties of recognizing emotion from speech in different linguistic information embedded in acoustic information. Surprisingly, text-independent criteria consistently performed worse than speaker+text-independent (STI) criteria. The full order of difficulties for splitting criteria on SER performances from the most difficult to the easiest is text- independent, speaker+text-independent, speaker-independent, and speaker+text-dependent. The gap between speaker+text- independent and text-independent was smaller than other criteria, strengthening the difficulties of recognizing emotion from speech in different sentences.


Experiment #1: average of 30 trials (runs)
Experiment #2: 5-fold cross-validation
Experiment #3: Same number of training and test data


Take home message

Sentence (or linguistic) information plays a crucial role in speech emotion recognition.

Full paper + code:

Related Posts Plugin for WordPress, Blogger...