Understand MFCC Difference Between Python librosa and python_speech_features – Python Tutorial

admin

2 years ago

In order to extract audio mfcc feature, we can use python librosa and python_speech_features. However, we can find the mfcc result is different between them. In this tutorial, we will discuss it.

Extract mfcc using librosa

In librosa, we can use librosa.feature.mfcc() to extract audio mfcc feature.

Here is an example code:

import librosa
import numpy as np
import python_speech_features

# init fname
fname = "videoInvite.wav"

# read audio
audio, rate = librosa.load(fname, sr = 16000, mono=True)

# using librosa
lisbrosa_mfcc_feature = librosa.feature.mfcc(y=audio,
                                             sr=rate,
                                             n_mfcc=96,
                                             n_fft=1024,
                                             win_length=int(0.025*rate),
                                             hop_length=int(0.01*rate))
print(lisbrosa_mfcc_feature.T)
print(lisbrosa_mfcc_feature.T.shape)

Run this code, we will see:

[[-6.7564471e+02  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00
   0.0000000e+00  0.0000000e+00]
 [-6.7564471e+02  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00
   0.0000000e+00  0.0000000e+00]
 [-6.7564471e+02  0.0000000e+00  0.0000000e+00 ...  0.0000000e+00
   0.0000000e+00  0.0000000e+00]
 ...
 [-6.5992938e+02  9.7307272e+00 -1.0178448e+01 ...  9.6666622e-01
   6.8923593e-01 -9.6729130e-01]
 [-6.6336255e+02  6.4062834e+00 -1.0166929e+01 ...  3.0643657e-01
   3.2113945e-01 -6.7015398e-01]
 [-6.6725159e+02  2.2550015e+00 -1.0951193e+01 ...  4.1866398e-01
   4.7858238e-02 -3.9219725e-01]]
(2915, 96)

The mfcc is (2915, 96), we should notice the win_length is 25ms and hop_length is 10ms.

Extract mfcc using python_speech_features

We can use python_speech_features.mfcc() to extract audio mfcc. Here is an example code:

# using python_speech_features
psf_mfcc_feature = python_speech_features.mfcc(signal=audio,
                                               samplerate=rate,
                                               winlen=0.025,
                                               winstep=0.01,
                                               numcep=96,
                                               nfilt=96,
                                               nfft=1024,
                                               appendEnergy=False)
print(psf_mfcc_feature)
print(psf_mfcc_feature.shape)

Run this code, we will see:

[[-2.46037103e+02 -7.56018600e+01 -2.51750919e+01 ... -1.01911290e+00
   1.70075814e+00 -4.31814748e+00]
 [-2.45256949e+02 -6.94768844e+01 -2.09350841e+01 ... -2.07248365e+00
  -1.65457796e+00  1.61949854e+00]
 [-2.45637842e+02 -7.34105513e+01 -2.57776958e+01 ...  1.76839491e-01
   4.17425690e+00  8.44219400e+00]
 ...
 [-1.80706243e+02  4.40401371e+00 -4.43984965e+01 ...  1.55503580e+00
  -2.39587144e+00 -2.64740226e+00]
 [-1.84005249e+02  3.35962767e+00 -4.72043779e+01 ... -1.57962226e+00
   1.55466483e+00 -1.24262426e+00]
 [-1.86559804e+02  1.08205186e+00 -5.06794131e+01 ...  1.65799828e+00
  -1.73473103e+00 -1.61770741e+00]]
(2913, 96)

Compare two results, we can find:

librosa: (2915, 96)

python_speech_features: (2913, 96)

The shape of mfcc is different.

Because they are using different approach to computing the MFCCs, python_speech_features uses discrete fourier transform whereas librosa uses short time fourier transform.