In order to extract audio mfcc feature, we can use python librosa and python_speech_features. However, we can find the mfcc result is different between them. In this tutorial, we will discuss it.
Extract mfcc using librosa
In librosa, we can use librosa.feature.mfcc() to extract audio mfcc feature.
Here is an example code:
import librosa import numpy as np import python_speech_features # init fname fname = "videoInvite.wav" # read audio audio, rate = librosa.load(fname, sr = 16000, mono=True) # using librosa lisbrosa_mfcc_feature = librosa.feature.mfcc(y=audio, sr=rate, n_mfcc=96, n_fft=1024, win_length=int(0.025*rate), hop_length=int(0.01*rate)) print(lisbrosa_mfcc_feature.T) print(lisbrosa_mfcc_feature.T.shape)
Run this code, we will see:
[[-6.7564471e+02 0.0000000e+00 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 0.0000000e+00] [-6.7564471e+02 0.0000000e+00 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 0.0000000e+00] [-6.7564471e+02 0.0000000e+00 0.0000000e+00 ... 0.0000000e+00 0.0000000e+00 0.0000000e+00] ... [-6.5992938e+02 9.7307272e+00 -1.0178448e+01 ... 9.6666622e-01 6.8923593e-01 -9.6729130e-01] [-6.6336255e+02 6.4062834e+00 -1.0166929e+01 ... 3.0643657e-01 3.2113945e-01 -6.7015398e-01] [-6.6725159e+02 2.2550015e+00 -1.0951193e+01 ... 4.1866398e-01 4.7858238e-02 -3.9219725e-01]] (2915, 96)
The mfcc is (2915, 96), we should notice the win_length is 25ms and hop_length is 10ms.
Extract mfcc using python_speech_features
We can use python_speech_features.mfcc() to extract audio mfcc. Here is an example code:
# using python_speech_features psf_mfcc_feature = python_speech_features.mfcc(signal=audio, samplerate=rate, winlen=0.025, winstep=0.01, numcep=96, nfilt=96, nfft=1024, appendEnergy=False) print(psf_mfcc_feature) print(psf_mfcc_feature.shape)
Run this code, we will see:
[[-2.46037103e+02 -7.56018600e+01 -2.51750919e+01 ... -1.01911290e+00 1.70075814e+00 -4.31814748e+00] [-2.45256949e+02 -6.94768844e+01 -2.09350841e+01 ... -2.07248365e+00 -1.65457796e+00 1.61949854e+00] [-2.45637842e+02 -7.34105513e+01 -2.57776958e+01 ... 1.76839491e-01 4.17425690e+00 8.44219400e+00] ... [-1.80706243e+02 4.40401371e+00 -4.43984965e+01 ... 1.55503580e+00 -2.39587144e+00 -2.64740226e+00] [-1.84005249e+02 3.35962767e+00 -4.72043779e+01 ... -1.57962226e+00 1.55466483e+00 -1.24262426e+00] [-1.86559804e+02 1.08205186e+00 -5.06794131e+01 ... 1.65799828e+00 -1.73473103e+00 -1.61770741e+00]] (2913, 96)
Compare two results, we can find:
librosa: (2915, 96)
python_speech_features: (2913, 96)
The shape of mfcc is different.
Because they are using different approach to computing the MFCCs, python_speech_features uses discrete fourier transform whereas librosa uses short time fourier transform.