If you plan to train deep learning model using wav files, you may have to extract audio features from these files. In this tutorial, we will introduce you how to extract.
Preliminary
We can use python librosa and python_speech_features libraries.
pip install librosa pip install python_speech_features
Read audio file data
We usually use wav pcm format files to extract audio data. This kind of file contains sound data without be compressed.
To know what is the format of your wav files, you can read:
View Audio Sample Rate, Data Format PCM or ALAW Using ffprobe – Python Tutorial
Python Read WAV Data Format, PCM or ALAW – Python Tutorial
Then, we can read wav data using python librosa. Here is the example:
import librosa import numpy audio, sr = librosa.load(audio_file, sr= sample_rate, mono=True)
Here audio_file is the path of wav file. audio is the wav data, which is a numpy ndarray. sr is the sample rate of this file.
You also can read wav data using scipy.io.wavfile.read(). The difference between them is here:
The Difference Between scipy.io.wavfile.read() and librosa.load() in Python – Python Tutorial
Extract audio fbank feature
After having read wav data, we can extract its fbank feature. We can use python_speech_features to implement it.
Here is an example:
frame_len=0.025 #ms frame_shift=0.01 wav_feature, energy = python_speech_features.fbank(audio, sr, nfilt = 256, winlen=frame_len,winstep=frame_shift)
The wav_feature is the fbank feature of this wav file.
Notice: From paper: Understand the Difference of MelSpec, FBank and MFCC in Audio Feature Extraction – Python Audio Processing
We can find wav_feature is MelSpec, in order to get FBank, we should use logfbank() method or:
wave_feature = numpy.log(wave_feature)
As to python_speech_features.fbank(), it is defined as:
def fbank(signal,samplerate=16000,winlen=0.025,winstep=0.01, nfilt=26,nfft=512,lowfreq=0,highfreq=None,preemph=0.97, winfunc=lambda x:numpy.ones((x,))):
We should notice nfilt is the dimension of the output.
For example, you may find the shape of wav_feature is 499*256. Here 499 is determined by wave file total time, winlen and winstep. 256 is the nfilt.
Normalize the fbank feature
After we have got wav file fbank feature, we can normalize it. Here is an example:
def normalize_frames(m,epsilon=1e-12): return np.array([(v - np.mean(v)) / max(np.std(v),epsilon) for v in m]) wav_feature = normalize_frames(wav_feature)
Then you can use this wav_feature as your model input to train.
You may see this fbank feature as follows: