In this tutorial, we will introduce how to calculate F0 from an audio in pytorch. It is widely used in voice conversation, emotion transfer learning.
How to calculate F0?
In python, we can use librosa to compute, here is the tutorial:
Extract F0 (Fundamental Frequency) From an Audio in Python: A Step Guide – Python Tutorial
However, we also can use python pyworld to do, which is more popular than librosa.
We can install pyword as follows:
pip install pyworld
In order to get an audio data, we can use librosa or pytorchaudio.
Understand librosa.load() is Between -1.0 and 1.0 – Librosa Tutorial
TorchAudio Load Audio with Specific Sampling Rate – TorchAudio Tutorial
In this tutorial, we will use torchaudio to get audio data.
Step 1:use torchaudio to get audio data.
Here is an example code.
import torchaudio wav_file = "music-jamendo-0039.wav" wav_data_2 = read_audio(wav_file) print(wav_data_2.shape)
Here wav_data_2 is audio data, the shape of which is torch.Size([1, 2560416])
Step 2: use pyworld to get F0
To use pyworld, we should convert wav_data_2 to numpy array.
import numpy as np wav_data_2 = wav_data_2.cpu().numpy().astype(np.double) print(wav_data_2.shape)
Here wav_data_2 is numpy array, the shape of it is: (1, 2560416)
Then, we can use pyworld to calculate F0
import pyworld x = wav_data_2[0] fs =8000 f0min = 70 f0max = 550 hop_length = 256 f0, timeaxis = pyworld.dio( x, fs, f0_floor=f0min, f0_ceil=f0max, frame_period=(1000 * hop_length / fs), ) f0 = pyworld.stonemask(x, f0, timeaxis, fs) print(f0.shape) print(f0[0:300])
Here are some important parameters we should notice:
- x = wav_data_2[0], it is 1 dimension, it is the audio data.
- fs = 8000, fs is the sample rate of audio, we use 8000 in this example.
- f0min = 70, the minimal f0,we set 70
- f0mx = 550, the maximal f0, we set 550
- hop_length = 256, we set 256 in this example.
To understand hop_length, we can read:
Understand n_fft, hop_length, win_length in Audio Processing – Librosa Tutorial
Run this example, we can get F0 feature as follows:
(10002,) [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 163.20322703 0. 197.66181392 198.4260219 192.16735841 194.73397564 0. 0. 0. 180.12702044 216.04567165 194.8315039 217.99977316 218.86887114 0. 0. 175.74471473 181.30908384 219.95440898 220.19288881 218.62106139 214.13874385 200.91268578 193.88375986 192.74658768 198.20816003 197.97677956 0. 0. 0. 0. 134.99862088 0. 0. 0. 0. 0. 0. 219.26672858 220.59821667 219.31900165 219.18864018 0. 367.12108306 243.19008656 0. 0. 0. 0. 0. 0. 0. 0. 107.95974481 0. 0. 0. 0. 293.12536583 285.91725885 293.93020309 300.09728178 294.74282636 297.25232277 304.09641548 0. 0. 266.47749054 0. 303.61773098 293.14778962 294.36112975 0. 0. 0. 0. 0. 0. 0. 0. 0. 182.4888113 180.94039341 177.97671174 0. 0. 328.84853647 331.24029652 322.40873042 334.12838717 0. 0. 0. 0. 0. 0. 0. ]
We will find there are many 0 value in F0 feature, they are invalid.
How to process 0 value in F0
From the result above, we can find there are many 0 values in F0. How to process these invalid values? We can use perform linear interpolation. For example:
from scipy.interpolate import interp1d def _convert_to_continuous_f0(f0: np.array) -> np.array: if (f0 == 0).all(): return f0 # padding start and end of f0 sequence start_f0 = f0[f0 != 0][0] end_f0 = f0[f0 != 0][-1] start_idx = np.where(f0 == start_f0)[0][0] end_idx = np.where(f0 == end_f0)[0][-1] f0[:start_idx] = start_f0 f0[end_idx:] = end_f0 # get non-zero frame index nonzero_idxs = np.where(f0 != 0)[0] # perform linear interpolation interp_fn = interp1d(nonzero_idxs, f0[nonzero_idxs]) f0 = interp_fn(np.arange(0, f0.shape[0])) return f0 f0 = _convert_to_continuous_f0(f0) print(f0.shape) print(f0[0:150])
Run this code, we will see:
(10002,) [163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703 180.43252048 197.66181392 198.4260219 192.16735841 194.73397564 191.08223684 187.43049804 183.77875924 180.12702044 216.04567165 194.8315039 217.99977316 218.86887114 204.49415234 190.11943353 175.74471473 181.30908384 219.95440898 220.19288881 218.62106139 214.13874385 200.91268578 193.88375986 192.74658768 198.20816003 197.97677956 185.38114782 172.78551609 160.18988435 147.59425262 134.99862088 147.03692198 159.07522308 171.11352418 183.15182528 195.19012638 207.22842748 219.26672858 220.59821667 219.31900165 219.18864018 293.15486162 367.12108306 243.19008656 228.16449303 213.1388995 198.11330597 183.08771245 168.06211892 153.03652539 138.01093186 122.98533834 107.95974481 144.99286901 182.02599322 219.05911742 256.09224162 293.12536583 285.91725885 293.93020309 300.09728178 294.74282636 297.25232277 304.09641548 291.55677383 279.01713219 266.47749054 285.04761076 303.61773098 293.14778962 294.36112975 283.17389791 271.98666606 260.79943422 249.61220237 238.42497053 227.23773868 216.05050684 204.86327499 193.67604315 182.4888113 180.94039341 177.97671174 228.26731999 278.55792823 328.84853647 331.24029652 322.40873042 334.12838717 328.4760692 322.82375124 317.17143328 311.51911531 305.86679735 300.21447939 294.56216142]
We can find the difference after performing linear interpolation.
Use log() for F0
We can find the value of F0 is large, for example: 163, which is not good for model training.
We can use log() function to conver F0 value to smaller.
For example:
use_log_f0 = True if use_log_f0: nonzero_idxs = np.where(f0 != 0)[0] f0[nonzero_idxs] = np.log(f0[nonzero_idxs]) f0[np.isnan(f0)] = 0 print(f0.shape) print(f0[0:150])
Run this code, we will see:
(10002,) [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 5.09499622 0. 5.28655756 5.29041635 5.25836665 5.2716344 0. 0. 0. 5.19366227 5.37548983 5.2721351 5.38449402 5.38847279 0. 0. 5.16903246 5.20020322 5.39342029 5.39450393 5.38733992 5.36662414 5.30287041 5.2672588 5.26137631 5.28931779 5.28814975 0. 0. 0. 0. 4.90526456 0. 0. 0. 0. 0. 0. 5.39028893 5.39634302 5.3905273 5.38993273 0. 5.90569172 5.49384339 0. 0. 0. 0. 0. 0. 0. 0. 4.68175842 0. 0. 0. 0. 5.68060039 5.65570246 5.68334233 5.70410669 5.6861032 5.69458135 5.71734481 0. 0. 5.58528978 0. 5.71576945 5.68067688 5.68480735 0. 0. 0. 0. 0. 0. 0. 0. 0. 5.20668886 5.19816766 5.18165271 0. 0. 5.79559727 5.80284408 5.77582009 5.81152531 0. 0. 0. 0. 0. 0. 0. ]
In summary, we can create a function to calculate F0.
For example:
def calculate_f0(input: torch.Tensor, fs: int = 22050, hop_length: int = 256, f0min: int = 50, f0max: int = 550, use_continuous_f0: bool = False, use_log_f0: bool = True,) -> torch.Tensor: x = input.cpu().numpy().astype(np.double) f0, timeaxis = pyworld.dio( x, fs, f0_floor=f0min, f0_ceil=f0max, frame_period=(1000 * hop_length / fs), ) f0 = pyworld.stonemask(x, f0, timeaxis, fs) if use_continuous_f0: f0 = _convert_to_continuous_f0(f0) if use_log_f0: nonzero_idxs = np.where(f0 != 0)[0] f0[nonzero_idxs] = np.log(f0[nonzero_idxs]) f0[np.isnan(f0)] = 0 return input.new_tensor(f0.reshape(-1), dtype=torch.float) def _convert_to_continuous_f0(f0: np.array) -> np.array: if (f0 == 0).all(): logging.warn("All frames seems to be unvoiced.") return f0 # padding start and end of f0 sequence start_f0 = f0[f0 != 0][0] end_f0 = f0[f0 != 0][-1] start_idx = np.where(f0 == start_f0)[0][0] end_idx = np.where(f0 == end_f0)[0][-1] f0[:start_idx] = start_f0 f0[end_idx:] = end_f0 # get non-zero frame index nonzero_idxs = np.where(f0 != 0)[0] # perform linear interpolation interp_fn = interp1d(nonzero_idxs, f0[nonzero_idxs]) f0 = interp_fn(np.arange(0, f0.shape[0])) return f0