Calculate F0 (Fundamental Frequency) From an Audio in PyTorch

In this tutorial, we will introduce how to calculate F0 from an audio in pytorch. It is widely used in voice conversation, emotion transfer learning.

How to calculate F0?

In python, we can use librosa to compute, here is the tutorial:

Extract F0 (Fundamental Frequency) From an Audio in Python: A Step Guide – Python Tutorial

However, we also can use python pyworld to do, which is more popular than librosa.

We can install pyword as follows:

pip install pyworld

In order to get an audio data, we can use librosa or pytorchaudio.

Understand librosa.load() is Between -1.0 and 1.0 – Librosa Tutorial

TorchAudio Load Audio with Specific Sampling Rate – TorchAudio Tutorial

In this tutorial, we will use torchaudio to get audio data.

Step 1:use torchaudio to get audio data.

Here is an example code.

import torchaudio
wav_file = "music-jamendo-0039.wav"
wav_data_2 = read_audio(wav_file)
print(wav_data_2.shape)

Here wav_data_2 is audio data, the shape of which is torch.Size([1, 2560416])

Step 2: use pyworld to get F0

To use pyworld, we should convert wav_data_2 to numpy array.

import numpy as np
wav_data_2 = wav_data_2.cpu().numpy().astype(np.double)
print(wav_data_2.shape)

Here wav_data_2 is numpy array, the shape of it is: (1, 2560416)

Then, we can use pyworld to calculate F0

import pyworld

x = wav_data_2[0]
fs =8000
f0min = 70
f0max = 550
hop_length = 256

f0, timeaxis = pyworld.dio(
        x,
        fs,
        f0_floor=f0min,
        f0_ceil=f0max,
        frame_period=(1000 * hop_length / fs),
    )
f0 = pyworld.stonemask(x, f0, timeaxis, fs)
print(f0.shape)
print(f0[0:300])

Here are some important parameters we should notice:

x = wav_data_2[0], it is 1 dimension, it is the audio data.
fs = 8000, fs is the sample rate of audio, we use 8000 in this example.
f0min = 70, the minimal f0,we set 70
f0mx = 550, the maximal f0, we set 550
hop_length = 256, we set 256 in this example.

To understand hop_length, we can read:

Understand n_fft, hop_length, win_length in Audio Processing – Librosa Tutorial

Run this example, we can get F0 feature as follows:

(10002,)
[  0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.           0.           0.
   0.           0.           0.         163.20322703   0.
 197.66181392 198.4260219  192.16735841 194.73397564   0.
   0.           0.         180.12702044 216.04567165 194.8315039
 217.99977316 218.86887114   0.           0.         175.74471473
 181.30908384 219.95440898 220.19288881 218.62106139 214.13874385
 200.91268578 193.88375986 192.74658768 198.20816003 197.97677956
   0.           0.           0.           0.         134.99862088
   0.           0.           0.           0.           0.
   0.         219.26672858 220.59821667 219.31900165 219.18864018
   0.         367.12108306 243.19008656   0.           0.
   0.           0.           0.           0.           0.
   0.         107.95974481   0.           0.           0.
   0.         293.12536583 285.91725885 293.93020309 300.09728178
 294.74282636 297.25232277 304.09641548   0.           0.
 266.47749054   0.         303.61773098 293.14778962 294.36112975
   0.           0.           0.           0.           0.
   0.           0.           0.           0.         182.4888113
 180.94039341 177.97671174   0.           0.         328.84853647
 331.24029652 322.40873042 334.12838717   0.           0.
   0.           0.           0.           0.           0.        ]

We will find there are many 0 value in F0 feature, they are invalid.

How to process 0 value in F0

From the result above, we can find there are many 0 values in F0. How to process these invalid values? We can use perform linear interpolation. For example:

from scipy.interpolate import interp1d
def _convert_to_continuous_f0(f0: np.array) -> np.array:
    if (f0 == 0).all():
        return f0

    # padding start and end of f0 sequence
    start_f0 = f0[f0 != 0][0]
    end_f0 = f0[f0 != 0][-1]
    start_idx = np.where(f0 == start_f0)[0][0]
    end_idx = np.where(f0 == end_f0)[0][-1]
    f0[:start_idx] = start_f0
    f0[end_idx:] = end_f0

    # get non-zero frame index
    nonzero_idxs = np.where(f0 != 0)[0]

    # perform linear interpolation
    interp_fn = interp1d(nonzero_idxs, f0[nonzero_idxs])
    f0 = interp_fn(np.arange(0, f0.shape[0]))

    return f0

f0 = _convert_to_continuous_f0(f0)
print(f0.shape)
print(f0[0:150])

Run this code, we will see:

(10002,)
[163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 163.20322703
 163.20322703 163.20322703 163.20322703 163.20322703 180.43252048
 197.66181392 198.4260219  192.16735841 194.73397564 191.08223684
 187.43049804 183.77875924 180.12702044 216.04567165 194.8315039
 217.99977316 218.86887114 204.49415234 190.11943353 175.74471473
 181.30908384 219.95440898 220.19288881 218.62106139 214.13874385
 200.91268578 193.88375986 192.74658768 198.20816003 197.97677956
 185.38114782 172.78551609 160.18988435 147.59425262 134.99862088
 147.03692198 159.07522308 171.11352418 183.15182528 195.19012638
 207.22842748 219.26672858 220.59821667 219.31900165 219.18864018
 293.15486162 367.12108306 243.19008656 228.16449303 213.1388995
 198.11330597 183.08771245 168.06211892 153.03652539 138.01093186
 122.98533834 107.95974481 144.99286901 182.02599322 219.05911742
 256.09224162 293.12536583 285.91725885 293.93020309 300.09728178
 294.74282636 297.25232277 304.09641548 291.55677383 279.01713219
 266.47749054 285.04761076 303.61773098 293.14778962 294.36112975
 283.17389791 271.98666606 260.79943422 249.61220237 238.42497053
 227.23773868 216.05050684 204.86327499 193.67604315 182.4888113
 180.94039341 177.97671174 228.26731999 278.55792823 328.84853647
 331.24029652 322.40873042 334.12838717 328.4760692  322.82375124
 317.17143328 311.51911531 305.86679735 300.21447939 294.56216142]

We can find the difference after performing linear interpolation.

Use log() for F0

We can find the value of F0 is large, for example: 163, which is not good for model training.

We can use log() function to conver F0 value to smaller.

For example:

use_log_f0 = True
if use_log_f0:
    nonzero_idxs = np.where(f0 != 0)[0]
    f0[nonzero_idxs] = np.log(f0[nonzero_idxs])
f0[np.isnan(f0)] = 0
print(f0.shape)
print(f0[0:150])

Run this code, we will see:

(10002,)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         5.09499622
 0.         5.28655756 5.29041635 5.25836665 5.2716344  0.
 0.         0.         5.19366227 5.37548983 5.2721351  5.38449402
 5.38847279 0.         0.         5.16903246 5.20020322 5.39342029
 5.39450393 5.38733992 5.36662414 5.30287041 5.2672588  5.26137631
 5.28931779 5.28814975 0.         0.         0.         0.
 4.90526456 0.         0.         0.         0.         0.
 0.         5.39028893 5.39634302 5.3905273  5.38993273 0.
 5.90569172 5.49384339 0.         0.         0.         0.
 0.         0.         0.         0.         4.68175842 0.
 0.         0.         0.         5.68060039 5.65570246 5.68334233
 5.70410669 5.6861032  5.69458135 5.71734481 0.         0.
 5.58528978 0.         5.71576945 5.68067688 5.68480735 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         5.20668886 5.19816766 5.18165271 0.
 0.         5.79559727 5.80284408 5.77582009 5.81152531 0.
 0.         0.         0.         0.         0.         0.        ]

In summary, we can create a function to calculate F0.

For example:

def calculate_f0(input: torch.Tensor, 
                 fs: int = 22050, 
                 hop_length: int = 256, 
                 f0min: int = 50, 
                 f0max: int = 550, 
                 use_continuous_f0: bool = False,
                 use_log_f0: bool = True,) -> torch.Tensor:
    x = input.cpu().numpy().astype(np.double)
    f0, timeaxis = pyworld.dio(
        x,
        fs,
        f0_floor=f0min,
        f0_ceil=f0max,
        frame_period=(1000 * hop_length / fs),
    )
    f0 = pyworld.stonemask(x, f0, timeaxis, fs)
    if use_continuous_f0:
        f0 = _convert_to_continuous_f0(f0)
    if use_log_f0:
        nonzero_idxs = np.where(f0 != 0)[0]
        f0[nonzero_idxs] = np.log(f0[nonzero_idxs])
    f0[np.isnan(f0)] = 0
    return input.new_tensor(f0.reshape(-1), dtype=torch.float)


def _convert_to_continuous_f0(f0: np.array) -> np.array:
    if (f0 == 0).all():
        logging.warn("All frames seems to be unvoiced.")
        return f0

    # padding start and end of f0 sequence
    start_f0 = f0[f0 != 0][0]
    end_f0 = f0[f0 != 0][-1]
    start_idx = np.where(f0 == start_f0)[0][0]
    end_idx = np.where(f0 == end_f0)[0][-1]
    f0[:start_idx] = start_f0
    f0[end_idx:] = end_f0

    # get non-zero frame index
    nonzero_idxs = np.where(f0 != 0)[0]

    # perform linear interpolation
    interp_fn = interp1d(nonzero_idxs, f0[nonzero_idxs])
    f0 = interp_fn(np.arange(0, f0.shape[0]))

    return f0