Understand torchaudio.load() normalize, frame_offset, num_frames with Examples

By | October 30, 2024

We can use  torchaudio.load() to read audio data easily. For example:

Understand torchaudio.load(): Read Audio with Examples

TorchAudio Load Audio with Specific Sampling Rate

In this tutorial, we will discuss more details on it.

Syntax

torchaudio.load() is defined as:

torchaudio.load(uri: Union[BinaryIO, str, PathLike], frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None, buffer_size: int = 4096, backend: Optional[str] = None)

There are three important parameters.

  • normalize = True, it will convert to the value of each frame to [-1, 1]
  • num_frames = -1, how many frames you want to read in this audio file
  • frame_offset = 0, where you plan to read audio frame.

Then we will use some examples to discuss the effect of these parameters and help you understand them.

Normalize

If True

, the value of wave file is [-1, 1]

wav_path = r'10091.wav'
waveform, sample_rate = torchaudio.load(filepath=wav_path)
x = waveform[:,:100]
print(x)

Output:

tensor([[-3.0518e-05, -3.0518e-05,  0.0000e+00, -9.1553e-05, -3.0518e-05,
          0.0000e+00, -3.0518e-05, -9.1553e-05, -3.0518e-05,  0.0000e+00,
          0.0000e+00, -6.1035e-05, -9.1553e-05, -6.1035e-05, -6.1035e-05,
          ...
        ]])

You can use (1 << 15) to get the effect of normalize = False

y = x * (1 << 15)
print(y)

Output:

tensor([[-1., -1.,  0., -3., -1.,  0., -1., -3., -1.,  0.,  0., -2., -3., -2.,
         -2.,  0., -1., -2., -5., -1., -2.,  0., -1., -2., -2.,  0., -1.,  2.,
         -2., -3.,  0., -2., -2.,  0., -1.,  0., -2., -3.,  0., -2., -3., -1.,
        ...
        ]])

num_frames and frame_offset

You should notice: the audio data shape = frame number

For example:

wav_path = r'10091.wav'
waveform, sample_rate = torchaudio.load(filepath=wav_path)
print(waveform.shape)

Output:

torch.Size([1, 164127])

It means 10091.wav contains 164127 frames.

end_frame = 10000
start_frame = 1000
waveform, sample_rate = torchaudio.load(
                    filepath=wav_path,
                    num_frames=end_frame - start_frame,
                    frame_offset=start_frame) 

print(waveform.shape)

From this code, we can find we will get end_frame  – start_frame  = 10000 – 1000 = 9000 frames.

Output:

torch.Size([1, 9000])

Leave a Reply