Understand torchaudio.load(): Read Audio with Examples

In this tutorial, we will use some examples to introduce how to read an audio file using torchaudio.load()

Syntax

torchaudio.load() can be defined as:

torchaudio.load(filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None)

It will return (wav_data, sample_rate)

Here:

filepath: the path of audio file, it also can be a url

normalize: default = True. When True, it will convert the native sample type to float32. If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.

channels_first: default = True. It will return [channel, time], when False, it will return [time, channel]

How to use torchaudio.load()?

Here we will use some example to show you how to use this function.

When normalize = False

import torchaudio

wav_file = "008554.wav"

wav, sr = torchaudio.load(wav_file, normalize = False)
print(wav.shape, sr)
print(wav)

We will see:

torch.Size([1, 230496]) 48000
tensor([[0, 0, 1,  ..., 1, 0, 0]], dtype=torch.int16)

From the result, we can find:

wav is a int16 tensor, the shape of it is [1,230496]. The channel = 1, which means 008554.wav is a mono audio.
sr = 48000, which means the sample rate of this audio is 48k

When normalize = True

wav, sr = torchaudio.load(wav_file, normalize = True)
print(wav.shape, sr)
print(wav)

We will see:

torch.Size([1, 230496]) 48000
tensor([[0.0000e+00, 0.0000e+00, 3.0518e-05,  ..., 3.0518e-05, 0.0000e+00,
         0.0000e+00]])

We can find wav is a float32 tensor.

When channels_first = False

wav, sr = torchaudio.load(wav_file, channels_first = False)
print(wav.shape, sr)
print(type(wav))

We will get:

torch.Size([230496, 1]) 48000
<class 'torch.Tensor'>

How to slice the wav data?

We can slice a wav data similar to numpy array.

For example:

import torchaudio

wav_file = "008554.wav"

wav, sr = torchaudio.load(wav_file)
wav = wav[0]

sub_data = wav[0:20]
print(sub_data.shape)
sub_data = wav[-30:]
print(sub_data.shape)

We will get:

torch.Size([20])
torch.Size([30])