Samples for "Fitting New Speakers Based on a Short Untranscribed Sample"

Introduction

We present supplementary audio samples that were generated using the proposed method. These samples capture new voices, unseen during training and the same VCTK85 trained model was used to fit all voices. For each dataset used in the paper, we present below the ground truth (unseen during both training and fitting) as well as the sample generated for the same text as the ground truth.

Note that the fitting occured on a different sample of the same speaker. The obtained embedding, coupled with the text of the new sample were used after the fitting step to generate it.

VCTK dataset

Length of sample(s)
used for fitting
Ground truth Generated Ground truth Generated Ground truth Generated Ground truth Generated
1.5 sec
1 sentence
2 sentences
1 min
5 min
10 min
15 min
20 min

Note that as reported in our paper, a longer sample does not typically lead to better quality for our method.


Libri-rest dataset


Ground truth

Generated

VoxCeleb dataset


Ground truth

Generated

Priming based method - VCTK dataset


Ground truth

Priming

Generated