Samples for "Fitting New Speakers Based on a Short Untranscribed Sample"
Introduction
We present supplementary audio samples that were generated using the proposed method.
These samples capture new voices, unseen during training and the same VCTK85 trained model was used to
fit all voices. For each dataset used in the paper, we present below the ground truth
(unseen during both training and fitting) as well as the sample generated for the same text
as the ground truth.
Note that the fitting occured on a different sample of the same speaker. The obtained embedding,
coupled with the text of the new sample were used after the fitting step to generate it.
VCTK dataset
Length of sample(s) used for fitting
Ground truth
Generated
Ground truth
Generated
Ground truth
Generated
Ground truth
Generated
1.5 sec
1 sentence
2 sentences
1 min
5 min
10 min
15 min
20 min
Note that as reported in our paper, a longer sample does not typically
lead to better quality for our method.