i'm going to pretend to this work about them in addition speaker recognition
a decision strategy in preston from scratch one element in speaker recognition
we want to carry out speaker recognition on a new domain not up to increase
the criticism detection
thanks to adaptation techniques
but we don't want
to meet to take into account the difficulties of the task in real life situations
the task of data collecting and also without the cost and therefore forming the large
available to them in dataset
so as to assume that a unique and nonaudible in them and development dataset not
anymore possibly reduced in size down stuff speaker also segments per speaker
this dataset is used to learn an adaptive speaker recognition model
first we want to know that how about the performance increase depending on the amount
of unlabeled in domain data
in terms of segments
and so of speakers or
of some po size of segments per speaker
instead of the asking is always number of clusters damman thanks to another line in
domain data set
so this break distinct and number
we want to
carol to clustering without this requirement for exist in
in domain
and lower bound
data
this is explained below in this presentation
displays most edges back and process for speaker recognition systems based on embedding
the different adaptation techniques that can be included
missiles are which amazed
transforming vectors to reduce the shift between target and out-of-domain distributions
covariance indignant
while or at the feature distribution of the up to attempt to about the out-of-domain
distributions to also target ones
leading to transform on out-of-domain data into possible in domain data
when speaker labels of in domain simple or about anymore
supervised adaptation can be carried out
that's the kind of map
approach
that's more z-norm to linear interpolation between them and then total and parameters
also score normalizations can be considered as and supervised adaptation is
as they use an on the rubber in the man subsets for impostor cohort
that does not that we generalize is interpretation of the lda parameters
to all possible stages of the system and a and whitening
this tactic improvements of performance of a percent
on all our experiments
so how does not from i think raise depending on the a
amount of data
we carry out
experiments
focusing on the gain of adaptive systems a function of the invaluable data and results
sort parameters are selected for the coarse reference tonight it's
speaker else
speaker samples
and
adaptation technique
they are is a description of the experimental setup for our
i'm not exist
we use and that's just seen from county you is twenty three cepstral coefficients
the window size
of three seconds
then vad with the u c zero component
z extract a fixed vector r is a one of candide toolkit
what is attentive statistics putting layer
this extractor is trained on switchboard and nist sre
right tails
use five fold it i one session strategy with full crowed you please
nor is music
bubble from use "'em"
so the men is that it is an arabic language which is called a manner
as the nist recognition evaluation
two so
so than eighteen
cmn and two thousand
eighteen
nineteen sorry
cts
this languages finalists from the nist speaker recognition training data bases
one do things to our mismatch
the in domain corpus for development and test is described in system or
development dataset may have just the enrollment test segments the leave out of from nist
sre eighteen development test
and how for the enrollment the segments delivered from nist sre eight nineteen that's
the other are fixed set aside for making up trial data set of test
the fifty per cent split takes genders into account to more elements will be asked
us you
contains committee on trial perhaps
a normally and uniformly picked up with the constraint of being equalized by gender
and of target prior
equal to one percent
one analysing the adaptation strategy
to predict errors number of speakers and the number of segments per speakers are rated
another two three different total amount of segments and also
given a fixed amount to assess the impact of speaker class variability
each time a subset is picked up from the three hundred and ten speakers size
development dataset and an important for the two models
system development set
is fixed and on the intended for testing
for alternatives are considered that experimented
system applying and supervised adaptation only
system applying supervised adaptation only
and the system applying for pipeline
unsupervised installer
the goal is to assess the usefulness
of unsupervised techniques for speaker labels are available
this figure shows the results of our analyses
performance in terms of recall rate of unsupervised and supervised
adapted systems depending on the number of speakers
and segment bell speakers
of the in domain development dataset
the case
since andy segments per speaker s corresponds to all segments remorseful the speakers
so and t is the mean
x is the number of speakers
where x is the number of segments per speaker
it can be upset of that
combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't
make sense provides questionable
and sre
also we observe that
and then with the small in domain data set here or fifty speakers there is
a significant gain of performance with adaptation compared to the design of twelve point
twelve best
now or not a subset of the dashed curves in the figure
they correspond to fixed total amount of segments
for example
this last row corresponds to the same amount of two thousand and five hundred segments
possibly
fifty speakers and fifty segments
bell speaker or one hundred
suppose
by sweeping the kl
we cannot sell that
given a total amount of segments performance improvement with the number of speakers
gathering data from a few speakers to then with many utterances per speaker
really needs again off adapted systems
talk about clustering
the goal is to up to show reliable a in domain data set by using
unsupervised clustering and in defining the provided places
this is to speaker labels
dataset x
cluster on
the results
is the actual speaker labels for
note that we use
why previous thing total dataset form in domain data
a model is computed
using out of them and training dataset
then the score matrix of course tails x is used for going out
an item out to hierarchical clustering using s
a similarity matrix
given this clustering problem is how to determine the actual number or
of places
by sweeping the number of clusters for each number you a model is estimated which
includes and double delta parameters
and the preexisting in them into a low dataset y is used for error rate
computation
then we select the class labels corresponding to the number of classes q that minimizes
the or right
nor block of this approach is here quality nor and
actually quite a preexisting the mental set that is not
so a missile from scratch without in domain data except
so we propose a missile for clustering the in domain data set and determining the
optimal number of classes from scratch result requirement of preexisting in them into a set
is algorithm
first
this algorithm is identical
then
for each number of classes q
we identify class and speaker
and by key matrix can
then we use
this is not weights of artificial keys
for computing the error rate
now we have to determine the optimal number of classes
we use the remote gridiron one on the field of clustering
on display in the air or its those criteria for determining the optimal number of
clusters
reported was is correspond to the loop of the algorithm from scratch
we can see that the slope of equal error rate goal so then it slows
down around the neighbourhood by excess of the exact number of speakers
which is
two hundred and fifty
moreover the values of this yes we still operating points
rich local minima before converging to zero
the trust one in the same neighbour
two hundred and fifty
so i don't format salted gives the wrong
three hundred
classes
with the colour white beyond this threshold also dcf increases
no display the performance of the adapted system using clustering from scratch as a function
of the number of clusters
compared to unsupervised and supervised with the exact speaker labels adaptation
with
exact syllables and spell adaptation the performance of eigenvalues round six test and
with only and style adaptation performance is round seven percent
and we can see the crawled all results by varying the number of classes
form the clustering
from scratch that we propose
we can see that the missile or estimates the number of speakers but manage to
attain dusting performance in terms of equal error rate and this yes
close to the performance
with exact lower bounds and supervised adaptation
of the residuals
with values number of segments per speaker
five ten or more
for example
last line we can see that results by clustering from scratch
the right
a similar to goals were produced in one about that moment set
but also close to the ones with the exact speaker labels
now will conclude
the analyses that we carried out
shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques
michael a or lda
that's techniques well combine one is a model field
the other on the picture failed to achieve best performance
also
it's subset of that the small sample of in domain data can significantly reduce the
gap of performance
but when following the amount of speakers
rather than of segments per speaker
lastly a new or partial optional speaker labeling has been introduced here
doing from scratch
without break this thing in the man labeled data
for clustering
well actually being a given and performance
thank you for attention
can try to as for more details on this study
but by