hello my name is then of course you have no i will be presenting joint
work with the michael it's
the excel and unlikely
from the human language technology center of excellence
i johns hopkins university
that i don't know or work is might need to expect a marketing estimation network
plus also
for improving speaker recognition
current state-of-the-art in text-independent speaker recognition is based on the in and variance training with
our classification loss
for example a multiclass cross entropy
if there is no severe mismatch between the nn training data and the deployment environment
this in that cosine similarity between them but is from a system trained with the
angular marking softmax
or vice versa
speaker discrimination
for example in the most recent nist sre evaluation
which is audio from
video particular didn't of videos
the top performing
single system
on the audio track was based on this part of that
unfortunately
even though cosine similarity provides
good speaker discrimination
directly using those scores
does not allow us to make use of the automatically a stressful
because this discourse are not calibrated
typical way to address this problem is
use an affine mapping to transform the scores into look like a result alright calibrated
this is
typically done using on logistic regression
and we learn two numbers i scale
and also
so looking at the top equation
the raw scores are denoted by s i e a
which is the cosine similarity between
two and variance
and this can be
basically as precise that unit for a well unit length and then in x i
till the
transpose
x till the j
so it is nothing more than this they number of unit length and between
you will learn a calibration mapping
with the parameters a and b we can transform this score in to log-likelihood ratios
and then we can make use of the bayes threshold to make optimal positions
in this work we
proposed agenda cezanne
and i went to look at it is to think that the actual
scale at
can be thought of us
simply a sign
constant might be due to the
unit length and variance
so it's inventing
get the same active
instead
we suggest that it's probably better that it somebody has its on my data
and we want to use a neural network
to estimate
the optimal value of those magnitudes
we also used a global offset to provide
the mapping
to like to raise
no the this new approach
may result in a non monotonic mapping
which means that you has the potential to not only produce calibrated scores but it
also can improve discrimination
by increasing the classical range
to train this mike to network
we
want to use a binary classification task
so we draw target and non-target trials from a training set
and i lost constant it's a by a weighted by now regression
where are five is the prior of a target trial
and then
these two i is the log posterior odds
which can be decomposed in terms of the local error rates so
on the log prior art
the overall
system architecture than one and use
it's gonna be training so we a steps
on the well left
it's a block diagram of our baseline architecture
we're gonna use
to the convolution with a resonant architecture
well why a temporal only
and getting a high dimensional
was probably in activations
and then we use an affine layer
to do a bottleneck so that we can obtain then between
and but it's are gonna be a one and fifty six dimensions
and their star is used to the node where the embedding is extracted condition at
work
in a more will be trained using multiclass cross entropy with the softmax classification the
using directive mark
the first as the of the training process is to use short segments to train
the network
in the past we've seen this to be a good you know compromise
because the sort of sequences allow for a good use of gpu memory with a
large buttons
i'm not the same time in makes the task are even though we have a
very powerful classification head was to get
error so that we're going back propagated gradients
as the second step we propose to freeze then most memory intensive layers
which are typically only layers we are operating at the frame level
and then finding the postpone layers with more recordings
using
all the nodes of the sequence of the audio recording
which might be a two minutes of speech
by freezing the people were layers
we
the dues the man's of memory and therefore we can use the long sequences
and also we avoid overfitting to these are problem
based on the long sequences
finally the third step it in which we train them i-vector estimation
the first thing we do is we're gonna discard the actual multiclass classification
and we're gonna use a binary a classification
we're gonna use a sinus structure which is depicted here by copying the network tries
but the parameters of the same this is just for illustration purposes
and notice that we also three is
the affine layer corresponding this is denoted by
degree of colour
so
actually merges fixing them variance and now we're adding
and i'm into the estimation of work
that takes the possible in
activation which are very high dimensional and tries to learn as a lower magnitude
the along with the unit length expressed or
it's gonna be optimized people use a to minimize the cross-entropy
we also keep the global also
as part of the optimization problem
to validate are ideas we're gonna use the following setup
i'll start baseline system we're gonna use a modification of the rest in a thirty
four
expect to propose
by saving
and company
the modifications that we're doing is we're allocating more challenge more channels to their layers
because wishing that improves performance
i'm not the same time to control the number of parameters
we're gonna change the expansion rates of different layers so that we do not increase
the channel so much in deeper layers
and we have a certain is control the number and it is without degrading performance
to train the n and we're gonna use the box selected to dev data which
comprises about six thousand speakers and a million utterances
and this is wideband a i sixteen khz
note that we process differently the data when we use it with source segments on
full-length refinements in terms of how we apply a plantations
and i refer you to the paper to look at it excels
those are very important
what a good performance and also generalization
to make sure that we do not overfit to a single
evaluation set we are benchmarking against for different states
speakers in the while and bob select one
are actually good three it's to bob select two
there there's not much about the means that between
those two evaluation sets on the training data
the
sre nineteen outperform video portion and the time five
have some domains is compared to the training data
and i will be someone in the results later
mostly this is
in the case of sre nineteen is because the tails audio comprises multiple speakers and
there's a need for diarization
and in the time five k's
there is
far field microphone recordings with a lot of overlap speech and higher levels of reverberation
so there is a very challenging setup
also the time five results will be a split
in terms of a close-talking microphone and too far field mike
so that start by looking at the baseline system the we're proposing
we're percent of results in terms of equal error rate and to all other operating
points
we're doing this to facilitate the comparison with prior work
if you look at the right of the table
we are listing the best single system with no fusion number the we're able to
find in the literature
for all the benchmarks
you know all of the costs work reported
but our baseline
since to do a good job compared to the prior work
i know performance of most of the operating
points
note that we're not actually doing any particular tuning for its evaluation set
it's the for some small carrier that as i said sre nineteen
require a diarisation so we'd are as the test segments
and then for its
speaker that adding a text and then we extract an expert or
and their score
we can score
with the enrollment on all the test expect sources the one of the key for
scoring
so check
the
improvements that the phone lines refinement brings
in the second stage
we can compare in this table
with respect to the baseline
overall we also positive trends across all the data sets an operating points
but the games are
larger for the speakers in a while also and this makes sense because
this is done
so for with the evaluation data has a longer duration compared to the four seconds
segments that were used to train the nn
so
this value is the recent findings know how in our interest this paper
in which we saw that formants were fine and it's a good way to mitigate
the mismatch between the duration
in the training faces on the test phases
regarding the amount of the destination node work we explore multiple topologies
all of them were fit for where architectures and we explore interracial that and with
here and percent in three
represented in cases
a change in terms of the
number of layers and the with of the layers
the parameters go for one point five million to twenty million
when we compare performance for this three architectures across all the task
we do not see why changes
so the performance is quite stable across networks which is
it's probably a string
to find a good trade-off between the number of parameters some performance we're gonna be
the magneto two
architecture for the remaining part of experiments
percent
the overall
gains in discrimination
and due to the three stages
you have the graphs
the horizontal axis
are
the different benchmarks
we are explained in a
by far field microphones and of different plot just a first facilitate the visualisation because
they're in a different dynamic range
on the vertical axis we're depicting one of the
a cost
and then the colour coding indicates into the baseline system
the utterance
is
applying the for refinement to that baseline system
and the grey indicates application of the magnitude estimation on top of full-length refinement
so overall we can see that there was it is trained as well the
full answer feynman an i-vector estimation produced
gains
and we see that across all data sets
in there so
e r
we are getting out twelve percent gain and then for the other two operating points
we're getting an average of to the one percent gains
even though i'm only assign one operating points here in the paper you guys the
results for the other two operating
so finally a look into the calibration results
most of the global calibrate or and the miami network our training on the ball
so they have to dataset
this is a is a good night for the box select one on this because
in the well evaluations that but is not subset would match forty five and sorry
nineteen
where the reason segments and
before
you know they do not calibrate or we can see that we can obtain good
performance
in terms of the actual cost max and the mean cost
what both box evidence because no well
but when we moved to the other datasets
we struggle to obtain a good calibration with the global calibration
looking at the magnitude estimation that work we see a similar trend
for box a lemon speakers in a while we obtain very good calibration
but we also system struggle for the other sets
i think that a fair statement is to say that the mac we can estimation
does not deal with the domain saved
but you
performance the global mean and calibration
in all the operating points
i'm for all data sets
to gain some understanding of what mounted estimations doing
we did some analysis
the bottom plot on the right shows the cosine scores
the histogram scores for the
non-target on the target distribution
the red colour indicates a non-target scores and the blue collar indicates a target score
the top two panels are still in the cosine score
it's kind of a lot
against the magnitude
the of the product the magnitudes
for both and variance involving strike
therefore some of the line indicates the global scale
or magnitude
the big global calibrate or assigns
to this limiting
discourse used for this analysis are one the speakers in a while evaluations
since the magnitude estimation network improves discrimination
we expect
two trends
for the local sense for targets
we expect that the
a lot the magnitude
should be bigger than a global scale
on the other hand for the high cosine score
non-target trials
we expect the others
that is that the product manager to be smaller than a global scale
the expected trends are actually person in these plots
we look at the top plot we see
the
there's on
tilt
and the
magnitude for the no
cosine scoring
tend to be of all
the
contact constant magnitude that will be assigned for the global i-vector
on the other and we see that a large portion
then non-targets are the global scale
and
the ones that are doing getting very high cosine scores
also quite attenuated
this is consistent with the observation that magnitude estimation there were is improvement of discrimination
so to control we have
introduce undirected estimation network
within a global offset
the idea is to assign an eigen to each one of the unit length x
vectors that are training with an angular mark the softmax
the resulting scale extractors can be directly compare used in inner products to produce calibrated
scores
and also we have seen that it increases the discrimination between speakers
although
the domain is still remains a chance this are significant improvements
the propose system outperforms a very strong baseline on the for common benchmarks dimensional
when we but also for the validated the use of for recording refinements to help
will the duration mismatches interviews you another training and test phase
if you found this work interesting i suggest that we also take a look at
day
current work the senator a and meets my clan are gonna be presented in this
work so
once it is related
and if you have any questions you can reach me at my email
and i look for
to hand you guys in the middle sessions
thanks for the time