so hi everyone i'm gonna talk
about very similar approach to what mutual described before
for at least for the part of speaker recognition actually
to say model
so you're it's it won't be anything new
this is the outline more or less i'm gonna describe a little bit about the
use of the nn since speech ends now speaker recognition
and how to extract baumwelch statistics i'll
do with a little bit more analytically the
done what mitchell did some the inane be lda configurations
and some experiments on switchboard and the nist two thousand twelve
so little bit about the limitations of the ubm based speaker recognition so far the
short-term spectral information that we are traditionally been using on the as a front end
feature as from the features
in speaker recognition work fine in one in some sense
but in some others not and that it would be more specific our experiences that
when you know alignment suppose i'm going to say to australia
purchase going to jump on
"'kay"
and of cereal normal is a language because
it check this okay
so i think you'll be able to
discriminate between speakers a little bit more effectively than if i go jump on
okay and that of problem is that with the current traditional ubm based
a speaker recognition systems we don't capture of this that information and also because they're
not phonetically
a where
the assignments the classes that we are define
by using an unsupervised way of training
i ubm so segmenting let's say the input space
using the feature itself but we're gonna then use
"'kay" to a due to i to extract baumwelch statistics
came it not it do not have these a phonetically awareness that is needed
i hope so
so the challenge here
is to use the n n's
which we know that now are capable of
improving drastically the performance of asr systems
and scab to these ideal socratic way
way in which is speaker pronounces it's
as we said that signals which actually as others are
tied triphone states
and help with about that units or on asr
the reports or something like thirty percent relative improvement in terms of word the error
rate
compared gmms
there have several hidden layers five or six in triphone states
as outputs a their discriminative classifiers yet we can combine them we'd hmms using this
trick that
we term posteriors back and likelihood by subtracting
the you prior into the log domain
and
and then we can combine them with a mean hmm framework
initially the used to initialize them with us
stock a restricted boltzmann machines
a this is no longer need it's that's has been proven but
you might imagine cases or domains where or languages were not enough labeled data is
available
yet you might have very few data but many unlabeled
data
in this case is but exclude the possibility of
of for using is be stacked architecture of our be ends
due to a due to initialize the they are bm more robustly
and i think the key difference is that the fact that the capacity of handling
a longer segments as inputs
okay so
something about three hundred milliseconds
in order to capture
say information the temporal information
this is done the reference by the way a little bit old now
from two of the pioneers
so the ubm approach does your
more i sumo no
is goes like this you
you start whereby training
using the em algorithm a ubm
and the for each new utterance you extract the so called zero order statistics and
first order statistics
and then you use again you're ubm
in order to somehow pretty wide from your bumble statistics a component wise it is
because
that's what you're doing effectively
so a in these the nn based approach or we are using these by the
posterior probability
of each frame belonging to its component
it that's the only difference so this by t
tease the frame count sees the component
that's the only thing that changes
so that means that don't have a change or
algorithms are all we just have to have it the nn algorithm some to put
usually posteriors
and that's all
no need to create use of course
so i take ubm is still need is only practically for the last step
two prewhitening the bible statistics before feeding them either to
to an i-vector extractor maybe to jfa
and of course em here is not required to train the ubm because
the
the posteriors came come actually from that unit so there's my
no need to do this is just an m step
all a single of several
will be sufficient
and it is easy does it is interesting to note here that different features can
be used for estimating
the assignments
or of a frame to they sit to the scene on or
what we used to say the component of the ubm
and those that you finally use
for a extract
i i-vectors or whatever
you're using so
you don't have to change that you can have two parallel way that are optimized
for the two tasks for the sr task
and for the speaker recognition task as long of course that you are having it
you the same frame rate
so i'm not gonna go deep enough into that this is the first unit configuration
we
we developed it was inspired by this paper robustly at all and he was a
very successful paper that's of asr we managed to reproduce the sr results
and do something to find it also you next
and this more as the configuration
and we have some results and then
he was gently percent estimated telling us is a guy's we managed to obtain some
amazing results
with this where i
and we show that the method was actually saying
and
so we tried this as well
the first the first the configuration but we tried to switchboard data not an east
so this was the configuration of young really of voice alright
from sri and it's a little bit different the uses trap features at the fantasy
maybe
it's better thing to do
it's along the span thirty one
frames it's i use it uses log mel filter banks
they use forty i think we you another think that we used twenty three that
was i guess one of the reasons why there are we the results with our
and obtain are not that will well there are several reasons of we have expect
you know
these said you know sub a lot of free parameters that
that someone has to like in and
but i'm gonna show you next and so
we have to configuration the small one was practically
so that we include results for the common already paper
and here and we have be configuration also
with that is more close are close to what is right be seen there in
the paper
these are some an asr results of be obtained
there or you see first of all the comparison that is on basically paper
just two
to address the dramatic improvement you can obtain by using
the in insisted of
gmms as emission probabilities
and to these are the two configurations
of we developed in a green most inspired by the work of vastly and then
this or i
now let's go back to speaker recognition
these are the plp a questioned us to tell you that what
flavour of p lda we used
we found that for most of the cases
the full rank
v transpose that is a speaker space
work better we didn't of course trial recognition but it will with work better compared
to one twenty
for example these system got
we before links norm apply w c n
instead of doing prewhitening that word most of the cases again very well but much
better prewhitening
and about this dilemma whether you should average
after or before length normalization i think you should average
before and after length molestation
because that's more consistent with the way you're training the p lda model
and in our case make made a lot of difference
okay
so
these other results from switchboard with the first configuration they're not that good
then all that good
their
not even comparable to
the once you tame as a baseline system
okay so we were rather disappointing that the state each that was somehow christmas
and that but what once you fuse and you get something that like yes
it's good not that the in this case so that we are gonna using a
single
enrollment utterance the same for male more less
and
notice go to nice with the configuration or not what we thought was you configuration
of is right
these sees the small configuration
now we see that's at least for the low false alarm area without but we
have we're making progress
not by fusing them up much though
"'kay" the fusion was not a that's
that's good
and it's by the way i'm emphasising c to classify both although c five these
a subset just a
to make sure that
you know that
that if it's so it's both clean and noisy tell
and this is with the configuration the same picture now we are we are comparing
it with it
to forty eight gmm
and it's more the same picture you get some improvement on the low false alarm
area
that some caves the don't think that so much this is one
four we could be configuration
so i'm gonna i'm gonna
just keep a little bit i'm gonna talk a little bit about
the p lda now because it was there was this issue about the domain adaptation
a gender so we're gonna focus a little bit on p lda now just to
share with your result which i think it's interesting
we know that link when you apply length normalization you may attain results that are
even better
compared to heavy tailed be of the in some cases
the problem is that this transformation is some cost sensitive to two datasets
so we ideally we would we would be great to get rid of it
and the possible alternative would be to scale down the number of recording so
what that what that means is that's you pretends
that's instead of having an and recordings you are having and over three
we define a scaling factor arbitrary but one by one over three one able to
works fine
in practice
and using that streak all the evidence criteria work
at all you the i mean you once you trying to be lda you getting
it strictly increasing privates but you which is good
and it's you somehow a losing caught your somehow losing confidence
which is a good thing
okay that to lose confidence in some cases
and it's the problems can we get rid of length normally that no the answer
is no
but we are rather close a gets so the scale factor of one means practical
nothing
here are some results
with different scaling factor so all i'm doing is simply divide the number consisted role
in training
and when evaluating them all the other large
dividing the number of
of recordings by either one over it multiplied by
one over two or will buy bound over three
i'm guessing that most of the gap
between door not doing length normalization and doing length normalization is somehow
i think by the by about maybe strict so
maybe because the other people that are using these domain adaptation are function with domain
adaptation can use that
as an alternative
to the like someone addition and tell me maybe if they the found something interesting
so was conclusions
the use of the state-of-the-art to the nn sri can replace definitely a traditional gmmubm
it'd a ubm ceased based system
and a good thing is that once a baum-welch statistics are extracted is exactly the
same machinery but can be applied
and no need to change the coleman teachings anything and they're the results provided by
sri
and is now
missiles only that's
you was also done your merrill this morning that but also sound role
models to stick to get some results exactly the same idea
clearly so these results clearly so the superiority
we did something suboptimal probably that's why would we didn't manage to get the desired
results
so as an extension component
obviously a convolutional neural nets a neural nets maybe make might be useful
and there is also another idea where
we used for asr where
what we did this was to all commands
fifteen the input layer of the d and then by blowing
and i typical i-vector a regular i-vectors
"'kay" we do that for broadcast news
in order to make a some sort of speaker adaptation
we presented that and i cuts
so i don't help a lot it hopefully you've a
one point five two relative improvement which is what not relative sort absolute improvement
so which is very good for size so you can maybe margin a
and architecture where you extracts
i regular i-vector
to feed that the nn in order to extract
it didn't based i-vector you can imagine hold things like that's
so that's all things a lot
thank you channel we have time for some questions
i didn't quite catch when you talked about scaling down the number of counts
you talking about scaling it down your
in the p lda score
i mean you don't scored by the book
i don't know
no i'm averaging i'm training the p lda model first of all by doing this
trick
that's quite still here that's crucial to train the model like that
then
i treat i doing i'm doing averaging
but i treat
the single utterance has been
one over three or one or two utterances
in the scoring
okay so you just so you whiten the variances when you try but i and
then it then you also add uncertainty
and scoring
the it is one if you put down the llr score it's i you can
clearly see where you where you need to multiply scaling factor especially for
thanks tables well just mention a few things just those like community that you know
would be quite a forward to see what the difference is that it feels like
the money this key ingredient somewhere else or but we close might be the scheme
gradient that you know all the teams are gonna try and at this time i
stumble into the same thing
so some of things up a lot it since this conference is that as you
mentioned the low number of filter banks twenty three instead of audio believe you said
this wasn't impacting factors that might be one reason a also worked out that we're
not applying vtln before training the t and then sort of things for the isr
yes but not for the d and n assets another factor
and also removing the silence index of the demon during the accumulator generation they're number
of things there and that's good if you that other people have also been able
to a might make a wireless well so we know that something positive no it's
moving in the right direction
one of the other things i wanted to mentioned was
let me think that them on blank right now
that's right that we we're talking about isr performance one of things that people said
was you know this configuration works really well a bias are so why should we
change that and what we've seen so far is that the indication of performance on
the isr side of things
doesn't necessarily reflect have suitable is for this it to a speaker id task sorry
if you struggle straight up to use your paradise a system or whatever you have
and apps go back to the whatever was published in the configurations and just start
from scratch and see if that works better
and certainly don't be afraid to contact any of that aims at you know working
on this
so we're all happy to address the issues
because in errors are you it's a the asr is forward once you exploit the
posteriors your it's in a to members folding it's a language model that can smooth
some results
whereas we don't have that's in when we are extracting posteriors
for speaker recognition
so that might be
an indication that they sat results that at all necessary reflects better results for speaker
recognition
are you implying image that you guys turned out to vtln specifically because you're gonna
use it for sitter that was something that was already the way you did asr
i just that are working on the actually didn't inside of training myself unity was
doing it beforehand i and the configuration dekai we had switched off and i asked
you know should not be doing this i
can't actually for whether you said it doesn't help for or it doesn't make much
difference
that's just one thing when we can fit with a tape most and what we're
doing that's one thing we knighted my will have an impact it's removing speaker discriminability
a simple
they're writing
so you seem to have a very good there is all too is
you
you and that's that the
convolutional nets have been around for twenty years right
but i mean and can that can was working on that
let us also
how can these other right now and
and the second question is what the both recurrent one that's which is also useful
but the what the story white does this you hear twenty is it
sure after the question
i guess
and major over the place the fact that where using now much longer windows as
input spaces
okay that will that and of course the fact that we have processing power not
that
it took as
the month
maybe less to check the big system of course
we but using g it to be you of course there isn't some optimisation
then the that need to be done in a made in terms of engineering
but it takes a lot text all the time so to process all these data
that is required to train robust that all systems maybe wasn't feasible during the eighties
that that's that definitely of bait most the community y
they failed to show during those sarah
that
those discriminative models are powerful enough to compete by far
the gmm approaches the channel right