Speech Transcript - Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

so hi everyone i'm gonna talk

about very similar approach to what mutual described before

for at least for the part of speaker recognition actually

to say model

so you're it's it won't be anything new

this is the outline more or less i'm gonna describe a little bit about the

use of the nn since speech ends now speaker recognition

and how to extract baumwelch statistics i'll

do with a little bit more analytically the

done what mitchell did some the inane be lda configurations

and some experiments on switchboard and the nist two thousand twelve

so little bit about the limitations of the ubm based speaker recognition so far the

short-term spectral information that we are traditionally been using on the as a front end

feature as from the features

in speaker recognition work fine in one in some sense

but in some others not and that it would be more specific our experiences that

when you know alignment suppose i'm going to say to australia

purchase going to jump on

"'kay"

and of cereal normal is a language because

it check this okay

so i think you'll be able to

discriminate between speakers a little bit more effectively than if i go jump on

okay and that of problem is that with the current traditional ubm based

a speaker recognition systems we don't capture of this that information and also because they're

not phonetically

a where

the assignments the classes that we are define

by using an unsupervised way of training

i ubm so segmenting let's say the input space

using the feature itself but we're gonna then use

"'kay" to a due to i to extract baumwelch statistics

came it not it do not have these a phonetically awareness that is needed

i hope so

so the challenge here

is to use the n n's

which we know that now are capable of

improving drastically the performance of asr systems

and scab to these ideal socratic way

way in which is speaker pronounces it's

as we said that signals which actually as others are

tied triphone states

and help with about that units or on asr

the reports or something like thirty percent relative improvement in terms of word the error

rate

compared gmms

there have several hidden layers five or six in triphone states

as outputs a their discriminative classifiers yet we can combine them we'd hmms using this

trick that

we term posteriors back and likelihood by subtracting

the you prior into the log domain

and

and then we can combine them with a mean hmm framework

initially the used to initialize them with us

stock a restricted boltzmann machines

a this is no longer need it's that's has been proven but

you might imagine cases or domains where or languages were not enough labeled data is

available

yet you might have very few data but many unlabeled

data

in this case is but exclude the possibility of

of for using is be stacked architecture of our be ends

due to a due to initialize the they are bm more robustly

and i think the key difference is that the fact that the capacity of handling

a longer segments as inputs

okay so

something about three hundred milliseconds

in order to capture

say information the temporal information

this is done the reference by the way a little bit old now

from two of the pioneers

so the ubm approach does your

more i sumo no

is goes like this you

you start whereby training

using the em algorithm a ubm

and the for each new utterance you extract the so called zero order statistics and

first order statistics

and then you use again you're ubm

in order to somehow pretty wide from your bumble statistics a component wise it is

because

that's what you're doing effectively

so a in these the nn based approach or we are using these by the

posterior probability

of each frame belonging to its component

it that's the only difference so this by t

tease the frame count sees the component

that's the only thing that changes

so that means that don't have a change or

algorithms are all we just have to have it the nn algorithm some to put

usually posteriors

and that's all

no need to create use of course

so i take ubm is still need is only practically for the last step

two prewhitening the bible statistics before feeding them either to

to an i-vector extractor maybe to jfa

and of course em here is not required to train the ubm because

the

the posteriors came come actually from that unit so there's my

no need to do this is just an m step

all a single of several

will be sufficient

and it is easy does it is interesting to note here that different features can

be used for estimating

the assignments

or of a frame to they sit to the scene on or

what we used to say the component of the ubm

and those that you finally use

for a extract

i i-vectors or whatever

you're using so

you don't have to change that you can have two parallel way that are optimized

for the two tasks for the sr task

and for the speaker recognition task as long of course that you are having it

you the same frame rate

so i'm not gonna go deep enough into that this is the first unit configuration

we developed it was inspired by this paper robustly at all and he was a

very successful paper that's of asr we managed to reproduce the sr results

and do something to find it also you next

and this more as the configuration

and we have some results and then

he was gently percent estimated telling us is a guy's we managed to obtain some

amazing results

with this where i

and we show that the method was actually saying

and

so we tried this as well

the first the first the configuration but we tried to switchboard data not an east

so this was the configuration of young really of voice alright

from sri and it's a little bit different the uses trap features at the fantasy

maybe

it's better thing to do

it's along the span thirty one

frames it's i use it uses log mel filter banks

they use forty i think we you another think that we used twenty three that

was i guess one of the reasons why there are we the results with our

and obtain are not that will well there are several reasons of we have expect

you know

these said you know sub a lot of free parameters that

that someone has to like in and

but i'm gonna show you next and so

we have to configuration the small one was practically

so that we include results for the common already paper

and here and we have be configuration also

with that is more close are close to what is right be seen there in

the paper

these are some an asr results of be obtained

there or you see first of all the comparison that is on basically paper

just two

to address the dramatic improvement you can obtain by using

the in insisted of

gmms as emission probabilities

and to these are the two configurations

of we developed in a green most inspired by the work of vastly and then

this or i

now let's go back to speaker recognition

these are the plp a questioned us to tell you that what

flavour of p lda we used

we found that for most of the cases

the full rank

v transpose that is a speaker space

work better we didn't of course trial recognition but it will with work better compared

to one twenty

for example these system got

we before links norm apply w c n

instead of doing prewhitening that word most of the cases again very well but much

better prewhitening

and about this dilemma whether you should average

after or before length normalization i think you should average

before and after length molestation

because that's more consistent with the way you're training the p lda model

and in our case make made a lot of difference

okay

these other results from switchboard with the first configuration they're not that good

then all that good

their

not even comparable to

the once you tame as a baseline system

okay so we were rather disappointing that the state each that was somehow christmas

and that but what once you fuse and you get something that like yes

it's good not that the in this case so that we are gonna using a

single

enrollment utterance the same for male more less

and

notice go to nice with the configuration or not what we thought was you configuration

of is right

these sees the small configuration

now we see that's at least for the low false alarm area without but we

have we're making progress

not by fusing them up much though

"'kay" the fusion was not a that's

that's good

and it's by the way i'm emphasising c to classify both although c five these

a subset just a

to make sure that

you know that

that if it's so it's both clean and noisy tell

and this is with the configuration the same picture now we are we are comparing

it with it

to forty eight gmm

and it's more the same picture you get some improvement on the low false alarm

area

that some caves the don't think that so much this is one

four we could be configuration

so i'm gonna i'm gonna

just keep a little bit i'm gonna talk a little bit about

the p lda now because it was there was this issue about the domain adaptation

a gender so we're gonna focus a little bit on p lda now just to

share with your result which i think it's interesting

we know that link when you apply length normalization you may attain results that are

even better

compared to heavy tailed be of the in some cases

the problem is that this transformation is some cost sensitive to two datasets

so we ideally we would we would be great to get rid of it

and the possible alternative would be to scale down the number of recording so

what that what that means is that's you pretends

that's instead of having an and recordings you are having and over three

we define a scaling factor arbitrary but one by one over three one able to

works fine

in practice

and using that streak all the evidence criteria work

at all you the i mean you once you trying to be lda you getting

it strictly increasing privates but you which is good

and it's you somehow a losing caught your somehow losing confidence

which is a good thing

okay that to lose confidence in some cases

and it's the problems can we get rid of length normally that no the answer

is no

but we are rather close a gets so the scale factor of one means practical

nothing

here are some results

with different scaling factor so all i'm doing is simply divide the number consisted role

in training

and when evaluating them all the other large

dividing the number of

of recordings by either one over it multiplied by

one over two or will buy bound over three

i'm guessing that most of the gap

between door not doing length normalization and doing length normalization is somehow

i think by the by about maybe strict so

maybe because the other people that are using these domain adaptation are function with domain

adaptation can use that

as an alternative

to the like someone addition and tell me maybe if they the found something interesting

so was conclusions

the use of the state-of-the-art to the nn sri can replace definitely a traditional gmmubm

it'd a ubm ceased based system

and a good thing is that once a baum-welch statistics are extracted is exactly the

same machinery but can be applied

and no need to change the coleman teachings anything and they're the results provided by

sri

and is now

missiles only that's

you was also done your merrill this morning that but also sound role

models to stick to get some results exactly the same idea

clearly so these results clearly so the superiority

we did something suboptimal probably that's why would we didn't manage to get the desired

results

so as an extension component

obviously a convolutional neural nets a neural nets maybe make might be useful

and there is also another idea where

we used for asr where

what we did this was to all commands

fifteen the input layer of the d and then by blowing

and i typical i-vector a regular i-vectors

"'kay" we do that for broadcast news

in order to make a some sort of speaker adaptation

we presented that and i cuts

so i don't help a lot it hopefully you've a

one point five two relative improvement which is what not relative sort absolute improvement

so which is very good for size so you can maybe margin a

and architecture where you extracts

i regular i-vector

to feed that the nn in order to extract

it didn't based i-vector you can imagine hold things like that's

so that's all things a lot

thank you channel we have time for some questions

i didn't quite catch when you talked about scaling down the number of counts

you talking about scaling it down your

in the p lda score

i mean you don't scored by the book

i don't know

no i'm averaging i'm training the p lda model first of all by doing this

trick

that's quite still here that's crucial to train the model like that

then

i treat i doing i'm doing averaging

but i treat

the single utterance has been

one over three or one or two utterances

in the scoring

okay so you just so you whiten the variances when you try but i and

then it then you also add uncertainty

and scoring

the it is one if you put down the llr score it's i you can

clearly see where you where you need to multiply scaling factor especially for

thanks tables well just mention a few things just those like community that you know

would be quite a forward to see what the difference is that it feels like

the money this key ingredient somewhere else or but we close might be the scheme

gradient that you know all the teams are gonna try and at this time i

stumble into the same thing

so some of things up a lot it since this conference is that as you

mentioned the low number of filter banks twenty three instead of audio believe you said

this wasn't impacting factors that might be one reason a also worked out that we're

not applying vtln before training the t and then sort of things for the isr

yes but not for the d and n assets another factor

and also removing the silence index of the demon during the accumulator generation they're number

of things there and that's good if you that other people have also been able

to a might make a wireless well so we know that something positive no it's

moving in the right direction

one of the other things i wanted to mentioned was

let me think that them on blank right now

that's right that we we're talking about isr performance one of things that people said

was you know this configuration works really well a bias are so why should we

change that and what we've seen so far is that the indication of performance on

the isr side of things

doesn't necessarily reflect have suitable is for this it to a speaker id task sorry

if you struggle straight up to use your paradise a system or whatever you have

and apps go back to the whatever was published in the configurations and just start

from scratch and see if that works better

and certainly don't be afraid to contact any of that aims at you know working

on this

so we're all happy to address the issues

because in errors are you it's a the asr is forward once you exploit the

posteriors your it's in a to members folding it's a language model that can smooth

some results

whereas we don't have that's in when we are extracting posteriors

for speaker recognition

so that might be

an indication that they sat results that at all necessary reflects better results for speaker

recognition

are you implying image that you guys turned out to vtln specifically because you're gonna

use it for sitter that was something that was already the way you did asr

i just that are working on the actually didn't inside of training myself unity was

doing it beforehand i and the configuration dekai we had switched off and i asked

you know should not be doing this i

can't actually for whether you said it doesn't help for or it doesn't make much

difference

that's just one thing when we can fit with a tape most and what we're

doing that's one thing we knighted my will have an impact it's removing speaker discriminability

a simple

they're writing

so you seem to have a very good there is all too is

you

you and that's that the

convolutional nets have been around for twenty years right

but i mean and can that can was working on that

let us also

how can these other right now and

and the second question is what the both recurrent one that's which is also useful

but the what the story white does this you hear twenty is it

sure after the question

i guess

and major over the place the fact that where using now much longer windows as

input spaces

okay that will that and of course the fact that we have processing power not

that

it took as

the month

maybe less to check the big system of course

we but using g it to be you of course there isn't some optimisation

then the that need to be done in a made in terms of engineering

but it takes a lot text all the time so to process all these data

that is required to train robust that all systems maybe wasn't feasible during the eighties

that that's that definitely of bait most the community y

they failed to show during those sarah

that

those discriminative models are powerful enough to compete by far

the gmm approaches the channel right

Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

Neural Nets for Speaker and Language Modeling

Patrick Kenny, Vishwa Gupta, Themos Stafylakis, Pierre Ouellet and Jahangir Alam