Speech Transcript - Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15

kind of the transition from the systems and in the previous salary into the n

n's in what

people do deeply we all could have

no i don't think so

we all clodo presented in both but i think this is a good transition "'cause"

we did have some kind of new things that we did that i wanna talk

about

this is work with my colleagues correct cell and daniel from johns hopkins of both

of whom were unable unfortunately to get spousal permission to attend this work so but

they have that excuses rags wife had their second child two weeks ago and daniels

is due in about two so

they have a reason

so i'm gonna present an overview of the d n an i-vector system that we

submitted to leo every fifteen

i wanna hear give a shout out nist for introducing his fixed training data condition

which actually allowed us to make a very competitive system with only three people which

is a not very common in our is historically

the approach that we used algorithmically i'll go in the more detail but we use

the n n's unlike some of the previous presentations you've seen we were able to

get good performance not just with the bottleneck features but also with the nn state

labels i'll talk about that

we used a three different kind of i-vectors i'll explain that more but

everyone had acoustic systems and those are very good we're able to do quite well

with the phonotactic i-vector system as well and here we're trying for the first time

a joint i-vector which does both things at once

because we had a fairly powerful system that we were comfortable with and we didn't

trust that we had enough development data we used i think the you simplest and

most naive fusion of anybody a net seem to work for us because we actually

got of using game which i think also we were one of the few

and that was just to some the scores together and then scale "'em" with the

duration model that all talk about

and lastly as i think it's been mentioned but i wanna go with a little

bit more because this was a limited data task data augmentation turned out to be

very helpful for us

so in the top i'll go through our bayes the i-vectors system design a talk

about the two ways that we use the d n n's that have both been

touched on previously today

and i'll talk about the use alternate i-vectors to we experimented with

talks a more specifically about the lre fifteen task and how we use the data

and what we learn later about how we could have used the data

and trying to that will talk about the results that we had in the summation

in some interesting things that we've learned since both about whatever other systems could have

done and also how we could've done better with the systems that we that we

use

so here's a but block diagram of

our lid system

it's a little i-vector system so we can be split into two parts the first

uses the unlabeled data to the to do the ubm and the t matrix learning

and then the supervised system is basically the two covariance model

within class across class covariance that's first used in lda to reduce the dimension and

in the same matrices are used for the gaussian scoring following on after that

we've done for awhile rather than having a separate back end to do the work

we do a discriminative refinement of these gaussian parameters

to produce a system that not only performs a little bit better but also produces

naturally calibrated scores

and we do that in a two-step process first we learn a scale factor of

this within class covariance

and then we go into all the class means and adjust them to better or

provide the discriminative power and that's we we're using the mmi algorithm from gmm training

in a really simplified mode

and of course that's the same criterion is the multiclass cross entropy but all everybody

uses for every day

so just layout data

talk more about how we use the d n and together people dimension it but

let me have some pictures of so you can see a better of what we're

doing splitting up the normal use the gmm to do the alignment and then compute

the stats after the fire

from that

where splitting it out in two ways and using the announced the first is simply

to replace the mfccs with bottleneck features

from indiana and we are just using a straightforward bottleneck note that kind of anything

else

and then the

second system

is a little bit more complicated were used to the nn to generate the frame

posteriors for the signals are for the cluster all state

that used to label the data and you the alignment and then you use the

ubm after that

the unit time are to draw indian and but this is daniel's best rendition of

of a probably d n and a couple of things that the power particular perhaps

about our system or about the cali way of doing things

which by the way we do highly recommend

is it uses this t-norm would just kind of like to max pooling so there

is a there's an expansion in a contractual made at each layer that's how the

nonlinearity come there

what else we i think probably nobody says these days but we're not using fmllr

which i think it is common

for our purposes

you can see we basically use the same architecture either for this you know posteriors

are or we introduce the bottleneck to the one that's just gonna be the bottleneck

that goes

the that's a little in your layer before the

in the middle that one there

we have

about nine thousand output state so it is it is a pretty big ubm that

we get out of this

and of course it's trained using switchboard one "'cause" that's what we were given for

the a fixed data condition

in you know

so let me talk about desire is a little bit the one that

we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian

probability model and german output in a little parentheses for use a given that the

alignments already know otherwise it would be much more complicated

and but because of that it's a big gaussian supervector problem there's closed form solution

for the map estimate that the i-vector

there's an em algorithm for the that the estimation

the second approach is phonotactic thing now i think the you guys mentioned that used

it for a number of areas before

the this is well i'll talk about the details of the or lighter but that

the king is we can still have sort of a gaussian model for an i-vector

but the output now is the latent model we're talking about the weights of gmm

instead of the means

and those things are naturally gonna be count based so we need a multinomial probability

model out not a gaussian probability model

and the way we do that with is to go from log space with the

softmax singular probability part

even when they're fairly simple formula unfortunately there's not a closed form solution for what

is the optimal i-vectors of these additions method iteration

and similarly there's not a two year for the t matrix that we know what

yet so there is a alternating maximization algorithm

so we presented this phonotactics a thing for lid the four

and in the meantime we don't think it okay we have two systems we have

an acoustic in a phonotactic are we gonna combine

actually the first thing we knew score fusion and yes we did that and use

that works

and then we are a little more except well

about two i-vector systems there are doing the same thing why don't i stack the

i-vectors together and get one big i-vector and then run one i-vector system and does

that work

and yes that works two

and we thought of as more and said well

why the widely twos independent i-vector extractors

what can i make one latent variable the both models

the means of the gmm the latent gmm the generated code and the weights of

the gmm generated to cut

the fact is the math says that you can i'll go into a little more

detail but basically this is

a permutation of the subspace gmm that the input we was talking about in two

thousand eight thousand nine

to see leslie workshop and sense

so there are algorithms for doing this we had to manipulate them a little bit

for our purpose

so a couple of the tails how to do this we have some references in

the paper

so complex in particular that we're doing differently than then if you just to get

out of what bandwidth

the first is he did everything but sort of ml estimates so we didn't have

any prior didn't how many backup

obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors

we've actually shown previously that for a phonotactic system map is also beneficial and if

we're gonna do a jointly it's

critical the to be the same criterion for both things because it back

it is a joint optimization of

map of the overall likelihood plus that the prior

a nice trick we can do with this joint i-vector is since this closed form

solution for the acoustic we can

initialize the newton's method with the acoustic and then just refine it using the phonotactic

as well

and that gets us to a starting point pretty easily where we can then do

winning greatly simplify the newton's descent

in particular by pretending everything is independent of each other which is a huge spud

improvement because the doing full haskins in this update

is anybody who's ever looked at it is a pretty tedious

so once we do that

it's essentially rather than being much slower than acoustic i-vector system it's essentially the same

order it's very simple

so that no okay

the lre fifteen task which has been discussed

this i guess isn't happening here there is telephone and broadcast narrowband speech with it

twenty language six confusable clusters

but the limited training condition is very important element from what we were able to

get away with

and of course that means

both that you have limited a little data to more only twenty languages but also

means that you can only train your supervised the nn

on the switchboard english because that's the only thing that had transcripts

which is not our favourite thing to do it was kind of limiting but it

allows nist exercise the technology

and because of the languages didn't have much data that was also would keep

so all of our systems

basically because we had a small team we didn't built too much complicated stuff

i described really everything that we did

so we had two different ways of the using the d n and we had

three different kinds of i-vectors that we could have built out of each of the

to the in an i-vector the unit systems

out of that we could've done six things i'll talk about a few that were

interesting and the ones that we actually

but everything was the same classifier

as i missed because the systems are already calibrated a by this mmi process

we didn't have to use a complicated back end

the thing we get introduced because there is we knew there was this range of

durations that had to be exercised

i think the simplest way that we could get there was to re reuse some

work that we had done previously on making a

duration dependent backend where there's a continuous function which maps

duration into scale factor score

between of the raw score and the true log likelihood estimate that you're trying to

make

and that there's a justification for that function but for our purposes the important thing

is that

it's very simply trainable because it's just got to free parameters

so then you can use this cross entropy criterion and figure out the best parameters

and then because we have is a very simple system

we just at all scores together assume that they were independent estimates of things and

then rescaled the whole thing to bring it back in

and we found that to be helpful for us

another thing about lre fifteen which was mentioned but maybe to go to be more

familiar with the task you it went past incorrectly is very important

so nist the

proposed these somewhat on task of close to the texan across each of the clusters

what we did is

it is generated each cluster is an id score which means that each cluster had

a id posteriors on the one since the ri six clusters of means we gave

an scores from the six which means if nist wanted to evaluate across cluster performance

it was meeting this

and we had to convert these ideas to detection log likelihood ratios which is something

we've all over how to do your

but one thing i wanna mention about our system is we didn't do anything

cluster specific anywhere we just change train a twenty language lid system

and then just the

spun on the scores for each of the clusters because that's what nist one

i think we would like in the future for a more generic lid task

not the key element that i mentioned is the

other with limited training data so

we had figure out what to do with that

as i mentioned we have the unsupervised and supervised power we took the theory which

was later proven not quite ready to we would use everything we could

for the unsupervised data which included switchboard which is english only and was not one

of the languages

for that we could've done better than that all talk about it

and then for the classifier design we did find it helpful

to do augmentation to do duration modeling a cut so we can use all sides

we use segments that were duration

appropriate for the lid task

and we used argumentation used augmentation to change the limited clean data

and try and give us more examples of things to learn what i-vectors would look

to go into the augmentation a little bit more

many of these are standard things the this big thing indian ends now is to

do augmentation

so sample rate perturbation additive noise

right made a kind of forty kind of an additive noise but maybe that's more

interesting we did throw in reverb

and a multi band compression is kind of a signal processing thing that you might

see in an audio signal

but the thing i wanna mention and the thing that we have actually don't have

been slides but if you look in the paper

the most effective single augmentation for us in the task was to run to use

"'em" so you were encoder decoder against

which kind of the makes sense

as a thing to do

and as former speech coding to a fairly attractive

so our submission performance

these are the for things that we submitted our primary wasn't fact one of the

bottom which looks like it's pretty good choice out of the were available to us

so we did a joint i-vector on the bottleneck features we have well i'll show

later of the when some more stable i guess other through that they know what

dimensional ways in this submission

our second basis them was actually slightly better than our bottleneck system and again

that makes that the best sort of phonotactic system i think than anybody saw because

everyone else from the bottlenecks will be the only really good thing to do

and fusion provided again partly because we have simple fusion and partly because we have

two systems which are pretty good

so we get a couple things post email with we found someone educational the first

one i will go in the much details in the paper but

within the family of gaussian scoring there's a question of whether you count trials as

independent are not which in speaker you typically pertain you all had one you only

had one trial for

enrolment is all one hour

that turned on and what we submitted we usually see it is slightly better turns

out for this develop a slightly worse

i have no idea work

the other thing that might be a little bit more interesting is the list usage

we spent quite invaded time even with their the metadata trying this

decide what to do with the ubm and t

but i think that it turned out to work best

we didn't try because we thought of as a dom idea which is to just

use

only the lid data

and only for cuts

which i forget exactly but i think that's only three or four thousand cuts or

something

it ought to be nowhere near enough to train a t matrix we thought

but without or

so here again there's a more numbers splitting things out the first thing which is

kind of interesting for us as we went and rented to this acoustic baseline so

what we would have done with previous technology we are definitely better with all the

stuff we have i dunno if we're not instantly better but we're better

thing

i'm sorry

the ldc is now we split out with the scene on system the three different

kinds of i-vectors and the first thing is the phonotactic system by itself

is actually better than the acoustic system which is what we have seen before and

i think that's

well linguist might about whether it's really a phonotactic system to look at the counts

of frame posteriors but

that aside it's i think the best performing phonotactic system that's out of for lid

right now and then you see also that the joint i-vector doesn't five given noticeable

gain over the acoustic

so that's

okay and the fusion still work let me just go one so then included in

we were able to get pretty good performance in this evaluation with a small team

and of relatively straightforward system

we think that there is still whole in the signal count system that doesn't have

to be just bottleneck

and we were able show that

we think that the phonotactic in the joint i-vectors the joint i-vector especially is a

nice simple way to capture that

that information is one of things that enables the signal system to be competitive

we think it is helpful to use a really simple fusion if you have this

discriminatively trained classifier to start with

and find the data augmentation it can be a very valuable thing for the manager

limited data

thank you

we have time some questions

thank you for told you proposed able to collect all is doing marks

we can focus to the lower levels double is the tools like to other tools

for d is a classical for more old home too

yes there are we always use the same mailed again gaussian classifiers

no matter what kind of i-vectors

"'cause" distribution is not

no the intention is the i-vector could still have been in a gaussian space that

that's this is why we like this kind of

a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i

think that in for example is compared some of those

where the subspaces in the linear probability space and that

i don't think would be well modeled by gaussian fact i know it wouldn't be

well my time gaussian pretty comfortable that "'cause" it's positive

but by going into the log space i think you it does

it really is going to lda and that's right tools okay cindy

but i'm very much like the additional processing that you're doing to kind of or

to the data you had casework security of sample rate perturbation most of the speech

coders most versions

if you had to go back again we're which ones you think actually would help

i think you mean imagine which up there is a table in the paper

which many of them are helpful but the speech coder is the most helpful on

its own

so we choose the sample rate conversion was at a really big variations

we did things like plus or minus ten percent plus or minus five percent but

i think

i would say that big

so we use a big difference maybe we have other cts broadcast news progress has

been which would typically be guessing

we didn't break them apart

try other nations that just

a pattern p norm

we have since

and

it's

so little bit it seems like for this particular task it looks like the sigmoid

is that some other people use are a little bit better i'm not sure if

we think that's a universal statement

excuse me the sigmoid are better for training the bottlenecks

i think for this you know maybe not

so we have looked a little bit

there is more to explore

so if there are no more questions and we assume everybody here knows everything about

language recognition got common both systems

so that the same speaker again

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15

Speaker & Language Recognition: Deep learning approaches

Alan Mccree, Greg Sell, Daniel Garcia-Romero