Speech Transcript - Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

okay so by no means but also

and whatever send a work with microwave so these hand plucking

we're basically we test and evaluate our address the m utterance system

in a like a different mean on this and so this was a system that

we already presented

and at an obvious to test how we disaffected on different scenarios

so for someone a the one the motivation why we started using this architecture and

how

how we started using this

there we will lead to a very of file is the m probably you will

be quite

already aware this that i guess it's

nice to have some tracks near

then we will all the details of the screen men so we will detailed a

system description

the reference i-vector system that we will compare

our proposed system we

the different scenarios we're gonna tested

and results

and finally we will conclude the work

we all know what we take these already the process of automatically identifying language for

a given a spoken utterance

and typically this has been done for many years

rely you know acoustic model so these systems

basically have the state is first some i-vector extraction

and then some classification states

and last years we're seeing a really a strong

a new line that it's deep neural networks

and it can be more or less divided in three different approaches

one is the and two and systems we have seen that it's a very nice

solution but we are not achieving best results so far

then we have the what an x

and then

after computing but as we go to the i-vector

struck some and we keep the fuel line

and then we have this signals

sorry for type other

so this would be a and this paper we wanna focus on the end-to-end approach

so we want to improve the end-to-end approach for this

we would be a very like stander the nn for language recognition when we try

to use and to an approach

basically we have

some parameters as input

then we have one or several he'd over the years with some nonlinearity

and we try to compute the probability of some

some of the

the language we are gone test

in the last layer so for this we use a softmax it with us probabilities

one of the main drawbacks of these

be system is that

we need some context if we try to get an output frame-by-frame we are not

gonna get then you would result so this system relies on stacking several acoustic frames

in order to all the time context

and that has many problems one and we have a fixed

length but probably will not work best for all different conditions

and it's like bright the

since a deep to union

so how can we model these in a better way

the theoretical answer he's recommend your networks so basically we have same

structure that before

but this once we have recourse if connections

all the others are saying

what's the problem with this one's we have a vanishing gradient problem

the a basically what happened sees

in theory it's a very nice model

but when we try to train these networks because of these records you can extra

we end up having all day they weights going to either zero

or something really high

so there are ways to avoid this but usually is very tricky it depends a

lot on the task on the data so it's not

really useful

and here is where the others the m columns

basically stm means they first

stander the nn

and we replace

all day hidden nodes

with this l s t m block that we have here

so let's go to the theory of this blog

basically it seems kind of scary when you see first

but it's pretty simple after you look at provider

we have a flow of information that goes from the bottom to the top

and as in any

a standard you know we have a nonlinear function

that we

this one here

and is bessel thing all the others the n is that it has a minimal

use

we take this one

so that

the all the other stuff that we have there are three different gates duh what

they do he's they let

or b

they later they don't lead the information go through

so here we have a input data

the if it's activated we will lead the input

of a new times that we'll for war

if it's not it won't

we have they forget gate

that's what it that is basically we set the memory so

so if we speech calculated it will would that sell to zero otherwise it will

keep the state of the of the previous time step

and the output gates

note that gate we'll let the computer

computer output

here

go to the network or not

and then what we have of course is a vector and connex so

the output of

well as time goes the input

of day next one you know data

so it's basically trying to meaning they are in and model

but with this case we avoid that problem because that gates work

both in this time but also entering time so when we have we're doing the

back propagation

and we have some ever that's a stew maybe rice the weight

that forget gate that would be a that input gate it's but also clock that

error from going

many times so we avoid that problem

the system that we used for language recognition

been doesn't rely on stacking acoustic frame so we receive only one frame at the

time

we will have one or two hidden layers and the relay here will be a

unidirectional it is the m

we impose

impose war

these connections that we have here

that basically

it allows the network to decide things on the like depending on time so we

it supposed to improve they the performance of a memory cell

the output we will use a softmax right like in the nn

cross entropy error function

and for training

what we do he's in the first area will have a very balanced nice dataset

so we need to do any implies either

but on more difficult to know is we will have some and but also the

data

so what we do in order to avoid problems with them but data

we just over something so we take random sites of two seconds and then we

have six hours

of all the other languages in every iteration

so that it so that we have

for every iteration is different

then we we'll use

to compute the final score of an utterance we will do operates of day softmax

output

but taking into account only the last ten percent

of this course i was playing ability later right

and then finally we will use a multiclass linear logistic regression calibration we use simple

we will compare the system to a reference i-vector system needs a very straightforward using

mfccs the see exactly the same features that we used for that is the m

we we'll one thousand twenty four gaussian components for the ubm

the i-vector ease of size four hundred

it's based cosine distance scoring that's

it controls are it depending on how many languages we have snow would

this was working better

the and doing lda you're doing the lda so that's why we decided to take

a cosine distance scoring

if we have more languages it would be better to use lda but the difference

was a small enough to note that a too much since there

and this is the most implementing quality and need has exactly the same by recent

technique always trained with the same

same data

so these are the three scenarios that we are going values to compare and test

these

these network personnel you e

a subset

on the nist

two thousand nine language permission evaluation

so that is that we use is that there is coming from the three seconds

task

this is a subset the a pretty minutes it's like very set so that the

it is the in will work based

so it's a very kind of d c subset in the in the two thousand

and nine evaluation what we d d's first we have a imbalance meetings of cts

voice of america so we draw all the cts data then we will avoid that

buttons makes and also we will avoid i mean a mismatched

in training so we have only one dataset

a for the languages we wanted to have also a high amount of data

so we to only those then which is that had at least two hundred of

more hours

i'm we also then want to have unbalanced data so we got those datasets so

all of them

two hundred hours per available for training

and that lid

two d subset of we have here

it's not a soul seven it's not the most difficult like we so before it's

just those that happened these two hundred hours a of voice of america data

and we use only a three seconds task because historically we so that for starter

addresses

is where the neural networks outperform more director so we wanted to be in that

in that scenario

then seconds note that we want to test is they that said

of nist language is no one listened to for some fifteen

here we don't avoid any of the difficulties so we have a meeting so cts

and brought about and b and b s

and we will keep

everything

we have seen the there's of this so it's twenty language is scroll in six

cluster accordion similarity so it's supposed to be more challenging because the languages are closer

within a cluster

that model training data it's also gonna be like it during just followed we have

some then which is we lessen have a lower something which is with more than

hundred hours

and split that we need ease

eighty five percent for training fifteen percent for testing

that's something we wouldn't do again if we like run experiments again this is what

we need

the time so before i mean this set and everything

and we thought it would be nice to have more data for training but after

that we ran some experiments and we found that having

it'll be less training data but more they've data we'll help experiments

but we keep exactly what we use in the one best

and that's a test what we need these with that fifty percent we took chance

of three seconds ten seconds and three seconds

two meeting a little bit the a

the performance of on the and the one less

and then that are texan area will be they test set of nist language relational

oneness

we discover a broad runs of speech durations it's not been beans anymore

and we has a big mismatches between training and unable as we so before

so the results that we have first this is kind of aside result is not

that important but as we are using a unidirectional it is the em what we

have is that the output at a given

times them

things that depends

not only on the input of that

times that are also and all the input of the previous inputs

so the last output is always more reliable

then the ones before

and we thought that maybe we were affecting they performance if we take the first

outputs that are less reliable so we just started dropping all the first outputs and

seen how that affected the performance

this is this so for this one

we don't really care about the

the moderated we have here we only got about how improves

so the absolute

equal error rate doesn't matter only the relative difference

and we found a taking into account only the last ten percent

would be a very optimal point

and we also so that taking into account only the very last score only one

output of a softmax

we were as good as taking the last ten percent but we do they

the last ten percent or

so these are the results

on they

on they first scenario

remember that this is the one do we only voice of america a languages

two hundred hours per language for training

we have here

this is the different architecture that we use we both three had one hidden layer

those two layers and then we have different size of the he'll data from like

this is the smallest we two hundred fifty six

are the begins with one hundred twenty four one thousand and four

this is the a size in terms of number of parameters

of all the models

and be so the results that we obtain

so the reference i-vector system and a seventeen percent almost equal error rate

and point sixty now see a rates

and we see that pretty mats all day and as the em approach is clearly

outperformed that

and i'm not of them has a much smaller number of parameters

so those are really good results but we are in these

balance error you

as we can see the best system us like

fifteen percent a better error

and has like

i four percent gain in terms of size

and we also wanted to check how complementary information that these others the m and

the i-vector were struggling so we fuse the best alice consistent with the reference i-vector

system

and the result whether the way remotes

that's better

we twelve percent

which is like fifteen percent better than they based system i'll

this is the completion metrics doesn't have much information but we can see i'm not

only in terms of accuracy but comparison with other languages how would be performed in

this subset

these are the results in that the dev set of a language recognition evaluation

to for some fifteen

for this one we just we didn't do an experiment with different detectors we were

a little bit and harris we use only they based system on the previous scenario

we what which was to don't layers of size five hundred total

and what we can see here is that the

that is the m

performs

much better than the than the i-vector or on three seconds

while on thirty seconds

we d scenario where we have these mismatches between that the bases and these buttons

on the data sets

this end to end system is not that so we still results for what are

like that were always outperform an i-vector why this and to an approach i it's

able to extract more information from sort lessons but not that matt's for longer

that would think that we so here is that even though the results for longer

utterances

a way worse than the one of the i-vector

diffusion used pretty much always

better than any of a single system

so even if the even when the when there is the m is working worse

than the i-vector

we are able to strut different information that will help in a file and system

so we were also quite be quite happy with the results

this is they

they do that we have for three seconds where we can see that the l

is the em outperforms and over twenty percent relative improvement

over the i-vector

and we see also that the a diffusion works always

better

that in any of a single system

and now here we go on to the results of at all but this set

of language recognition evaluation

and here the things get much more so

first of all

we have first column is that is the and second column is a i-vector

third one is the fusion of both

noncheating one the one we used for the listening

and a point one

is exactly the same but using like the succeeding fusion so we use a two

fold cross validation

so we will use in how of the test set

for training they fusion on the other half

of course that that's

that is not alone in the evaluation

but we wanted to know how like

whether the

the systems were learning complementary information

or whether they weren't so what with always maybe we've used in a good way

we can distract how maps how complementary information

so for the messages that we have to take from here is that versa for

at the end you learning these very hot a scenario is able to

get a result they need a comparable with a i-vector but when it gets much

worse as when the base increases because the i-vector is able to extract better than

better results when it is the m

a status

on the

on that same performance

but the good thing is that we don't have such a big might minutes or

we are able to do a good solution

we can steal even when you're

on the known as the rest we can use the in room we diffusion

the performance of that i-vector

it's conclude the work

basically the same take a messages are

first of all on a control unbalanced scenario

we have we promising results

it's a very simple system we eighty five percent this parameters

and that it's able to get fifteen percent relative improvement

problem is the once it gets

on an imbalance in a real more exciting england the results are not

as good

and finally we know that the that on strong mismatches and harder scenario it we

are not able to strike information within a

so there is a need for variability compensation but we still think that it's a

you really promising approach

that we need to simpler a systems that can get quite good results

lots of questions

so also

just the small comment you say that you're averaging the outputs of the ten percent

of the last frames

you are always using than posants for free second test of a thirty second test

we always using them person did you try to just

a rate for the thirty last frame independently of duration of your this

we e actually not for they how this areas but for the aec once we

need a like a lot of

things like not only averaging about like i mean or selecting only based on was

one or

just a drawl all the ones that are out that yours

and we found that is not really work need to the little thing to note

in there

but maybe in day in a more telling in serious it would be a with

with the way we haven't right

is it possible to go back to slide twenty four second here

sorry i notice you're always getting improvement with i guess when you're good to elicit

iain's versus the i-vector but when you look at the in which case and think

when you want to the fusion

fusion with emotion actually did worse than the i-vector system three point to each six

or each seven and the i-vector had one point nine that's the only one where

you didn't get an improvement was really reason why

you saw maybe when it happened may be used or you know stm actually had

words performance of guessing is because

you got more realist em system in

so i'm not completed is are but we have some kind of idea of why

that happened the idea is that

for training day systems

what we d is these oversampling

so on english there was one of the i think it was reduced english that

had only half an hour

so you know to train the others the n

that of course hardly the need has a war is useful but i think it

also hardy follow the fusion

so when you have

one

well in with that has a less data for today the nn for that is

the m

you can more or less we'd oversampling because that

an infusion usually you need much less

much less data in general so in all the other clusters that since a lot

because you anything they are imbalance

for calibration you have stealing a of all the blame

but for the english one i think it yes that we do not have enough

data for calibrating

so the fusion for training sufficient to so i think that was there is

the diffusion is not well trained because of not having enough data one of the

languages

i've a question and i found it quite interesting that you're

a list em has fewer parameters than the and the i-vector system

and i'm wondering about the

the time complexity how long does it like to train it and test time

some compared to the i-vector system

the a training time is much longer because we had a lot of iterations i

think it's also because of the way we trained the that we use a different

subset per iteration we need a lot of them so

actually i think that there is also have your are not the best we could

see because these was only and evaluation and so some of the networks were still

improving when we had to stop band and there's run them as they were sewing

training time eight side and w has much fewer parameters each word

but testing time he's way faster

and of course of one thing is that was you have the network trained you

only need to the day before while in the editorial you have new data you

have always extra i-vector before

before doing scoring

anymore questions

so then there's lots of time for costly i guess we'll back end at five

o'clock

forty four special tools

so that's target speaker can

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

Speaker & Language Recognition: Deep learning approaches

Ruben Zazo, Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez