alright technique for this introduction and i would like reverse also thank the
or this recognizer for inviting me to give this presentation and to give
present my last work and also for bringing us here too
this one different location it was an amazing week
the with that was very good
and the social events with many things so i i'll exercising so as a good
part
so
so it was really good since was very enjoyable to a week to talk to
people and meet would be blunt this costs and exchange ideas so that's what wonderful
and the gospel begins to
to see this winter vision of the basque country
so hopefully we'll come back to visit that this tourist
we have chance so they only presenting some of my let this work about using
the
some of the i-vectors some kind of i-vectors to model the hidden layers and see
how
the u d n and sparkling information in the hidden layers and because usually the
way
actually the way we doing now the nn since we are trying to look to
the output of the d n and land the n is to make some decisions
or we look to use the bottleneck features one half of it and one of
the hidden layer to use a bottleneck feature to do some classification with it
but unfortunately not
a lot not only not any work have been proposed to sit to look to
the whole unpacking the nn
because i believe that some way that we can there is some information were not
exploring and using actually into the nn is the activities of the part of activation
how the information was propagate over to the nn and that's what we're going to
be talking today
and show some results
so
so this is the out of my possible our staffers by can an introduction benefit
that all move onto
you know slowly to the my lattice work but before that would give some you
know and reduce the i-vectors which i don't need to because a lot of people
you but probably a know what sometime better than me
so i mean you guys you know the i-vectors is based on the gmm so
the first pass will be based on a gmm how we use it for gmm
so we present for the gmm white gmm mean adaptation
and we are we show to study of case speaker recognition language recognition here i'm
not give any i'm not telling you how
how to build your language or speaker recognition system but i just want to tell
you that would i-vectors we can do something that is what is a show and
again see some very interesting behavior of the data how the channels and one of
the remote the condition can affect
for speaker recognition system if you don't do any channel compensation
for language recognition or we show how the closeness of the speakers from data driven
this what is asian so the number that would remove
then the direction of how we can use actually some discrete i-vectors to models the
gmm weight adaptation is just some work that has started one
of hugo new when you most students pass in has sent by how do you
sees is actually was the case of an in bellingham he visit my me an
almighty for six months
and we will start working this gmm one advantage of language id
and then after that i'm that the announced are progressing comment over feel
and that's where start looking saying maybe this discrete i-vectors can be also use it
to model the posterior distribution for the n s
so i start this is what had this of the second part of that also
a start looking how you know the intended representing information in addition to layers
because a lot of the box in the vision to show all you can recognise
that's this moron that model actually cat's face from youtube videos or something like that
so
can we do something for speech
you know that's how i start thinking about using i-vector representation the model data layers
and that's why
then we show that you know how for example the accuracy goes more to go
deep in indian and how the accuracy going for example for language id task how
we go better
and also how we can more than that of activation the progression of that you
know the activation of the information over the non-target the nn
so if you feel like you what one hours too much for you to sit
in the shower and you want to the perfume is that you should even the
first part because the gmm part but the second part maybe more interesting for you
guys
i would be not offended if you want to the
so and that after the our finished by so given some conclusion of the work
so
so as you know i-vectors have been largely used it's a nice way to work
on to it's a compact representation that nicely of summarize and describe what's happening in
a given recording
you know it's have been largely used for a different task
speaker language speaker diarization
speech recognition so there isn't i-vectors was actually related to the gmm adaptation of the
means
so i just say lately i have interested also in the gmm weights adaptation
for using i-vectors and then are you know that after they move on to use
this for the model that the nn based i-vectors
for the for you what modifications
so that's not you know slowly take you to data the
to my the others for what is what slowly
so you know in speech processing usually what you have you have a recording of
this one recording and you transform the to get some features
then based on the complexity of the distributions features you build a gmm usually classes
when but the gmm top of this remote to maximize the probability of distributions
so you know
gmms are have been is defined by portions and portion has the weights the means
and covariance matrix are described this portions
so the way that some other countries the i-vectors in context set a concept of
speaker recognition so the way we were doing it in early twenties well that's what
how the kernel started
you know we dig a lot of non-target speakers were trained a large gaussian mixture
model
then after that because we don't have to meet sometime too many recordings from the
same speaker where n and one maximum likelihood do adaptation so we tried got that
the universal background model which is a cut prior of that how all the sounds
looks like to the direction of t
target speech
and the so the way that okay this should happen the between source trajectory gmm
supervectors because we finally he found that the one of the pine find out that
only the adaptation of the means is enough so the main the weighted it the
mean shift from this universal background models of the large gmm trained on a lot
of data
to the direction of the target speaker can be categorized of something happened the recording
that make happen that shift
so the lot of people starts to think this shift example packet kenny which one
factor analysis to try to
supplied with one speaker and channels
during the gmm supervector shoot for example also become boundaries what would svms you know
trying to model gmm as input to the svm to describe the model to the
probability of between speakers there
so the in the sense fear i-vectors came out as well
so the i-vector disposal you have a gmms subspace the ubm is one point there
and so we have one recording so we try to ship to the new recording
so
to the ubm to this new recording so if you have a survey recordings i
you we have look different one space the i-vectors extracted more the oldest variable between
all this recording
in the low dimensional space
so
and we still rocking is the ubm
so all this new recording can be mapped to this new space and now we
can represent and i is reporting by and vector of a fixed line
so this can be an modeled by this equation so we have the universal background
model here middle and east recording gmm can supervectors can be explained by the ubm
plus an offset
when offset also describe the
what happened is recording which is you are given by the i-vectors into proposed variable
the space where the i-vector a vector space
so now when you have a strange you doesn't margaret training for that you when
you have new recording utterance from the to get the features than after that you
map that you're subspace are you sure that all familiar with that
so now i'm not going to give anyway how to tell you how to do
speaker recognition you have been seen a lot of goods
talks during this will wonder four
all this is a conference but and still that will show you how we can
do visualisation with it so
first of also for speaker recognition this i-vectors have been applied for different kind of
speaker recognition task of speaker modeling task like spoken speaker verification when you have a
set of speakers you want anyway of recording you want to defy with those who
spoke in this segment speaker verification when you have a to want to verify that
to recording are coming from the same speaker or diarization
you want to know box and one
so for the for the speaker recognition task i would like to show some visualisation
that explain to you what's happening in the they that if you don't do any
channel compensation for do that
i would like to notice the work of that currently was actually psd students with
the unopened hyman that bill combine a mighty and he was working would not at
time
so we took the this is that in the nist two thousand us a ten
speaker recognition evaluation was based on i-vectors and the time of was that this that
system was we build was actually based it was a single system that rounded to
deal with the telephone and microphone data in the same subspace
and so we look like a box five thousand recordings from that the data and
we build a cosine similarity between all the recordings
that i think that it does this make metrics that similarity matrix and he built
teen is never appear at so is your for that would be connected to that
this tenish never
and then use this software called guess to do the graph visualisation
so in this graph you know that the relative location of the node is not
important but the relative distance between the notes
and the clusters important because
it's reflect how close they are and how to structure your data is
so that so here
exactly they female they data but database with the inter session with a channel compensation
applied so we can see the colours are by speakers
and the is so he's and he should british or point corresponded recording and cluster
compare the speakers
so for people that actually want to the museum and since all are this early
week can you do i mean what's the at this was thinking twenty been this
"'cause"
so
so the thing is like now what we start doing that we say okay well
known that we tried to remove the channel components i said what happened well we
lost the speaker clustering
and something happen that were some cost so that happen that appeared in this clusters
and always say like well what's going on he says so he went we went
together we will look cd a
to the labels and we start looking what's going also for example here
each you one check all the microphone at used for the different back that they
that the microphone was used to recover one of the recordings and you find that
actually with the clusters like to the microphone that was have been use
and that would like to pursue the pretty surprising for example it may assume at
this at the telephone data we have like one in-cylinder and this of the microphone
data
and also we have five that you also find that there's to actually for the
same activities cluster two clusters and actually because the room was there
that the ldc lifetime used to rooms the collected data so also the two rooms
was also reflected in your data
this is a liberal press every civilisation to show that you know i don't want
to give your michael right one from two to one point five whatever but i
don't tell you that if you don't anything about the market for the channel compensation
it may be big issue
so this is what happened there is only
the data can be affected by the my microphone can be affected by the channels
and also can be affected by the room that have been recorded
so this is that what we do try on the market the channel compensation
and we do the clustering by speaker and bit the visualisation is by the
by channel so that specific the channel compensation doing some good job too
trying to normalize this
so i front lately we recognise mel bit and female on a y
but different clusters of the time was better so this is that say the same
at a later we all have see also the same behaviour so this is the
one to the microphone data which is the most interesting
and you can still see that split between microphone between the room one and room
to the ldc and this use the collected data
so this is actually unique visualisation
that have been you know very helpful for us and stand and you know shows
the people that actually about the what we are doing it makes sense
and you know how we can still be fun to the some pictures and microphone
a microphone channel compensation
so this is the same thing so i honestly after that you know what we're
doing language id two thousand eleven i start looking to the language id task so
and i will try to do the same things also for visualisation so he language
recognition task we have a verification is why doesn't fixations so you don't need to
to spend too much time at that so here what i did is actually a
i to connect nist two thousand nine i have an i-vectors was trained in the
training data on it took it doesn't matter just a can cost
and a two hundred recalling for each language i think we have like twenty three
for that language
and i know to the same thing salad build the cosine distance or similarity and
bill between a separate graph and try to visualise it so this is what happened
for this kind of language recognition class so for example here disappointed because we have
for example
english and into english close together
we have into english and hindi and urdu you know like what they are very
close together
mandarin cantonese and that i mean and korean
is same almost in the same cluster
so
so also here's duration ugly green and was any and of course shines origin
in the same cluster and also french and real
so it's really data driven
at a visualisation that show you how big how the closeness of the languages are
from the acoustics
that have the primary using to model the i-vector representation
so here this is what have been you know you know
i-vectors were allowed to do because you have this you know in the time with
cosine distance between you can be lda to this was a bit as well
so
that we can you know doing i-vectors and represent the data and see what's happening
the data and how you can interpret what's
what's phenomena is going on
so that of is what is it was a good tools for that
so it is a you know that meet now try to move on because i
know that you all familiar i-vectors i don't want to
to spend too much time it anymore probably prefer we want to the more interesting
topic of this to of this talk so that after that i start looking to
the gmm what adaptation is a say with the students from what has a higher
you
and the way the gmm weight works that there's lot of actually the several decay
that have been applied to that
for example maximum likelihood should
the most a simple way
and one of the and also nonnegative magic factorisation which is actually you go via
newman was working in that at the subspace multinomial model
which is that what else complement inequality and what but people use
and what we propose which called non-negative factor analysis because the you know that the
gmms what adaptation is a little bit tricky because you have the nonnegativity of the
weights as well as they should sum to one so this is can trying to
do you have to deal with
during the optimization and when you're training your
your subspace
so it's a
so the whiteboard ogi validation for example you have a set of features like oneself
recording industry features
and you have any bn you model if you try to compute impostor distribution of
a of a given a component for some time of a frame
given the ubm subspace are you so we get this posteriors and then you and
your then you accumulate that and can
from that
so the object so in order to get that the gmm what adaptation you don't
you try to maximize looks very function given here
and if you want to do a maximum likelihood so the way to do what
you accumulate all this serious overtime and it divided by the number of frames that
you haven't you can do maximum likelihood
all
you can for example do nonnegative market factors estimation
which consist that okay we just try to split this weights adaptation into little small
negative matrix as
basis that
also maximize looks very functions that given here they the input is that the count
and you try to estimate is to subspaces vector representation one assumptions one and they
the representation of this in the subspace
to characterize the weights adaptations
so this is a negative matrix factorization is the you go value money students paper
that describe that
what implemented via t is that you have a multinomial distribution
and which kind of is described
so we have this subspace all that describe the a this the i-vector representation of
in the weight subspaces the when did v is actually but so we have you
know ubm plus share and didn't but no matter here also how to make sure
that the weights obtained are normalized to one
the good part of it here is that this is very good to when you
have a nonlinear data to fit for example he an example i would like to
thanks
but an specially older for shown with giving me the slides and that this
picture
here for example you have a gmm of to question for example
and he would try to similar each point corresponds to one recording weights adaptation
for example much estimation
and we tried to simulate what happened when you have a large gmm so we
have some sparsity not all the goshen would appear so we can see that this
question here the corner sorry
then the d
so this abortion here we would not be this is just a simulation
in what happened when you have a large ubm
so we can see that we for example in this case how the data looks
like
and this subspace moody model in the minima the sorry multinomial that model is very
good to fit the data
but that it has a drawbacks make overfit so that's why the but you guys
user regularization do not make it more overfit
so has send work at a time was trying to do that similar the same
as an i-vector so you haven't ubm weight i weights and you want to make
sure that new recordings had the ubm for you the weights for the new recordings
is that it will be in what was an offset
and the constraint here it's
you they should a weighted sum to one and they should be noted nonnegative so
we developed in an em like approach so but someone right in the center of
sound i think we did something applied to maximize the likelihood of the objective function
so you have to step second compute all i-vectors and you got many of they'd
are you but the l and you have you tried and w because the convergence
so let's say we tried to maximize the lower the likelihood of the data does
a function of the subject that they should sum to one and they should be
opposite if there is
projected gradient ascend that can belong to do that
and this is are you gonna go to the reference in you can find all
the information i don't want to go there to be a for this talk to
not
so
the difference between for example the non-negative factor analysis and the s m is of
actually
showing this table so that they i don't think that tend to not overfit because
the approximate or the maximum data is that would not touch the corner compared to
the ammonia s m
but sometimes good sometimes bad dependent which application you are targeting
but we compare that for several application they seem the same bit s m invented
non-negative factor is practice to
behave almost the same
so this discrete i-vectors have been applied for several applications and purposes for example modeling
of prosody that's what marcel that for his phd
phonotactics when you model the n-grams for example on dry and the did that and
method is based is this
and also what we did for the gmm weight adaptation for language recognition and
and dialect recognition would have sent has an work so
in this paper we compared activity taking and i'm have
assume m and as well as the you don't get a factor analysis so we
can go and check that
be almost behave the same thing as one for gmm weight adaptation
so now in order to go to the fun part
how we can use this
discrete i-vectors to model the
the gmm that the model that the nn activations i was actually the time of
was motivated by
this picture
so i was watching what it was actually that any one of the pocketing whatever
was given a talking to go on training or something like that and he was
showing that you if you do like some a deep belief network to unsupervised trained
your auto-encoder data
and he trained in the millions of unlabeled youtube
number link but component
and he said that maybe if you divide one or in top you maybe you
can actually construct
the pictures and he was saying all kayaking see the cat
face
and it will like okay well we do something for speech and wishart okay it's
a continuous the time series but
that was taken it can actually see how the data is are gonna the nn
hidden layers and that's how it is exactly what motivated to start this work
so remember that before i say we have a recording and the waitress from that
to set of features
then we get this feature to a gmm earlier now let's just remove the gmm
and give it to
due to deanna so for example we can do easy where a language recognition as
in what you give some frame versus like modelling of frames that's what you not
your from who did freeze paper really got thousand fourteen so it's input is of
segment was just like a frame and output is a language and
i will show the several the same like eggs experiment
note that when you have a new recording and you want to make the decision
you do a frame-by-frame decision and he aberration he tries to the max of the
output so that's largely what we compared to and you can also do example show
anymore seen on the n n's and you want to see how the data as
representing in the this task so
so imagine you have it that have been there so the way that we do
now
the before as a set earlier is we to get the n and we take
the output to make a decision
you know like or alignment for example for ubm i-vectors
or we take one hidden layer
and are used to it as a bottleneck features
but whenever and since we only see one level of what we've got the and
only one
one hidden layer or the output we don't see how the d n actually provide
get the information over
all his on fire the end on part of the nn and the reason for
example imagine you have a sparsity coding for each
for example for each hidden layers
and use a for each input only fifty percent of your
of your the foregone or inactive for example but for example drop out
so the way that the data we colour information for example for class one the
one and you will call it here and the one he would call you can
be different
because some randomness the way he would provocative what when coded information so if you
can model you get more that of the battles activation of how the class went
to the nn
and this is an information that's available there but we're not using it
and that's exactly what actually motivate me for doing for doing this work
so can we looked at all hardly nn and see how to progress there and
you know this is our should be one way to do with maybe is not
the best way to maybe don't always but this is one way to do
so the idea here were tried to do is
since we had this discrete i-vectors that also based on counts
and posteriors so can i use that to model
i i-vectors for each that we should outlier
that's what it is only built for example of the nn here we use an
i-vectors are presented and one
it into a taken as a present the lastly a loss leader as well and
noted to do there i need to have some counts
to react like we were so i'll be able to apply to my gmm weight
adaptation techniques to do it be used for gmm weight adaptation so here is to
when you get a combined counts
for example you can compute the posterior fortyish norm activation foster for each normal then
if we use you don't layer for each input your normalized to sum to one
artificially a common either because the you know was not trying to do that
and then you accumulated over time i became that became counts because here you should
allow us to sum to one
and you can you can use the same gmm to gain you don't change anything
to them
so the second one gonna post softmax for example
similar thing but you ample softmax we generalize to map and sum to one
and the accumulated you can also trained with softmax as well
but what is the most important one which the most understanding of all this ad
hoc
situation
and it compute the probability activation operational wrong and its complement one minus one
so you can consider this to normalize the one gmm of to work
so now we don't you only model that you can use the d n and
have the rest of the response so we don't normalise anything
so here so for example here for example if you have one thousand four neurons
you will have double their doubled that and you would have
thousand of
genments what to bush and you use the subspace model tool to do that what
the constraint that we used to normalize and his company wayne one is complementary sum
to one and in this case you don't do anything go wrong because you're modeling
the same behavior of the nn
so
we tried to compare few of them but we didn't will i'm not going too
much in a detector the want to make too much numbers here to confuse you
there will be have the same one
so in this case the say we can use for example here for the first
application we should dialect the condition
i use non-negative factor analysis
for the nist eight are you subspace multimodal more than one not be a model
"'cause" i wanted to show that but actually but works there is no distinctive to
be you
so he to the say
the non-negative factor analysis you have the weights of a new recordings used the ubm
so with a wary compute d b m's can i the weights i usually take
all the that the training data extract the count for each of them are normalized
m and it took an average and that's might ubm so every ubm response for
that's only the average response of a moral issue the layers
for a given him and it and
so if you shouldn't layers for a given all the recordings
so when you can use the at the you wanna get the factor allows us
to do that
so now
though that resting by is an eigen factor as a scan all support other approaches
can help you also to model all the hidden layers as well one way to
do it for example you can build hit and i-vectors for each subspace then you
can compensate the i-vectors of them
and you would have
or you could have one
that actually model everything with the constraint that uses hidden layers of some to well
and this will allow you to see how
you know how the correlation is happening between all the activation of your hidden layers
and that's exactly what we did
so
in order to do that we extended for example accented to d non-negative factor analysis
so you have a different ubm each one corresponding to issue the layers and it
would have a common
i-vector that control all of all the output for each dollar data sorry you have
a common
i-vectors for all the weights for all data it hidden layers
so in order to do that let's try to give some experiments and show something
results
so the first experiment that i would like to show is in that some dialect
id so we have a small sore from apart from vision
so we were interested in doing some back here we have five dialects we have
this isn't know how many recalling by training
it's about forty hours important thing for ten or fifteen hours and it'll it an
hour threeish a dialect
and we have training how many cost for training and development and eval
so a train the d n and
to
so we have five class that problem of trying to the n and with five
hidden layers
thousand and the first you know little about two thousand and then after that i
have five for all the hidden layers of five hundred
five hundred
then so the in is that the while training that the input is the same
the is the features of a stack of
i think was twenty one features frame then the output is the five dialect class
the same as a google paper with any with the in a two
then the when you get the i-vectors are used cosine scoring with lda and the
people described earlier today
and the best image method we find for this task is that the it's also
most full rank
as about thousand five hundred five and the for each other ones
so that so the first results show is the i-vector results
and he was the i-vectors actually it's worse than twenty to the d n an
average of the output
which a mean that for each frame you compute the posterior for the five o'clock
for the five class and you average them and you mathematics which is exactly what
we would paper describe and he is better because the that this the characteristic of
this data is that's the recording are very short cuts around thirty second you know
organ sometime less
so we know that you know if you do that the nn and you do
average scores it's always better you have already seen that talks in a wednesday afternoon
a show that
even for news data so this is the error rate sorry so that less is
better
so now i will show you know there is a twenty do the i-vectors in
the hidden layers and starting from it layer want to layer five and how the
results are is
more you go deep but there is which we know that
so this understanding what are preprocessing on other feel like in a vision so we
were able to do the same thing here so
you can see that were from their one layer wanted to the board the devil
that's cool down and i can't this
five lighters because i want to show that sometimes there's no need to go too
much deep
for example layer five already saturated
like that like five didn't have anything but they q prodigious to make sure that
you know sometime we will try to make it really d but is not necessary
so this is one example what you really don't want to do it
so
and putting is now we can also see that you know we were able to
see the accuracy of you should the layers and we can we also be able
to prove that more you go deep in this that the network but there is
a result are so you will probably get more information
model in all the hidden layers maybe have model but the representation
so here this is l deity
to do that a dimension
of the that the five classes is an lda project into dimension lda and a
member the first on the presented this work and the what the slide that people
say well but probability don't to lda i said that's true i forgot to do
that
so this time i didn't forget
and so what i took a set of the row i-vectors for example for the
last layer
and i do it i did jesse any to model that so now here just
a zero i-vectors were using to see any use lda also you can see that
for example the origin is around here so we can see the scatter going this
way
which just signed that okay length normalization will be useful again
so this is what you wanna do the length normalization due to the same thing
so it's and speaker area
so is the same thing so that normalisation is also useful here so
i'm not sure this project was unfortunately i was hoping to see different behaviour but
it is in say behave the same thing
so this is using to see any cell this is a role
i-vectors
so since the reason why i was asked this question because of the i was
just which are trained to the task
so how it really actually represent
the that the data and the layer was and their important thing to do
so this is a one is one thing that we were tracked
so now
i just say here probability result the i-vector result in that the nn and over
averaging the scores of the frames which is better than i-vectors then more than in
the hidden layers actually better is necessarily
and the results so and i say that from all my experiment that they have
been that seeing is that the last he'd of the last layer is the worst
one in time of information so don't take decision that
but with data we so that the old information is actually in the hidden layers
there's no doubt about
so here i give the last layer result and then what happened if you model
everything one you get more again
you get all other two percent gain by modeling all the hidden layers
and the same thing would happen witness tape
so my point here is you know is true hidden layers
you know more go deep but there is
but if you also looked at all the correlation that happening over all hidden layers
is actually better
and the reason for example why is you know the even people that do some
you know brain division amount vision and everything that wanna try to the activation the
cost of you know what him or more i've something's can use it and one
level but you cannot see that how this to propagate maybe she can correcting about
that if i'm wrong you know this way we can do the same thing for
the n and we can
top and one hidden layer or we can see what's happening all the d n
and is the same okay
you can you do td in my right to sit activation how it happened or
you can cut and one levels can and make a decision this is the same
thing can we just so this is the same behaviour and here i'm just saying
that
the n and has more information that we are now using
because we are not looking to the path of activation that he took too cold
his data
so this is a deck id probably are not familiar with that so probably move
onto the speaker id but before that i did an experiment because i you know
in the state of the i-vectors was completely unsupervised i was thinking okay so that
i used is actually
discriminatively trained for this specific task
can i have the n and that was just using to call the data on
colder
for example
and you know the simplest way to do it i say let me just try
to do a good idea learning every n to try to see you know what
happening i'm sure that people has more sophisticated network for that
so i tried this every have the same are selected that trained before the same
data these speech as input frames input
and i use of dimensionality reduction at that it subspace and use cosine distance so
we use five by the layers are b m's
and i
this of the results l the i-vectors here at the d n and output
but i am having some struggle because i cannot go more than the first layer
for the every m called an ongoing colours
so the how the first layer give me the best at all is not as
good as
you know this discriminatively trained subspace with the in a subspace forty i-vectors but
you know it's not that bad
you know and that's what have been seen
so the hidden layers the first one you trained is actually the best one more
you go deeper
it doesn't how and my
my hypothesis i'm not sure if it's true
because they are not jointly training
altogether
if there may be they are all the number of the
the layers are jointly trained to maximize the likelihood of the data that may be
different story and that's why what that's what we are trying to investigate now
with the my students so can we trained variation for example operational uncoded to train
the maximize the likelihood of the data
and see how
all this representation has a meaningful or not
so this is one thing that we are trying to explore
so now for people that are more familiar would
with the nist data so are you what you seen as it was wednesday afternoon
session that people are more than in six languages
i tried to the same thing so we selected with the help of like to
laugh read a give me this subset of the data
so first in the korean mandarin russian vietnamese
and the difference between us and other people doing people try to use all evaluate
data so that want to remove the mismatch but the trend not use the what
of density s and v only be to avoid the mismatch
it because i want to know what's going on
for us was where everything together
it seems that we didn't have this issue
so that's the difference between possibly not you paper and sum p other papers in
the that section of the
wednesday afternoon so we should put everything together and we're trying to the n and
that actually you take the frames as input and the output is a six class
and this is actually that is also so actually before that i will say
i train firefly the error five data layers about thousand ish
the input is the frames sec frames of twenty one eleven contextfree side
at certain context for each side sorry the output is the class
of the six class use a linear according to this time before of course
cosine this one is a collection
and the so here this i the result in a subset of the thousand nine
for the six languages
so there's a result of the i-vectors intended to second ten second and three second
and the average of the score which is what everyone is doing what you the
direct approach
and
so the that the characteristic of this is as have been said before
it only got the this the and it's
average only be the i-vectors in the three second entire thirty seconds and ten second
it's not it doesn't work
but what happened when you do the hidden layers is a little bit different story
so is well more legal given that the nn but there is
so this is the same thing a slow does not different story here
but the thing is
or actually here forty four you know participant and second that no one is able
to be this because the this
if you do the hidden layers and for example i want to the hidden layer
five
it's obtain the best result everywhere
for even for ten for this for to just forty seconds
so hidden layers and also this is actually was interesting it is the hidden layers
five is just the one preceding this i'll put e
so this one sign the last layers as the one that you really don't need
to look
so based on the my experience so and here again see that the last in
the letters actually marsh much better than
then the i-vectors and as well as the nn output every
so the hidden layers aims at that i-vectors representation for this case seems to do
an interesting job of aggregating and pooling
the frames data to make your representation of the data and you can do classification
with it
so this is an interesting funding for that so actually all surprising to see what's
on the data
so now
what happened when you do everything model all that a whole hidden layers as well
so here are show d
i-vector representation d v d n and every score as well as the last hidden
layer five
and you know i'll i
and also try to see what happen if you do
all hidden layers what used again some k
and you can win also one almost like zero point eight this sorry i forgot
synthesis the averages right in there so we can see that for thirty seconds there
is already low
you know i don't i don't think that too much seriously
that we was little bit here but for ten seconds we were able to wayne
and forty eight the signal were also able to
so it's the same behaviour that all hidden layers
has better information than the one that single-layer of the time
and also the last layer is also better the than the first layer and then
then the first so that last is also but the minutes like the first layer
a hidden layers and looking but the last output layer is not that much interesting
in term of making decision
so either one reason to be honest one explanation is that this the nn time
by ten to overfit
which i just a do
second to shoot
but even when they overfit like that and use them to make a representation or
discrete your space
it's in they work fine if you try to make decision what over fitting a
different story
as one thing here
so this is what i have been finding this last
you're trying to use this models to
understand what's going on
so
so let me try to conclude
so we have five minutes and have something called that you want to say
so the i-vector representation is you know an elegant way to do a representation of
speech with the different lance you know a lot of people ready also used in
a wood that's and twenty of
of the work of the recordings the one where you have a long segment and
short segment
gmm innovation gmm weight adaptation subspace can also be applied to as a show sheen
say that that's you have seen in this talk can be applied to model the
d n and activation
in the hidden layers as well and they would doing good job
so was actually the take home here
so that stating that they want to focus here the seldom under down for all
the information that was modeling that the nn is not in the output but isn't
inherently
looked at that it is this
don't try to make a decision directly from the from the out
so
and so also you know looking to one the liar at the time and not
seen what's going on in all the data layers
it may be a mistake were going but it's may be good also to look
at that
because it's will tell you what's how the information one to all the d n
and how we show that each class to be model
that's something to seem to be
very useful
the subspace approaches that have been trying is one thing that i was thinking off
to do this work demo specially in time of modelling all data layers
that you know we can use and it is seen to doing good job of
putting and are aggregating that they the all the frames and give you are not
representation with the maximum information you can use for your
for your classification task
so this has seemed to be very good even if the day was trained in
frame based
so with svms trained at the end frame based and use it to make a
sequential classification
i-vectors is actually a representation seems to be doing a really good job for that
so
take two minutes to
and we have to mitigate
so
for future work contracts that we have been explored my students and colleague
my colleagues
is
now that's a earlier that the other than being using are based on frame based
and segment length
frame of contacts of twenty one or something like that
it's not doing so we are trying to shift to
more like memory the nn is like for example td and endorse unit time
or l s t m or which is the
special case of recurrent networks that's what ruben is doing
my inter so we are trying to explore data instead of frame-by-frame to make more
to extract a model more speech more the dynamics
explore more data such vector for speaker
to make them more useful for speaker
we're still working on that as well
and the set earlier i would be of interest and people spy authors in my
talk to meet i mean maybe there is a better way to do
watercolours
to really corpus clear that the data speech
and my whole is at some point we would be able to
to get some speech modeling at the end the nn or speech colour so you
know
it just call the speech and after that i used to discrete my space and
use task for example i give you
a bunch of thousand of recordings you call your data and after that you say
i want to use speaker i want use language
can i use from the
from the same model
just calling speech
so if anyone has any idea or have any tell please come talk to me
so also to make the things the i activation more interesting
i'm interesting in exploring the sparsity of activation for if you know later
no i'm not doing a specifically i'm trying to use that the nn training but
is there a way to for example one way that i'm doing now we didn't
have time to compare the result is dropped
example i say
what for each input fifty percent of my for additional layers fifty percent of mine
or active
so there is some randomness between the recording but when the hidden layers because
i find that actually some if you do have at the end and the two
hidden layers consecutively the layers sometime i redundant because i close together but supplied them
is actually the two of these separation between it's better
sometime
so if you do surpassed activation with for example would drop obviously the simplest way
to do
you make them complementary because there's some randomness happen in the middle
so that you for that the nn to take different bat for each hidden layers
are normally
so
so that's something i'm really interesting to make the
information but the between two consecutive
hidden layers more powerful more interesting and then and make them more rather than rather
than and
and also there's a way to for example alternate activation functions
by same we can say sigmoid rectified linear and sigmoid on
so between two consecutive sigmoid that something in the method to make things changing a
little bit
so the behaviour change for the consecutive sigmoid
so when you model down there is there's hopefully a way to get more information
and you're so in the subspace and also how the how the d n and
is coding information can be useful for the classification
and
to conclude
well i'm organising assess it doesn't sixteen portrait ago
so hopefully to suit their lee's summit your paper the same is the same time
as the c
so that that's this work so please help to see there and if you come
at the workshop you can also stay
for the rest of the week you enjoy the beach and that the cocktails very
nice to signature nor owns to make your compared to the right object a function
so and so that and that's it i q
jim had sent you from these distortions mum concerns just about a point which is
not main point of view or which is not in the main point of your
talk
it's about the television in addition
a particular always the t s in the stochastic neighboring of meetings
the to use of form determinization think that
this techniques and that is this phenomena useful and satisfying four
for thinking for the it but also for the thinking and understandings the distributions
but we remark and some if you put forth
and so for presenting the high divorce which of data with those techniques particular these
speaker classes
i'll distributed along ambulance form norwegian
this thing directions
t s and then don't does not respect the initial distribution
it separates speaker classes but so as you
the does not respects is montreal
direction of speaker classes
so it is useful because we use e
separation between necklaces of speaker
but not
or maybe more
view of this is we'll distribution
so it's i think a very good tool
two but it may become few not to use it of to propose a new
nist
it's as those more one so you're saying i it's here's just want to show
that you know how it's kind of structured but i'm not checking account how it's
model was a distribution from a t c any that's what you're saying yes
simply for the also points in particular fourteen and
i didn't write down all the numbers but i saw you had results and b
r and the dialect id task for other five dialects arabic
and their numbers are three writing down here you had to i think that the
fourth layer supervectors right a twelve point two percent and then when you into if
we're was twelve point five percent
and i apologise i didn't see a slight that that's if there so my question
wise
as you're moving forward you're actually getting improvement but would really be nice in dialect
id it's a lot more subtle differences between a derelict right search a lot of
times it interesting to figure out what are the things that are differentiating between each
of the dialects so i'm wondering if it anywhere you go back
and look and the bad the test files that you went through here for guitar
residual moving in the improvement here
you you're some not your hand it may be assumption would be that you're getting
a few more files except it correctly but you're just likely to have a few
morph rows rejected incorrectly a and it would be nice to can see what they
balance it's are you getting more pluses
and you're losing a few or are you not losing anything in gaining more so
that's where i'd like to kind of c is you're moving down here is zero
is a positive movement forward or are there some better falling backwards but the net
gain is always possible
no i agree with that in i didn't do it you know virginia the wood
but also is interested at the time of than more interested also to see
between the hidden layers what's if i'm getting i was hoping to see what happened
the recording you know is that having a linguist work we made me trying to
understand okay handling like this that classified correctly in the hidden layer five but not
in the layer for three or to what make its change that it's so i
want to know
which affirmation of the five layer that got me to make this one better than
another one that's true we window at the end we were thinking about
so not too much just want to thank you very much for proposing a new
solution to the very heart problem so
i just like to put that the difficulty of the problem in into context because
we've been banging our heads against the same kind of difficulty so
to summarize the problem
it the problem is to get a low dimensional representation of the information in the
in that it in a sequence so you've got lots of speech frames
and then you want to the stall the information in all the speech frames to
single smallish vectors
so
the reason is difficult is let's look at the i-vectors the classical i-vector so
you can write down information that the generative model for the i-vectors in one equation
you had
so
it's very easy for most of us to just look at that an immediately understand
so that's the general the fruit
but what you're doing is the inference fruit
from the data back to the two
the hidden information so now we have two
share all the information from all the frames accumulate that information back into
back into the single
vectors so
if you look at the i-vector solution
that the formula for
for
calculating the i-vector posterior
that's a lot more complex than
just the generative formula for the i-vector
and that takes as
might be applied to the live
i that formula and
i believe it's similarly difficult for the neural network to learn that
so you mentioned the variational bayes order encoders
so we've been looking at that was quite a lot
in the papers that have been published thus far it's always a one-to-one relationship between
the hidden variable and the observation and then everything's i r d so
i was machine learning by per state been solving that a much easier problem
so
to accumulate on all that information is a harder problem that's also computationally it is
also computationally hot
if you think of the i-vectors posterior lots of piper's with published how to make
that computationally lighter
so
that's why say you all
no solution is quite exciting to us
what else also the one of the guy from machine learning ask me okay say
okay so we have indian and you have your i-vectors representation can you propagate the
errors from the i-vectors of the nn to make it more power for your specific
task with the i-vector percent
that's something interesting for psd topic noise
if you're i
you know way to combine the subspace and that the like the same as what
people do in the data in asr the symmetry of training sequence of training can
we do the summary things with when you have the error coming from the i-vector
space that work to propagate the data the d n n's dow
that's something maybe
interesting as well that's we got from machine learning cost me this
so not nice presentation nudging
i hadn't thought of questions one was
when people move from gmm based i-vectors to you know the nn
least i-vectors using c you know just classes
as i understood the improvement was
because of the fact that just these was quantized much better than using gmms right
and
i that it was phones as classes or you know languages classes
so
if you doubly that you're proposing to use auto-encoder
has no information about you know any classes so what's your intuition behind
something like that would work better than
using c you know ones are you know languages as classes
well you know it's actually is a good question so my tuition is just a
my feeling up in the speech processing and hairless how without doing it
is we start too much scrolly
make in to win information away from the signal
for example
here if you do line frame and language is a class
i'm normalising speakers i'm doing the l d n is doing all the things for
you
so i'm hoping to not do that
try to maximize as much information
as i can
for example i give you
to a four thousand six or ten thousand of speech i don't giving level about
the development but you know going to train the speech continuance provides way in your
data which you had be helpful for you because you have thousand hundred thousand speech
and maybe in the industry is different
i say you have moral appleton with us
but for so can we do that so that's what i hope so i can
you know this is the same talk what the jackal said the twenty have letterman
supervised
can you use that you and your training
so i'm hoping to have a kind of speech coder
this model speech that you hear something you given the same thing from both sides
of the affirmation is there
it's not sure what away it just how to use it
that's exactly feeling wineries and i'm not saying that would be the i don't they
would be the destructive training or something like that i'm just saying that if i
haven't all the speech coder that something like to maybe if i am too much
use anything august alameda truth but that this is what i one is like something
you know if we haven't woken colour style or something like that
if the if he can produce the speech again
so the information is there we just need extracted
i don't know if it was clear and