you know all that from you the
i will be presenting what we did for lre fifteen and probably
great part of you have already seen most of this presentation
at the workshop
we have changed you things correctly some errors
and i will give you the presentation again
so
well lets them here it was as john already said it was a collaboration between
per no i need your and only technically three you know
i included the almost the full list of people who participate it is a in
our team that was a lot of concentrated fun during the autumn and we really
enjoyed that
so
let's go straight to the system what they to be used to be we decided
to participate in both nist conditions the fixed data condition and open data condition
and the fixed data condition we joint some affords with mit and the they provided
some definitions of the
of the development set and the shortcuts so we split all of the data we
had available for
training and they have we kept sixty percent for training and forty percent for that
and we also generate the some short cuts out of the long segments that are
uniformly distributed from three to thirty seconds because that was that's what we apply are
expecting then devil data according to evolution one
for the open training data condition a
we try to harvest all of the data from a harddrive that we could find
we also asked our friends
from here from bilbao to provide some other databases and also nudging from mit so
these databases that you might not using your systems regular eer colour guthrie that is
we took european spanish and british english
and from al jazeera free speech corpus we took some arabic dialects otherwise it was
just all the data that be harvested for nist lre o nine from the radios
from the voice of america and so on just to let you know we didn't
use any bible four
for the classifier training we just use the bible data to train some
bottleneck feature extractors able to speak about it later
bottleneck features that's really is a core far system so it's
i think that most of you are already familiar but this architecture we train a
neural network do classify phoneme states it's just some better specially did is architecture because
it is stacked bottleneck so
the structure is here on the picture
the stacked mean that
we first train the classical network to classify the phonemes days then be coded at
the bottleneck
and then steak these bottlenecks in time and train again
so that we train another stage and we take the bottlenecks
from the second stage from the second network so that's why the stacked bottlenecks
the effect is that
in the end they see longer context and
from our experience other they work pretty well but if you do
some tuning you can you can
you can just use the first bottlenecks it's enough especially for speaker id i say
so for the fixed training condition apparently we had to use switchboard and the network
was approximately seven thousand triphone states at all
and the we were trying some new technique a with the automatic acoustic unit discovery
and we train the bottleneck on these and for that we used lre fifteen data
for the open training
condition b
we use the bible data and later in the most of all we've train another
network that has seventeen languages of the bible and it is indeed the one that
that it would like to use if you can use
all kind of data
so general system or would be you as i already said the basis of our
system other bottlenecks either based on switchboard or labeled data and then some reference we
had the mfcc shifted delta cepstral system we had be llr system we also tried
some
some politics systems and model the
expect the n-gram counts with the multinomial subspace model and techniques like that where around
fewer spectra they didn't make it a diffusion
and are favourite classifier is just a simple wiener gaussian classifier
and if you can along with it's good to include the i-vector uncertainty in the
computation of scores that helps quite a bit with the calibration and also
provide you slides
performance boost
and
we had them new fink
a sequence summarizing neural network
i will speak about just now
just later because it was a little bit of a disaster labels e
the fusion
fusion was a little bit different we tried to reflect the nist criteria because we've
are to the c average was computed over the clusters and then averaged so
so we are reflected ease and the otherwise
we had one way then
per system and one buys per language
and the cluster prior and that be assigned the cluster specific priors for the data
for each cluster and all of the or other data
other set whose where had the prior set to zero and v be trained over
all clusters in the end so that
i think that it improve the results on the nist metric what substantially
and also we gave nist a system that was
a classical multiclass system that they could they could do some between cluster results on
this is because if we gave them just the one that b calibrated or fused
this way
they would be out of like with doing anything with that because of course
the asked for
a log likelihood ratios not the log likelihoods i hope that the next time they
will they will rectify this
this all what we had in the end in our submissions
most of the systems are stacked bottlenecks to see in the and mean the cluster
dependent system i will speak about it just two slides later
and then there was this a sequence summarizing network
and as you can see
it is the clear that were system it would never make it to the to
the diffusion but at the nist workshop five as present think is this as a
system that could almost perfectly classify but that's data it's not the case there was
a bunch of course
some level data in the training data
so now it's the worst system
so anyway we were so scared added what worked so well on our test data
that we didn't included in the primary system anyway so that the red arrow shows
what we had as a primary system a narrative and the
the alternate system would be with the
sequence summarizing that were included the what i report here is the c word star
means that the calibration was performed on the dev set
i don't i don't show already the c average for the dev so because during
that develop we were doing check and i think
which is
not here in this lies anymore
and so these are the results on that that's that it's
it's pretty good let's skip to the
results on the of also
there is nothing much to say just that the we sing quite some a calibration
loss on the of all data
and the
which was not the case on our test data especially on the on the fixed
set because it proved to be
quite easier said than the one i design for the open data condition
so
so that's it that's our that's of this are fixed that's our system for the
fixed training condition
so now let's talk about those specialities we had there the one with a cluster
dependent i-vector system
the cluster dependent means that we train
per cluster we train the ubm separate cluster and then the i-vector and the rest
of the system is trained on the whole data
they provide
you can see there's a six independent systems which provide the scores and then we
fuse them here with the
with a simple average due to provide some robustness be we calibrate them later anyway
so based this proved to be quite effective during the development with you just need
to take care about the amount of the daytime in the in the cluster so
the results line coming here indicate that there is no need you know data and
if you use of diagonal ubm you have a
you have a better result in the end which i believe this cost by not
enough data per cluster to fit all of all of the parameters of the full
covariance ubm
and the sequence summarizing neural network which doesn't work
it's
is i don't know if you have ever use it for language id it's basically
you take a sequence and short utterance
and
and passing through the network summarise it at this there is a summarisation a layer
inside
when you many of initial the frames then you then you provoke a the rest
till the end where you have to
probabilities of the classes and you do it all over again over all the data
and
and the that's it
the
and then to just that you can use the sequence summarizing layer
as some sort of feature extractor and model it is and later it differently
and apparently works a little bit better than then just using the network to do
the final classification
we had some partial results with the sequence summarizing that for the at when we
tried it on lre o nine but here the task is so much tougher
and
the system was a complete disaster
open training data condition
it's a almost the same scenario just we had a little bit more variability in
features here specifically i would like to point out the multilingual features multilingual bottleneck features
that is the ml seven insist in
and
you can see that if you include this whole machinery and all of the data
and the nice a look like that can really cluster the space
of the languages you get the cleared the best system that you can get
and it also is the case on the ml data
here i can even show you that what is the difference when you use the
use the covariance in the in the gaussian linear classifier to obtain the scores
it's the last line versus the second line of the table there is not so
much gain on the on the dev data because they're already
goals are to whatever we are training on but there is a nice gain
my skin on the on the of all data
if we if we submitted just the single system that would be probably the best
but of course
we haven't seen the
seen the results on the dev all data before submitting and
and tried try the whole fusion which is
slightly worse than the single best system
some analysis with the training data
we
we had a little a time constraints and we thought that
from our experience
it's experience it's always good do
necessary to retrain the final classifier i mean when you have the i-vectors to retrain
the logistic regression or regions of classifier to get your classes posteriors
but it unfortunately was not this case or for the album data condition we decided
okay we have this ubm i-vector extractor let's just use deals and retrain a retrain
the system we will use for our submission of the open data condition
as
and we didn't train the new ubm and i-vector extractor of course we did it
after
and you can see that
the column just below the submission is the one that we would get if we
to the time and retrained both ubm i-vector and the classifier on top of our
dataset
so we hurt ourselves quite a bit here as well
so features
as i already said the bottleneck features are the best ones that we were able
to
to train
if you compare it with the mfcc and shifted okay switch shifted delta cepstra there
is a there is a huge get and i think that
the bottleneck system should be the basis of
any serious
language id system nowadays
the bottlenecks out of the network it was trained on the automatically derived units
it didn't perform very well but of course
that was a very new thing and we didn't want to only
run the bottlenecks and
be done with the evaluation so we tried it you can see that still it's
really depends if you can if you can derive some
some meaningful units and
and more specifically if
if the ml data would match your that they do very are trained it because
then the units what
would correspond and probably the book like would be better
it so far doesn't work that well
with french cluster yesterday i so many people present the results here already been of
the french cluster they but inspired with great in the nist workshop where he it
excluded them from the results i think that we should not do that i spoke
the ldc
at the data are completely okay people can recognise a there is just the problem
with the channel as if they gave us
one channel in training and another one in the test they basically swap it
and because this is a cluster of just two languages we all build a very
nice channel detector
so
that is something we should deal with and not to exclude the french class are
from the evaluation
just please fix it
well we will try but we haven't time to really do that so all of
the results i will show in q of course include the french cluster
and
there
they're pretty good if you if you take the a multilingual bottleneck features but we
have to be careful even you when you're doing analysis of with the french cluster
the croat from the french is actually from bubble so if you happen to have
some bubble data bic or for about it rather not use it or use it
carefully
or you might be surprised how useful the problem
well it didn't solve it it'll
so
we of course try the bunch of the classifiers on top of the i-vectors and
i can say that
it's all about the same
and the classifier of choices the simplest one just the gaussian in our classifier that
you can build
right away out of i-vectors
an eagle was experimenting with some different language dependent i-vectors when you extract the i-vectors
with the language priors involved it was
it was performing nicely but
but the
not really beating the
the simple across a linear classifier we try it
fully bayesian classifier we tried a neural network and the logistic regression you can see
that all the columns here are pretty much the same
and
we still have a few minutes so i can again briefly us to do something
all this automatically derived you needs it's a it's a variational bayes method a we
train a duration a process mixture of hmms and b we try to fit the
open phoneme blue the on the data to estimate the estimate the
units
and then be used this to somehow transcribed data
and use these once this
as the source for a training the training the neural network which would include the
bottleneck and then
then we would have some
unsupervised bottleneck
well maybe there is there is a
still somehow four days and i hope that people edge h work should bill
we'll move this thing forward and we will see the goal think is that
we were able to surpass the mfcc baseline on the dev set with this system
that is i think that's already impressive
so the conclusions
again
use the bottleneck system in your lid system the gaussian linear classifier is enough
it if you can do you just include the uncertainty in the score computation
and we tried a bunch of the phonotactic systems and they perform
okay but they didn't make it to the fusion
and
i would say that it's always good to have some exercise with the data engineering
and try to see the
see the data that we have and try to collect something and
where with the data not only with the systems
we tried a bunch of other things like the denoising the reverberation we didn't see
any gains on the dev set then there is very slight gains on the evaluation
set
for the phonotactic systems we very using the switchboard to train it
and
we try to frame of the nn which
which was pretty bad
so that's all ready thank you
okay time for some questions
so my question is more related with the stacked bottleneck that you were recently there
you mentioned that it's good for language at night you didn't get so many good
which holds for speaker at
well we get the good results for speaker id just that we get as good
results with the bottlenecks that would not be the stack so you can train the
first network
only and take the classical what lex you don't need to do this exercise which
thinking the bottlenecks and training another network
well but they perform well for speaker s one is not what the right
i once i wouldn't think i wouldn't say that it's worth it
but maybe bill using the sorry sixteen just don't you don't use it as an
excuse
and the other question it's a
although i guess that using these are stacked bottleneck features on later six ubms for
language cluster you're solution was like in terms of time like are we can't
well that is indeed a
oracle system
from the point of the design but it worked slightly better
i wouldn't be in favour of a building such a system for five percent relative
gain over ten percent relative in but it simulation no
the numbers matter the usability is
the second thing
recursions
thank you the for the presentation i'm sorry because my question is also related to
the stacked bottleneck i was wondering if you have made in the analysis on the
alignment provided by both
the first bottlenecks and the stacked one to see if there is really an evolution
in the process all
alignment
you mean you mean what you mean the performance of the system or some
no i'm talking read about the lid alignment on your ubm to see how they
are about the distribution of the features evolves
i don't think we made my this comparison
sorry
our can ask questions are also messiah problem accurate context you're looking at plusminus time
found that did you
we don't something you kind of exporter you can't that fixed to the set of
course this is the ideal number explored
a bunch of numbers if you're having just the first network i think that you
can play more with a context
you should aim for something like three hundred millisecond of the context we if you're
using the stacked bottleneck the context is more because used a
several bottlenecks and
use that in the second stage so
that's why they will something plusminus then
i was thinking for maybe more sensitive
with the background noise "'cause" you do in your other systems you said you did
some denoising theirselves wondering what's more sensitive to noise the bottleneck is pretty good in
dealing with the noise actually i had a paper interspeech when we trained the denoising
all tangled or
and it works pretty well on the mfccs
then be used they'll that denoised spectral to generate the bottlenecks
and the
and well basically repeat all the experiments with the bottlenecks and the gains are much
more much smaller
discussion
so that this is more of a
a comment on the french cluster you're speaking about and i agree you know it
showed up is problematic that you said ignoring it is not the answer to it
i would point out that we do a contradiction going on in the sense you
about you label that a single the channel thing right
but we know from lre nineteen other ones we done
narrowband over brought up or broadcast and haven't seen this massive the ship four
so we have that the contradiction in the past use this successfully with telephony speech
pulling it from broadcast and so forth there is an interesting point here which the
it again ldc went out that did say that it's not it was not in
this labelling was errors in there but
this chance that the formality of the language changes based on whether you're broadcast you
might be at a higher you know high versus low whereas telephony so there's i
just bring doesn't bring these are in general because policing talks coming on the display
be one thing that may be something about the actual
dialect show that happens based on how to produce not so much of the channel
we don't know yet
i agree
okay lets them for speaker again