so
going to all and comic from a is a defined for can research
the papers them go to precisely so i score submissions to the twenty fifteen is
a language recognition i-vector challenge
and this paper is that right will we might collecting the
in a subdued
okay this is a liar five percent asians were for so for a given a
break go and a will be one of the i-vector challenge which is have different
from a perspective of the organiser
okay
and then the what would be a always detections a strategy
which constitute the most part of the work you're in the i-vector challenge
and then we talked on the description of subsystems but you know the final submission
there have is in fact fusion from multiple systems
and then at this for the vice the expand result
so it's the conclusions
okay so
the
the i-vector challenge of course of consists of the i-vectors are extracted from fifty target
languages
that's why some unknown languages and
all these i-vectors probable the
conversational telephone speech and of service and the error rate of speech
and from the perspective of the participants there are three major challenges
the first one being the
it is also a language identification tasks
so in that in addition to the fifty target languages we have two models that
additional class
to detect the offset languages
and on top of these the set of always languages is unknown
and it has to be learned
from what a labeled training
development data
and
one if we get these that the unlabeled development data is that you're consists of
the target languages target languages and as well as the always languages so we have
to select a pretty carefully do so or as languages from the unlabeled a development
set
okay so this is the only a three dataset the are provided to the participants
the first one is the training set and this is a label layer consists of
a fifteen thousand a
i-vectors
to cover the fifty target languages
so we have about you and i-vectors but languages
the next is the development set which is not available
consist of both target and non-target languages
so most of the will is in fact consists of how to select those or
as i-vectors
from these development set
and find it is that the test set sic this external find i-vectors and split
into the
so the and seven the speed
for the progress and evaluation set
okay nist provide a baseline
is the i-vectors cosine scoring baseline consisting of a three step
first one is the whitening
followed by norm salty a whitening parameters is actual and
from the unlabeled elements that
is
because this one is an unsupervised training just need to mean and the corpora metrics
and then i suggest that was the cosine scoring so we have the five k
here
which is the average mean
all the
the three hundred i-vectors for specific target language
and k from
one kl case of fifty here
fight i is the i-vectors of the a test segment
so the cosine scoring given by these equations and of course after the rank normalization
this to them that would be equal to one
i never the case of language identification is what we have to do is to
select the language that but this is that gives the highest score
so i we can see from here
the i-vector cosine scoring a baseline
there's no
the or scost is not included here
okay so
you know as we can see later if we include additional "'cause" with or as
the performances so we get a quite
improvement compared to the baseline
okay no we evaluate the
the cost
which is defined as the not identification at a rate
error rates across the fifteen target languages and the always class
but it is efficient as the ones
correct instrument
but now if you
if you but this case the fifty which is the number of target languages
and well with the of voice into this formula
well we can see that the weight
given to the you know the or s everywhere
in detecting the
the or s filling detect no where it is much higher
compare
to the target classes
so this means
the cost so as colours that
always detection is a very important
things to do
to reduce the cost
so that this one can talk about
okay so
you know to investigate different strategy to perform always occasions the be designed a
so called unlabeled sre
from labeled training data we have
so the labeled train it does consist of fifteen thousand
i-vectors a four fifty target like this
so what we did was actually we do forty then split
so we have
for each target languages
and assuming that the act and i os languages and this is
run is that random selection
it's not particular
preference for any other languages
so this
i-vectors
is used as the
or is languages in all other percent
and we select fifth the all three hundred
as the unlabeled
a target languages in the unlabeled data set
okay
and of course we perform lda to reduce the dimension so that we could investigate
different strategy in the in the
for us we
okay are basically we investigated to strategy of each file we invest in many strategy
and have to study we follow pretty useful the first one record this fee talking
and the second one is the best or s so the this fit target mean
that you know we train a classifier
and then be for the target languages so those i-vector based let's feet to the
target classes would be taken as the always i-vector
well as for the
for this
bass fiddle as we train kind of like
fifty or forty plus one of fifty plus one and a which is that it
took about
so is that you have one class for the o s so we select those
that is breastfeed to the voice class
"'kay" so no we this is the radius of philosophy okay so what happens that
we to the a target languages that we train a multi-class svm soul for the
case of our seamlet the unlabeled a development set we have forty classes therefore the
actual we have fifty classes
so what is that
we train a multi-class svm
and then be scores look in those i-vectors in the unlabeled development set
so
what have like what is of the i-vectors where have one probably the posterior probabilities
for each of the classes
then we take the max
a amount of k classes we have which if these
so then this okay will be those i-vectors having the posterior probability
nasa then a given
try show
right
well as for the case of these that best fit or s
we train
o k plus one
multiclass svm b k plus one clusters
so no the question is how
how we're going to get the additional costs
even that we don't have the label so what is that b
we have the fifteen target languages
and then to be true in
those unlabeled development set
assuming that
all those unlabeled data out set i-vectors
maybe train the multiclass svms for that
and of course target i-vectors inside out of it a demo set so but
well it is
using the multiclass svm that trained in this manner
we compute the posterior probability
with respect to the voice class right so that is the like those i-vectors with
the highest
probability
which
what we call the best fit us
so in this way we actually discarding
those target i-vectors
in the unlabeled development set
i homing clean like spanish and have a
too much
chuckling
that's right okay
i travel events
okay so this is that i comparison of the to make the
the best fit of that
and the
this fit okay
and this is
the precision versus recall so we can see here the best fit
i'll so best fit offset
if a better
precision for all call value
compared to the these fit of a list target
and this
bigrams that illustrate
the on the two dimensional
graph
so later but setups that
can actually give a geographically energy
detection
the better the or as segment i-vectors
from a limited amount set
okay so
you know of the idea the
best fit or as cussing then be we do a attractive purification step
to improve the always detections what points that the based on the score
based on the
no but the detections that we have
from the
best fit offset
we randomly
the i-vectors for the top to bottom
talk will be the
most likely to be for s
what one would be a most likely to be target
and then we have these a to process of i-vectors
then we take the mean
and then be scores against all unlabeled i-vectors
and then we form be rang again we get and all
we the larger and increase and likely
for each iteration
and
we collect a risk of high so in that if you do this effectively
then when the best result what we can have peace
no we increase the lm
to these
you one three percent
a job forty percent meaning
against these six thousand
find a bus the i-vectors unlabeled i-vectors that we have
okay so the system there is something that is a fusion stuff on multiple classifiers
so is consists of pretty symbols and now classic so i've a classifier we have
the first one is the gaussian backend followed by multiclass logistic regressions
and then we have a solutions of svms
one is based on what we call polynomial expansions
and in a one is a fundamentally
then we also have investigated using the a multilayer perceptron
to expand i-vectors in a non you know we endpointed svm
just this one no so we also have these that the nn classifier that take
the i-vector as input and output is a few the last one
a target
fifty target languages and one i'll set
a languages
and the system there is something that is a very simple is the in your
fusion so
the way we learned the weight
is by some meeting the result to that systems
and then have a series on the progress set and the
it has to wait accordingly
okay so for the first
classifiers
well we what we did you section we train a gaussian so
distributions for each all the target languages
so for the case of fifty k good fifty target and just be trained fifty
gaussian distributions
and he a the means
estimate the separately
well as for the cobra metrics used actually a we get a global gram matrix
and then was moved in
we the smoothing figure two point one may be adapted to the individual target classes
then
then we get a new backend but in the score space we train a you
be included always clusters
as one additional process
okay this is counter have a standard in the language recognition
and this is followed by us cost score calibration using a multi class logistic regressions
and of course we used a multi class logistic regression be could come good any
log-likelihood in two parts deal
and this is maybe can actually control
the of trial so t v can get you know perhaps put more
prior onto the always classes because
and have seen voice detection is
right important in production costs
okay
okay that may probably svm
that we have a
use
so
we do or a simple well in the by expansions use in a second up
to the second order
so this one expand a four hundred dimensional i-vectors seems to at k which is
scaled by a b
then we didn't is obvious a bit worse and i sent rising to a global
mean and normalized to unit norm
and perform any p
be the rank not just at each is kind of small compared to the
the dalmatian have
okay and then
to include always classes the we have a fifty one classes so we used to
strategy once one versus all
and get a one is a pair-wise strategy so
the final score we combination of these two o a strategy to be used to
train svm
okay so
and i one is what we call the empirical kind of mapping
so what we did this we use the polynomials break those that we have
then we construct we call a possible way the matrix
using all the training that we have
as well as the or else
you know i-vectors to be a detector
then we do for each of the i-vectors that were going to score
we do a mapping
by just simply modifications to the matrix we have
then be account like a combating all transforming the a polynomial select those to the
score space
the optimal course call score vectors
and this is followed by
you know us and writing and to the global mean and normalized to unit norm
and the same strategy line
so we have to a kernel that we use one polynomial expansions that second emprically
kinda mapping but svm
is result
first of all we see we would like to compare how the a local minima
selectors the score scoring goes compared to the i-vectors
so this pulse first lines the baseline
ways i-vectors followed by cosine scoring
zero point three nine five nine
and t v just simply change cosine scoring to svm
then what we get is about seven point eight percent improvement compared to the baseline
and then if you chase endurable in my expansion and i-vectors then we get is
that your point three for which is a fourteen percent buttons
and if we know from the polynomial select those used empirical kernel
a of the scost with those we get a sixteen percent of phones
okay so next we see the a simple example always detection strategy maybe on to
compare the this fit target database without set
for both or no male svm and emprically connects
okay so
this is what they like you know when you includes the
it does not include any more s is
this fourteen percent due to the classifier compared to baseline
if you use the is the lowest fit target
variable get the d two percent improvements okay then best fit or s get this
and if you on not the best fit a or s
then we do a exactly for purification
we get a forty five percent improvements
similarly for the case of empirical kernel
alright so this is the you know how final submission is
we get about fifty five percent no improvement on the progress set
and a fifty four percent
compared to baseline
on eva sense that so
the improvements a new setting one century come from a better classifier
but you svm multiclass logistic conditions we used the n and t is the mlp
and i think the most part actually contribute at the contribution is from the always
detection strategy b c
give us a raw forty percent so far improvements
compared to baseline
okay i not examine the mentions that
we have in one day from the has a cassette
the number always the fact that is a one thousand seven hundred i think this
is much
higher than the
a real more file or as a segments all i-vectors in the test set
but given that the cost actually in a very well
if you do a
miss detection or as
you're going to lose much in terms of the cost so
it is better to say i-vector that so as then
then this not always
okay so this is the how progress
across
treat formant
so from the baseline systems
then we have a the you know
classifier
then be a
we found that the
this fee target
it's a good strategy for the always detection then we get a boost the performance
and then the betsy lawrence strategy eva santana bows
and then adaptive a cluster verification difference and a one
and then finally we have the fusion which that's to the zero point one seven
in terms of the costs
okay so
in conclusion so we have obtained a bow
fifty percent of buttons compared to baseline feature is
major contribution from the fusion multiple classifier
and the s voice detection strategy
and the following are always detection strategy find to be useful
which is the this fit target bessy always ended if a classification
but i have a real are actually able to find a good strategy to actually
extract
useful target i-vectors from a delay but i'm set
so all we believe a t v
have a bit distracted in doing that
you would give us a for the improvement
okay we have time for some questions
i think things
forward three d is your two we observe the
not very useful to sort out of so that this one class
because
i
distributed between different based on this definition you try to
maybe more k plus one but k plus the than we choose the
the o posted to
well
i four comments i'm for channel we didn't try because when you know during the
evaluations we do not reno
how many other languages that there
in the as
classes in maybe one he may be too
so
we do have and the ideas of how many languages in the class
so we don't actually explored it not that options you is what we take the
much from the that reject mm can see that much from the that you're is
entirely on the or at least the these japanese
so we can say okay this all of the or more close to italian family
you results or group of the way we show
we should have done the in of the and language tree and green
and the second question do you the confusion matrix
c we choose or more in terms of somehow pool
so
the thing we have peace
not exactly what say but maybe you know take this opportunity to actually talk about
the is greater actually the central but the snow so
you know overall what we did for the for the i-vector challenge is not always
detections of cost a lot of are those expect that we explore
i for example you know the target detection is actually not very good if you
if you see that the able find the summation even though we give a lot
model
fifty percent improvement compared to baseline
but the target detection effect it was compared to the baseline
if you see what
thank you thank you
are there
this study
the i-th this one the channel i-vector challenge to
proceed the
the nist the l at
distribution right so
how much of this work the was left and right in
in this well afford to aid a star forgeries i
i'm for divorce because the
our at a ten to fifteen is across the identifications
we have open set verification problem
where the always cucumber important for a way that ten to fifteen is kind of
not available but maybe we in fact use the we called the and pick a
kind of map
for however
but of course for the our you
well what we actually important you use of the bottleneck features
compared to
you know we may be used to use sdc
in a once you replace sdc be a bartender features we get around fifty percent
automatically we are doing anything
so you more focusing on the
p two levels and
for lid anymore
because for the i-vector challenge to pass on the policies that about always detection
if it's in for the us presentation this well
okay so it i think we're out of time so it's pretty slick the speaker
again