hello i'm having problems from university of east and feel and
well it's my pleasure two presents my guess that the in this workshop i dunno
it's good to be the last
among the last the speakers or not but
well
in the following fifteen twenty minutes i will present
and effective and simple a out of the detection method over i-vector space in the
context of a language identification
well
language identification can be done in two ways one is closed set
where the language of a test segment corresponds to one of the instead or target
languages
and in open-set
where the language of a test segment may not
be any of the target languages
the task is to classify
the test segment
either into one of the inset languages for
and out of set model
well
one way to perform open set language identification is to training
i out of set model from additional data
but
then the data is huge and on and only build
the practical key question is
how to select the most representative out of set data
to model to be all this out of set model in other words
how to obtain
the higher quality
out of set data or additional data to train
this an out of set model
well
in the context of language identification the good candidates for out of that they do
have some properties deductible of their main properties or
i don't set candidates should come from a different lingo is the language families
by language families i mean that those languages that have the same kinds of the
common ancestor for example a russian ukrainian polish are all from the slavic a language
family
and the second property
is that open-set candidates should be pillows
into instead languages while others for of a well i because of having at various
general out of set model which represents which better represent the ward of out of
set data or out of set languages
well
and are some ways to do this
dorsum classical approaches one is one class svm where the idea is to enclose the
data with an hypersphere
and collapsible new data has an or model if they fall within this hypersphere and
as out of set out otherwise
to other classical approaches are k nearest neighbor where
given each data a the sum of its distances between this data and it's k
nearest neighbours are computed and
the higher this task is the more a confidence we ought to say that this
data is outlier is out of set
and another classical approaches distance the class means of l if we assume that the
data is a gaussian
those data that long
two or three the standard deviation a bill or eyeball the class name
are considered as out of set data
what we consider in this study is to use of a nonparametric statistical test known
as a whole marker of the smirnoff test
it's a non parameter
nonparametric
and the idea is to
we have two samples
we estimate
but their these two samples have the same underlying distribution
but computing the maximum difference between their
empirical cumulative distribution functions
well as you could see in this picture this maximum difference is known ask i
guess value if it is a great an accurate critical value
we can in indicates that this these two samples are from different distributions or in
our case from different classes
okay how we adopted and two are open set language identification task
well even and unlabeled vector w us up a script on i and all their
all i-vectors in class barely language l we can he would
that a the empirical cumulative distribution functions between this w only and all directors
then we will have a
if you have a and samples in this language
language l
we will come up with l individual k s values so we take average for
on this
individual king is values and then become a bit average k s e
that corresponds to
and outlier a score of w on i in language
well
we repeat this work other l target languages
and then become a bit l average k s values and then we take the
minimum value
as the final outlier a score
for and w only
this unlabeled i-vector
well
it's interesting that this that the distribution of this case you values
have also a distribution
in this in this picture
and the and the red bars shows the instantaneous in values meaning that for example
if you're in the data class
and the red ball strolls the shows that
for computing the red bars the in the data
those data that correspond to derek the last very used to compute the k s
z values and the for the and for the blue wires and the outputs that
they to those they don't that do not belong to their equal ask for example
very use the computer used you values
and interestingly
the incipiency values
tends to values close to zero and out of set
casey value stands to
and values close to one
so we couldn't see this problem where do directly about looking at that the data
the beginning but now
we have a tool that shows how instead that out of set data are separated
well that's
applied in our open set language identification task
well
be applied idea and the and used language i-vector challenge two thousand fifteen
the training set corresponds to prevent house and
utterance s
fifty in that languages
and development sets has six thousand five hundred on labeled
data and the same amount of data for the test set
well the data that was balance between each languages
and the dimensions of the i-vectors were four hundred
and to be did some post-processing like within class covariance normalisation and
linear discriminant analysis
and the i-vectors
well
to perform
evaluation of the out of the detection methods we need labeled data because the development
set didn't have a label was not labeled be used for training set to
to be segmented training set into three different portions training you have and test portions
so that we have certainly we assign thirty instead languages and twenty out of set
languages
and the test portions has all the languages of the instead
and twenty out of set
and the data was
what's didn't have any overlap between these three portions
well
if here is an example of labeling of the out of set and for the
out of set a evaluation for example for those data that and their true language
was one of the instead languages for example data id one
be a label it as instead
and for those data that there
two language was different done
one of things that line from the instead languages
we label
we label them as out of set
here is the results of
on a out of the detection methods and our proposed
method well case devalues yes i a method outperforms other classical approaches
for example in case of svm and knn we have fourteen and sixteen percent relative
it all error rate reductions in out of set detection
well
before their f use this baseline systems with k s and we have improvement we
have improved all individual systems by
by fusing k s e with them
and the best performance is fusing k is a bit one class it's we have
that resulted in twenty percent
it while error rates of around twenty eight
individual t s a we dropped
the equal error rate to twenty percent
well
let us look at the open set language identification results
here
the table and the different roles in the table shows
and
the they differ based on the data selected for out of set modeling
for example we have random
we use all the training set
all the development set combination of training and development set
and the last rule is the proposed selection method
as a for the reference purposes we include that the colours that result
this results are based on the svm classifier and dark directly reported from the news
evaluation website
well
the proposed selection method
based on identification results sorry i didn't mention that
the
bill the lines are that identification "'cause" is twenty six around twenty six
a performance that nist baseline
buys thirty three percent relative
improvement the best relative improvement was fifty
fifty five percent
well
looking at the for the first rose
i think i think additional data well hand held to reduce the identification cost but
what not was not bitter and then selecting
so selecting in a supervised by selecting out of set a date or in a
supervised a
well
here we look at be we compare the
casey with other out of the detection methods in the open set language identification
well all of them help to
all of them and outperforms the that the candles that results
but they contain is it is the wiener system with twenty six
identification cost
well
we had one thousand five hundred out of set data
and you set and fifteen
out of that language as we were able to detect what around one thousand of
them
with this method
it can use them as that
so that the and important thing in this challenge was
two bitter detect out of set it change your level when you correctly detect out
of set data
well in the conclusion
in this study
we propose to use a simple and effective method to detect out of the data
over i-vector is space we showed that
this no
the that the case in values the proposed method
has it nicely distribution
and then been integrated to the open set a like this is that we receive
thirty three percent relative reduction in identification cost
or a closed set
system
okay thank you for attention
so if you if you go back to slide fifteen
making did you
did you try different partitions of in set not observed and the this
make much of the difference for your
well no we select that's their twenty percent
is there content you languages or c
so this was on the next slide but you the thirty and twenty you didn't
write different portions now do you think this would have made a difference
in our offset detection yes
yes it
i dunno what you mean by making a difference but
the results maybe difference but the output
will be the same this is the this case system
it's something are
among other systems
i see but maybe the amount by which one
whatever the
different had you selected
which we ran the random it's not supervising on the selected target languages
and set and twenty s out of that
and the other are there other questions
one classes them what the couldn't that used
investment coding what was the current that linear yes polynomial kernel
and
between the two images that used
that you can that he scanned and one and the ones
which one is more efficient
which was the first one
fast this one
well
my method was fast
and knn was also first not a
i didn't look carefully at that well the speed but
i think goes and this one class svm this the this nonstick plastered to cluster
mean and
gaussian and canyon unless it
the speech or more or less the same
but i didn't look at the speaker now step by step
evaluation
if there are no the questions let's take the speaker again please