okay moving so this is a joint work for me from
bob dunn first
so we describe a limited to challenge
i in the in a explain our approach it is based networks
i'll show how we can be applied
to the and then show experiments
so we start with a the channel the forced on the buttons
so we have a
actually she was we have some labeled they have not been a lot of the
unlabeled data is it as a development set and we have a it is
actually it's
is it does gratification one of the future it's just under classification task a but
there are two different first we have
unlabeled data that we do to use it is part of the classification and we
found out of state
data that you want to take your
so i in this task will focus on these
to challenge of how to use the unlabeled data
it's part of training and how to take care of ops
so that these two fifty in languages i
some of them are very similar to each other some of them are different
in this is the cost function for from the cost function we can roughly assume
that
one quarter of the test set and the unlabeled data is
use it
it will use this but
okay
so
e
i want to discuss how can use the unlabeled data for training in the case
of a deep-learning in the framework of people
so just a way to lose unlabeled data
we use it as a pre-training
instead of a random initialization of the hotel so we used a the unlabeled data
to a pre-training then network
that doesn't probably due to do pre-training
systems thinking
a restricted boltzmann machine
the second the middle based on the denoising
but both of them a pre-training
we used to extract
and then we form k one as well
the unlabeled the
so how do i
really mentioned how a how auto-encoder
is that well we have a
the data points point and we tried to construct the noise and they gathered structure
that is
similar is not okay to the clean data
you in our approach you use a generalization
of out to go there because corner store
the two years
and data just to work with taking points but to across the entire that'll
with a c
data we actually to the
that's what the cost function is the local section of
but also though the construction of the hidden layers
will explain how in more detail
so this is that it
this is a standard
you want we have
i don't think that we have a soft spot classification is done to a architecture
but beside it if we apply for labeled data
in case of unlabeled data with the same network
the same parameters but at each step we add noise
to the
data with of the important we add noise to each of the hidden layer we
try to recall that there will try to construct
the hidden layers so engaged every possible for them but these the base of the
c
however i and the construct and
a previously
i the that the cost function is one or construction
will be
so very close to the clean
a hidden layers
this is therefore in
for the former so we have encoder and decoder is it's the in the encoder
each you just
a hundred and voice
at each step
in the decoder their we construct the denoising huge layers
so if we will be more specific the main problem
all of this model is of course of action how we can reconstruct the hidden
layer based on the model
he or and of course factors previously or and
still we assume that the that we apply a additive gaussian noise
to the in the later so we use a you know i estimation
the daily we estimate
then i
we estimate hidden layer is a linear function
all of the noisy you hear that
we take the that we now require sufficient we take it
it is that you know away from the eh
the previous construction workers fact that leo this is the did not we know that
money not is applied by you know function plus a sigmoid function
and we don't for each one
several the concept is
similar to else
so if you have this intuition
but the idea here is to reconstruct the noisy but they are based on the
two
based on their the professional constructed
popping so once we have
it actually we have a training data that is based on
boast labeled data
and unlabeled data
the training cost function will be
standard cross entropy
applied on the labeled data
and construction and all applied on the data and bibles action a well i mean
reports about not just the input but
reconstructed each of the hidden layer
so if all
so if we go back to this picture so i
e in the unlabeled data we want
we inject the noisy version into the net and then try to are constructed such
that it will be very similar to the a noisy data
so in this way we use the unlabeled data not just is the three training
and then they forget about it but we use it i in it is part
of the training
the training data of the neural network is explicitly
very small bones the labeled and unlabeled data
this is an illustration of the power of a larger networks
so we can this is a result of the standard em based
a u i
it that's or something this is the number of flavours and this is the construction
or and we can see that using a large networks
we had we can have a
performance that is that is if everyone based on
only for every something like a one or two hundred labeled data
well all other consonants
of a images is argument
okay so this is the idea of other networks there you
we will apply to this a challenge
now we want to discuss how can we incorporate i'll sit in this a frame
or
so we use a you want network architecture of but i would add
another class a fixed it so we have fifty classes for each of the labels
for each of the unknown languages and it and that are out of that class
and the how we can train the out of state the label
so that we used a label layer distribution the legalisation
so what we meet again
assuming we do a batch a training so we can compute the frequency
of a of they all say
all of classifying the a language at the difference you of
off a classifier languages so we can count how many times we classified in the
language is english how it are we classified as hindu
that's right and how many we time to classify it as a state
and we have
a rough estimation what should be okay the histogram was what should be this distribution
we can assume that we have
all languages should be roughly
it appears
and i'll start should be roughly one quarter
of the
so we can say
it cross entropy
score function
two
it to who it define a score of
the data distribution of the classifier so and the main point is that we can
do it because we have a label
we have out of set
in the unlabeled data if we don't
in this challenge is the main point
in these times we have out of set in the unlabeled data so we can
assume that the adapted with the that some of my
the labels should be altered
okay so softly season a the additional cost function so we have
two
it cost function one is the lower cost function there is some supervised cost function
and the other the other one yees day
a discrepancy of course of the labeled the decisions
okay now i go to experiment what the three years the detector that deals with
the input is the i-vectors we use a
a natural therefore ready
value hidden layers
and the and we have a softmax output with diffuse t one
classes
so that this is the extent pale
this assimilation the two
we text sort it of the languages
it is inserted and the other languages
is out of set so we this is simulation we know
all the labels
i and he is example what is happening if we use
the baseline
they without today the latter it and if we and
a that the latter
score so we can see that
we gained a
in a significant improvement using the unlabeled data
the during the price of doing it it's more difficult to learn the log spectral
the prices is
that we need to do more reports but it's not a big issue the that
it is more
so this is there a result a the progress are the results
so we have a either doing a ladder or not doing other a and
taking the labour statistics
a or not take detecting the label statistics score
so this is the baseline in for our case
so if we take a larger
we get a an improvement
if we take a label statistics
we also get improvement but not much and if we a
a combine
the two strategies
the first strategy is the for unlabeled data in the second strategy
for all to start we get it would gain a significantly
improvement
we i and this would this is
a this problem is the out of set statistics
at the two
the daily the system will provide then
we tried to stall for example here what
what we classify us thirty percent
all of a development set as is
that would try
to a
to adjust the number of out of state to be one quarter
because we don't that roughly the that this should be that the number but
in the baseline we got improvement but here
it doesn't a
actually the performance and decreases
so i still the this was that the
you the best results
okay so to compute we tried to apply here lately a deep learning strategy the
take care of both
is a challenging role of three
unlabeled data and out of set for unlabeled data
we use the latter network that explicitly
take the i labeled data into account while training
four out of set a we use a label and distribution score
that is also i
i is used in the training
i we show that
combining
these two methodology we can mitigate
a improve the results
okay ten q
we have time for questions
can you tell us exactly how much this unsupervised data help you anyone either training
like
for example i imagine you do the also to express reconstruction in the same training
data that you have like a regularisation into the classification categorisation would it you did
you compare between added due to the regularization as what is the supervised and unsupervised
no need to measure how much you will gain by don't answer provide it's a
good question
i didn't write but
the utterance is used also is a regularization
you think that the draw pile a strategy is that
but not
not sure
just you deduction the well but
we did try
if i remember that it helps
but anyhow we need
do we need unlabeled data be because it has out that
but it's still
want to know if you a what applying some and kind of pre-processing for this
for the i-vectors
if what a new three and in and
what you know what you know within two
so called a low
so the i-vectors that what provided by nist by nist
the results but maybe preprocessing
i don't know but we tried we use the raw data
if there are no other questions let's think the speaker again
so i think we we're at the end of the session i think we have
a few