Speech Transcript - A Semisupervised Approach for Language Identification based on Ladder Networks

okay moving so this is a joint work for me from

bob dunn first

so we describe a limited to challenge

i in the in a explain our approach it is based networks

i'll show how we can be applied

to the and then show experiments

so we start with a the channel the forced on the buttons

so we have a

actually she was we have some labeled they have not been a lot of the

unlabeled data is it as a development set and we have a it is

actually it's

is it does gratification one of the future it's just under classification task a but

there are two different first we have

unlabeled data that we do to use it is part of the classification and we

found out of state

data that you want to take your

so i in this task will focus on these

to challenge of how to use the unlabeled data

it's part of training and how to take care of ops

so that these two fifty in languages i

some of them are very similar to each other some of them are different

in this is the cost function for from the cost function we can roughly assume

that

one quarter of the test set and the unlabeled data is

use it

it will use this but

okay

i want to discuss how can use the unlabeled data for training in the case

of a deep-learning in the framework of people

so just a way to lose unlabeled data

we use it as a pre-training

instead of a random initialization of the hotel so we used a the unlabeled data

to a pre-training then network

that doesn't probably due to do pre-training

systems thinking

a restricted boltzmann machine

the second the middle based on the denoising

but both of them a pre-training

we used to extract

and then we form k one as well

the unlabeled the

so how do i

really mentioned how a how auto-encoder

is that well we have a

the data points point and we tried to construct the noise and they gathered structure

that is

similar is not okay to the clean data

you in our approach you use a generalization

of out to go there because corner store

the two years

and data just to work with taking points but to across the entire that'll

with a c

data we actually to the

that's what the cost function is the local section of

but also though the construction of the hidden layers

will explain how in more detail

so this is that it

this is a standard

you want we have

i don't think that we have a soft spot classification is done to a architecture

but beside it if we apply for labeled data

in case of unlabeled data with the same network

the same parameters but at each step we add noise

to the

data with of the important we add noise to each of the hidden layer we

try to recall that there will try to construct

the hidden layers so engaged every possible for them but these the base of the

however i and the construct and

a previously

i the that the cost function is one or construction

will be

so very close to the clean

a hidden layers

this is therefore in

for the former so we have encoder and decoder is it's the in the encoder

each you just

a hundred and voice

at each step

in the decoder their we construct the denoising huge layers

so if we will be more specific the main problem

all of this model is of course of action how we can reconstruct the hidden

layer based on the model

he or and of course factors previously or and

still we assume that the that we apply a additive gaussian noise

to the in the later so we use a you know i estimation

the daily we estimate

then i

we estimate hidden layer is a linear function

all of the noisy you hear that

we take the that we now require sufficient we take it

it is that you know away from the eh

the previous construction workers fact that leo this is the did not we know that

money not is applied by you know function plus a sigmoid function

and we don't for each one

several the concept is

similar to else

so if you have this intuition

but the idea here is to reconstruct the noisy but they are based on the

two

based on their the professional constructed

popping so once we have

it actually we have a training data that is based on

boast labeled data

and unlabeled data

the training cost function will be

standard cross entropy

applied on the labeled data

and construction and all applied on the data and bibles action a well i mean

reports about not just the input but

reconstructed each of the hidden layer

so if all

so if we go back to this picture so i

e in the unlabeled data we want

we inject the noisy version into the net and then try to are constructed such

that it will be very similar to the a noisy data

so in this way we use the unlabeled data not just is the three training

and then they forget about it but we use it i in it is part

of the training

the training data of the neural network is explicitly

very small bones the labeled and unlabeled data

this is an illustration of the power of a larger networks

so we can this is a result of the standard em based

a u i

it that's or something this is the number of flavours and this is the construction

or and we can see that using a large networks

we had we can have a

performance that is that is if everyone based on

only for every something like a one or two hundred labeled data

well all other consonants

of a images is argument

okay so this is the idea of other networks there you

we will apply to this a challenge

now we want to discuss how can we incorporate i'll sit in this a frame

so we use a you want network architecture of but i would add

another class a fixed it so we have fifty classes for each of the labels

for each of the unknown languages and it and that are out of that class

and the how we can train the out of state the label

so that we used a label layer distribution the legalisation

so what we meet again

assuming we do a batch a training so we can compute the frequency

of a of they all say

all of classifying the a language at the difference you of

off a classifier languages so we can count how many times we classified in the

language is english how it are we classified as hindu

that's right and how many we time to classify it as a state

and we have

a rough estimation what should be okay the histogram was what should be this distribution

we can assume that we have

all languages should be roughly

it appears

and i'll start should be roughly one quarter

of the

so we can say

it cross entropy

score function

two

it to who it define a score of

the data distribution of the classifier so and the main point is that we can

do it because we have a label

we have out of set

in the unlabeled data if we don't

in this challenge is the main point

in these times we have out of set in the unlabeled data so we can

assume that the adapted with the that some of my

the labels should be altered

okay so softly season a the additional cost function so we have

two

it cost function one is the lower cost function there is some supervised cost function

and the other the other one yees day

a discrepancy of course of the labeled the decisions

okay now i go to experiment what the three years the detector that deals with

the input is the i-vectors we use a

a natural therefore ready

value hidden layers

and the and we have a softmax output with diffuse t one

classes

so that this is the extent pale

this assimilation the two

we text sort it of the languages

it is inserted and the other languages

is out of set so we this is simulation we know

all the labels

i and he is example what is happening if we use

the baseline

they without today the latter it and if we and

a that the latter

score so we can see that

we gained a

in a significant improvement using the unlabeled data

the during the price of doing it it's more difficult to learn the log spectral

the prices is

that we need to do more reports but it's not a big issue the that

it is more

so this is there a result a the progress are the results

so we have a either doing a ladder or not doing other a and

taking the labour statistics

a or not take detecting the label statistics score

so this is the baseline in for our case

so if we take a larger

we get a an improvement

if we take a label statistics

we also get improvement but not much and if we a

a combine

the two strategies

the first strategy is the for unlabeled data in the second strategy

for all to start we get it would gain a significantly

improvement

we i and this would this is

a this problem is the out of set statistics

at the two

the daily the system will provide then

we tried to stall for example here what

what we classify us thirty percent

all of a development set as is

that would try

to a

to adjust the number of out of state to be one quarter

because we don't that roughly the that this should be that the number but

in the baseline we got improvement but here

it doesn't a

actually the performance and decreases

so i still the this was that the

you the best results

okay so to compute we tried to apply here lately a deep learning strategy the

take care of both

is a challenging role of three

unlabeled data and out of set for unlabeled data

we use the latter network that explicitly

take the i labeled data into account while training

four out of set a we use a label and distribution score

that is also i

i is used in the training

i we show that

combining

these two methodology we can mitigate

a improve the results

okay ten q

we have time for questions

can you tell us exactly how much this unsupervised data help you anyone either training

for example i imagine you do the also to express reconstruction in the same training

data that you have like a regularisation into the classification categorisation would it you did

you compare between added due to the regularization as what is the supervised and unsupervised

no need to measure how much you will gain by don't answer provide it's a

good question

i didn't write but

the utterance is used also is a regularization

you think that the draw pile a strategy is that

but not

not sure

just you deduction the well but

we did try

if i remember that it helps

but anyhow we need

do we need unlabeled data be because it has out that

but it's still

want to know if you a what applying some and kind of pre-processing for this

for the i-vectors

if what a new three and in and

what you know what you know within two

so called a low

so the i-vectors that what provided by nist by nist

the results but maybe preprocessing

i don't know but we tried we use the raw data

if there are no other questions let's think the speaker again

so i think we we're at the end of the session i think we have

a few

A Semisupervised Approach for Language Identification based on Ladder Networks

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Ehud Ben-Reuven, Jacob Goldberger