Speech Transcript - A Semisupervised Approach for Language Identification based on Ladder Networks

0:00:15	okay moving so this is a joint work for me from
0:00:20	bob dunn first
0:00:23	so we describe a limited to challenge
0:00:27	i in the in a explain our approach it is based networks
0:00:33	i'll show how we can be applied
0:00:35	to the and then show experiments
0:00:39	so we start with a the channel the forced on the buttons
0:00:45	so we have a
0:00:47	actually she was we have some labeled they have not been a lot of the
0:00:53	unlabeled data is it as a development set and we have a it is
0:00:59	actually it's
0:01:01	is it does gratification one of the future it's just under classification task a but
0:01:08	there are two different first we have
0:01:11	unlabeled data that we do to use it is part of the classification and we
0:01:15	found out of state
0:01:17	data that you want to take your
0:01:19	so i in this task will focus on these
0:01:22	to challenge of how to use the unlabeled data
0:01:26	it's part of training and how to take care of ops
0:01:30	so that these two fifty in languages i
0:01:34	some of them are very similar to each other some of them are different
0:01:40	in this is the cost function for from the cost function we can roughly assume
0:01:45	that
0:01:45	one quarter of the test set and the unlabeled data is
0:01:51	use it
0:01:53	it will use this but
0:01:55	okay
0:01:57	so
0:01:58	e
0:01:59	i want to discuss how can use the unlabeled data for training in the case
0:02:04	of a deep-learning in the framework of people
0:02:07	so just a way to lose unlabeled data
0:02:11	we use it as a pre-training
0:02:13	instead of a random initialization of the hotel so we used a the unlabeled data
0:02:19	to a pre-training then network
0:02:23	that doesn't probably due to do pre-training
0:02:26	systems thinking
0:02:28	a restricted boltzmann machine
0:02:31	the second the middle based on the denoising
0:02:36	but both of them a pre-training
0:02:38	we used to extract
0:02:40	and then we form k one as well
0:02:42	the unlabeled the
0:02:44	so how do i
0:02:46	really mentioned how a how auto-encoder
0:02:50	is that well we have a
0:02:53	the data points point and we tried to construct the noise and they gathered structure
0:03:02	that is
0:03:02	similar is not okay to the clean data
0:03:07	you in our approach you use a generalization
0:03:11	of out to go there because corner store
0:03:15	the two years
0:03:18	and data just to work with taking points but to across the entire that'll
0:03:24	with a c
0:03:26	data we actually to the
0:03:29	that's what the cost function is the local section of
0:03:33	but also though the construction of the hidden layers
0:03:37	will explain how in more detail
0:03:40	so this is that it
0:03:43	this is a standard
0:03:45	you want we have
0:03:47	i don't think that we have a soft spot classification is done to a architecture
0:03:55	but beside it if we apply for labeled data
0:04:00	in case of unlabeled data with the same network
0:04:05	the same parameters but at each step we add noise
0:04:10	to the
0:04:11	data with of the important we add noise to each of the hidden layer we
0:04:16	try to recall that there will try to construct
0:04:21	the hidden layers so engaged every possible for them but these the base of the
0:04:27	c
0:04:28	however i and the construct and
0:04:31	a previously
0:04:33	i the that the cost function is one or construction
0:04:37	will be
0:04:38	so very close to the clean
0:04:41	a hidden layers
0:04:45	this is therefore in
0:04:49	for the former so we have encoder and decoder is it's the in the encoder
0:04:55	each you just
0:04:56	a hundred and voice
0:05:00	at each step
0:05:02	in the decoder their we construct the denoising huge layers
0:05:08	so if we will be more specific the main problem
0:05:13	all of this model is of course of action how we can reconstruct the hidden
0:05:17	layer based on the model
0:05:20	he or and of course factors previously or and
0:05:25	still we assume that the that we apply a additive gaussian noise
0:05:31	to the in the later so we use a you know i estimation
0:05:37	the daily we estimate
0:05:39	then i
0:05:41	we estimate hidden layer is a linear function
0:05:45	all of the noisy you hear that
0:05:49	we take the that we now require sufficient we take it
0:05:54	it is that you know away from the eh
0:05:57	the previous construction workers fact that leo this is the did not we know that
0:06:02	money not is applied by you know function plus a sigmoid function
0:06:08	and we don't for each one
0:06:10	several the concept is
0:06:13	similar to else
0:06:15	so if you have this intuition
0:06:18	but the idea here is to reconstruct the noisy but they are based on the
0:06:27	two
0:06:28	based on their the professional constructed
0:06:32	popping so once we have
0:06:35	it actually we have a training data that is based on
0:06:39	boast labeled data
0:06:42	and unlabeled data
0:06:44	the training cost function will be
0:06:48	standard cross entropy
0:06:50	applied on the labeled data
0:06:52	and construction and all applied on the data and bibles action a well i mean
0:06:59	reports about not just the input but
0:07:01	reconstructed each of the hidden layer
0:07:04	so if all
0:07:10	so if we go back to this picture so i
0:07:14	e in the unlabeled data we want
0:07:18	we inject the noisy version into the net and then try to are constructed such
0:07:25	that it will be very similar to the a noisy data
0:07:32	so in this way we use the unlabeled data not just is the three training
0:07:37	and then they forget about it but we use it i in it is part
0:07:42	of the training
0:07:43	the training data of the neural network is explicitly
0:07:47	very small bones the labeled and unlabeled data
0:07:54	this is an illustration of the power of a larger networks
0:08:00	so we can this is a result of the standard em based
0:08:06	a u i
0:08:08	it that's or something this is the number of flavours and this is the construction
0:08:12	or and we can see that using a large networks
0:08:16	we had we can have a
0:08:19	performance that is that is if everyone based on
0:08:23	only for every something like a one or two hundred labeled data
0:08:29	well all other consonants
0:08:32	of a images is argument
0:08:35	okay so this is the idea of other networks there you
0:08:40	we will apply to this a challenge
0:08:46	now we want to discuss how can we incorporate i'll sit in this a frame
0:08:52	or
0:08:52	so we use a you want network architecture of but i would add
0:08:58	another class a fixed it so we have fifty classes for each of the labels
0:09:05	for each of the unknown languages and it and that are out of that class
0:09:10	and the how we can train the out of state the label
0:09:15	so that we used a label layer distribution the legalisation
0:09:20	so what we meet again
0:09:22	assuming we do a batch a training so we can compute the frequency
0:09:29	of a of they all say
0:09:33	all of classifying the a language at the difference you of
0:09:39	off a classifier languages so we can count how many times we classified in the
0:09:44	language is english how it are we classified as hindu
0:09:47	that's right and how many we time to classify it as a state
0:09:52	and we have
0:09:54	a rough estimation what should be okay the histogram was what should be this distribution
0:10:00	we can assume that we have
0:10:03	all languages should be roughly
0:10:05	it appears
0:10:07	and i'll start should be roughly one quarter
0:10:10	of the
0:10:12	so we can say
0:10:14	it cross entropy
0:10:16	score function
0:10:18	two
0:10:21	it to who it define a score of
0:10:24	the data distribution of the classifier so and the main point is that we can
0:10:30	do it because we have a label
0:10:32	we have out of set
0:10:34	in the unlabeled data if we don't
0:10:36	in this challenge is the main point
0:10:39	in these times we have out of set in the unlabeled data so we can
0:10:43	assume that the adapted with the that some of my
0:10:46	the labels should be altered
0:10:50	okay so softly season a the additional cost function so we have
0:10:55	two
0:10:56	it cost function one is the lower cost function there is some supervised cost function
0:11:01	and the other the other one yees day
0:11:06	a discrepancy of course of the labeled the decisions
0:11:11	okay now i go to experiment what the three years the detector that deals with
0:11:17	the input is the i-vectors we use a
0:11:21	a natural therefore ready
0:11:24	value hidden layers
0:11:25	and the and we have a softmax output with diffuse t one
0:11:30	classes
0:11:31	so that this is the extent pale
0:11:34	this assimilation the two
0:11:37	we text sort it of the languages
0:11:40	it is inserted and the other languages
0:11:43	is out of set so we this is simulation we know
0:11:47	all the labels
0:11:49	i and he is example what is happening if we use
0:11:53	the baseline
0:11:54	they without today the latter it and if we and
0:11:59	a that the latter
0:12:01	score so we can see that
0:12:03	we gained a
0:12:06	in a significant improvement using the unlabeled data
0:12:10	the during the price of doing it it's more difficult to learn the log spectral
0:12:17	the prices is
0:12:18	that we need to do more reports but it's not a big issue the that
0:12:23	it is more
0:12:26	so this is there a result a the progress are the results
0:12:31	so we have a either doing a ladder or not doing other a and
0:12:37	taking the labour statistics
0:12:39	a or not take detecting the label statistics score
0:12:43	so this is the baseline in for our case
0:12:48	so if we take a larger
0:12:51	we get a an improvement
0:12:54	if we take a label statistics
0:12:57	we also get improvement but not much and if we a
0:13:02	a combine
0:13:03	the two strategies
0:13:06	the first strategy is the for unlabeled data in the second strategy
0:13:10	for all to start we get it would gain a significantly
0:13:14	improvement
0:13:15	we i and this would this is
0:13:19	a this problem is the out of set statistics
0:13:23	at the two
0:13:25	the daily the system will provide then
0:13:28	we tried to stall for example here what
0:13:32	what we classify us thirty percent
0:13:36	all of a development set as is
0:13:39	that would try
0:13:41	to a
0:13:45	to adjust the number of out of state to be one quarter
0:13:50	because we don't that roughly the that this should be that the number but
0:13:54	in the baseline we got improvement but here
0:13:58	it doesn't a
0:14:01	actually the performance and decreases
0:14:04	so i still the this was that the
0:14:09	you the best results
0:14:12	okay so to compute we tried to apply here lately a deep learning strategy the
0:14:19	take care of both
0:14:21	is a challenging role of three
0:14:25	unlabeled data and out of set for unlabeled data
0:14:28	we use the latter network that explicitly
0:14:31	take the i labeled data into account while training
0:14:35	four out of set a we use a label and distribution score
0:14:40	that is also i
0:14:43	i is used in the training
0:14:45	i we show that
0:14:46	combining
0:14:47	these two methodology we can mitigate
0:14:50	a improve the results
0:14:54	okay ten q
0:15:01	we have time for questions
0:15:13	can you tell us exactly how much this unsupervised data help you anyone either training
0:15:21	like
0:15:22	for example i imagine you do the also to express reconstruction in the same training
0:15:27	data that you have like a regularisation into the classification categorisation would it you did
0:15:32	you compare between added due to the regularization as what is the supervised and unsupervised
0:15:37	no need to measure how much you will gain by don't answer provide it's a
0:15:42	good question
0:15:43	i didn't write but
0:15:48	the utterance is used also is a regularization
0:15:52	you think that the draw pile a strategy is that
0:16:00	but not
0:16:02	not sure
0:16:04	just you deduction the well but
0:16:06	we did try
0:16:08	if i remember that it helps
0:16:12	but anyhow we need
0:16:14	do we need unlabeled data be because it has out that
0:16:28	but it's still
0:16:33	want to know if you a what applying some and kind of pre-processing for this
0:16:37	for the i-vectors
0:16:38	if what a new three and in and
0:16:42	what you know what you know within two
0:16:46	so called a low
0:16:49	so the i-vectors that what provided by nist by nist
0:16:52	the results but maybe preprocessing
0:16:56	i don't know but we tried we use the raw data
0:17:11	if there are no other questions let's think the speaker again
0:17:18	so i think we we're at the end of the session i think we have
0:17:22	a few

A Semisupervised Approach for Language Identification based on Ladder Networks

NIST 2015 Language Recognition i-Vector Machine Learning Challenge

Ehud Ben-Reuven, Jacob Goldberger