Speech Transcript - End-to-end speaker recognition — why, when and how to do it?

0:00:00	okay
0:00:01	i everyone mm and you know
0:00:04	and i work the
0:00:06	brno university of technology and amelia and i will be giving this the control room
0:00:11	about
0:00:13	and two and speaker verification
0:00:20	so the topics
0:00:22	to discuss in this tutorial is we'll start with some background and definition l and
0:00:28	when training
0:00:30	and then discuss some alternative training proceed years but mean which often use
0:00:36	and then talk about the motivation for ends when training
0:00:40	and continue with some difficult this of and when training
0:00:45	and then
0:00:46	talk about that were reviewing sound
0:00:50	existing work on and then speaker recognition but not that you in like grade the
0:00:55	real
0:00:57	and then we will rubber with some summary and all but and
0:01:01	questions
0:01:03	i was would like to give some acknowledgement and assigns to my colleagues from but
0:01:07	in the media
0:01:08	ming
0:01:10	by who i'm
0:01:11	it's cost bayes topics a lot
0:01:15	so let's start we recognition
0:01:19	this is
0:01:21	i kind of typical
0:01:23	mm at the recognition scenario and
0:01:26	in steps to marry we assume we have some features x and some labels line
0:01:30	and we wish to find some function which is parameterized by
0:01:35	second let's say
0:01:37	and it which
0:01:38	given the
0:01:41	features critiques
0:01:42	some label or predict the label
0:01:46	which should be close or equal to the true
0:01:50	like
0:01:53	to be more precise
0:01:54	we would like to me prediction to be such that some loss function which compares
0:02:00	the
0:02:00	predicted label with the true label is as small as possible on unseen data
0:02:06	and the loss functions for example if we do call a classification it can be
0:02:11	something that used
0:02:13	zero if the predicted label is same as the true label and one
0:02:16	otherwise the basis i kind of error or
0:02:18	not case
0:02:23	of course ideally what we want to do is to
0:02:28	minimize the expected loss on unseen test data which we could calculate like bass
0:02:36	and here we use capital x and y to denote that they are unseeing random
0:02:40	variables
0:02:41	but since we don't know the probability distribution of
0:02:44	x and y we cannot do this
0:02:46	exactly or explicitly
0:02:49	so
0:02:51	in the supervised learning problem we have access to some training data which would be
0:02:56	many examples of features and labels we can complete not the most set
0:03:02	and
0:03:02	p check the average loss on the training data and we are trying to minimize
0:03:07	that
0:03:08	and then we hope that this we
0:03:11	this procedure here means that we will also get a low loss on
0:03:15	unseen test data
0:03:18	and this is a call empirical risk minimisation
0:03:21	and use expected to work uses
0:03:25	the classifier that we use this not to our four
0:03:29	e
0:03:30	the to be precise something would be to dimension should be if units and it
0:03:34	also requires that the distribution of the loss
0:03:37	needs
0:03:38	not to have attained but to for typical scenarios this
0:03:42	really into it improves in your is expected to work
0:03:49	so then let's talk about speaker recognition
0:03:53	as probably most
0:03:55	in the audience here knows we have these three some tasks of speaker recognition
0:04:01	it's speaker identification
0:04:04	which basically is used to classify close to all speakers of this is a very
0:04:09	standard
0:04:11	i recognition
0:04:14	scenario and then we have speaker verification where we deal we
0:04:17	open set as we say
0:04:19	so the speakers that we may see in testing
0:04:22	or not the same as we have access to in training when building the model
0:04:27	and our task is typically to say whether two segments utterances are from the same
0:04:32	speaker or not
0:04:34	and then there's also speaker diarization which is
0:04:38	to assign basically you know in a long recording each time you mean you need
0:04:43	to a speaker
0:04:47	so here i will focus on speaker verification because the speaker identification task is
0:04:53	quite easy you know at least conceptually
0:04:57	and the speaker diarization is card and then approaches are still in very rarely station
0:05:03	or although some great
0:05:05	stuff as has been done
0:05:07	it's maybe too early to focus on that you know tutorial
0:05:14	so
0:05:15	generally
0:05:17	it's
0:05:19	preferable
0:05:20	if a classifier
0:05:22	i'll codes
0:05:23	not a heart the heart prediction like it's this class or in this class but
0:05:28	rather probability of different classes
0:05:31	so we would like some
0:05:34	classifier that uses an estimate of the probability of some label given the data
0:05:39	in the case of speaker verification with are rather prefer it all put the log-likelihood
0:05:45	ratios
0:05:46	because from that we can
0:05:49	okay
0:05:51	the probability of a class given the labour i classes here is just target over
0:05:55	non-target
0:05:57	but we can
0:05:59	do this based on a specified prior probability
0:06:03	so it uses a bit more flexibility in how to use this
0:06:07	system
0:06:11	so
0:06:13	but some talk about and training
0:06:16	and my impression is that it's not completely or well defined in the literature
0:06:23	but it seems to enable
0:06:26	these two
0:06:29	aspects
0:06:30	first all parameters of the system
0:06:34	should be trained jointly and that could be anything from feature extraction to producing some
0:06:38	speaker inventing
0:06:40	to the back in the comparison of speaker and endings and increasing the score
0:06:46	a second aspect is that
0:06:48	and then system should be trained specifically for the and
0:06:51	intended task in which in our case would be verification
0:06:58	one could go even more stricter say that it should match to extract evaluation metrics
0:07:02	but we are interested in for example in right
0:07:06	so
0:07:07	in this tutorial i will try to
0:07:11	discuss
0:07:13	no
0:07:17	how
0:07:19	important
0:07:20	these criterias are or what is it can be
0:07:24	to impose this criteria or what doesn't mean if we don't do it
0:07:31	so
0:07:33	first
0:07:33	let's look at what would
0:07:37	typical and when speaker verification architecture
0:07:41	look like and
0:07:43	well i process first i know this was first attempted for speaker verification in two
0:07:47	thousand sixteen
0:07:49	in the paper mentioned here the mortal
0:07:53	and
0:07:54	it will be some so we start with some
0:07:57	enrollment utterance so as
0:07:59	here it's three and we have some test utterance
0:08:02	all of these goes through some embedding extracting neural networks
0:08:06	reducing in many different architectures there
0:08:09	we produced and bindings which are fixed size
0:08:12	so
0:08:14	utterance representations
0:08:16	one for each utterance of in three now enrollment and endings and one test reading
0:08:22	and then we will create one and rollers model by some kind of pulling for
0:08:26	example taking the meeting
0:08:28	of the and warm of them buildings
0:08:31	and then we have some similarity measure and in the and
0:08:35	a score comes out that says
0:08:38	the log-likelihood ratio for four
0:08:41	the hypothesis that these
0:08:43	test segments
0:08:44	it's from the same speaker as this enrollment segments
0:08:49	and
0:08:50	all of these models should all these parts of the speaker model should be
0:08:55	trained
0:08:56	jointly
0:09:02	to be a bit fair and maybe a for historical interest we should say that
0:09:08	this is
0:09:10	no a
0:09:12	new idea
0:09:15	we had it's already in nineteen ninety three maybe that's their list i'm aware of
0:09:21	at least
0:09:22	and the one paper at the time was about
0:09:26	handwritten signature recognition and another paper was about the fingerprint recognition
0:09:33	but they used exactly this idea
0:09:39	and
0:09:40	okay so we talk about and
0:09:43	training and modeling
0:09:46	so what would be the alternative
0:09:49	one thing would be
0:09:51	generative modeling so we train a generative model
0:09:54	that
0:09:55	means a model that can generate the data both the observations x and
0:10:02	labels line and it can you was
0:10:08	it can also give us
0:10:10	probability of or probability density for such a observations
0:10:16	me typically training with maximum likelihood and if the model is correctly specified for example
0:10:22	of the data really comes from a normal distribution and we have assumed that
0:10:26	in our model are then
0:10:29	with enough training data we will find the correct parameters but the
0:10:33	that is no
0:10:35	and it's may be worth pointing out that
0:10:37	and the lars from such a model is the best
0:10:40	we can have its
0:10:43	so to have access to the log-likelihood ratios from
0:10:47	from the model that really generated today that is
0:10:51	then we can make the model decision for classification verification is a long
0:10:56	no
0:10:57	other
0:10:58	classifier would have was more
0:11:04	problem with this is that when the
0:11:07	more than
0:11:09	assumptions are not correct then the parameters we find with maximum likelihood may not be
0:11:14	optimal for classification
0:11:17	and sometimes maximum likelihood training is also difficult
0:11:25	other approaches will be some type of discriminative training so and then training can be
0:11:30	seen as a where is a lot one type of discriminative training but other discriminative
0:11:36	approaches we can tries to train the neural network where the embedding extractor for speaker
0:11:41	identification which seems to be the most
0:11:45	popular approach right now
0:11:48	and then we will use output of some intermediate layer as somebody and train and
0:11:54	i'm not either
0:11:55	back end on top of that
0:11:58	then there is this a course of the metric learning which
0:12:05	there
0:12:07	mean kind of train the embedding extractor together with a distance matrix with sometimes can
0:12:12	be simple
0:12:14	so in principle the inventing and kind of distance metric or back end
0:12:19	trained jointly
0:12:21	but typically not for the speaker verification task
0:12:24	so this is kind of and then training according to the first criteria but not
0:12:28	according to the second
0:12:32	so
0:12:34	now
0:12:36	when we know that we will
0:12:38	is costs
0:12:40	why the end-to-end training would be preferable
0:12:44	so
0:12:45	we had two things one is that we should train models jointly and the other
0:12:48	thing is that which are trained for the
0:12:50	intended task
0:12:52	so
0:12:54	mm
0:12:56	in the case of joint training is actually quite obvious selects the consider
0:13:01	system consisting of two modules a and b and we have fit that a which
0:13:05	is the parameters of model a and b which is the
0:13:08	only there's of what would be if we just first training module a and then
0:13:14	module b
0:13:15	it is essentially like doing
0:13:18	one iteration of
0:13:20	coordinate descent or block coordinate descent
0:13:22	so we train model
0:13:24	a
0:13:25	and we get here we train one ubm we get here
0:13:29	but we will not get for them that's not to the optimum which would be
0:13:34	so of course we could trade continue
0:13:38	two
0:13:39	do a few more iterations
0:13:40	and we might end up in the
0:13:43	optimal and this is actually kind of in principle equivalent to a joint optimization
0:13:51	when we have right kind of a non-convex model as one we may not actually
0:13:55	get the same
0:13:57	right optimum but as if we did
0:14:00	all the parameters in one go what would happen also depending on which optimize the
0:14:05	we used so
0:14:06	in principle
0:14:08	this is
0:14:12	why or so joint training would be like
0:14:16	really make sure that you find the optimal
0:14:19	also both
0:14:20	models and that's clearly better than just training one
0:14:25	first one and then the other ones
0:14:28	so i think there is no really argument here
0:14:31	that the these part of and then training is justified
0:14:36	the joint training of for more details
0:14:42	the task specific training the idea that we should training for
0:14:48	the
0:14:51	the intended task so if we do
0:14:55	you our application we want to do speaker verification why we should training for verification
0:15:00	and not for identification for example
0:15:04	well
0:15:05	first mission say that
0:15:10	we have some guarantee that this idea of minimizing loss on training data
0:15:14	we need was good performance on test a the empirical risk minimisation idea
0:15:20	and the only guarantee we have there is
0:15:26	this in this case the only holds if we are training for four we for
0:15:30	the metric that we are interested in with the task of very interested in
0:15:35	if we
0:15:36	trained for one task and or
0:15:39	you can evaluate
0:15:40	on another source we don't really have any guarantee that
0:15:45	we find the optimal model parameters for this task
0:15:49	but one can of course ask shouldn't is really work anyway training for
0:15:54	identification
0:15:55	and use the model for verification "'cause" it's kind of similar tasks
0:16:00	it does as we know
0:16:02	so but let's just discuss a little bit what could
0:16:05	go wrong
0:16:08	or why it wouldn't be optimal
0:16:16	so here is kind of toy example
0:16:20	we are looking at one dimensional inventing so we imagine that these have been
0:16:25	where rather the distribution of one dimensional and endings
0:16:31	so the embedding space is here and each of these colour represent the
0:16:38	distribution of impending is for some speakers of you is one speaker or will is
0:16:42	another speaker and so one
0:16:46	of course this is a little bit that we are
0:16:49	shape of the distributions i showed it alright okay kind of for simplicity
0:16:54	so in this kind of for example we assume that the mean of the
0:16:59	speakers are used a new that when you call distance like this
0:17:06	so
0:17:09	what would be the identification error in this case
0:17:13	so whenever we observe an amending we will assign it to the closest speaker
0:17:19	so
0:17:20	if we
0:17:22	observed on a bending in this region we will assign it so that no speaker
0:17:26	if we also observe it here
0:17:28	we will assign its to this end
0:17:31	this and
0:17:32	you green
0:17:34	speaker
0:17:36	and of course it means that sometimes it will be the blue speaker
0:17:43	when something sampled from the blue speaker will be here but we will assign its
0:17:47	the v is
0:17:48	green
0:17:48	style speaker area
0:17:50	so we will have some error in this situation
0:17:54	and
0:17:55	if we consider only the neighboring speakers the error rate will be
0:17:59	a twelve point two percent in this example
0:18:10	what would be the verification error rate
0:18:14	so
0:18:15	if we consider
0:18:16	for this type of data
0:18:18	so
0:18:19	we will assume that we
0:18:21	have speakers
0:18:23	which are you can be installed is to muted
0:18:26	like well
0:18:27	these stars
0:18:29	and
0:18:30	now the target trial we will sample
0:18:35	and bending from one speaker
0:18:37	and see if they are closer to each other than some threshold
0:18:41	based happen to the optimal special for this iteration
0:18:46	and if the
0:18:48	they are after that first we that i think so that you
0:18:54	thank you
0:18:56	okay
0:18:56	but i
0:18:58	if available
0:19:02	the case
0:19:03	the
0:19:05	and for nontarget trials
0:19:08	so
0:19:11	here in this image we could see
0:19:14	it would have an error rate of fourteen percent
0:19:17	again i'm only actually considering that the non-target trials are from neighboring speakers
0:19:26	that's why they're rate is high
0:19:33	so
0:19:34	no
0:19:35	i'm only changing this is to use a little bit
0:19:39	the within speaker is to me you show so
0:19:43	as before
0:19:45	the speaker means are on the same distance
0:19:48	like this
0:19:50	and
0:19:50	we have made them little bit more narrow here the within speaker distribution a little
0:19:55	bit more broad here
0:19:57	the overall variance the within speaker variance this the same obtain a little bit different
0:20:01	shape
0:20:02	and we will see that identification error has increased to thirteen point seven percent
0:20:09	whereas the verification error is that there
0:20:15	well
0:20:16	more extreme situation we have made them
0:20:19	the distributions equally sake or broad
0:20:23	do those two mixtures
0:20:26	now id and the means speaker means are all the same distance
0:20:31	like this
0:20:32	but the within speaker variance is
0:20:35	well in the within speaker variance is also the same as before
0:20:40	and here it would actually get
0:20:42	zero
0:20:44	identification error
0:20:46	but you will have worse
0:20:48	verification error or in any of the other example and it's because
0:20:53	if you sample a target trial you we very often have
0:20:57	and endings that are far from each other and similarly
0:21:01	for a non-target trials will very often have weddings that are close to each other
0:21:07	so this
0:21:08	example
0:21:10	should illustrate that
0:21:14	the within speaker distribution that is optimal for identification is not the same is not
0:21:20	necessarily the distribution that is optimal for verification
0:21:27	okay so
0:21:29	as another example
0:21:31	let us consider triplet loss which is another popular
0:21:38	most
0:21:38	function
0:21:40	could i
0:21:42	so it looks like this that
0:21:44	each training example you have
0:21:48	and bending for some speaker which we call the anchor invading
0:21:52	and then you have an embedding from the same speaker in which all the positive
0:21:55	example and animating from another speaker we should call the
0:21:59	negative example
0:22:00	and basically we want the distance between the anchor and the positive example can be
0:22:06	small
0:22:07	and the anchor between the at the distance between the anchor and the negative example
0:22:11	to be big
0:22:14	so
0:22:15	if this distance is bigger than
0:22:18	this class and
0:22:20	then these loss is gonna be zero
0:22:26	however
0:22:27	this is not
0:22:29	ideal the an ideal criteria for speaker verification and two show this i have a
0:22:34	rather complicated feed your here the illustrates
0:22:40	three speakers
0:22:41	and the embedding some three speakers in a
0:22:45	two dimensional space
0:22:47	so we have
0:22:48	the speaker may
0:22:50	with and buildings
0:22:52	distributed in this area
0:22:55	speaker be with the meetings in this area and speaker c with them endings in
0:22:59	this area
0:23:01	and
0:23:03	eve
0:23:04	we are using some and go from speaker to a the worst case would be
0:23:08	to use it here on the border
0:23:10	and then the biggest this test for a positive example would be to have it
0:23:15	here on the other side
0:23:17	and the biggest the smallest this there's to a negative example would be to take
0:23:21	something here
0:23:23	so simply we want this
0:23:27	and distance with the positive example
0:23:30	here class some margin to be smaller than the distance from the
0:23:35	negative example of anchor
0:23:37	so it's okay
0:23:38	in this situation
0:23:41	consider then speaker seen which hasn't b
0:23:46	wind
0:23:46	is the fusion of data now if we have i'm gonna here
0:23:51	we need
0:23:52	the
0:23:53	distance to the next speaker the closest speaker to be
0:23:58	be here then the internal distance
0:24:00	class some margie
0:24:02	so
0:24:03	and that's the case in this figure so that replied loss is completely fine with
0:24:08	this situation
0:24:10	but if we want to use
0:24:12	we do
0:24:13	verification on data that is distributed in this way then we should
0:24:19	accept
0:24:21	at all well if we want to have good
0:24:24	performance of target trials from speakers t
0:24:27	we need to accept
0:24:30	trials as target trials whenever we have a smaller distance then this otherwise we will
0:24:34	have some error or for target trials of speakers e
0:24:38	but this means that if we have a threshold like this year we will have
0:24:42	would be in confusion between
0:24:45	speaker a and b
0:24:47	so
0:24:48	this
0:24:49	again of course they could be ways to compensate for this environment or another but
0:24:53	it's just to show that like to sign
0:24:55	these
0:24:57	metric is not
0:24:58	gonna lead to optimal
0:25:00	performance for
0:25:03	verification
0:25:06	so if we try to summarise a little bit about the idea of task specific
0:25:10	training
0:25:12	minimizing identification error wouldn't necessarily the minimal verification error or
0:25:18	but of course i was showing these on kind of toy examples and the reality
0:25:22	is much more complicated
0:25:24	we
0:25:25	usually don't optimize classification error but they're all the cross entropy
0:25:29	or something like that
0:25:31	and we may use some loss to encourage more jean
0:25:36	between the speaker and endings
0:25:39	and maybe these assumptions that the made about the
0:25:42	distributions here are
0:25:44	well to compute more realistic at all
0:25:48	and
0:25:50	mm
0:25:53	so the maybe not completely clear
0:25:56	what would happen we knew test speakers that were not in the training set as
0:26:00	one
0:26:01	so i one and then to say is that this should not be interpreted as
0:26:05	some kind of proof that other object is would fan maybe they would even be
0:26:09	really good
0:26:11	but
0:26:12	yes to use training data be that it's not really
0:26:17	completely just defined to use them
0:26:20	and this is of course something that ideally should be studied much more
0:26:24	in future
0:26:27	but
0:26:31	and so we discuss that the and then training has some and good motivation
0:26:39	but still it's not really the most popular strategy for building speaker recognition systems today
0:26:46	at least in my impression it is my impression is that the multiclass training is
0:26:50	still the most popular
0:26:52	and
0:26:54	so
0:26:55	why is that well there are many difficulties with the and when training
0:26:59	it seems
0:27:01	no e
0:27:02	he's more prone to overfitting
0:27:05	we have additions we statistical dependence of training
0:27:08	trials which are we go more into detail in
0:27:12	i of the dislike
0:27:15	and
0:27:16	they're also maybe questionable how to do how should be trained based in the system
0:27:21	when we want to
0:27:23	and many enrollment utterances also to be mentioned of it
0:27:28	but one
0:27:30	so
0:27:35	the issue
0:27:36	one of the issues with using a cane of verification objective let's call it that
0:27:41	when we are comparing draw
0:27:43	two utterances and wondered say whether it's the same speaker or not
0:27:48	is that
0:27:51	the day that
0:27:52	we e
0:27:54	statistical independence i mean same y
0:27:57	well you know minutes about
0:27:59	so this is
0:28:01	generally these idea of training of minimizing some training also assumes that
0:28:07	the training data
0:28:09	are independent samples from whatever distribution comes from
0:28:14	and this is often the case i mean we have data that has been independently
0:28:19	selected
0:28:21	but
0:28:21	in speaker verification
0:28:23	the data
0:28:25	x
0:28:26	automation
0:28:27	is
0:28:28	a pair also happens then roll utterance and the testing utterance and the label is
0:28:34	indicating whether it's the target trial or a non-target trial
0:28:38	so for location i mean use
0:28:41	why equal one for target trial and one equal minus one for nontarget trials
0:28:46	the issue here is that
0:28:49	typically at least if we have limited amount of training data
0:28:53	we create
0:28:54	many trials
0:28:56	from the same speaker from the same utterance of each of the speaker and utterances
0:29:01	are used in many different right and then these
0:29:05	date time is not
0:29:06	these trials are not which is the training data
0:29:10	is not
0:29:12	statistically independent
0:29:14	which is something that the training procedure assumes they are
0:29:19	so
0:29:22	this can be a problem exactly how big the problem is
0:29:25	i think it's still something that needs to be investigated more but let's elaborately to
0:29:30	be what about what happens
0:29:35	so
0:29:38	here i brought adjust the training objective that we would use in the for a
0:29:43	kind of a verification loss when we train the systems and in verification
0:29:48	so it looks
0:29:49	complicated than being but it's not really anything special is yes the average training loss
0:29:55	of
0:29:56	target trials here and the average training loss of
0:30:00	non-target trials here and they are weighted with a fact or
0:30:05	probability of target trials and probability of non-target trials which are
0:30:10	some parameter that we use that to
0:30:14	dear the system to fit
0:30:15	better for the application that we are interested in
0:30:19	and again
0:30:22	what we hope is that this would minimize the expected loss
0:30:26	of
0:30:28	target trials and non-target trials
0:30:32	weighted we these
0:30:33	min
0:30:34	probability of target trials and non-target trials
0:30:38	on some unseen data
0:30:40	this loss function here is often the cross entropy but could be other things
0:30:49	so what are the desirable properties of training objective
0:30:56	so
0:30:57	here
0:30:59	we have
0:31:00	are hat which is the
0:31:03	and directional for training the loss
0:31:06	and
0:31:07	since the training data
0:31:09	use
0:31:09	can be assumed to be generated from some probability distribution this or have is also
0:31:14	a random variable
0:31:18	and we won't these
0:31:20	to be close
0:31:21	to the
0:31:23	expect that
0:31:24	loss
0:31:29	where the expectation is calculated according for the true probability distribution of the data
0:31:35	and for every value of
0:31:37	fit that because
0:31:39	in that case
0:31:43	and
0:31:46	if
0:31:47	the expected loss is this black line here
0:31:54	then
0:31:56	e
0:31:57	well let's say we are
0:31:59	we have some training set the blue one
0:32:02	and we check the average loss as a function of data
0:32:06	it may look like this
0:32:09	another training set it may look like this the red line and the third one
0:32:13	would be
0:32:14	the power of one so the point is that it's a little bit random and
0:32:17	it's not gonna be exactly like the expected loss
0:32:22	but ideally it should be close to this one because if we find a filter
0:32:26	that minimize the training loss for example here for the in the case of the
0:32:29	red training set
0:32:31	then
0:32:32	e we know that okay it will be also a good value for the
0:32:38	expected loss which means that the loss on things test data
0:32:43	so we want
0:32:46	the
0:32:47	training loss
0:32:48	for some as a function of the parameter in grammar the model parameters
0:32:53	can be close to the expected loss for one values of the
0:32:57	parameters
0:33:02	so
0:33:03	in order to study the effect of
0:33:08	statistical dependences in the training data in this context
0:33:11	we
0:33:12	right the
0:33:14	training objective slightly more general than before
0:33:19	so
0:33:20	use the same as before but yes that's for each trial
0:33:23	we have a way to be done
0:33:25	and if we set the to when one over and then it would be the
0:33:30	same as before but now we consider that we can choose some other value of
0:33:35	these
0:33:37	try and weights
0:33:38	in the training data
0:33:39	training trials
0:33:44	we won't
0:33:45	the training objective so the average training loss to have an expected value which is
0:33:52	same as the expected value
0:33:56	of the loss of test data so it should be an unbiased estimator of the
0:34:03	the test loss or the expected loss
0:34:07	and we also want these want to be good in the sense that it has
0:34:10	a small variance
0:34:18	well the expected value of the training loss is just calculated like this so we
0:34:23	end up with the expected value of a loss
0:34:26	and this is exactly are
0:34:28	what we what we usually denoted or
0:34:30	so in order for these to be
0:34:32	unbiased we simply want the sum of the weights to be one
0:34:39	and of course this would be the case when we use the standard choice of
0:34:45	meta which is one over and the number of
0:34:48	trials
0:34:49	in the training data
0:34:53	the variance
0:34:55	of this empirical loss
0:34:58	is gonna look like this
0:34:59	it's the
0:35:00	weight vector or for all the trials
0:35:03	and so on the matrix
0:35:06	times the weight vector
0:35:09	and this matrix is the covariance matrix for the loss of all trials with the
0:35:14	with this little t so that easy the one for the target trials or
0:35:18	minus one for the non-target trials
0:35:21	and one could derive that
0:35:23	the optimal
0:35:24	choice of
0:35:26	he does that would minimize this variance
0:35:29	is
0:35:29	and i look like this
0:35:36	so this is what we can call them you training objective
0:35:40	a best linear
0:35:42	unbiased estimate
0:35:44	that's the meaning of you so this is the best linear unbiased estimate of
0:35:48	the
0:35:50	test loss
0:35:51	using the training data to estimate what
0:35:53	well the test loss would be
0:35:59	some
0:36:00	details about this is that we don't really need covariance between the most of the
0:36:05	raw the correlation
0:36:07	because
0:36:08	we assume the diagonal elements in section matrix is
0:36:12	equal
0:36:14	then it turns out like this
0:36:18	and in practice we would assume that
0:36:21	search
0:36:22	and lennon's in this covariance matrix does not depend on cedar which
0:36:26	could be questioned
0:36:31	so
0:36:32	the objective that we discussed is not really specific the speaker verification in this is
0:36:37	that whenever you have a
0:36:39	dependence is in the training data can you could
0:36:42	use this idea
0:36:43	but for
0:36:45	the structure of this the covariance matrix
0:36:49	between the training which describes the covariances of the loss of the training data
0:36:54	that depends on the problem the specific problem that you're studying
0:36:58	so now we will look into how to
0:37:01	creating search a matrix for speaker verification
0:37:06	so here
0:37:07	we will use
0:37:08	x
0:37:09	i two denotes the
0:37:12	i utterances of speaker x
0:37:16	so we will assume that
0:37:19	correlation coefficients
0:37:21	hands on what trials i mean comments so for example
0:37:24	the here we have
0:37:26	trial of speaker a utterance one speaker to a utterance to and some loss of
0:37:31	that and the all several also speaker eight utterance long speaker eight
0:37:36	utterance three and some loss of that
0:37:38	and they have some correlation
0:37:40	it because
0:37:42	they involve the same speaker
0:37:45	so we assume there is a correlation
0:37:48	coefficient denoted c
0:37:50	at least eight here
0:37:52	so in total we have these kind of situation in verification if we consider target
0:37:57	trials
0:38:00	there you could have the situation that's
0:38:02	well okay let's look here
0:38:05	the
0:38:05	to target trials which have one utterance in common this is speak a target trial
0:38:10	of speaker eight
0:38:11	and here we have buttons one of those two and here you have buttons one
0:38:15	utterance trees is also has a long using both
0:38:17	trials there is some correlation between these trite
0:38:21	here
0:38:22	there is no common utterance but the speaker still the same and this is as
0:38:26	opposed to this situation where
0:38:28	you have
0:38:30	i
0:38:30	trial of speaker a and the trial of speaker a they have nothing in common
0:38:34	so we assume here the correlation is zero
0:38:37	for such trials
0:38:39	for the non-target trials you have more complicated situation but all possible situations are listed
0:38:46	here
0:38:47	for example
0:38:48	you may have that
0:38:50	okay
0:38:50	the speaker is you have one
0:38:54	utterance in common
0:38:58	so we have this utterance in common and in addition to that
0:39:02	these speaker is in common that's what they mean with this notation here
0:39:08	and so one
0:39:14	and if we have such weights one can derive
0:39:18	yes
0:39:18	the all the words such correlation push coefficients we can drive the optimal weights for
0:39:24	a speaker with this many utterances
0:39:27	is gonna look like this
0:39:32	the exact form is maybe not so important but just
0:39:34	we should note that one could
0:39:37	the right
0:39:38	how to
0:39:39	given the way to these speaker and it depends on how many utterances
0:39:44	the speaker s
0:39:47	for the non-target trials to formalize more complex
0:39:51	it would depend on me if the trial involves speaker names p can be it
0:39:55	depends on how many
0:39:56	utterances speech to speaker as
0:40:02	so
0:40:03	then comes they show how to estimate correlation coefficients one could look at some recorrelation
0:40:09	of some trained model
0:40:12	or we couldn't
0:40:14	learned them somehow
0:40:16	or which we will mention briefly later or we can just make some assumption and
0:40:21	into neat so for example one simple assumption is the set
0:40:25	this for score coefficient of target trials are five and this one which we assume
0:40:30	should be smaller so i'll four square
0:40:32	and then
0:40:35	to an affine this range and similarly for the non-target trials
0:40:44	just to get some idea of how we would change the weight for the target
0:40:47	trials
0:40:48	well
0:40:49	for target trials
0:40:51	we see here that this is the number of utterances for the speaker
0:40:56	on the y-axis here we have their corresponding weights
0:41:01	so
0:41:02	and for different values of these correlations so if the correlation is
0:41:07	a small
0:41:09	then
0:41:11	even when we have many utterances up to twenty here we will still give reasonable
0:41:16	way to each utterance
0:41:19	but if the correlation is a large
0:41:22	then we will not give so much weight to
0:41:25	but each utterance when a speaker as many utterances
0:41:29	which means that the total
0:41:31	and
0:41:32	wait for this speaker is not gonna increased a much even if it has a
0:41:35	lot of
0:41:36	utterances
0:41:45	and
0:41:48	in the past i was exploring little bits how
0:41:52	these kind of correlations really are
0:41:55	this was on the i-vector system with clearly a and the scores
0:42:01	here in the first
0:42:05	i in this
0:42:07	column here
0:42:08	it's a
0:42:09	okay lda model trained with em algorithm and then the score samples and instigated system
0:42:14	i find calibration
0:42:18	and the other column here is for discriminatively trained p lda
0:42:22	so the main thing top so here is that we
0:42:25	to have
0:42:26	correlations between trials that's how for example an utterance in common answer one
0:42:32	in correlations can be quite large in some situations
0:42:38	so these
0:42:40	problems seem to exist
0:42:44	and doing this kind of correlation composition main goals this is like again on the
0:42:49	kind of discriminative
0:42:50	clearly a
0:42:55	and
0:42:57	e does have a bit
0:43:05	so it's something
0:43:08	two
0:43:11	possibly take into account
0:43:13	the course of ssl it's four db lda but the where we train a p
0:43:17	lda model
0:43:18	using all the trials in the training set
0:43:21	that can be construct and then training set but of course the same
0:43:25	problem with the dependence exist all seen and system
0:43:37	so
0:43:40	no some problems that the we could encounter if we tried to do this
0:43:45	well mister the
0:43:47	results or the
0:43:50	compensation formless that we derive
0:43:52	was assuming that
0:43:54	all trials
0:43:55	stuff can be created from the training set or used equally often which is the
0:43:58	case if you train a backend likely p lda
0:44:02	discriminatively and you use all the trials
0:44:05	a we
0:44:07	in
0:44:08	well we train a kind of and system with involving neural networks
0:44:14	we use media bashers so one could achieve this situation by
0:44:20	making a
0:44:21	list of trials
0:44:24	and
0:44:25	then we just sample trials from years okay here is a trial is this speaker
0:44:29	compared to this final trial is the speaker compared to this one as a long
0:44:33	and this is
0:44:34	long list of all trials that can be formed and then we just
0:44:41	select some of them into the mini batch
0:44:44	the point is of course that if we have these speakers like this
0:44:47	in the mini batch and we compare this one with this one
0:44:50	this one we this one and so long
0:44:53	we are not using all the trials that we have
0:44:56	we have for example not comparing this one with this one in the mini batch
0:45:01	recall and that's maybe a bit the waste because we are anyway using this deep
0:45:06	neural network to produce them paintings and so once we can just as well
0:45:12	produced and reading or will use all of them in the in the scoring part
0:45:15	as well
0:45:17	well then
0:45:17	we will have a little bit different
0:45:20	balanced
0:45:22	of the trials
0:45:24	globally compared to what we had before
0:45:27	so the former lastly that we derived wouldn't be exactly valid in this situation
0:45:33	so
0:45:34	the
0:45:36	question then it is if we do decide that all the segments
0:45:40	that
0:45:42	can that be extract them ratings for
0:45:44	that we have in the mini batch if we want to use all of them
0:45:48	was in the scoring what you how are we gonna select
0:45:52	the data for the mini batch
0:45:54	they can be different strategies here
0:45:57	we could consider for example
0:45:59	we
0:46:00	strategy a
0:46:02	we
0:46:03	select some speakers
0:46:05	and then for each speaker we take all the day the segments that they have
0:46:08	let's say that these rates speaker has
0:46:11	three segments and these yellow speaker has
0:46:14	for speaker for segments
0:46:17	and then all
0:46:21	we can consider only five so we can have
0:46:26	segment one of the red speaker scored against segment to segment one scored against segment
0:46:30	three as a long
0:46:33	we don't use the diagonal because we don't consider
0:46:39	try segment scored against themselves
0:46:42	and the course here is just the same as here
0:46:46	a scoring segment two
0:46:48	i guess segment one
0:46:50	so
0:46:52	this would be one way another way would be constructed you be
0:46:57	two
0:47:00	select speakers but then just select to utterance for each speaker in the mini batch
0:47:08	so
0:47:10	you will have just one target right for each speaker
0:47:14	it differs here is that
0:47:16	we have
0:47:17	we are gonna have
0:47:19	fewer target trials
0:47:21	overall in the mini batch but one of them will be from different speakers and
0:47:24	we will add target five from more speakers
0:47:28	typically
0:47:29	so
0:47:30	needs
0:47:31	not exactly clear what would be the right thing but some little bit informal experiments
0:47:36	we have done
0:47:37	so just of this strategy b is a better
0:47:46	then again the formulas that we'd right before how to weight strives on not completely
0:47:51	the they were not the right on the assumption that we are doing like this
0:47:55	so they are not
0:47:56	darlene
0:47:58	and
0:48:00	and they need to be modified to be it and i mean come to that
0:48:03	in a minute
0:48:07	the second problem that can occur in and when training is that
0:48:12	in respect of these issues is that
0:48:17	we do want
0:48:19	to use
0:48:20	what we do want to have a system that can deal with the session enrollment
0:48:24	and it
0:48:26	of course of the session trials can be incorporated
0:48:30	it work can be handled with dances and system as we discussed in the initial
0:48:34	slide
0:48:36	by having some pruning armour enrollment utterance
0:48:40	but how to create a training date time is again a little bit the
0:48:46	complicated
0:48:47	because
0:48:48	already in the case of single session tries we had a complicated situation how many
0:48:54	different kind of dependent system can occurrence along and in them with the session case
0:48:59	it's gonna be even more
0:49:01	complicated because you can have situations like
0:49:04	these
0:49:06	trial
0:49:08	for example these two could be the enrollment and this is the test and another
0:49:12	trial where
0:49:13	these two are the enrollment
0:49:15	and
0:49:15	this is the test then you have one optimizing common here
0:49:19	we're gonna have a more extreme situation where both enrollment utterances
0:49:24	in to try to solve the same but the test utterance is different
0:49:27	so the number of possible a dependence is that can occur is way more complex
0:49:32	and i think it's
0:49:33	very difficult to derive some kind of formal or how the trials should be weighted
0:49:41	so to deal both with the mini batch the fact that we're using mini batch
0:49:46	as and to move the session trials and to estimate proper trial weights
0:49:52	for that maybe one strategy can be to learn them hand this is not something
0:49:56	i tried i just think it's
0:49:57	something that maybe should be tried
0:49:59	well
0:50:01	so we can define
0:50:02	i training loss
0:50:04	again as average of losses over the training data with some weights
0:50:09	and the we also neon use a development loss with some
0:50:14	which is an average over
0:50:16	another set of the average of most over the development set
0:50:22	and these weights here should depend only on number of utterances of the speaker
0:50:31	or speakers involved in that right
0:50:35	then one can imagine some scheme like these
0:50:38	mm
0:50:39	we send both training and development data through the and then we get the neural
0:50:44	network and we get some
0:50:47	training loss and some
0:50:49	and development lost
0:50:53	as usual be estimate the
0:50:56	the grand here we take the gradient with respect to the model parameter off
0:51:03	for the training lost
0:51:05	and it
0:51:06	this
0:51:06	right in is not a function of the weights the trial weights
0:51:11	and we can update
0:51:13	the model parameters still keeping in mind that these are then value is a function
0:51:18	of the
0:51:21	the trial weights
0:51:23	the training try and weights
0:51:25	and then
0:51:27	we can
0:51:28	on the development sets
0:51:30	calculate
0:51:31	the gradient
0:51:33	with respect to these training weights
0:51:36	and then
0:51:37	use this to update
0:51:40	the training try and weights
0:51:46	a second
0:51:47	thing
0:51:49	to explore
0:51:51	or like a final note on these
0:51:56	and
0:51:57	depend statistical dependence issue is that
0:52:00	we just
0:52:02	discussed some ideas for balancing the training data the training trials for better optimization
0:52:08	but for example in the case when all speakers have the same
0:52:12	number of utterances
0:52:14	this rebalancing has no effect
0:52:17	still of course there are dependence is there is a one would think shouldn't we
0:52:20	do something more than just we balance the training data
0:52:24	and one possibility that i think would we will worth
0:52:28	try
0:52:29	is to
0:52:32	we
0:52:34	we assume the following
0:52:35	that
0:52:37	the covariance of
0:52:39	to what's a scores of the
0:52:42	of a trial of speaker at
0:52:45	which has
0:52:45	one utterance
0:52:47	in common should be bigger than
0:52:49	the covariance between two trials
0:52:51	of these
0:52:52	speaker which has
0:52:54	no often as in common
0:52:56	which should be bigger than the covariance between
0:53:01	two
0:53:02	target trials of different speaker this should be zero actually
0:53:06	so one could consider two regularized the model to be in that way
0:53:14	so now
0:53:17	after discussing the issues with
0:53:21	and hence training
0:53:23	then i will briefly mention some of the
0:53:27	eight pairs
0:53:29	or some papers
0:53:32	on and trend
0:53:33	training and i this should not be considered as i kind of literature review or
0:53:38	describing the best architectures or anything like that
0:53:42	it is
0:53:43	more
0:53:45	just a few selected paper that illustrate some point source on them
0:53:53	some of which and some good take away messages about and find training
0:53:59	so this paper called and point text dependent speaker verification as follows i know was
0:54:04	the first the paper on and ten training in speaker verification
0:54:09	and it also networks like this or some architecture like this feature goes in the
0:54:14	throes on
0:54:15	and neural network and in the end we are doing
0:54:21	we this network is gonna say
0:54:24	is it the same
0:54:26	speaker or not
0:54:28	the important thing here is that
0:54:30	the
0:54:32	input is fixed
0:54:36	so the inputs to the neural network as the feature dimension times the number of
0:54:41	features
0:54:45	the duration that is
0:54:48	and there was no temporal pooling which is
0:54:52	the done in many other situations
0:54:55	and this is suitable
0:54:56	when
0:54:58	when you do text dependent speaker verification as they did in this paper
0:55:02	so because this means that
0:55:05	the network is kind of aware of the word and phoneme order
0:55:10	and
0:55:11	i would say that the main conclusion from this paper is that
0:55:15	the verification loss was better than the identification lost
0:55:19	especially when you have been the amounts of training data for small amount of training
0:55:24	data guys
0:55:25	not as big difference
0:55:28	and the one can also say that t-norm could
0:55:32	too large extent to make these two things
0:55:35	this colossus more the models trained with these two moses more similar
0:55:42	but i still won't say that this kind of suggested verification loss is beneficial
0:55:48	if you have large amounts of training data
0:55:55	so this is another paper
0:55:59	there wasn't doing in
0:56:01	text-independent speaker verification and here
0:56:05	different from the other is that they do have a temporal pooling layer
0:56:11	so
0:56:12	that would kind of remove the dependence on wonder of the input
0:56:17	the to some extent at least and is maybe a more suitable architecture for text
0:56:22	independent speaker verification
0:56:25	and this was compared to i-vector p lda baseline down here to it was found
0:56:30	that really large amount of training data is needed even to be something like an
0:56:34	i-vector
0:56:36	the lda system
0:56:44	and this is
0:56:46	some study that we did and
0:56:51	it was
0:56:52	use also again text independent speaker recognition or verification
0:56:58	but trained on smaller amount of data and to make it work we instead constrained
0:57:04	these neural network here this big and time system to behave
0:57:08	something like a
0:57:10	another i-vector and p lda baseline so we cannot constrain did not to be two
0:57:16	different from the
0:57:18	i-vector purely a baseline
0:57:21	and
0:57:23	we found there that training model blocks jointly with their verification also was improving
0:57:33	so as can be seen here
0:57:36	you
0:57:36	little bit regrettably we data as a separate
0:57:40	clearly whether that improvement came from the fact that we were doing joint training
0:57:45	or the fact that we were
0:57:50	using the verification loss
0:57:55	another interesting thing here is that
0:57:59	we found that
0:58:00	training we verification most requires very large batches
0:58:05	and this was an experiment done only on the
0:58:09	scoring art and of course lda discriminatively lda
0:58:12	so if we train is gonna be p lda with
0:58:16	a and b if yes using full batches
0:58:19	that
0:58:21	so not i mean you match
0:58:24	training scheme
0:58:26	you achieve some
0:58:27	loss
0:58:28	like this on the development set
0:58:31	and this dash
0:58:33	blue line
0:58:34	whereas if we trained with adam with mini batch just for different slices front end
0:58:39	up to five thousand
0:58:41	we see that we need really be batches to actually
0:58:45	get close to be of q s
0:58:48	trained model which was trained on full marshes
0:58:50	so that kind of little bit suggests that you really need to have many trials
0:58:55	within the mini batch for you know what of four
0:58:59	training these kind of
0:59:02	system with a verification lots which is a bit of a problem and maybe a
0:59:06	challenge to deal with
0:59:07	in future
0:59:12	this is some more recent paper and the interesting point of this paper was that
0:59:17	they didn't train the whole system
0:59:20	all the way from the waveform is that this from features as the other
0:59:27	first
0:59:29	but it was
0:59:31	i couldn't to
0:59:33	understand completely the improvement came from the from the fact that they were
0:59:37	training from the waveform or if it was because of
0:59:41	the choice of architecture and so one
0:59:45	but it's interesting that
0:59:48	systems and going
0:59:49	all the way from waveform to the and
0:59:53	can work well
0:59:58	and this is paper
1:00:00	for this year's
1:00:02	in their speech it's interesting because
1:00:08	it's one of the more recent studies that the really proposed or showed some good
1:00:13	performance of using verification loss
1:00:17	here it was a joint
1:00:19	you
1:00:20	but i can have more details training so they were training using both identification was
1:00:24	and verification lost
1:00:28	and that's actually something i have tried to another and any
1:00:32	benefit from we but one thing they did here was to
1:00:36	start with a large weight for that it is indication of austin gradually
1:00:40	increase the weight for the verification will also make this is the interesting and maybe
1:00:47	actually the right way to go
1:00:49	i'm curious about it
1:00:54	so
1:00:55	now comes just little bits summary of this talk
1:00:59	we discussed about the motivation for and two and
1:01:04	training
1:01:05	and
1:01:06	we said that it has some good motivation
1:01:09	and
1:01:10	we show that's on
1:01:13	we will refer to some
1:01:16	experimental results the of also another first
1:01:19	which shows that it seems to work quite well for text-dependent task with large amount
1:01:24	of training data
1:01:27	in such case it's probably prefer able to preserve the temporal structure to avoid
1:01:33	the temporal pooling
1:01:35	in text-independent benchmark one would need to strongly like a regular station or a mix
1:01:42	the training objective in order to benefit from
1:01:45	and when training and typically we would want to do some temporal pooling their
1:01:54	one couldn't guess that and twenty training would be preferable choice in scenarios where we
1:02:00	have many training speaker with few utterances we have less of the statistical dependence in
1:02:05	problem
1:02:09	something that to me seems to be or button questions is and which would be
1:02:14	great if someone it explore
1:02:17	is
1:02:18	okay
1:02:19	it is difficult actually to train and then system especially for the text independent
1:02:24	tell us
1:02:25	so this is because of overfitting so training convergence this dependency issue we discussed
1:02:32	it's not really clear i would say
1:02:34	and
1:02:36	practical question is how to adapt search systems because see this more blockwise systems we
1:02:43	would of the nine at the back end
1:02:45	well could be trained the system in a way that we don't need adaptation
1:02:53	and also how could we input some human knowledge about speech into these training and
1:02:58	we need it
1:03:00	something we know about the data distribution or number of phonemes or
1:03:04	whatever
1:03:07	and we discuss that maybe
1:03:12	training a model for speaker identification is not ideal for speaker verification but is there
1:03:18	some way to
1:03:21	to find and bindings that are good for all these tasks
1:03:27	another interesting quick question is
1:03:32	how well
1:03:34	the llr is that comes from
1:03:36	and to end
1:03:38	architectures
1:03:39	actually could simulate the true llr
1:03:44	so in other words what kind of
1:03:47	and
1:03:49	distributions could be
1:03:51	arbitrary accurately simulate or modeled by these architectures
1:03:57	so completely clear out there
1:04:00	okay so
1:04:02	thank you for your attention
1:04:05	by right
1:04:10	hello this is you'll huh and no i really present the hassan session for that
1:04:19	and that speaker verification concordia
1:04:26	e these informal do not work well i don't know book
1:04:31	well i'm not really run cold war used rate it let's see
1:04:37	i mean "'cause"
1:04:40	one and talk about ease
1:04:43	two things first
1:04:44	i will go through the call that are using their
1:04:48	most of my experiments
1:04:51	and
1:04:53	after that i mean how well if you can do tricks to solve the batteries
1:04:58	implementation issues
1:05:02	that i have used and
1:05:06	okay so
1:05:10	first
1:05:11	the call for and final system so this is a call that i started work
1:05:17	on during my forestalled a but from to those in sixteen the person time t
1:05:23	initially horse in the on all but the now consider a while
1:05:30	and idea sees the
1:05:32	time to switch to or a data tensor able to or like torture or something
1:05:38	else
1:05:41	the links of the repository is here
1:05:44	and most stuff in this repository is no and is more most states there are
1:05:52	four multiclass the weighting well mostly to use a little because training where maybe in
1:05:59	combination with other stuff
1:06:01	but the
1:06:03	don't know much on a
1:06:05	that's uses
1:06:07	you're and then training with the verification loss
1:06:11	the paper is that we're of only stores actually based on hold close to the
1:06:15	on the one i think it's not so much point two
1:06:19	maintain that are in more
1:06:23	but i do have a one screen here that you to the verification lost in
1:06:29	combination with the identification lost so that's description we will look at
1:06:37	and generally
1:06:40	or
1:06:41	well it's a this first i'm trying to point out things in this call that
1:06:46	i think yes certainly well known and worked well and are known and also mention
1:06:51	what they we show
1:06:52	really them differently
1:06:54	to maybe give so
1:06:57	well at least i can say from like stressful as good an allpass time
1:07:02	some
1:07:03	small toolkit for speaker verification
1:07:09	i know that i didn't see and then if we hear from and the verification
1:07:15	lost to that identification the most
1:07:18	and contrary to the paper and mentioned in the tutorial
1:07:23	it could be that these quite complicated scheme for changing the balance between the losses
1:07:30	throughout the training is really ladies this may be something i don't look at some
1:07:37	point
1:07:41	and this screen i think units
1:07:45	you want to try to instances where only you know little normal way
1:07:49	the in the local but you don't want running in the not here unique feel
1:07:54	a little bit with the intention because
1:07:58	right in
1:07:59	in
1:08:00	cantonese in such a way that it's
1:08:02	here three but
1:08:07	some small adjustment might be needed if you actually want to run it here
1:08:16	so
1:08:17	nh
1:08:18	i tried in these in when organising my experiments to high in the way that
1:08:25	there is one screen where everything that is specifically the experiment is set so that
1:08:32	includes which data to use and the configuration of the more balanced along
1:08:38	i was really i
1:08:42	an efficient lighting to have
1:08:44	input arguments to this researchers we should be to use as long because anyway you
1:08:50	were wireless always have to change something in this creation
1:08:56	for a new experiments are then you can just as long routine often a and
1:09:02	so on
1:09:03	a wrestler
1:09:06	but other things that a little bit more face from extend this experiment this is
1:09:12	just the loaded from this good
1:09:15	such as model on different architectures as long
1:09:25	so usually i use these underscore for denoted sensible variables underscore v for placeholders
1:09:34	so long
1:09:36	the kind of
1:09:40	models are
1:09:43	similar to here as models are then maybe a little bit less
1:09:49	fancy if you're
1:09:54	features
1:09:58	i didn't use here us here initially because when i started with this years ago
1:10:03	cares more flexible enough there were i quite agree pure only those of recruited two
1:10:11	neatly with this but i know it is definitely flexible enough
1:10:25	so
1:10:31	for example here is this is five where features are things that
1:10:37	things maybe some one would think is that are those on a you all remember
1:10:43	about a
1:10:45	seems anyway necessary to change things in this problem for every experiment i prefer you
1:10:51	their thing here
1:10:53	so you're somebody stole training data
1:10:56	how long as the shortest and a longer segments are trained on
1:11:03	some other patterns related to training batch size
1:11:08	maximum number of the box
1:11:10	and
1:11:12	number of bashes in an input so i don't really define
1:11:18	yep or as warm day a by defining that's the second number of patches that
1:11:23	the wine in it in a minute
1:11:29	also patience probably most of your familiar with it is worth mentioning
1:11:34	you train or
1:11:35	what it is score
1:11:37	so the next part of the screen is the bar for defining how to load
1:11:46	and prepare data
1:11:48	and here is long important points is the
1:11:53	so you the bashers we will
1:11:56	well gee chunks of feature from different utterances so randomly selected segments
1:12:05	if you know say that from a normal hardest and randomly select different segments from
1:12:13	different utterances
1:12:15	this will be nice too small i was to sell
1:12:21	so often
1:12:23	you can would meeting it is time varying or case at a time or can
1:12:28	compare a
1:12:29	many lashes
1:12:32	well
1:12:33	so that's one way he in all my service so i is the to the
1:12:40	data on missus the and then can be loaded as you wish feature shows can
1:12:47	be loaded randomly fast enough for that
1:12:50	so this is
1:12:53	good because it allows for a lot much more flexibility in experiments for example sometimes
1:12:59	you may want to load to segments from the same as is that what one
1:13:04	proportional
1:13:06	to go for some for some experiments
1:13:10	or sometimes you just want to change the duration of the segments
1:13:16	i
1:13:16	you
1:13:18	use our case then you have to prepare and you are case for this
1:13:22	so i don't say that
1:13:24	using is the ease
1:13:28	and then just load features a single going is
1:13:33	very good thing and as the c is really good however to invest see if
1:13:38	you want to
1:13:39	it can of experiments
1:13:44	i define some functions for example low fee training process given some
1:13:52	given and some list of finals this one we load the data and that could
1:13:59	so if you want remotes parcels these batteries specifically as long again
1:14:05	if find that here but if you want to do for example of the thing
1:14:08	i mentioned too low to segments from the same utterances that one then you would
1:14:13	have to change the function here
1:14:15	so this was quite the
1:14:18	useful way of organising is that for me at least in my experiments
1:14:26	i also another important thing in this for easter creates on dictionary a religious train
1:14:32	is sixty four conversation other missionaries of for example a closest eager not be
1:14:38	and thus to fine off a thing and the law
1:14:43	and that's
1:14:46	created here
1:14:51	and he's
1:14:55	no means are used to create a media batches
1:14:59	and a little bit later down here i create a generator for media batches and
1:15:04	it takes the this stationary off
1:15:09	mappings across a speaker mapping as a long and i have different the generators depending
1:15:15	on what kind of media matches i won't for example you want
1:15:19	randomly selected speakers and older data are going to the actual remote randomly selected speakers
1:15:24	and for example two apples each or something like that
1:15:30	so that's its shape by changing on a gender
1:15:39	then the next step is to
1:15:42	so that the modal
1:15:44	and here i'm using here
1:15:47	t v in a artificial light expect or other comics or
1:15:52	and it i also a det lda model
1:15:57	a half to the school and endings from this
1:16:03	or text editors still called
1:16:09	we should
1:16:11	do kind of verification
1:16:19	i mentioned is minor differences from the holiday architecture is that i found it necessary
1:16:24	to have some kind of normalization layer alter the temporal coolly better or more just
1:16:32	at feast elicitation but estimated on the data that supports in the beginning works fine
1:16:36	as well
1:16:38	i guess line is needed here could be because we use a simpler optimize the
1:16:44	we use just stochastic gradient descent as compared to colour the use that are most
1:16:48	of the monster
1:16:54	so in this conan columns
1:17:00	definition of the are they show like number of layers their sizes
1:17:07	activation functions
1:17:09	and so
1:17:11	whether we should have a normalization of features normalization all the are truly
1:17:18	and whether they these
1:17:22	normalizations
1:17:25	mm
1:17:27	or you don't face of the data being initial last
1:17:37	auctions for regular stations the lower
1:17:41	we initialize the model here and we provide
1:17:47	when you do this at the rate or the generator for the
1:17:52	they the training data and this is used to initialize to model the normalisation layers
1:17:58	this is something that creates be a mess and i probably wouldn't song i
1:18:04	differently if i work right and you are
1:18:09	maybe some knowingly initialization and that's around a few
1:18:16	iterations
1:18:18	in the before starting the can you just
1:18:21	initialize the layers the normalisation layers
1:18:30	you if we apply a smaller to today the which is in this place holders
1:18:35	here
1:18:36	and
1:18:39	then
1:18:41	what comes out will be this and endings the classifications
1:18:46	and
1:18:47	so or
1:18:49	then ratings in this particular we will send them to
1:18:55	in the lda model
1:18:58	basically here
1:18:59	we make some settings for here
1:19:03	and
1:19:05	probabilistic lda model we can get the score
1:19:09	scores
1:19:10	and for all pairwise comparisons
1:19:14	it in the dash and also loss for that can provide
1:19:18	labels for it
1:19:22	so next car is to and are defined lost and train functions along
1:19:31	we have lost as a weighted keisha lost it has lost and a single because
1:19:36	the verification loss
1:19:38	well here's in binary and their average fits weights in the original one point five
1:19:44	and still one seventy five respectively
1:19:47	and maybe one important thing use here we these forces are normalized in there and
1:19:54	from be so that's
1:19:56	minus
1:19:58	log of their probability in the case of so long we're number of speakers
1:20:05	i mean for around a classification of random quotes
1:20:09	and the reason to do this is
1:20:13	if the model is yes initialized or just a round of relations that the loss
1:20:18	maybe one or approximately well
1:20:21	and we do the same thing for the verification loss
1:20:27	i
1:20:28	you this means that all these also source data you know similar way and it
1:20:33	becomes easier to choose to interpolate between them
1:20:40	and the end of these the screen we define a training function which takes
1:20:45	the data for actually in school and to one article
1:20:50	the more
1:20:53	next for please for a
1:20:59	defining functions for a set i think parameters locating parameters for the more
1:21:04	and
1:21:06	define
1:21:07	i function two
1:21:10	change of the easy to shake some kind of validation lots of the each block
1:21:21	so this starts just for setting parameters and getting parameters
1:21:26	and
1:21:28	maybe no so importance
1:21:31	it can find
1:21:34	function for changing the validation was here
1:21:39	finally the training is to combine these
1:21:41	function here which takes these
1:21:45	function and therefore
1:21:47	changing validation loss takes many other parameters
1:21:51	and things that the undefined
1:21:55	okay
1:22:03	for example in function for training and so on
1:22:07	so these the way we trained here is basically so
1:22:12	alternately she for which was defined as
1:22:16	alright for however bashers
1:22:20	and this is because we don't really have a case you just complete equal continues
1:22:24	every random statements
1:22:26	this is
1:22:29	as long as they won't work so there's really clear idea what is the what
1:22:33	is data
1:22:35	but anyway
1:22:38	we do training if he doesn't include one the one additional also be a good
1:22:45	review
1:22:46	try a few more times o and two patients number of times and you is
1:22:51	to include that we will
1:22:54	research around there's to the best on the whole the learning rate increase but okay
1:23:00	i don't know this is the best
1:23:02	"'kay" be seen but as for well enough for me
1:23:13	so
1:23:14	yes for the whole piece
1:23:16	and
1:23:18	going on i would like to mention a few weeks
1:23:23	not very complicated things
1:23:25	it was maybe slightly difficult for me to figure out
1:23:30	and
1:23:31	they are related to back propagation and the things i wanted to modify their
1:23:41	so let's just first briefly review the back propagation algorithm
1:23:48	basically
1:23:52	you know that the neural network is just
1:23:55	some serious of affine transformation followed by nonlinearity then again affine transformation and again only
1:24:01	the install
1:24:03	so that's a result in some you will be applied affine transformation
1:24:09	i guess is set here and then we apply some nonlinearity and
1:24:14	i mean yes the a that's going to and we do that over and over
1:24:18	and that's called a final
1:24:22	i'll put four and then we have some cost function
1:24:26	i'm on that for example cross entropy
1:24:29	and we if we you know function composition bit is the reading here's basically means
1:24:34	the compositional g and h is just like
1:24:39	an h on the data and energy and still then we know that we can
1:24:43	write the whole neural network s
1:24:46	applying the first affine transformation of the input
1:24:50	next door first the nonlinearity
1:24:54	all the way
1:24:55	but the output
1:24:57	it can be written like these
1:24:59	and is also easy to write well the
1:25:04	gradient of the
1:25:06	loss with respect to that you could point using the chain rule i is
1:25:11	so it's just
1:25:13	basically everybody will see with respect to improve this just
1:25:19	change like this study video scene with respect to a time period well a i'm
1:25:23	this dataset i install
1:25:25	so i have this
1:25:27	funny thing brackets here just and you know that these are
1:25:32	just covariance so the multivariate shaver looks
1:25:37	same as the second one just that we need to use digital us instead of
1:25:43	this is not normal productive
1:25:48	so forceful
1:25:55	the
1:25:56	relative lc with respect to a
1:26:01	this is i criterion is really right because it's a vector so
1:26:07	when all these elements like these here
1:26:12	criminal a with respect was
1:26:15	easy just gonna be a diagonal probably unlike is because f is the functional design
1:26:20	elements bias
1:26:22	and the other one three that you off
1:26:26	san interesting to a
1:26:28	if we look at this thing here we will see maybe a little bit for
1:26:33	this is just the weight matrix
1:26:36	so then back propagation is
1:26:39	okay we start by calculating the
1:26:42	d c
1:26:45	this is a i
1:26:48	and that's just these two
1:26:51	and then
1:26:52	we can
1:26:54	continue with
1:26:56	get it is easy with respect to some other set i by just taking that
1:27:02	are that we have and multiply for example we these two then we get an
1:27:06	extra and still
1:27:08	so it's
1:27:10	but course process like that so that yes you lost the remote people loss with
1:27:16	respect to include in the of what we want this of course with respect to
1:27:19	model parameters which is that
1:27:22	biases in the weights
1:27:24	which we have
1:27:26	here and here
1:27:29	those are given by these extensions here
1:27:33	so
1:27:34	for the biases is just these
1:27:38	a second down here
1:27:40	for the weights model can claim that the corresponding part of the weight matrix
1:27:49	this is just sorry within corresponding part of the
1:27:55	ye activation and a here we are interested in contributing with respect to this also
1:28:00	we need more like the corresponding part this
1:28:07	okay so no i'm talking about when we are fresh test
1:28:13	and here we also to really good references for these if you want to
1:28:19	further into it
1:28:24	no where we have
1:28:27	mentioned this
1:28:28	i would say
1:28:30	well buffy different issues that i run into their that require some
1:28:36	little bit of thinking in relation to this
1:28:40	first thing is that you see here that in order to calculate the derivative existing
1:28:47	weights you need the our schools of each layer is a here
1:28:53	and so that means that we need to see all of those memory from the
1:28:58	forward also needed you the main memory okay we look that passed and if you
1:29:03	have to be batches many utterances also long utterances this can become too much
1:29:10	it can go up to many gigabytes several makes sense for example
1:29:17	or larger batches
1:29:20	so
1:29:23	both
1:29:24	the no and sensible well as on printing home way of getting around this
1:29:31	and that is that you
1:29:37	or where the data
1:29:39	then they have the option in case of ten some for the case of the
1:29:43	angle you have the option to discard the
1:29:48	intermediate file was from the for us then maybe you that there are also you
1:29:52	will recalculate then when you need that so you basically just have the
1:29:59	in memory for one dollar score one on this time
1:30:04	that's the floor one have the same thing about a little bit better because data
1:30:09	to discard the corporate like to the cu memory which is generally bigger
1:30:15	there you family
1:30:17	so
1:30:19	in that case
1:30:21	or to use this we can
1:30:25	we you over the inputs a until probably layer and all the pooling layer we
1:30:31	put all these
1:30:33	close together so that we have now a kind of
1:30:40	tests or with the old adding store
1:30:44	and then that can be processed normally
1:30:50	and then you would just calculated los and ask for the right so or at
1:30:55	least one okay so that to think carefully
1:31:02	this of course also has the advantage that we can have same and different directions
1:31:08	well we may things like
1:31:11	but for a bit complicated or maybe not even possible
1:31:18	i'm not showing the congo
1:31:21	these people sees me see so many other things hours and makes is very difficult
1:31:27	to see what's going on
1:31:30	i have it does not seventeen
1:31:32	is
1:31:34	scripts
1:31:36	but the i was hoping to write some small for example but they didn't manage
1:31:42	to do it in time
1:31:47	okay so that's one three
1:31:50	a second
1:31:53	tree is related to parallelization
1:32:03	so
1:32:06	suppose that we have some or detection like this because feature but and then we
1:32:11	are probably
1:32:13	and then we have some processing all them things and finally scoring
1:32:19	no if we want to
1:32:22	well normally if we want to do parallelization will be training for some multiclass okay
1:32:27	it doesn't really a problem because we just is to give the day on different
1:32:31	workers each of them calculate some radians and we can actually right yes
1:32:35	or we can not irish the updated models
1:32:39	but in this case seems this scoring large when we do use the verification lost
1:32:45	in the scoring or we would like to have a comparison of all trials all
1:32:50	possible trials
1:32:51	so we need to do
1:32:53	time delay and the things on individual workers the sound of all the and endings
1:32:59	to the master where do this scoring
1:33:03	no we do back propagation a to them but he's and then we sell those
1:33:10	tries to each worker
1:33:13	and the they can continue the
1:33:17	the back propagation
1:33:20	the thing is this is not exactly and by normal to the case when you
1:33:24	have
1:33:26	calculated the loss here then you
1:33:30	a propagation but also the includes what is known has included which was just everybody's
1:33:37	then you basically they try to loss with respect to and endings
1:33:42	and how to use that s two
1:33:47	continue the back propagation on
1:33:50	the individual nodes
1:33:56	one single tree to do this is defined like in a sequence only a loss
1:34:01	like this here so i define a new loss which is yes
1:34:05	this is the remote zero
1:34:08	see the cost
1:34:11	with respect to the embedding elements which is what we have to change
1:34:16	problem most or no
1:34:18	times now ready or just like doesn't all probably like this
1:34:23	and if we know
1:34:26	optimize these loss you will get
1:34:28	what we won't be cost
1:34:31	let's consider right and the order derivative of these loss increased a cell to some
1:34:36	model parameter of the neural network
1:34:39	okay we apply here
1:34:41	just take this started in here
1:34:45	here is something that it has on these
1:34:48	there are so we are right yes here and this is i certainly exactly the
1:34:53	loss
1:34:54	the relative that the are
1:34:57	off looking for so
1:35:01	the remote view
1:35:03	for these loss with respect to model or anything will be exactly the same passed
1:35:08	a law that we are interested e
1:35:10	is possible that some newer tutees has
1:35:15	what is actually just do this without using some tree i'm not sure that
1:35:21	this was as though to achieve this
1:35:25	it's
1:35:29	final tree
1:35:30	ease
1:35:34	related to
1:35:36	something the holocaust repair saturated rental units
1:35:42	so
1:35:45	right is the sum operation function so let us remember we have a fine transformation
1:35:51	formal by so
1:35:54	activation function and if it's the revenue proposal is one of the
1:36:01	problem on then
1:36:03	whenever the goal is always below sea able to these rental will or when everything
1:36:09	but this is close to the red will put zero so if or includes or
1:36:14	below zero then this rhino is basically never all putting anything in because it's a
1:36:20	vector
1:36:22	useless
1:36:23	and we there is also the opposite problem if they but is always a zero
1:36:27	then there are n is just a linear units so we really models
1:36:34	the includes threatens to be
1:36:37	in a
1:36:40	be sometimes
1:36:42	positive and sometimes negative then the railways brady units
1:36:48	nonlinearly and
1:36:51	the network is doing something interesting
1:36:53	so how we have these is that usually checks if read a unit has problem
1:37:00	like this and in that case
1:37:03	they will ask
1:37:04	some a little also
1:37:07	to test a
1:37:09	so that everybody will see with respect to set
1:37:14	a problem to do this in some of the standard neural network is that we
1:37:19	don't really we can't really we don't have an easy way to manipulate this stuff
1:37:24	that
1:37:26	which is used in the back propagation
1:37:28	so we will be set to manipulate the derivatives with respect to model parameters directly
1:37:35	and
1:37:38	seeing
1:37:41	how
1:37:43	these relations lou
1:37:46	we wanted us from the data that and
1:37:49	the derivative with respect to be easy just
1:37:52	is the as we were asked thing is achieved in a place you can just
1:37:56	at that it leads to this
1:37:58	do not here we usually get from model to
1:38:02	and similarly for the way it's is just as we also need to multiply these
1:38:08	articles and a because that's called it remotely calculate
1:38:13	so these for some small three weeks and there may be summary i can say
1:38:18	that is quite helpful to when you were neural network to
1:38:23	based on the back propagation probably so that you know what's going on
1:38:29	and then you can easily too small fixes like is
1:38:34	so that's
1:38:35	or well from the hands on session thank you for attention and by

End-to-end speaker recognition — why, when and how to do it?

Tutorials

Dr Johan Rohdin, Brno University of Technology, Czech Republic