Speech Transcript - Source Normalization for Language-Independent Speaker Recognition using i-Vectors

0:00:18	oh
0:00:19	oh
0:00:22	oh
0:00:35	well this is so you organise race
0:00:39	a that
0:00:40	if you serve also
0:00:42	third order all three here that the first author individual
0:00:47	yeah it solves the general idea of a source position
0:00:52	that's the i think right after
0:00:55	the previous oversee
0:00:57	no
0:00:58	and
0:01:00	if
0:01:01	afterwards after this talk you think that this is fantastic i'm going to implement this
0:01:06	tomorrow or next week
0:01:08	in you need to do is done it so
0:01:10	and mature
0:01:13	and he slides because he'd make the slides form
0:01:17	yeah
0:01:18	i'm very happy with that if only a house you think yeah afterwards i didn't
0:01:23	get i think of this and why did before but this is probably due to
0:01:27	me and all being able to convey the message
0:01:31	and if you before hands thing to the same thing
0:01:36	then you're are sort of each what we have
0:01:40	right
0:01:41	anyway
0:01:42	so
0:01:45	this is sort of you automatically generated a summary of my presentation today
0:01:52	which i think is kind of pointless in this particular case because
0:01:55	contains lots of irony is that are can be explained later soldiers
0:02:00	having to
0:02:01	the motivation to do this work and the idea is that in speaker recognition
0:02:07	we all phone that and you evaluations were things change from year to year we
0:02:12	often
0:02:13	get to the situation where you get new data we haven't seen for sitting here
0:02:18	well yeah
0:02:20	voicing data no you know what kind of noise maybe some people
0:02:24	but most of us don't know
0:02:27	and
0:02:29	well
0:02:30	how we're going to deal with that's
0:02:33	i don't
0:02:35	sometimes have to
0:02:37	locations and see what i'm going to talk about
0:02:40	every once a while actually to rubbish i guess because
0:02:44	i haven't seen this to go from
0:02:46	but anyway that the basic idea is that if you get conditions definitely to train
0:02:52	and test
0:02:53	are of a different kind you like to have to see
0:02:57	that you like to have seen this before what you do
0:03:01	i don't
0:03:02	if you're
0:03:03	if you know that you won't have seen
0:03:06	and one way of to do with that is this ideal source normalization
0:03:12	i'll try to explain
0:03:13	the basic ideas source say
0:03:17	oh here's some slides about i-vectors i think i i'll skip
0:03:21	with these two then you probably in your hands and standard much better i
0:03:25	the basic idea is that we review the i-vector in this particular presentation that's a
0:03:31	very low dimensional representation
0:03:33	of the entire utterance
0:03:35	containing
0:03:36	apart from speaker information other information
0:03:41	i essential to the idea of source position is that wants to that we do
0:03:46	in the standards
0:03:48	approach is we hear by within covariance
0:03:52	within class covariance normalization
0:03:56	for the P lda
0:03:59	and
0:04:00	that's needs to be changed
0:04:02	with data and in the training
0:04:05	the
0:04:06	within class and between-class
0:04:08	scatter matrices are
0:04:11	are computed
0:04:12	and that's where the source normalisation takes place
0:04:18	so here and notes that we actually need to estimate those scatter matrices
0:04:23	for
0:04:25	so this is the mathematics just to stay in line with the previous torso
0:04:29	to have at least some mathematics so on the view screen
0:04:34	this is the expression for the within
0:04:37	speaker scatter matrix
0:04:40	and this is what the source position is going to
0:04:44	try and estimating a better way
0:04:48	because what is the what is the
0:04:50	problem with a wccn in this particular
0:04:54	a matter is this issue is that
0:04:58	relevant kinds of variation are observed in the training data
0:05:05	and this is more often to if you don't have
0:05:09	data
0:05:11	so i hear another graphical representation of what typically happens here we look at a
0:05:17	specific
0:05:19	kind of data is the label of the data in mind say which is
0:05:25	the language so you have also english language data and every once in a while
0:05:30	we get some a tests
0:05:33	where language model english
0:05:35	how that in i think that
0:05:37	six
0:05:38	and before
0:05:40	a also two thousand eight seconds content easy
0:05:43	so when is you get in two thousand twelve be what you get
0:05:46	so maybe language itself is not so relevant for the current that is what it
0:05:51	is a good example of where things
0:05:53	change
0:05:55	an important
0:05:56	here is that even if we have some training data from here's
0:06:00	will not have for all speakers
0:06:05	the different languages so typically
0:06:07	the
0:06:09	the speakers are decoupled from the language of for some language you have some speakers
0:06:13	and for the language you have other speakers
0:06:15	so how do you know the problem where you in the end in your
0:06:19	recognition have to compare one segment in one second
0:06:23	in the other language where the case might be that it's actually same speaker
0:06:30	so what about shown out is why this kind of
0:06:36	difference in language labels going to
0:06:39	influence these
0:06:41	we can
0:06:42	speaker
0:06:43	sky within class scatter matrix
0:06:46	so this is one way of viewing how the
0:06:49	i-vectors might be distributed in this very
0:06:52	this way
0:06:55	and
0:06:56	is used
0:06:59	so
0:07:00	these
0:07:01	three big circles denote the different sources in this case of source
0:07:06	might be a language
0:07:08	with some means there's a global mean which would be yeah mean i
0:07:14	i guess
0:07:15	i don't have some speaker so for the speaker you have a little bit of
0:07:17	variability any comes from one source
0:07:20	and the speaker is the she and he comes from another source and we have
0:07:25	also you speaker sources in a last
0:07:28	you think imagine if you're going to compute the between speaker variation that you actually
0:07:35	i don't a lot of between source variation and that's probably not a good thing
0:07:40	which you want to
0:07:41	no we did different speakers and between source
0:07:46	so
0:07:47	the wccn is going to
0:07:51	do this for myself
0:07:52	based on this information
0:07:57	and related to this
0:07:59	is what is stacey the source variance
0:08:03	is not correctly
0:08:06	observed so the various tv sources
0:08:09	is not explicitly
0:08:12	models
0:08:14	so there's another problem for wccn
0:08:20	so
0:08:21	this is as follows is summarising again
0:08:25	what problems are
0:08:27	that's moved to the solution i think this is much more interesting that to see
0:08:32	what's how do we tackle this problem that these sources to hang around
0:08:37	have
0:08:39	globally different means in the this i-vector stage the solution is very simple is compute
0:08:45	these means
0:08:46	for every source
0:08:50	so here you look at the
0:08:52	scatter matrix
0:08:54	for a
0:08:56	conditioned on the source
0:08:58	we simply say i compute the mean for every source
0:09:02	and before computers contrary i
0:09:04	subtract these means
0:09:06	so the effect basically means that you
0:09:09	all these three
0:09:11	sources this is still going from like these two microphone
0:09:16	yeah and telephone data
0:09:18	also them for languages
0:09:20	yeah more
0:09:21	and you subtract the mean for
0:09:24	label per language
0:09:26	and then this scatter matrix will be estimated better so the mathematics then we'll say
0:09:32	okay
0:09:34	that's very nice fit within
0:09:36	within a class variation
0:09:40	we still have the between class variation
0:09:46	but we'll just see that as the difference that's data rate
0:09:51	issues
0:09:52	so that the other way around
0:09:54	but it does so the idea is that you can compensate for one
0:09:57	scatter matrix and because you have total variability
0:10:01	you can compute the other as the difference from
0:10:03	total variability
0:10:08	so this idea is to stress
0:10:10	in fact that you only need the language labels see records applied to language
0:10:16	for the development set
0:10:19	so you're languages are you development
0:10:22	and you're training your system you have all kinds of labels in your data in
0:10:25	this case we consider
0:10:26	language label
0:10:28	but in applying this you do not need the languages
0:10:33	because this is only used to make a better
0:10:37	transforms for these wccn that make
0:10:42	how can you actually see that it works well one way of doing that is
0:10:46	to
0:10:48	to look at the distribution of i-vectors
0:10:51	a wccn
0:10:54	when you
0:10:54	do not apply this technique source-normalization a strong left
0:10:59	and here in different colours U C encoded of the label that we want to
0:11:04	so the way in this case language you see for each language recognition
0:11:11	these
0:11:13	languages might be familiar for these people needed
0:11:17	was
0:11:18	that what you see that
0:11:19	languages seem to have different places
0:11:23	this is by the dimension a dimension reduction
0:11:27	two dimensions
0:11:28	after the incision that's just for few problems
0:11:32	and you see a that is language normalization this source
0:11:35	source normalization by language
0:11:39	you get that all these different labels too much more similar
0:11:43	force for the basic assumptions that
0:11:47	i-vector systems are based on
0:11:51	should a little better
0:11:53	okay in our system results because
0:11:56	we need to have tables
0:11:57	in the presentation of we're going to get some
0:12:00	at first what kind of what kind of experiment we can do
0:12:06	we use
0:12:07	most i databases for is that the
0:12:11	yeah men the training
0:12:13	yeah i-vector make use of
0:12:15	but we did at one specific database callfriend
0:12:19	very little database are used
0:12:21	oh two starts
0:12:23	the first language recognition so it contains
0:12:26	a variation of languages and twelve languages certainly
0:12:30	for that
0:12:32	right
0:12:33	price
0:12:34	and
0:12:36	as for the evaluation data because these two data sets and from nist two thousand
0:12:42	ten
0:12:42	dataset and two thousand
0:12:44	eight oh two thousand ten you might think why would you do that there wasn't
0:12:49	actually much different language
0:12:52	from english that was sense but we don't use that for purposes one for training
0:12:58	calibration
0:12:59	calibration as well
0:13:01	another reason is to see actually what are we do doesn't spurts
0:13:06	the basic english performance too much
0:13:09	you a case of course is going to be used as a test data
0:13:14	where there is a there are trials from different languages
0:13:19	and there are also considered
0:13:21	condition english only so that
0:13:23	we compare
0:13:25	do you actually hurt ourselves
0:13:27	this is
0:13:28	durations are a simple standard
0:13:30	the U
0:13:31	have seen either numbers i'd say before so there's nothing you hear
0:13:38	these are indians the breakdown numbers for the
0:13:44	per language
0:13:46	for the training data
0:13:48	i
0:13:49	these funny are the results now here
0:13:52	i'll try to explain
0:13:55	database
0:13:56	red
0:13:57	it means this is you
0:13:59	doesn't mean this is better
0:14:02	but both figures means is better and the first condition
0:14:08	shows
0:14:10	see
0:14:11	yeah
0:14:13	the performance on all trials
0:14:15	four sre eight
0:14:18	and measured in where it and get
0:14:22	does not in calibration here
0:14:26	a C these numbers go down so for four O eight it works if we
0:14:31	see some languages i believe that
0:14:33	okay
0:14:34	force is also in english
0:14:38	if we
0:14:40	oh you look at english then used to use a little bit so it does
0:14:44	hurt our system but it doesn't hurt it's
0:14:47	here
0:14:48	a
0:14:49	and
0:14:50	the same for
0:14:51	as we can
0:14:54	for system gets hurt
0:14:56	but here
0:14:58	the basic conclusion there
0:15:01	here we have a breakdown where we look at the english languages
0:15:06	from history of weights
0:15:09	where is where we look at different positions are there in the in the trials
0:15:14	the same language or different language
0:15:17	when english is
0:15:20	so the top row which has to be the best performance because
0:15:24	still contains these N yeah
0:15:26	many english trials
0:15:27	systems that works best for
0:15:31	so the baseline
0:15:34	but this includes
0:15:36	both english and english so if you break down
0:15:40	for instance where you say okay i want a different language in the trial suppose
0:15:44	that the target of target trials language
0:15:46	difference
0:15:47	i was four
0:15:49	we see that the new figures that once right
0:15:53	are slightly better than
0:15:55	the red ones
0:15:56	left
0:15:57	the background smooth
0:15:59	and the same respect to four
0:16:01	in addition so you can specifically look at them english trials
0:16:05	where there's otherwise restriction
0:16:09	it helps
0:16:10	for the language
0:16:12	trials where you actually restricts trials you say minus the same time but english
0:16:19	still helps to there's one condition where for whatever it does not how
0:16:26	so that's a big difference
0:16:28	this is something we don't
0:16:30	is that
0:16:31	suppose
0:16:32	and that's for the old english trials
0:16:34	where you specify that the process
0:16:39	different language trials
0:16:41	so usually
0:16:42	it seems to work
0:16:43	except for one particular
0:16:46	place
0:16:47	where it's that's
0:16:48	one
0:16:49	dish
0:16:50	but i say that are actually not too many trials
0:16:53	it's not show the graph oh very nice
0:16:56	if you vision
0:16:58	so i don't know how
0:17:00	accurate this measure
0:17:04	now i'll except for also it's calibration
0:17:08	and
0:17:10	our to carlos also the it's a this kind of experiment i
0:17:15	looking at
0:17:16	make
0:17:16	more robust for
0:17:18	for languages
0:17:19	and we use a better different measure
0:17:21	is a measure used by the keynote speaker they
0:17:26	as cllr and one way of looking at how
0:17:31	however
0:17:32	you're calibration is small rates is to look at the difference between the cllr and
0:17:37	the minimum attainable C or your in G
0:17:41	or oh
0:17:42	C miss so as to that
0:17:44	posts of
0:17:48	this kind of H
0:17:50	section
0:17:51	it's not
0:17:56	this is gonna
0:17:59	alright so you have to this school mismatched different means
0:18:04	mismatched and matched
0:18:05	and
0:18:08	i was actually thinking
0:18:11	vigilance the intensity state we might build a set of mismatched
0:18:16	my niched
0:18:17	but
0:18:18	that might be to heart for you guys
0:18:22	anyway
0:18:23	and the is the
0:18:26	that they do thing that we tried to a remote here
0:18:30	and black is the old approach
0:18:33	at both is better figures
0:18:37	so we see a separate from female to answer
0:18:42	also
0:18:43	we ask for
0:18:47	and now
0:18:49	generally
0:18:51	for this mismatch condition by big mismatch we need to calibrate english only to be
0:18:56	a straight answer
0:18:58	ten for calibration and we applied to
0:19:00	sre eight is
0:19:03	to be the other way around that we consider that way in order to be
0:19:06	able to calibrate english and test or
0:19:11	in a channel
0:19:13	so this particular
0:19:15	in addition it works
0:19:17	always and
0:19:19	in the matched condition that is only looking at english scores of this really
0:19:24	well calibrated english words
0:19:26	ten
0:19:28	you see that it doesn't always help factors on one condition where it helps to
0:19:33	do so
0:19:35	the miscalibration itself
0:19:37	so you molecules
0:19:39	in calibration
0:19:40	becomes less
0:19:42	see that's for calibration there is still somehow
0:19:46	english only
0:19:48	but for the arts and figures it doesn't
0:19:51	however
0:19:53	alright i hope that
0:19:55	i
0:19:56	explains the numbers well enough
0:19:59	your first for the managers amongst
0:20:02	yeah
0:20:03	i just easier to draw at this
0:20:06	the same time
0:20:07	dataset
0:20:09	calibration this is just miscalibration so this is just the amount of information by
0:20:14	by not be able to
0:20:16	produce proper likelihood ratios
0:20:20	increases
0:20:22	for
0:20:23	the conditions where we applied is the language normalization
0:20:27	oh
0:20:28	but for english only trials you don't notice the difference
0:20:35	so i have a slight
0:20:38	conclusions are here
0:20:40	used to source normalization wish to general framework and i have to say here's been
0:20:45	applied before
0:20:47	it should be machine
0:20:50	three or four
0:20:54	conference proceedings
0:20:56	papers
0:20:57	about this technique applied it to this
0:21:02	definition of source being a microphone or integer interview or telephone
0:21:07	and we even applied it
0:21:10	i should say by
0:21:12	fair
0:21:13	to source being know the sex of the speaker so even though i speakers generally
0:21:19	don't change six
0:21:21	and that's in this evaluations
0:21:24	you can use this approach
0:21:27	to compensate for situations where you might not have enough data
0:21:32	so for telephone conditions this
0:21:35	didn't we make much difference but for conditions
0:21:39	where there wasn't really much data i did how to shoot pool the male female
0:21:45	i-vectors and make a human same gender independent
0:21:48	recognition system
0:21:51	and apply source normalization
0:21:53	very sad speaker sex is the label of the i-vector and we normalize that way
0:21:58	and that in your recognition
0:22:01	you can only the labelling marcy can basically more second column of your trial based
0:22:09	okay
0:22:10	but you reply to two languages seems to work
0:22:14	recently
0:22:16	and
0:22:17	that it doesn't for english trials too much first
0:22:23	which
0:22:23	and also basically S
0:22:27	what's to go
0:22:30	i
0:22:32	i
0:22:47	i
0:22:54	i stopped speaker cases
0:22:56	and we do not use try to use language as a discriminating speakers
0:23:02	in this
0:23:02	research of course you can see that very well
0:23:05	we think you that
0:23:07	take it as a challenge that you should be able to recognize speakers even if
0:23:11	the speaker speaks a different language than seen before in the in
0:23:15	in the training then you
0:23:18	what
0:23:19	of course
0:23:20	oh
0:23:21	make it will be easier by saying either or different speakers
0:23:24	it's a speaker
0:23:26	she
0:23:32	i
0:23:33	i
0:23:44	i
0:23:48	oh
0:23:49	yeah
0:23:53	i
0:23:56	yeah
0:24:08	i
0:24:10	i
0:24:28	i
0:24:36	yeah and i remember calibration was in one of the one of the major problems
0:24:41	in two thousand six
0:24:43	where you know if you have more english
0:24:46	performance actually be reasonable the discrimination performance but calibrations
0:24:52	where
0:24:53	car
0:24:54	so sure that
0:24:55	that even holds for be a systems
0:24:58	nowadays though but a systems nowadays are
0:25:01	generally behaving better
0:25:04	i
0:25:18	yeah
0:25:24	i
0:25:27	oh
0:25:34	no i don't think that a
0:25:36	that
0:25:37	that is what we want
0:25:38	say i think
0:25:40	to say is that it
0:25:43	with the channel
0:25:45	between channel
0:25:47	variation
0:25:49	estimated one of the
0:25:52	very of the total variance
0:25:55	is used to the fact that
0:25:57	things have a different language
0:26:01	and you don't observed that's in the within speaker
0:26:04	variability
0:26:06	so the attributes within language variability
0:26:09	as with the
0:26:11	channel variability
0:26:13	and that is not as to K stiff
0:26:16	this case
0:26:20	languages for same-speaker
0:26:29	i
0:26:30	i

Source Normalization for Language-Independent Speaker Recognition using i-Vectors

SESSION 02: Speaker Recognition - Generative modeling

David A. van Leeuwen