Speech Transcript - Adaptation Strategy and Clustering from Scratch for New Domains of Speaker Recognition

0:00:14	i'm going to pretend to this work about them in addition speaker recognition
0:00:18	a decision strategy in preston from scratch one element in speaker recognition
0:00:26	we want to carry out speaker recognition on a new domain not up to increase
0:00:30	the criticism detection
0:00:32	thanks to adaptation techniques
0:00:35	but we don't want
0:00:36	to meet to take into account the difficulties of the task in real life situations
0:00:42	the task of data collecting and also without the cost and therefore forming the large
0:00:48	available to them in dataset
0:00:52	so as to assume that a unique and nonaudible in them and development dataset not
0:00:58	anymore possibly reduced in size down stuff speaker also segments per speaker
0:01:05	this dataset is used to learn an adaptive speaker recognition model
0:01:10	first we want to know that how about the performance increase depending on the amount
0:01:15	of unlabeled in domain data
0:01:18	in terms of segments
0:01:19	and so of speakers or
0:01:23	of some po size of segments per speaker
0:01:31	instead of the asking is always number of clusters damman thanks to another line in
0:01:36	domain data set
0:01:37	so this break distinct and number
0:01:42	we want to
0:01:43	carol to clustering without this requirement for exist in
0:01:48	in domain
0:01:49	and lower bound
0:01:51	data
0:01:53	this is explained below in this presentation
0:02:01	displays most edges back and process for speaker recognition systems based on embedding
0:02:08	the different adaptation techniques that can be included
0:02:13	missiles are which amazed
0:02:15	transforming vectors to reduce the shift between target and out-of-domain distributions
0:02:21	covariance indignant
0:02:23	while or at the feature distribution of the up to attempt to about the out-of-domain
0:02:29	distributions to also target ones
0:02:32	leading to transform on out-of-domain data into possible in domain data
0:02:40	when speaker labels of in domain simple or about anymore
0:02:44	supervised adaptation can be carried out
0:02:47	that's the kind of map
0:02:49	approach
0:02:51	that's more z-norm to linear interpolation between them and then total and parameters
0:02:58	also score normalizations can be considered as and supervised adaptation is
0:03:03	as they use an on the rubber in the man subsets for impostor cohort
0:03:09	that does not that we generalize is interpretation of the lda parameters
0:03:14	to all possible stages of the system and a and whitening
0:03:18	this tactic improvements of performance of a percent
0:03:22	on all our experiments
0:03:29	so how does not from i think raise depending on the a
0:03:33	amount of data
0:03:36	we carry out
0:03:37	experiments
0:03:41	focusing on the gain of adaptive systems a function of the invaluable data and results
0:03:47	sort parameters are selected for the coarse reference tonight it's
0:03:54	speaker else
0:03:56	speaker samples
0:03:57	and
0:03:58	adaptation technique
0:04:02	they are is a description of the experimental setup for our
0:04:07	i'm not exist
0:04:09	we use and that's just seen from county you is twenty three cepstral coefficients
0:04:14	the window size
0:04:16	of three seconds
0:04:18	then vad with the u c zero component
0:04:23	z extract a fixed vector r is a one of candide toolkit
0:04:28	what is attentive statistics putting layer
0:04:32	this extractor is trained on switchboard and nist sre
0:04:36	right tails
0:04:39	use five fold it i one session strategy with full crowed you please
0:04:46	nor is music
0:04:48	bubble from use "'em"
0:04:52	so the men is that it is an arabic language which is called a manner
0:04:56	as the nist recognition evaluation
0:04:59	two so
0:05:00	so than eighteen
0:05:02	cmn and two thousand
0:05:04	eighteen
0:05:05	nineteen sorry
0:05:06	cts
0:05:10	this languages finalists from the nist speaker recognition training data bases
0:05:15	one do things to our mismatch
0:05:22	the in domain corpus for development and test is described in system or
0:05:28	development dataset may have just the enrollment test segments the leave out of from nist
0:05:32	sre eighteen development test
0:05:35	and how for the enrollment the segments delivered from nist sre eight nineteen that's
0:05:42	the other are fixed set aside for making up trial data set of test
0:05:47	the fifty per cent split takes genders into account to more elements will be asked
0:05:52	us you
0:05:54	contains committee on trial perhaps
0:05:57	a normally and uniformly picked up with the constraint of being equalized by gender
0:06:03	and of target prior
0:06:04	equal to one percent
0:06:07	one analysing the adaptation strategy
0:06:10	to predict errors number of speakers and the number of segments per speakers are rated
0:06:16	another two three different total amount of segments and also
0:06:21	given a fixed amount to assess the impact of speaker class variability
0:06:26	each time a subset is picked up from the three hundred and ten speakers size
0:06:31	development dataset and an important for the two models
0:06:36	system development set
0:06:38	is fixed and on the intended for testing
0:06:42	for alternatives are considered that experimented
0:06:45	system applying and supervised adaptation only
0:06:49	system applying supervised adaptation only
0:06:52	and the system applying for pipeline
0:06:55	unsupervised installer
0:06:57	the goal is to assess the usefulness
0:07:00	of unsupervised techniques for speaker labels are available
0:07:07	this figure shows the results of our analyses
0:07:12	performance in terms of recall rate of unsupervised and supervised
0:07:17	adapted systems depending on the number of speakers
0:07:22	and segment bell speakers
0:07:25	of the in domain development dataset
0:07:28	the case
0:07:30	since andy segments per speaker s corresponds to all segments remorseful the speakers
0:07:36	so and t is the mean
0:07:39	x is the number of speakers
0:07:42	where x is the number of segments per speaker
0:07:47	it can be upset of that
0:07:49	combining unsupervised and supervised adaptation is the best way having lower bound labeled data doesn't
0:07:55	make sense provides questionable
0:07:58	and sre
0:08:01	also we observe that
0:08:03	and then with the small in domain data set here or fifty speakers there is
0:08:08	a significant gain of performance with adaptation compared to the design of twelve point
0:08:14	twelve best
0:08:16	now or not a subset of the dashed curves in the figure
0:08:21	they correspond to fixed total amount of segments
0:08:28	for example
0:08:29	this last row corresponds to the same amount of two thousand and five hundred segments
0:08:37	possibly
0:08:39	fifty speakers and fifty segments
0:08:42	bell speaker or one hundred
0:08:45	suppose
0:08:48	by sweeping the kl
0:08:51	we cannot sell that
0:08:53	given a total amount of segments performance improvement with the number of speakers
0:08:58	gathering data from a few speakers to then with many utterances per speaker
0:09:03	really needs again off adapted systems
0:09:07	talk about clustering
0:09:10	the goal is to up to show reliable a in domain data set by using
0:09:15	unsupervised clustering and in defining the provided places
0:09:20	this is to speaker labels
0:09:23	dataset x
0:09:26	cluster on
0:09:27	the results
0:09:29	is the actual speaker labels for
0:09:34	note that we use
0:09:36	why previous thing total dataset form in domain data
0:09:40	a model is computed
0:09:42	using out of them and training dataset
0:09:45	then the score matrix of course tails x is used for going out
0:09:51	an item out to hierarchical clustering using s
0:09:56	a similarity matrix
0:09:59	given this clustering problem is how to determine the actual number or
0:10:05	of places
0:10:08	by sweeping the number of clusters for each number you a model is estimated which
0:10:12	includes and double delta parameters
0:10:16	and the preexisting in them into a low dataset y is used for error rate
0:10:21	computation
0:10:27	then we select the class labels corresponding to the number of classes q that minimizes
0:10:32	the or right
0:10:37	nor block of this approach is here quality nor and
0:10:42	actually quite a preexisting the mental set that is not
0:10:46	so a missile from scratch without in domain data except
0:10:56	so we propose a missile for clustering the in domain data set and determining the
0:11:01	optimal number of classes from scratch result requirement of preexisting in them into a set
0:11:10	is algorithm
0:11:11	first
0:11:12	this algorithm is identical
0:11:15	then
0:11:16	for each number of classes q
0:11:18	we identify class and speaker
0:11:21	and by key matrix can
0:11:27	then we use
0:11:28	this is not weights of artificial keys
0:11:31	for computing the error rate
0:11:37	now we have to determine the optimal number of classes
0:11:42	we use the remote gridiron one on the field of clustering
0:11:48	on display in the air or its those criteria for determining the optimal number of
0:11:53	clusters
0:11:55	reported was is correspond to the loop of the algorithm from scratch
0:12:01	we can see that the slope of equal error rate goal so then it slows
0:12:05	down around the neighbourhood by excess of the exact number of speakers
0:12:11	which is
0:12:12	two hundred and fifty
0:12:15	moreover the values of this yes we still operating points
0:12:20	rich local minima before converging to zero
0:12:25	the trust one in the same neighbour
0:12:31	two hundred and fifty
0:12:38	so i don't format salted gives the wrong
0:12:42	three hundred
0:12:44	classes
0:12:45	with the colour white beyond this threshold also dcf increases
0:12:55	no display the performance of the adapted system using clustering from scratch as a function
0:13:01	of the number of clusters
0:13:04	compared to unsupervised and supervised with the exact speaker labels adaptation
0:13:09	with
0:13:12	exact syllables and spell adaptation the performance of eigenvalues round six test and
0:13:19	with only and style adaptation performance is round seven percent
0:13:25	and we can see the crawled all results by varying the number of classes
0:13:33	form the clustering
0:13:35	from scratch that we propose
0:13:42	we can see that the missile or estimates the number of speakers but manage to
0:13:47	attain dusting performance in terms of equal error rate and this yes
0:13:53	close to the performance
0:13:56	with exact lower bounds and supervised adaptation
0:14:03	of the residuals
0:14:05	with values number of segments per speaker
0:14:09	five ten or more
0:14:11	for example
0:14:13	last line we can see that results by clustering from scratch
0:14:17	the right
0:14:18	a similar to goals were produced in one about that moment set
0:14:24	but also close to the ones with the exact speaker labels
0:14:31	now will conclude
0:14:35	the analyses that we carried out
0:14:38	shows that improvement of performance is due to supervised but also unsupervised domain adaptation techniques
0:14:46	michael a or lda
0:14:49	that's techniques well combine one is a model field
0:14:53	the other on the picture failed to achieve best performance
0:14:59	also
0:15:01	it's subset of that the small sample of in domain data can significantly reduce the
0:15:05	gap of performance
0:15:08	but when following the amount of speakers
0:15:11	rather than of segments per speaker
0:15:18	lastly a new or partial optional speaker labeling has been introduced here
0:15:23	doing from scratch
0:15:25	without break this thing in the man labeled data
0:15:29	for clustering
0:15:31	well actually being a given and performance
0:15:36	thank you for attention
0:15:38	can try to as for more details on this study
0:15:41	but by

Adaptation Strategy and Clustering from Scratch for New Domains of Speaker Recognition

Speaker and Language Recognition

Pierre-Michel Bousquet, Mickaël Rouvier