Přepis řeči - CLASSIFIER SUBSET SELECTION AND FUSION FOR SPEAKER VERIFICATION

0:00:13	okay
0:00:22	so that morning
0:00:24	i
0:00:27	is the what we mean by a classifier fusion
0:00:31	of
0:00:31	classifier fusion is applicable uh
0:00:35	whenever we have some uh and symbol of of
0:00:38	i X
0:00:40	and we need to come to some final decision don't
0:00:43	uh
0:00:45	furthermore
0:00:46	in this um but example we we assume that those experts are able to give us off
0:00:52	decisions
0:00:53	in in in a a uh a form of some can fit
0:00:57	so so perhaps the simplest and also own mostly working method uh
0:01:03	how to fuse those scores would be just to a breach out
0:01:06	those confidence values
0:01:09	that sometimes we we have some prior information about uh
0:01:13	the experts and
0:01:14	about better uh
0:01:17	is
0:01:17	uh
0:01:18	in the past
0:01:20	um so so we would like to exploit the host
0:01:24	this information to to uh make
0:01:27	that there
0:01:28	fusion
0:01:32	so the task of
0:01:33	classifier fusion is to take uh
0:01:36	the of and
0:01:38	base classifiers and uh produce one
0:01:42	output score
0:01:43	uh which which ideally a uh
0:01:47	which
0:01:48	we better performance than uh
0:01:53	a single base classifier
0:01:58	so we now we where we we assume uh
0:02:01	so called
0:02:02	linear fusion
0:02:04	which is
0:02:04	a very simple method that but
0:02:06	uh
0:02:07	i also uh
0:02:09	used in the state of the art tools
0:02:11	like the focal uh
0:02:13	toolkit kit or or
0:02:15	that's that the word to at
0:02:18	um
0:02:22	so a linear fusion is just wait it's sum of of
0:02:26	the input scores
0:02:27	uh
0:02:28	where are the weights are trained
0:02:30	uh from from previous
0:02:32	uh
0:02:33	trials
0:02:35	with with the known based through
0:02:41	but what we mean by uh subset fusion of uh
0:02:46	is that uh
0:02:47	in in
0:02:48	subset fusion
0:02:50	we first
0:02:51	uh
0:02:52	so like
0:02:53	uh
0:02:54	only certain classifiers from from the full set
0:02:58	and those uh
0:03:00	then for C to to the fusion training and and fusion
0:03:06	what what
0:03:07	could be the motivation for for such
0:03:10	to something uh so first for the traditional
0:03:14	uh approach uh with the full set
0:03:18	it's it's
0:03:19	the
0:03:19	mostly used method it's
0:03:21	forward
0:03:23	it
0:03:23	computationally efficient since you don't have to do the a subset selection
0:03:30	oh
0:03:30	but
0:03:31	for for the lot and when we have a large number of classifiers
0:03:35	uh we
0:03:36	could be
0:03:37	possibly simply over
0:03:38	training
0:03:40	fusion
0:03:42	virus in in the
0:03:43	stops case
0:03:45	um
0:03:46	we might
0:03:47	possibly suitably but there
0:03:50	that
0:03:51	of course this this uh
0:03:54	matt that relies on on a good subset selection
0:03:59	so the question is can a subset fusion give better performance than the force
0:04:10	oh forty for this system overview uh
0:04:14	on the input we have
0:04:15	uh
0:04:16	speech
0:04:18	typically two utterances
0:04:20	a
0:04:21	those are
0:04:23	um
0:04:24	uh
0:04:25	classified by a classifier
0:04:27	uh
0:04:29	which
0:04:30	i by several classifiers that we that we selected from from of full set of the classifiers
0:04:36	and
0:04:37	those
0:04:38	passive that were selected that and fuse
0:04:47	more more in detail
0:04:48	uh
0:04:49	how we do it
0:04:50	is uh we first
0:04:52	uh train uh the S skull mapping
0:04:56	for for each of the base
0:04:58	base classifiers scores
0:05:01	a a S come mapping mac maps the scores in
0:05:04	uh well calibrated log likelihood ratio
0:05:09	um
0:05:11	on the one that
0:05:12	first
0:05:12	yeah a you see you see that as kyle mapping
0:05:17	and on this second and uh is is uh
0:05:20	cost function C L
0:05:22	which uh we minimize
0:05:25	uh for the match score
0:05:31	okay then then for each of the subset
0:05:33	in be uh power set up two
0:05:37	you
0:05:37	a power of and minus one
0:05:39	uh we train a linear fusion
0:05:43	uh uh with a C C W L are objective function
0:05:47	same same that that's in the focal toolkit
0:05:52	a a that one you C
0:05:55	in the first
0:05:56	uh formal a
0:05:58	uh
0:05:58	the the prior uh
0:06:01	with which the the C W L R
0:06:03	function is way
0:06:05	comes from the cost function
0:06:07	so so for the cost function we we use the new next function
0:06:11	but at the cost of miss type of error one cost of false alarm is one
0:06:16	and uh a probability of target
0:06:19	you're
0:06:20	a target trial is
0:06:22	zero point zero zero one
0:06:26	okay that then after we uh use all the possible subset we we select the
0:06:32	subset based on the smallest
0:06:34	uh
0:06:35	minimum uh decision cost function
0:06:38	so the decision cost function of uh is
0:06:41	is a function of threshold
0:06:44	um
0:06:45	and and
0:06:46	the
0:06:47	cost function parameters
0:06:49	so so for uh
0:06:53	we we we pick the
0:06:54	we pick the one with with the low
0:06:57	uh with the minimum decision
0:06:59	function
0:07:01	and it possible threshold
0:07:04	and finally we we still
0:07:06	but the actual uh a decision cost function which is
0:07:10	the cost function
0:07:12	in a threshold in and all the multi racial that we trained on the training
0:07:18	a with includes
0:07:20	uh uh also the
0:07:21	calibration error
0:07:27	oh of a our
0:07:28	base classifiers
0:07:29	uh we had
0:07:30	well
0:07:31	different
0:07:32	classifiers
0:07:33	uh
0:07:35	which are used in the a i for you called salt to part for the nist two thousand then
0:07:40	evaluation
0:07:42	um
0:07:43	we used three different sets of scores
0:07:46	uh the so called train set and it about set one
0:07:50	where from the extended nice
0:07:52	uh sre sorry two thousand page
0:07:54	files set
0:07:56	and they are just
0:07:57	uh
0:07:59	a like they have very similar uh
0:08:01	score distribution
0:08:03	and then for um
0:08:05	or something different you have also to
0:08:07	is is is the
0:08:09	uh
0:08:10	if we shall nice
0:08:11	two thousand and a
0:08:13	uh evaluations
0:08:20	ah
0:08:20	so for the results
0:08:22	we we divide it uh
0:08:25	all the possible subset
0:08:27	i size
0:08:28	uh
0:08:29	from one to twelve since we had
0:08:31	twelve classifiers fires and and study different
0:08:35	and measure
0:08:37	we can get by selecting a good
0:08:39	a subset
0:08:44	uh
0:08:44	but three
0:08:46	uh
0:08:47	most important point in
0:08:49	points in this
0:08:50	a a lot of are
0:08:51	the worst individual subsystem
0:08:54	the
0:08:55	uh best individual system subsystems so that was are the sets of
0:08:59	size one
0:09:00	only only once is them not no fusion
0:09:03	and uh the baseline is uh the full
0:09:06	in sample the fusion
0:09:08	where where all the twelve plus fires
0:09:10	so
0:09:10	if
0:09:13	usual
0:09:16	so first for for the blue line uh
0:09:19	the blue line shows
0:09:21	uh
0:09:22	the non of non cheating really realistic use case where
0:09:25	we predict the best
0:09:27	uh subset
0:09:29	uh from the training set and then we evaluate on the about that one
0:09:34	so so for for this one unfortunately we we cannot but get the better result than the set fusion but
0:09:40	we can get
0:09:41	sometimes for for
0:09:43	in the size if of seven right
0:09:45	and
0:09:46	uh we can get
0:09:48	a a very similar result
0:09:53	and the best subset selection or or four shows uh the best subset uh
0:09:58	the uh performance of the best subset uh
0:10:02	if if we knew how to select a
0:10:05	uh
0:10:07	then the worst subset selection or well
0:10:10	uh i shows the case
0:10:11	uh uh when we
0:10:12	cell like the worst possible
0:10:14	subset from from the power set
0:10:19	so
0:10:19	those are uh
0:10:21	and and of are bound
0:10:27	ah
0:10:28	okay
0:10:28	this is the same case uh
0:10:31	only not to not for the actual dcf but for
0:10:35	minimum dcf and you rely right
0:10:38	so you can see we we can still uh get
0:10:41	but their mean dcf or equal error rate by
0:10:44	by
0:10:45	not doing the full set fusion
0:10:48	so
0:10:49	but selecting a subs
0:10:55	and finally
0:10:56	um
0:10:57	this is the performance on the of all set to
0:11:00	or or of the nist two thousand ten
0:11:02	a
0:11:03	evaluation set
0:11:05	um
0:11:06	and we can also see
0:11:08	see that for for most of the conditions
0:11:11	interview interview uh
0:11:13	interview telephone and telephone telephone
0:11:16	the best subset
0:11:17	gives
0:11:18	that their their performance than the full and sample
0:11:21	only only
0:11:22	in the
0:11:23	mike mike condition there is something wrong
0:11:26	uh
0:11:28	here uh even the even the full and sample
0:11:32	it's worse
0:11:33	results than
0:11:35	best individual
0:11:49	oh
0:11:49	uh
0:11:50	conclusion of
0:11:51	this research is that
0:11:53	subset fusion has
0:11:55	a then shall to perform the full set fusion
0:11:58	course
0:11:59	if we knew how to select
0:12:01	best
0:12:03	uh
0:12:04	there are the further study should focus on
0:12:08	subset selection methods
0:12:14	they i i think that
0:12:15	it
0:12:16	uh
0:12:23	okay
0:12:23	we have a a a a a a time question
0:12:28	right
0:12:28	you this was uh yeah back from please
0:12:31	uh i'd like to ask if you use the same subsets for all that i was or different subsets for
0:12:35	all the files
0:12:37	uh
0:12:38	you mean in one of the block
0:12:40	or
0:12:41	uh
0:12:43	i generally so a this is this the system
0:12:47	you you put a not that i was to it in
0:12:49	yeah
0:12:50	do you miss select a different subsets for each high or a no no no now
0:12:55	okay
0:12:56	so like one cell
0:12:59	i
0:12:59	okay
0:13:09	did you can are you a solution with the random selection of the subset set of positions
0:13:15	uh
0:13:18	what we mean by a round them
0:13:20	just
0:13:20	see to D you can you shows one to me
0:13:24	a so you have to plot here the
0:13:27	to a but the two bound
0:13:29	okay
0:13:30	well well the random decision
0:13:33	somewhere uh
0:13:35	in the base
0:13:37	oh
0:13:38	and when you when you pick randomly you you and up with the performance between them
0:13:43	two
0:13:44	well
0:13:45	and can be could be interesting to do where you these days
0:13:49	maybe
0:13:50	okay
0:13:50	it
0:13:51	the on the random selection but uh you what
0:13:54	probably like to see a distribution
0:13:57	oh okay
0:13:59	but
0:14:08	because
0:14:10	okay do not mess up the speaker

CLASSIFIER SUBSET SELECTION AND FUSION FOR SPEAKER VERIFICATION

Speaker Verification

Přednášející: Filip Sedlak, Autoři: Filip Sedlak, Tomi Kinnunen, University of Eastern Finland, Finland; Ville Hautamäki, Kong Aik Lee, Haizhou Li, Institute for Infocomm Research, Singapore