Přepis řeči - Intra-speaker variability effects on Speaker Verification performance

0:00:06	my name is uh as you get can
0:00:08	and uh i will present you with the
0:00:10	the work we uh
0:00:12	we do we can do it uh
0:00:14	in L D A
0:00:16	yeah which is entitled intraspeaker variability effect
0:00:19	speaker verification
0:00:22	over the last decade
0:00:24	uh the the one of the systems uh the performance of this is that
0:00:28	uh
0:00:29	is uh
0:00:30	very very
0:00:31	at uh
0:00:32	the performance the
0:00:34	have a rich uh a good uh little
0:00:36	and uh is it
0:00:37	so
0:00:38	this permit
0:00:39	uh to have allow set of practical application
0:00:42	uh like
0:00:43	in industry or in forensic application
0:00:46	and uh all this uh performance
0:00:49	performance are always driven by average error rate
0:00:53	and uh
0:00:54	uh
0:00:55	we don't have a lot uh
0:00:58	a lattice to D's
0:00:59	on uh the
0:01:01	i explanation
0:01:03	of
0:01:03	the performance viable
0:01:05	hmmm
0:01:05	and uh on the arrow
0:01:07	uh
0:01:08	we have a one important which is that doing so in that context mean actually
0:01:13	who uh explain the performance viability according to the speaker for five
0:01:18	it is a well known that the
0:01:20	according to the lens
0:01:21	of the training and testing that's out uh the the back of the T the performance liability is very important
0:01:29	and uh it was proposed
0:01:30	two
0:01:31	uh to use the diff there in front of me contain
0:01:34	in the
0:01:35	two
0:01:36	to use the interview
0:01:38	showings that
0:01:39	uh
0:01:40	there is interference performance
0:01:42	uh according to
0:01:43	this one i mean
0:01:46	to do a um
0:01:48	our question
0:01:49	uh we only work
0:01:50	on
0:01:52	the training data
0:01:53	four
0:01:54	one speaker
0:01:55	uh the question is uh is that
0:01:57	we have several except
0:01:59	for the same speaker
0:02:01	so
0:02:02	uh what is the viability
0:02:04	due to the signal sample used to that
0:02:08	the speaker point
0:02:10	and uh do you also questioned it is
0:02:12	what kind of information may explain
0:02:15	this difference of performance
0:02:17	and uh we propose to use to stew D the number of selected frames the phone and make distribution
0:02:23	and it in for uh for naming candlestick different
0:02:28	okay
0:02:29	uh we use the we use the the ideas is because that's a system which is an ubm gmm approach
0:02:36	approach
0:02:37	uh with uh that in fact one of these these
0:02:40	and uh we use the
0:02:42	the the C the this used and uh used for
0:02:46	the news that several complaints and uh
0:02:49	but we don't do a score normalisation
0:02:52	the global I D is uh to uh to do
0:02:56	a lot
0:02:56	of um
0:02:58	of uh trails
0:02:59	for the different training samples we have
0:03:02	four
0:03:03	a speaker
0:03:04	and uh we select
0:03:05	the best
0:03:06	training
0:03:07	except
0:03:08	and the worst training except
0:03:10	for each
0:03:10	speaker
0:03:11	um
0:03:12	the the best
0:03:13	training except
0:03:14	is uh use might have one is um calculated
0:03:17	by the um
0:03:19	by many
0:03:20	minimise the
0:03:21	the percentage of
0:03:23	four
0:03:23	exception
0:03:24	and uh
0:03:25	uh forms recreation
0:03:27	and it's the same thing we maximise
0:03:29	the
0:03:29	stage of
0:03:30	phone sex option
0:03:31	accepts and
0:03:33	false
0:03:33	action
0:03:35	Z we have a
0:03:36	it to to set
0:03:38	if uh we see selection
0:03:40	one
0:03:40	uh
0:03:41	named mean and that mother name
0:03:45	for i mean
0:03:46	max
0:03:47	and random
0:03:48	yeah
0:03:48	we do different
0:03:50	uh experiment
0:03:51	uh we
0:03:52	there is exactly the same speakers
0:03:54	exactly the same testing except
0:03:57	but we change the training except
0:03:59	four
0:04:00	each set
0:04:03	uh we do this uh experiments on two corpora
0:04:07	the first is the
0:04:09	based on then used uh
0:04:11	two thousand eight
0:04:12	with the telephonic conversational speech
0:04:15	and uh which uh a lance of uh
0:04:18	two two minutes
0:04:20	uh
0:04:21	for for each uh uh
0:04:22	samples
0:04:23	and that will maximise
0:04:24	the number of training except for each speaker
0:04:27	we do uh leave one out
0:04:29	uh and uh
0:04:31	with this per process uh we uh we have a
0:04:35	be doing this
0:04:36	uh one hundred that seventy one speaker for we have
0:04:40	three to uh twenty models
0:04:42	but
0:04:44	and uh
0:04:45	do you also corpus we used is the right for one hundred
0:04:49	twenty
0:04:50	which is an
0:04:51	stooge or recording uh
0:04:54	corpora that database
0:04:56	we visited exactly the same microphone
0:04:58	and uh it is the read speech
0:05:01	by uh newspapers and
0:05:02	is it
0:05:03	oh what a speaker on the T french
0:05:06	and uh we have uh
0:05:08	more
0:05:08	uh females on me
0:05:10	and uh
0:05:11	for each
0:05:12	uh speaker
0:05:13	we have a
0:05:14	training and
0:05:15	testing except
0:05:17	and uh it is uh
0:05:19	the the the the
0:05:21	we we concatenate
0:05:22	so the some sentences
0:05:24	to have more than
0:05:26	uh twenty seconds
0:05:27	all the selected frames
0:05:29	by itself
0:05:33	yeah i'm the heave uh we take a do we we we analyse the the viability due to the training
0:05:40	except
0:05:41	we see that
0:05:42	the uh
0:05:44	the equal error rate
0:05:45	uh range
0:05:46	uh is that
0:05:47	four point one person's too
0:05:50	twenty one
0:05:51	only nine percent
0:05:52	for mean that are and for breath
0:05:54	uh iran
0:05:55	two
0:05:56	uh one person to
0:05:58	the thirty three person
0:06:00	we uh have done a random
0:06:03	uh
0:06:04	set
0:06:04	and the
0:06:05	the mean
0:06:06	here
0:06:06	it's uh
0:06:08	with with the um the breath
0:06:10	is the is
0:06:11	the mean of up
0:06:14	different uh run them
0:06:18	it is the very and
0:06:20	very important gap
0:06:21	between according to the
0:06:23	so
0:06:24	training
0:06:27	and the
0:06:28	now the important is this
0:06:30	to explain the viability
0:06:32	and the the question is what kind of
0:06:34	information
0:06:35	so for
0:06:37	for the number of selected frame
0:06:39	it's possible to do that
0:06:40	we have uh nice
0:06:41	and right but for
0:06:43	when i make distribution and that for phonemic acoustic difference
0:06:46	uh we use
0:06:47	only right
0:06:48	because it is so
0:06:50	mm easier
0:06:51	two
0:06:52	what is this type of information
0:06:56	and uh
0:06:57	for i used to uh we have a significant effect
0:07:00	well the number of frames
0:07:02	but it is something that is controlled in uh breath one hundred
0:07:06	twenty so it is an relevant fact so for
0:07:09	eight an explanation of uh the difference of uh performance
0:07:12	but
0:07:13	the other four factors that was important
0:07:15	because we have
0:07:16	uh more important yeah
0:07:17	in this
0:07:18	in brief uh one hundred twenty
0:07:21	and uh it's not can be explained but
0:07:23	the number of
0:07:24	uh
0:07:24	for free
0:07:27	though so for the phonetic uh
0:07:30	content
0:07:31	uh we for me we do a forced alignment
0:07:33	also i mean and i
0:07:36	five
0:07:36	where the spirits about
0:07:37	and uh we correct
0:07:39	thus this argument
0:07:41	manually
0:07:42	and the to analyse the phonemic content
0:07:45	uh we
0:07:46	just
0:07:47	uh for the first time
0:07:48	uh counts the number of selected frame for each phoneme
0:07:52	we don't man over
0:07:54	with a between subjects factor which are the the set and the dependent variables
0:07:58	are the number of selected
0:08:01	and
0:08:02	we see that
0:08:04	there is
0:08:04	quietly no
0:08:06	different
0:08:06	on phone it
0:08:07	media content
0:08:09	between
0:08:09	here
0:08:10	as for female speakers
0:08:12	uh between the mean max
0:08:14	and uh the random
0:08:16	and the only oh
0:08:18	one
0:08:18	for names
0:08:19	which is uh
0:08:21	which is the relevant
0:08:22	and the formalities
0:08:24	was it the same thing
0:08:25	so it's not uh a sufficient to explain the gap
0:08:29	of performance
0:08:32	oh for the infra phonemic information
0:08:34	uh we uh we use the acoustic feature
0:08:38	uh for each for names
0:08:39	and uh
0:08:40	it's uh exactly the same for sitting with a a man of a
0:08:44	bit we have we have uh between subject factor of the set
0:08:47	and the dependence of i'll
0:08:49	are the L S D C the delta that so that's all
0:08:52	yeah
0:08:53	uh we have a
0:08:54	uh important significant difference
0:08:56	for L F C and for all the phonemes
0:08:59	and the four del sol
0:09:01	is an
0:09:02	important
0:09:02	uh yeah
0:09:04	difference
0:09:05	four
0:09:06	um
0:09:06	around majority of uh for names
0:09:08	and the mainly
0:09:10	stops
0:09:10	and several voice
0:09:12	but we don't find difference
0:09:14	for that utterance
0:09:16	and uh this is uh
0:09:18	this type of uh analysis
0:09:20	um
0:09:22	it is challenge and proves that uh the infra permit unique
0:09:25	acoustic difference our uh i
0:09:27	to be accounted for
0:09:29	from
0:09:31	and uh so when's the training except she ends
0:09:35	uh the uh we have a large performance differences
0:09:39	you might not be explained by the number
0:09:41	of selected frames
0:09:42	or it is a possible factor
0:09:44	but not a sufficient proctor
0:09:46	and the the form a mixture distribution to account
0:09:49	uh explain exactly
0:09:51	this is uh got
0:09:52	is there a investigation on it
0:09:54	to that reminds influence
0:09:56	of uh in prof anaemic
0:09:58	acoustic
0:10:00	and uh
0:10:01	that's the the question is to do the drilling
0:10:04	between six
0:10:06	acoustic
0:10:07	uh in phonemic acoustic difference
0:10:09	and uh uh higher
0:10:11	yeah but
0:10:12	four
0:10:13	uh from the media
0:10:14	information
0:10:15	and uh
0:10:16	work
0:10:16	there is uh in your results
0:10:19	since uh the
0:10:20	the the summation of the paper
0:10:22	and uh we see that
0:10:24	uh the intensity is either
0:10:26	you mean
0:10:27	than that
0:10:28	but
0:10:28	it is the
0:10:32	it's the significance but if you take the mean
0:10:35	of
0:10:36	the intensity it is uh
0:10:37	a very short
0:10:38	different
0:10:40	there is no difference
0:10:41	for uh
0:10:43	fundamental
0:10:44	top of the peach
0:10:45	and the you you can see it's form and here we don't have different
0:10:50	and uh
0:10:50	we we you say the dissipation of the volumes three and and no difference
0:10:55	for uh
0:10:56	this type of
0:10:57	information
0:10:58	and uh
0:10:59	it is the same thing for the spectrum
0:11:01	um so uh
0:11:02	right
0:11:02	of the
0:11:03	fig
0:11:05	so for the future work
0:11:07	uh
0:11:08	it's the the question it is
0:11:11	that the viability may not be only
0:11:14	the result
0:11:15	all the signal samples
0:11:16	and uh
0:11:17	maybe the system itself
0:11:19	a a a problem
0:11:21	and uh
0:11:22	now we are working on the linkage between the llr
0:11:27	by the frame
0:11:28	and
0:11:28	the phoneme it
0:11:29	distributed description
0:11:30	to understand
0:11:31	what are the exactly the
0:11:33	good for that frame and
0:11:34	if it is
0:11:35	there is not a link
0:11:36	uh with uh funding information
0:11:40	thank you
0:11:50	question
0:12:06	uh
0:12:07	i entered
0:12:08	and there's two you said that
0:12:09	there was no
0:12:11	significant difference between the snr
0:12:15	yeah
0:12:15	oh do
0:12:16	by
0:12:17	training try out some good three trials
0:12:20	yeah that is another difference for
0:12:22	there is a difference on uh the acoustic for the L F C C for a for it
0:12:27	we have
0:12:28	the significant difference for all the finance
0:12:30	but
0:12:31	uh she if uh we we want to find uh the link
0:12:35	with uh i'm here
0:12:36	uh features
0:12:38	and we don't fine
0:12:39	something so
0:12:40	the question is uh
0:12:41	oh
0:12:42	that
0:12:42	we don't
0:12:43	have found
0:12:44	uh with the description
0:12:46	the the the description the
0:12:50	the the feature we
0:12:51	use only used
0:12:52	uh in phonetic science
0:12:54	to describe
0:12:55	the speech
0:12:56	actually we don't have find
0:12:58	the link between
0:12:59	the L X T C
0:13:00	and uh
0:13:02	and the the the recognition
0:13:03	and uh
0:13:04	the
0:13:06	phonetic
0:13:07	uh information in the we don't
0:13:09	we don't know
0:13:11	uh
0:13:12	uh well
0:13:13	why
0:13:13	yeah we have this type of guy
0:13:16	and uh
0:13:17	and uh we don't have an explanation
0:13:19	actually
0:13:20	uh by by the acoustic and the phonetic
0:13:24	uh analysis
0:13:26	so if you just take your means
0:13:28	trials we don't we we selection
0:13:31	train
0:13:32	turned out
0:13:33	and the mean high snr don't know with an hour
0:13:37	so don't see a difference in performance
0:13:40	sorry
0:13:41	you take on your knees trials
0:13:42	no no no we we still i mean
0:13:45	but eventually you could do yeah yeah
0:13:48	yeah we we did something like that in there is to be difference in performance
0:13:53	i mean is what you would expect
0:13:54	but yes in our training data should be yeah
0:13:57	worse performance
0:13:58	buttons
0:13:59	you
0:14:00	not
0:14:01	not a break
0:14:02	you rattle basically for exactly the the same
0:14:06	but
0:14:07	maybe there is not so much but you be the the
0:14:11	nice
0:14:13	'cause
0:14:13	maybe the breath they that there is not so much
0:14:16	but maybe it's an hour
0:14:19	no
0:14:20	very
0:14:21	that um
0:14:22	the viability
0:14:23	about the the uh
0:14:25	a four position for example there is no viability right
0:14:28	okay
0:14:29	that no it is exactly the same microphone exactly
0:14:32	the only people are are recorded
0:14:35	uh oh no
0:14:36	as the same day and uh it's
0:14:38	there is no viability of the station
0:14:40	the unique the only uh this the unique viability
0:14:45	is uh is on the speaker
0:14:47	so and uh when we have only the information about the speaker
0:14:51	we can have
0:14:52	uh evaluation like
0:14:54	this
0:14:55	between one
0:14:56	two
0:14:56	thirty three percent
0:14:58	i think what everybody
0:15:00	so
0:15:01	it's
0:15:01	very
0:15:03	and then the the question you
0:15:05	how to explain that because that
0:15:07	if we can
0:15:08	if we can have a an explanation
0:15:11	we can the
0:15:11	and uh a coffee then score
0:15:13	or something like this
0:15:15	that
0:15:15	can't say that
0:15:16	uh okay
0:15:17	uh
0:15:18	i i know
0:15:19	the
0:15:20	the the training and i know the the testing
0:15:24	detecting the testing sample
0:15:26	and uh i can say i can say
0:15:28	oh okay for this
0:15:30	i i can't
0:15:31	i have a a good score
0:15:32	and i don't have a a confidence
0:15:34	with
0:15:35	this doctor
0:15:36	but we have an older data i can have
0:15:38	uh
0:15:39	a good
0:15:39	the a score uh would computed
0:15:41	and it is
0:15:43	it is the objective
0:15:44	of
0:15:45	this kind of us to do
0:15:46	it's a good
0:15:55	but
0:16:05	what
0:16:07	sure
0:16:08	hmmm
0:16:08	what
0:16:10	uh_huh
0:16:11	oh
0:16:12	some
0:16:14	from
0:16:16	hmmm
0:16:16	hmmm
0:16:17	uh_huh
0:16:19	um
0:16:22	yeah and it's uh yeah
0:16:24	the
0:16:25	actually boring problem anyway
0:16:27	any information we
0:16:28	just
0:16:29	use
0:16:29	the L S C that that that the delta delta
0:16:32	and that it was
0:16:33	to to check that
0:16:34	the there is the
0:16:36	a difference
0:16:37	because uh at the beginning we don't understand the question now it is the link
0:16:41	between
0:16:42	uh or the fornication mister
0:16:44	and
0:16:45	this
0:16:46	uh L S C uh
0:16:47	which are used because
0:16:49	we know that
0:16:50	in L A C C and delta we have information
0:16:53	but
0:16:53	we don't
0:16:54	yeah
0:16:55	found
0:16:55	a link between
0:16:57	the test
0:16:58	see
0:16:58	and the dental
0:16:59	and
0:17:00	this
0:17:00	the
0:17:01	the i'll evil
0:17:02	uh i phonemic information
0:17:04	actually i am working on them
0:17:07	the coarticulation information
0:17:09	and uh
0:17:10	the
0:17:11	uh
0:17:11	i i the first uh experiments i do we use
0:17:14	it was the only with the
0:17:16	a trifle
0:17:17	and analysing
0:17:18	the distribution of the triphones
0:17:20	and uh i don't
0:17:21	fine
0:17:21	difference
0:17:22	so uh actually i am a misery go all the locus
0:17:26	to see if our with a lexus whether we have here
0:17:30	in high school
0:17:31	that with raucous we have
0:17:34	yeah you use the you know
0:17:36	uh not use
0:17:37	is um
0:17:38	uh you take uh the value of the formants
0:17:41	of the second that's a formant
0:17:43	at uh
0:17:44	then purred
0:17:44	and
0:17:45	or the beginning of the boy
0:17:47	and uh on a fifty percent of the volumes and you
0:17:51	you
0:17:52	you analysed evaluation
0:17:54	between uh
0:17:55	as it to to the two values
0:17:57	and uh
0:17:58	normally if uh there is a a lot of articulation
0:18:01	and so the the people
0:18:03	uh we you and you have a
0:18:06	you are a regression
0:18:07	all the value according to the
0:18:09	for all the value but if
0:18:11	there is no coarticulation
0:18:13	uh you have something that is very
0:18:16	and uh
0:18:17	two
0:18:18	yeah
0:18:23	uh_huh
0:18:25	first
0:18:25	fig
0:18:28	oh
0:18:28	well
0:18:30	you yeah
0:18:34	oh good
0:18:36	or or
0:18:37	uh
0:18:39	oh
0:18:39	for those
0:18:41	yeah our
0:18:42	uh
0:18:43	okay
0:18:44	the more you
0:18:46	or or
0:18:49	the
0:18:52	yes yeah
0:18:56	it's a it's a good question
0:18:57	um
0:18:58	yeah you have uh the score
0:19:00	the last call
0:19:01	four
0:19:03	um i is the speaker that on the twenty eight
0:19:07	the
0:19:07	it is there is
0:19:09	a different
0:19:09	uh according to the normalisation
0:19:12	but it is
0:19:13	not compatible
0:19:14	with the difference
0:19:15	we have
0:19:16	in a house normalisation
0:19:18	between the
0:19:19	the
0:19:19	when we select
0:19:20	you said to yeah
0:19:24	yeah
0:19:25	that no we we are trying we are training the
0:19:29	the normalisation
0:19:30	is the it is something that so we have to do
0:19:33	but the problem is uh we have uh
0:19:35	a database like yeah right
0:19:37	uh it's very difficult because
0:19:39	we don't have
0:19:40	and now that a lot of uh
0:19:43	a lot of data and uh to be able to to have a a good uh a good word
0:19:47	and that's who have uh
0:19:49	uh would uh
0:19:50	all
0:19:51	different sub training and testing
0:19:53	uh we don't have a lot of
0:19:55	uh on that that so it's very difficult to to do
0:19:58	the normalisation
0:19:59	we if we want
0:20:00	to to have a lot of
0:20:02	different
0:20:03	uh training
0:20:05	excel
0:20:09	oh
0:20:10	or or what
0:20:11	two
0:20:12	maybe more to each source model one quarter sometimes you can point to
0:20:18	oh
0:20:19	um
0:20:20	we have for the the concatenation it is uh a randomised
0:20:25	concatenation
0:20:26	we are sure that there is
0:20:28	never
0:20:28	the same
0:20:29	uh samples
0:20:30	for testing and training
0:20:32	but
0:20:33	uh
0:20:34	uh it
0:20:34	so we we don't
0:20:36	combine that actually
0:20:37	um
0:20:38	for example if if your question is that
0:20:40	uh have betrayed try to train
0:20:43	right
0:20:43	to um
0:20:45	to use the the the best
0:20:47	uh and uh concatenate the bad
0:20:50	to to to have a best
0:20:52	model we don't have
0:20:53	uh
0:20:54	i tried
0:20:55	it's uh
0:20:56	type of combination
0:20:57	a small country
0:20:59	you have some recordings of each speaker
0:21:02	point
0:21:02	time
0:21:05	between three and twenty
0:21:08	recording yeah
0:21:09	and each recording
0:21:10	some
0:21:10	some some some
0:21:11	point in time
0:21:13	and
0:21:15	according to teach
0:21:16	yeah
0:21:17	okay
0:21:18	strong
0:21:19	combining multiple recordings to a more
0:21:22	no yeah
0:21:23	we we have done um
0:21:25	with um
0:21:26	to to to have a
0:21:28	um
0:21:29	samples
0:21:29	with
0:21:30	for
0:21:30	two minutes
0:21:31	i mean it's and how
0:21:32	uh
0:21:34	um
0:21:35	phrase selected frame
0:21:36	the a and the we
0:21:38	we
0:21:39	we do the same thing that uh
0:21:42	select the what best and the worst with um
0:21:45	a longer
0:21:46	uh
0:21:47	signal
0:21:47	and the
0:21:48	the the results
0:21:50	are
0:21:51	this one is that
0:21:52	uh the there is
0:21:53	let's uh that's also that's why the the curve
0:21:56	is that not
0:21:57	so
0:21:58	so good
0:21:59	but uh we have
0:22:00	the
0:22:01	the set not
0:22:02	uh the same yeah
0:22:03	that's
0:22:04	a gap which is important
0:22:06	and uh
0:22:07	here it is that the the equal error rate is last one
0:22:10	one person
0:22:11	and uh here it is um five percent and do we have
0:22:15	a lot of frame select
0:22:16	yeah
0:22:17	which shows more
0:22:19	combination of so yeah things from yeah point sometimes or
0:22:25	between no no no
0:22:28	no
0:22:29	no
0:22:29	it's uh
0:22:31	now because ah it is uh it is
0:22:34	yes there there is a it is exactly the same testing for
0:22:38	for this curve
0:22:39	and this curve
0:22:41	so it is uh compare it is possible to compare
0:22:43	the
0:22:44	that's why the
0:22:45	posted to
0:22:47	i don't know
0:22:51	from
0:22:52	sessions which
0:22:53	you just
0:22:54	no i have no information about it
0:22:57	because
0:22:58	because the
0:22:59	what the sample
0:23:00	or
0:23:01	uh recording in the same
0:23:03	it with the same microphone and exactly
0:23:06	the same day so if there is
0:23:07	no the there is no uh interior stationed viability
0:23:12	there is only
0:23:13	uh intraspeaker valuable
0:23:16	it is controlled that
0:23:17	the speaker hon that's a
0:23:19	the
0:23:20	the one i want to find an optional
0:23:25	for example for half an hour or two
0:23:30	open or something
0:23:32	yes
0:23:33	yeah
0:23:46	oh
0:23:46	oh
0:23:47	right
0:23:48	hmmm
0:23:54	right
0:23:54	hmmm

Intra-speaker variability effects on Speaker Verification performance

SESSION 5: Speaker recognition – Inter-session variability

Přidáno: 14. 7. 2010 11:08, Autor: Juliette Kahn, Nicolas Audibert, Solange Rossato, Jean-François Bonastre (Laboratoire Informatique d'Avignon, University of Avignon), Délka: 0:23:55