Přepis řeči - Bayesian Speaker Verification with Heavy-Tailed Priors

0:00:06	the title of my talk uh
0:00:08	vision speaker verification with
0:00:10	heavy tailed right
0:00:13	yeah
0:00:16	or not
0:00:17	yeah
0:00:18	right
0:00:19	oh
0:00:20	yeah
0:00:23	oh
0:00:24	or
0:00:24	okay
0:00:30	oh in a nutshell uh but still is about it is um
0:00:34	applying uh joint factor analysis where
0:00:37	i vectors
0:00:38	as
0:00:39	features
0:00:41	so i'll be assuming that you have uh some familiarity with
0:00:45	joint factor analysis
0:00:47	i vectors
0:00:49	and
0:00:49	cosine distance
0:00:51	scroll right
0:00:54	uh the key fact
0:00:55	about i actors is that they provide a representation of speech segments so
0:01:00	arbitrator durations by
0:01:03	vectors of
0:01:05	uh fixed dimension
0:01:08	uh these all these vectors uh seem to contain most of the information needed to distinguish between speakers
0:01:15	and as a bonus they are of relatively low dimension
0:01:20	typically four hundred rather than
0:01:23	a hundred thousand
0:01:24	as in the case of a gmm supervectors
0:01:29	uh this means that
0:01:31	it's
0:01:31	possible to
0:01:33	apply
0:01:34	modern
0:01:36	bayesian that because of uh pattern recognition
0:01:39	to the speaker recognition problem
0:01:41	we've banished
0:01:42	the
0:01:43	time dimension altogether
0:01:45	and we're in a situation which is quite analogous to
0:01:48	other
0:01:49	action recognition problems
0:01:54	the
0:01:55	um
0:01:57	i think i should
0:01:58	at the outset explained what i need but where nation
0:02:01	because it's open to several interpretations
0:02:05	um
0:02:06	what i intend is that it is
0:02:08	uh
0:02:08	in my mind
0:02:10	the terms station
0:02:11	and for the ballistic
0:02:13	are synonymous with each other
0:02:16	the idea is
0:02:17	two
0:02:18	as far as possible
0:02:21	do everything within the framework of the cartoons probability
0:02:27	it doesn't
0:02:29	really matter whether you prefer
0:02:31	to interpret probabilities and frequentist terms
0:02:35	or and added then surely terms
0:02:38	three
0:02:39	rules the probability of the same or only two
0:02:42	the sum rule
0:02:43	and the product
0:02:45	a very they give you the same results in both cases
0:02:50	um
0:02:51	and the advantage of this is that you have uh
0:02:53	logically coherent way of doing
0:02:57	reasoning in the face of uncertainty
0:03:01	the disadvantage
0:03:03	is that in practise
0:03:04	you usually
0:03:06	run into a computational brick wall in pretty short order
0:03:11	if you try to to follow these rules
0:03:14	consistently
0:03:16	so in fact
0:03:17	it's really only been in the past ten years
0:03:21	that's uh
0:03:22	this
0:03:22	field of
0:03:24	they shouldn't pattern recognition has really taken off
0:03:27	and that's that thanks to the
0:03:30	introduction
0:03:31	um
0:03:32	fast
0:03:32	approximate
0:03:34	methods
0:03:35	all
0:03:36	bayesian inference
0:03:38	uh in particular age a variational bayes
0:03:43	uh which makes it possible to treat
0:03:45	probabilistic models which are
0:03:47	well more sophisticated
0:03:50	then
0:03:50	was possible in the case of uh
0:03:53	traditional statistic
0:03:55	so the you know the unifying theme in my
0:03:57	twelve will be the application of variational bayes method
0:04:00	to the
0:04:02	speaker recognition proper
0:04:06	um
0:04:07	i start out with the
0:04:09	traditional assumptions in joint factor analysis that
0:04:13	speaker and channel effects
0:04:15	or
0:04:16	and uh so
0:04:18	statistically independent
0:04:20	and
0:04:20	gaussian the strip
0:04:23	and in the first part might well
0:04:26	i will simply a to show
0:04:29	how joint factor analysis
0:04:31	can be done
0:04:32	under these assumptions
0:04:34	using i pictures as
0:04:36	features and
0:04:38	a patient rate
0:04:42	um
0:04:42	this already works very well
0:04:44	yeah in my experience it gives better results them then joint factor analysis
0:04:49	uh the second part of my talk will be
0:04:53	concerned with hell
0:04:54	a variational bayes
0:04:57	can be used
0:04:58	two
0:04:59	model non gaussian behaviour in the data
0:05:03	uh i i found that this
0:05:05	leads to to a substantial
0:05:07	uh improvement in performance
0:05:10	and uh as an added bonus it seems to be possible to do away with the need for
0:05:15	score normalisation across the the whole day
0:05:22	ah the fun part of my talk of this factor
0:05:25	okay it's concerned with the problem
0:05:27	of
0:05:28	how to
0:05:29	integrate the assumptions of
0:05:32	joint factor analysis and cosine distance scoring you know
0:05:35	coherent framework
0:05:38	um
0:05:40	on the face but this looks like a hopeless exercise
0:05:43	okay the the assumptions appeared to be completely different
0:05:47	uh however
0:05:48	it is possible to do something about this
0:05:51	thanks to the flexibility
0:05:53	provided by variational bayes so even though this is like that of i think this is where
0:05:58	uh talking about because it's a real object lesson in how harmful
0:06:03	these beijing methods are
0:06:05	at least potentially
0:06:08	um
0:06:10	before getting down to business uh i just say
0:06:12	something about the way of organise this presentation
0:06:16	uh in preparing the slides i i tried to ensure that they were
0:06:20	reasonably complete and self contained
0:06:22	okay what are the idea i have in my mind is that
0:06:25	if anyone was interested in reading through the slides afterwards
0:06:28	they should tell a fairly complete story
0:06:31	okay but
0:06:32	uh because of time constraints i'm going to have to gloss over
0:06:36	uh
0:06:37	some
0:06:37	points in V in your presentation
0:06:41	uh for the same reason there's going to be somehow
0:06:44	in the slides
0:06:45	okay okay
0:06:46	to do some hand waving their
0:06:48	um
0:06:49	i found
0:06:50	that by focusing on the gaussian dance just
0:06:54	statistical independence assumptions
0:06:57	uh i could explain the the variational bayes ideas
0:07:00	but the uh an animal
0:07:02	uh amount of uh of technicalities so i would spend almost half
0:07:07	we
0:07:07	time
0:07:08	on the first part
0:07:09	really
0:07:10	tall
0:07:11	uh on the other hand the last part of the talk
0:07:15	uh
0:07:15	is
0:07:16	is technical is addressed
0:07:18	primarily
0:07:20	two
0:07:20	uh members of the audience who would have read
0:07:23	say the the chapter on variational bayes
0:07:26	and uh bishop's book
0:07:30	okay
0:07:35	okay so here the the the the
0:07:37	basic assumptions of factor analysis with
0:07:40	i vectors
0:07:41	uh
0:07:42	features
0:07:43	um
0:07:45	we had used
0:07:46	D for data as for speaker C for channel
0:07:49	or
0:07:49	recording
0:07:50	okay we have a collection of recordings per speaker
0:07:54	um
0:07:56	we assume that that can be decomposed
0:07:58	into two statistically independent parts a speaker part
0:08:01	um
0:08:02	uh channel or
0:08:04	these assumptions are questionable but i'm going to stick with them for the um
0:08:09	first part of the channel
0:08:15	um
0:08:16	this uh
0:08:18	this model
0:08:19	well we have replaced
0:08:21	they had the supervector
0:08:22	by
0:08:23	and observable i vector already has a name
0:08:26	it's known and
0:08:28	uh face recognition
0:08:30	as
0:08:30	probabilistic
0:08:33	a linear discriminant
0:08:34	uh i mouses
0:08:36	uh make i think as a
0:08:37	that's twenty nine is the true covariance model
0:08:41	okay but the other guy is is the one that you will find it very uh
0:08:45	and the best picture
0:08:47	the um
0:08:49	it's not
0:08:50	perhaps quite as straightforward as it appears
0:08:52	because
0:08:54	uh if you're dealing with high dimensional features for example
0:08:57	mllr features
0:08:59	you can treat these are covariance matrices
0:09:01	as being a full rank
0:09:04	yeah and you need uh a hidden variable
0:09:07	a representation of the model which is practically
0:09:10	analogous to the
0:09:13	hidden variable description of
0:09:15	joint factor analysis
0:09:19	so here on the left hand side D on that's an observable ivector not a
0:09:24	a hidden supervector
0:09:26	um it turns out to be convenient for the heavy tails stuff to refer to the
0:09:31	eigenvoice matrix
0:09:32	and the eigenchannel matrix
0:09:34	matrix using subscripts you want and you too
0:09:37	rather than the traditional names that you wouldn't be
0:09:41	uh same thing for the
0:09:42	um
0:09:43	where the speaker factors are labelled X one
0:09:46	the channel factors i label them X two or
0:09:48	B or indicates the V dependence on the right
0:09:51	or the or the uh
0:09:53	the channel
0:09:55	uh there's one difference here from the um
0:09:59	conventional formulation on a joint factor analysis in the lda this uh residual term
0:10:05	the epsilon
0:10:08	which
0:10:09	in general has been modelled by right now
0:10:11	by a diagonal covariance or or precision matrix
0:10:15	it's associated
0:10:16	traditionally with the channel
0:10:18	rather than with the speaker
0:10:20	okay in jfa i i formulated it slightly differently but i
0:10:24	i'm i'm just going to follow this uh
0:10:26	uh this model
0:10:28	in in in this presentation
0:10:30	so because the the residual epsilon is associated with the channel there are
0:10:35	two
0:10:36	noise terms
0:10:37	okay that's the contribution of the eigenchannels
0:10:41	okay that contribute
0:10:43	uh
0:10:43	this
0:10:44	so the
0:10:46	so the
0:10:47	channel variance
0:10:48	and the contribution to the residual and there's a precision matrix sense is to say the inverse
0:10:52	of the covariance matrix sorry about that too because
0:10:56	you have
0:10:57	statistical independence
0:11:03	uh
0:11:04	is the graphical model that goes
0:11:06	um
0:11:08	with that application
0:11:09	uh if you're not familiar with is that we just take a minute to explain how to read these uh
0:11:14	these diagrams
0:11:16	um
0:11:19	uh a much uh mode like that
0:11:22	in the case um observable there
0:11:23	oh
0:11:25	the black nodes
0:11:27	in the case
0:11:28	hidden variables
0:11:30	the
0:11:31	do not
0:11:32	indicate model parameters
0:11:35	and
0:11:36	the arrows
0:11:37	in the case
0:11:38	conditional dependency
0:11:40	okay so the
0:11:43	the i vector is assumed to depend on a speaker factors
0:11:46	the channel factors
0:11:47	um
0:11:48	residual
0:11:50	this like notation indicates that something is
0:11:53	replicated server
0:11:54	time
0:11:55	okay there are several sets of channel factors
0:11:58	one for each recording
0:12:00	but there's only one set of speaker factors
0:12:02	so that's
0:12:03	outside
0:12:04	three
0:12:04	of the plate
0:12:06	uh
0:12:07	here are specified
0:12:09	say that parameter lambda
0:12:12	but i did about
0:12:13	specifying the distribution
0:12:16	oh speaker factors because it's understood
0:12:18	be
0:12:18	standard normal
0:12:21	um
0:12:24	so
0:12:25	as i mentioned well including the channel factors enables this decomposition here
0:12:31	it's not always nest
0:12:33	if you have i mean vectors of dimension four hundred it's actually possible to model
0:12:39	full rank
0:12:40	are
0:12:41	rather full
0:12:42	precision matrices
0:12:44	instead of diagonal
0:12:46	okay and in that case
0:12:48	this time doesn't actually contribute anything
0:12:51	um
0:12:52	i have found it useful well
0:12:54	in experimental work to use this term
0:12:56	to estimate
0:12:57	eigenchannels on microphone data
0:12:59	so it's useful to people
0:13:02	and in fact it turns out that so these channel factors can always be eliminated at recognition time that's a
0:13:07	technical point i come back to it later
0:13:09	if i
0:13:15	okay so how do you do
0:13:17	speaker recognition with the the lda model
0:13:19	okay i'm gonna make some
0:13:21	provisional assumptions here
0:13:23	one is that you've already succeeded in estimating the model parameters
0:13:27	yeah eigenvoices the eigenchannels et cetera
0:13:30	and the other that you know how to uh evaluate
0:13:33	this thing known as the evidence integral
0:13:36	okay you have a collection of ivectors associated with each speaker
0:13:39	you also have a collection of hidden variables
0:13:42	to evaluate the marginal likelihood you have to integrate over it variables
0:13:48	so
0:13:49	and assume that
0:13:50	we've tackle these two problems
0:13:53	uh it turns out that the key to solving both problems in general
0:13:58	is to evaluate the posterior distribution of the hidden variables
0:14:01	and
0:14:02	i returned
0:14:03	so that in a minute
0:14:04	but first i just one to show you have to do speaker recognition
0:14:10	okay we take the simplest case
0:14:12	the the
0:14:13	the core condition in the nist evaluation
0:14:16	yeah one recording which is usually
0:14:18	designated as test
0:14:19	mother
0:14:20	designated
0:14:21	trained and you're interested
0:14:24	inception the question whether
0:14:26	the two speakers are the same
0:14:28	or different
0:14:30	so if the two speakers are the same
0:14:34	okay
0:14:34	i think it's natural to call that the alternative hypothesis but that doesn't seem to be an a universal really
0:14:39	about that
0:14:41	um
0:14:43	then
0:14:44	the likelihood
0:14:45	the atoms
0:14:46	is calculated
0:14:48	okay assumption that there is a
0:14:50	common seven speaker factors
0:14:52	but
0:14:52	different channel factors
0:14:54	for the two recording
0:14:58	on the other hand
0:14:59	it's the two speakers are different and
0:15:02	then be calculation of these two likelihoods can be done uh independently because the speaker factors
0:15:08	and that's channel factors
0:15:09	or on time
0:15:11	for that record
0:15:12	so the point is that everything here is an evidence into
0:15:16	okay
0:15:17	if you can evaluate the evidence integral
0:15:19	you're in this
0:15:22	uh a few things to note
0:15:24	uh unlike traditional likelihood ratios this is symmetric
0:15:27	and D one and D two
0:15:30	uh it also
0:15:31	has
0:15:33	an unusual
0:15:35	denominator here
0:15:36	okay
0:15:37	you don't see anything like this
0:15:39	and joint factor analysis
0:15:42	okay this is this is something that comes out of
0:15:45	following will be
0:15:46	the patient
0:15:48	um
0:15:49	power line
0:15:51	and it's actually
0:15:53	we see this later
0:15:55	potentially
0:15:57	and effective method of score normalisation
0:16:01	and the other
0:16:02	point i would like to stress
0:16:04	is
0:16:04	but you can write down the likelihood ratio for any type
0:16:08	speaker recognition problem in the same way
0:16:10	for instance
0:16:11	you eight conversations
0:16:13	in training one conversations and test
0:16:16	we might have three conversations and train into conversations and test
0:16:20	in all cases
0:16:21	it's just a matter of
0:16:23	following the rules of probability consistently
0:16:26	and you can write down the mic
0:16:27	ratio
0:16:28	or bayes factor
0:16:29	uh as it is
0:16:30	usually called in this field
0:16:36	uh the standard insensible
0:16:38	had to be evaluated exactly under gaussian assumptions
0:16:42	table is
0:16:43	it's rather convert
0:16:45	and if you do
0:16:46	relax
0:16:46	the gaussian assumptions you can't do it
0:16:49	um uh
0:16:50	i believe that even in the gaussian case you're better off using variational bayes
0:16:54	and the co disagrees
0:16:56	best but i decided to let it stand
0:16:58	and we can uh
0:16:59	yeah
0:17:00	go into it later
0:17:01	if
0:17:02	so um
0:17:03	if there's time
0:17:06	the
0:17:06	uh
0:17:07	key inside
0:17:08	here
0:17:09	is
0:17:09	that
0:17:10	this
0:17:10	uh this inequality
0:17:13	that you can always
0:17:14	find a lower bound on the evidence with
0:17:16	and we
0:17:17	distribution of it on the hidden factors
0:17:21	um
0:17:22	it's
0:17:23	and i i grant you it's not obvious just by looking at it but the derivation
0:17:27	turns out to be just a cost
0:17:28	once all the facts
0:17:29	come back like
0:17:30	right
0:17:31	but are
0:17:32	or
0:17:33	a nonnegative
0:17:36	um
0:17:37	and what i'll be focusing on is
0:17:40	the use of the
0:17:42	variational bayes method
0:17:44	so
0:17:45	um
0:17:46	find a principle
0:17:47	approximation to the
0:17:49	the true posterior
0:17:55	oh
0:17:56	let me just digress a minute to explain why posteriors of about nine
0:18:02	there's nothing mysterious about this posterior distribution you you just apply bayes' rule this is what you get
0:18:08	you can read all this term here from the graphical model
0:18:11	this is the prior
0:18:13	this is the evidence
0:18:15	okay
0:18:16	practically straightforward
0:18:17	the only problem can practise
0:18:19	says that you can't evaluate yeah
0:18:21	exactly
0:18:22	evaluating the evidence and evaluating the posterior
0:18:25	are
0:18:25	two sides of the same problem
0:18:29	you can't do it just by numerical integration because these uh
0:18:33	these integrals
0:18:34	are in hundreds of dimensions
0:18:38	um
0:18:39	another way of saying the difficulty which i i think is a useful way to of thinking about it
0:18:43	is that
0:18:45	whatever factorisations you haven't the prior
0:18:47	that's be a page they get destroyed when you multiply by
0:18:51	okay factorisations in the prior art
0:18:53	statistical independence assumptions
0:18:56	statistical independence assumptions get destroyed in the poster
0:19:01	uh it's easy to uh
0:19:04	to see
0:19:05	why this
0:19:05	the case in terms of the graphical model but as i said i'm going to draw so
0:19:09	if you
0:19:10	uh a few things
0:19:13	and
0:19:14	return to this question variational bayes
0:19:17	the um
0:19:20	yeah the in the variational bayes approximation
0:19:23	is that
0:19:24	what you acknowledge that
0:19:26	uh
0:19:27	independence has been destroyed
0:19:29	in the posterior
0:19:30	but you go back and forth so
0:19:32	impostor
0:19:33	okay and you look for
0:19:34	what's called a variational approximation of the poster
0:19:38	variational because it's actually free form
0:19:40	as in the countless variations you don't impose any restriction
0:19:45	on the functional form
0:19:47	of
0:19:47	oh
0:19:48	yeah
0:19:49	and there's a standard set of couple uh update formulas that you can
0:19:54	that you can apply here
0:19:56	the couple because this expectation is calculated with the posterior on extra
0:20:01	this
0:20:02	expectation is calculated with the posterior next one
0:20:05	so you have to uh iterate between the two
0:20:08	um
0:20:10	nice thing is that this iteration comes with ian like uh convergence uh guarantees
0:20:16	and
0:20:16	it's avoided
0:20:17	altogether the need
0:20:19	to invert
0:20:20	um
0:20:22	large sparse block matrices which is the only way you can evaluate the
0:20:26	evidence exactly
0:20:28	and then
0:20:28	only in the gaussian
0:20:29	okay
0:20:35	uh this uh posterior distribution or the the variational
0:20:39	approximation of the posterior distribution is also the
0:20:43	the key
0:20:44	to estimate and model parameter
0:20:47	okay you use a lower bound
0:20:49	as a proxy
0:20:50	for the likelihood of the evidence
0:20:53	and you see two
0:20:54	optimise a lower bound
0:20:56	calculated
0:20:57	over
0:20:58	uh a collection of training speakers
0:21:01	uh here i just
0:21:02	taking the definition and
0:21:04	rewritten it this way
0:21:06	uh it's convenient to do this because this term here doesn't involve me model parameters
0:21:11	parameters at all
0:21:13	so the
0:21:14	first
0:21:15	approach
0:21:16	problem or would be just too
0:21:18	uh optimise
0:21:19	uh this term here
0:21:21	okay the contribution again
0:21:24	to the
0:21:25	uh to the evidence criterion by summing this overall speaker
0:21:32	okay um
0:21:33	this
0:21:33	when you when you work it out
0:21:35	turns out to be formally identical
0:21:39	two
0:21:41	um
0:21:41	probabilistic principal components analysis
0:21:44	it's just a least squares problem
0:21:46	the only um
0:21:51	and it's actually the E M auxiliary function for probabilistic principal components analysis
0:21:56	the only
0:21:58	the only difference is that you have to use the variational posterior
0:22:02	rather than be
0:22:03	other than the exact
0:22:04	that's true
0:22:07	um there is another way of
0:22:10	estimation
0:22:11	which
0:22:13	i called minimum divergence
0:22:15	estimation the this is pretty good you can of confusion over here so uh
0:22:20	try and explains briefly
0:22:23	there is concentrate this term here
0:22:27	it's independent of the model parameters
0:22:29	okay
0:22:30	but you can do you can
0:22:32	the
0:22:33	i changes of variable here
0:22:36	okay which
0:22:37	minimise the B divergence but are constrained in such a way as to preserve the value of the um auxiliary
0:22:44	function
0:22:46	and if you minimise
0:22:48	these divergences you will them
0:22:50	keeping this thing
0:22:51	you will then
0:22:52	increase
0:22:52	the
0:22:54	the uh value you have adams uh
0:22:56	criterion
0:23:00	uh the way this work
0:23:02	say in the case of speaker factors
0:23:04	to minimise the divergence
0:23:06	what you do is you look for
0:23:08	uh i'm transformations of the speaker factors such that the first and second order
0:23:13	moments
0:23:16	are the speaker factors
0:23:17	agree on average
0:23:19	as as the number of
0:23:21	uh speakers in the training set
0:23:23	with
0:23:23	the
0:23:24	first order moment of the prior and the second order moment
0:23:27	right
0:23:27	that's that's just a matter of uh
0:23:30	a finding an affine transformation
0:23:32	that satisfies
0:23:33	this condition you then applied
0:23:35	the inverse transformation
0:23:37	to update the model parameters
0:23:39	in such a way as to keep the value of the
0:23:43	uh yeah i'm auxiliary function fixed
0:23:46	and it turns out that if you
0:23:49	interleaved these two uh steps
0:23:52	you will be able to accelerate the um
0:23:56	the convergence
0:23:59	so ah
0:24:01	well just one comment about
0:24:03	about this
0:24:04	uh and i set out to do here is to produce point estimates
0:24:08	of three
0:24:10	eigenvoice matrix and the uh i'm the eigenchannel matrix
0:24:15	uh if you are really hardcore bayesian you don't allow point estimates
0:24:20	into your
0:24:22	model you have to do everything in terms of
0:24:24	prior probabilities
0:24:26	um
0:24:26	posterior probabilities
0:24:29	so a true blue bayesian approach a prior
0:24:32	on the eigenvoices and calculate the posterior
0:24:35	again by
0:24:36	variational bayes
0:24:38	even the
0:24:39	number of speaker factors
0:24:40	could be treated as a hidden random variable
0:24:43	okay and the posterior distribution could be calculated
0:24:46	again by
0:24:47	haitian
0:24:47	right
0:24:49	so there is
0:24:50	an extensive literature
0:24:52	on this
0:24:53	on this subject
0:24:54	uh
0:24:55	and say that if there's one problem with variational bayes
0:24:59	it provides too much flexibility
0:25:01	you have to
0:25:01	exercise good judgement
0:25:03	as to which things
0:25:05	you should try
0:25:07	i wish things are probably not
0:25:09	going to help
0:25:10	in other words don't lose sight of your
0:25:12	you're engineering objective
0:25:15	and the particular thing i chose
0:25:17	to to focus on was
0:25:19	the
0:25:20	gaussian assumption
0:25:21	okay
0:25:22	uh as far as i can see
0:25:25	the gaussian assumption is just not realistic
0:25:28	for the
0:25:30	i don't a so that
0:25:31	we're dealing with
0:25:34	and what i set out to do using variational bayes
0:25:37	was to replace
0:25:39	the
0:25:39	gaussian assumption with the
0:25:41	exponential decrease adam famously by
0:25:44	a power law distribution
0:25:46	which uh allows
0:25:48	four
0:25:49	um
0:25:50	outlier
0:25:51	exceptional
0:25:52	speaker of facts
0:25:53	severe channel distortions
0:25:55	uh in the data
0:25:57	and this term black swan is amusing
0:26:00	uh it
0:26:01	so um
0:26:02	romans had a had a phrase or a rare bird much like a black
0:26:06	one
0:26:07	intended to convey the motion of something impossible or inconceivable
0:26:12	and they were in no position to know that uh likes one's actually do exist
0:26:17	uh in australia
0:26:19	um
0:26:20	um
0:26:21	a financial forecaster by the name of
0:26:23	tell the
0:26:25	a few years ago he wrote a polemic
0:26:28	against the gaussian distribution called
0:26:30	the black swan
0:26:32	the um
0:26:33	yeah actually rolled before they start
0:26:36	rationed in two thousand and made which of course is the
0:26:39	mother of all blacks ones
0:26:41	and
0:26:42	as as a result
0:26:43	is it
0:26:44	uh quite a bigger
0:26:45	media splash
0:26:50	okay it turns out that the um
0:26:53	textbook a definition of uh
0:26:56	the student's T distribution the one which i'm
0:26:59	going to use in place of the gaussian distribution that this is a workable
0:27:03	with the variational bayes
0:27:06	there is a not a construction that represents
0:27:09	the student's T distribution um
0:27:12	as a continuous mixture of
0:27:14	um
0:27:15	normal random variable
0:27:17	uh it's based on the gamma distribution is unimodal distribution
0:27:21	on the positive real switch has two parameters that enable you to adjust the
0:27:26	the mean and the variance independently of each other
0:27:31	but it was is
0:27:31	this
0:27:32	okay in order to
0:27:34	sample from a student's T distribution
0:27:40	you start with a gaussian distribution with precision matrix lambda
0:27:45	you then
0:27:46	yeah
0:27:47	the covariance matrix by a random scale factor drawn from the
0:27:53	gaussian distribution
0:27:55	and then you sample from the
0:27:57	normal distribution with the modified covariance matrix
0:28:00	is that random scale factor that
0:28:04	introduces the the heavy tail
0:28:06	behaviour
0:28:08	um
0:28:09	the parameters of the
0:28:11	gaussian distribution of the gamma distribution rather
0:28:14	determine
0:28:15	the extent to which this thing
0:28:17	is is heavy tail you have the gaussian at at one extreme
0:28:21	at the other extreme you something called the the cushion distribution which is
0:28:25	so heavy tail that the
0:28:27	variances in from
0:28:29	uh this term degrees of freedom it comes from classical statistics but it doesn't have any particular main
0:28:35	uh
0:28:36	in in this context
0:28:39	okay
0:28:40	so for example
0:28:42	suppose you want to make the
0:28:44	channel factors heavy tail
0:28:47	in order to model
0:28:48	applying
0:28:49	channel distortion
0:28:53	well you have to do here X
0:28:55	so
0:28:56	remember
0:28:57	are you one set of channel factors
0:28:58	for each recording so this is inside the plate
0:29:02	you associate a random scale factor
0:29:05	okay with that
0:29:07	hidden random variable
0:29:09	okay and that one time scale factor
0:29:12	is
0:29:12	sampled
0:29:13	from
0:29:14	a gamma distribution
0:29:16	call the member with the freedom into
0:29:19	so handy to the lda does this
0:29:22	for all of the
0:29:24	hidden variables
0:29:25	and the
0:29:27	gaussian P L D A model
0:29:29	yeah of speaker factors
0:29:31	have an associated
0:29:32	scale factor random scale factor
0:29:35	channel factors
0:29:37	and so pseudorandom scale factor
0:29:39	residual
0:29:40	has an associated time and scale
0:29:42	vector
0:29:43	so
0:29:44	in fact
0:29:45	all i didn't just here are just three extra
0:29:48	parameters
0:29:49	three extra degrees of freedom
0:29:51	in order to
0:29:53	model
0:29:53	the
0:29:54	the heavy tail
0:29:55	behaviour
0:29:58	yeah
0:29:59	these are some tactical points
0:30:01	okay
0:30:02	uh how
0:30:04	you can
0:30:06	carryover variational bayes from the gaussian case to the heavy tailed case and do so
0:30:11	in a computationally uh efficient way
0:30:14	um
0:30:16	i refer you to the paper for these
0:30:18	the
0:30:19	key point that i would like to draw your attention to
0:30:22	is that these numbers degrees of freedom
0:30:25	can actually be estimated
0:30:27	using the same evidence criterion
0:30:30	as the eigenvoices
0:30:32	and the eigenchannels
0:30:38	okay here's some results
0:30:40	this is a a comparison of gas
0:30:42	really
0:30:44	and
0:30:45	how detailed P L D A
0:30:47	um the several conditions
0:30:49	of the nist
0:30:50	uh two thousand and eight evaluation
0:30:55	okay so this is the equal error rate
0:30:58	and the two thousand and eight
0:31:00	detection cost function
0:31:02	okay it's clear
0:31:03	it in all three conditions the there's a very dramatic
0:31:06	uh
0:31:07	reduction in errors
0:31:09	uh
0:31:10	both
0:31:11	the dcf point
0:31:12	and we are
0:31:15	uh this was done without score normalisation if you do what score normalisation
0:31:20	what happens
0:31:21	this
0:31:22	you get
0:31:22	uniform improvement in all cases
0:31:26	okay i'll simply lda
0:31:28	i get uniform degradation
0:31:30	probably uh
0:31:31	student's T distribution
0:31:33	but only
0:31:34	does normalisation not help you
0:31:36	it's a nuisance
0:31:38	in the students to
0:31:46	uh let me just say a word about score normalisation
0:31:48	um
0:31:50	it's usually needed in order to
0:31:52	set the decision threshold in speaker verification in a trial dependent way
0:31:59	um
0:32:01	it
0:32:01	so uh this typically french are computationally expensive
0:32:05	and
0:32:05	it complicates life if you if you ever have to do cross gender
0:32:09	uh trials
0:32:11	on the other hand
0:32:13	if you have a good general model for speech in other words if you insist on the probabilistic
0:32:18	yeah
0:32:19	way of thinking
0:32:21	there's no wrong
0:32:22	for for score normalisation
0:32:24	if there is no need
0:32:25	for calibration but we're not there
0:32:27	yeah
0:32:29	um
0:32:31	in practice is needed because of
0:32:33	applying recordings
0:32:35	okay which tend to produce
0:32:37	uh exceptionally low scores for all of
0:32:40	trials in which they are
0:32:41	involved
0:32:43	and what the uh student's T distribution appears to be doing
0:32:47	is that the extra hidden variables these scale factors that i introduce
0:32:53	appear
0:32:53	the
0:32:54	capable of uh of modelling
0:32:57	this uh
0:32:59	this outlier behaviour adequate
0:33:02	thus doing away with the need for uh
0:33:04	for score normalisation
0:33:08	uh i should
0:33:09	so
0:33:10	i have a copy of about
0:33:11	microphones
0:33:12	each
0:33:13	if
0:33:13	the situation with telephone speech seems to be quite clear
0:33:16	okay i guess of the L D A
0:33:18	what's globalisation
0:33:21	gives results which are comparable to cosine distance scoring
0:33:24	get better results but
0:33:26	uh heavy tailed the lda at least on the two thousand and a data
0:33:30	and in general there about twenty five
0:33:32	send better than traditional joint factor analysis
0:33:36	uh but it turns out to break down and that
0:33:38	an interesting way
0:33:39	um
0:33:39	um
0:33:40	on microphone speech
0:33:46	uh
0:33:47	now how much yesterday he described an ivector extractor of dimension six hundred
0:33:52	which could be used
0:33:53	for recognition both microphone
0:33:56	and
0:33:57	telephone speech
0:33:59	so we started out by training a model using only telephone speech speaker factors
0:34:04	and the residual was modelled
0:34:06	with
0:34:06	a full
0:34:07	precision right right
0:34:09	okay then we augmented that with
0:34:11	the with eigenchannels
0:34:14	and everything was treated in the heavy tailed right
0:34:17	okay um
0:34:18	well turned out
0:34:19	upon
0:34:21	unfortunately
0:34:22	is that we ran straight into the
0:34:24	cushy distribution
0:34:25	for the
0:34:27	microphone
0:34:28	transducer
0:34:29	affect
0:34:30	that means is that the variance
0:34:32	all the channel effects
0:34:34	microphone back that
0:34:35	is infinite
0:34:36	um
0:34:37	it's a short so
0:34:39	it's a short step to realise that if you have infinite variance for channel effects
0:34:43	you're not able
0:34:44	to speaker recognition
0:34:46	so um i haven't been able to uh to fix this
0:34:50	uh at present
0:34:52	the
0:34:53	best strategy would seem to be too project away the V troubles some dimensions using some type of P O
0:34:58	D A that
0:34:59	so
0:35:00	that's not gene structure which i i believe
0:35:02	we
0:35:03	talking about
0:35:04	uh in the next presentation
0:35:09	okay
0:35:10	oh and then come to the third part of my talk
0:35:14	which concerns the question
0:35:16	oh
0:35:17	how
0:35:19	it would be possible
0:35:21	to integrate
0:35:23	joint factor analysis or P L B A
0:35:26	and call centre
0:35:27	and scoring
0:35:28	or something resembling a
0:35:30	in a coherent
0:35:33	probably
0:35:33	fig
0:35:34	right
0:35:36	uh if you haven't seen
0:35:38	these
0:35:39	types of uh scatter plots
0:35:41	there are very interesting
0:35:43	okay each colour here represents a speaker
0:35:46	and each point
0:35:48	represents an utterance
0:35:50	the speech
0:35:55	um
0:35:56	this is a plot of of supervectors
0:35:58	projected onto the
0:36:00	what is essentially the first two
0:36:02	uh i vector
0:36:04	components
0:36:07	so
0:36:07	you see what's going on here
0:36:09	this is the
0:36:10	well i motivation for
0:36:12	cosine distance scoring
0:36:13	cosine distance scoring
0:36:15	ignores the magnitude
0:36:17	of the vectors
0:36:18	and uses
0:36:19	only the angle between them
0:36:21	as
0:36:22	the similar signature
0:36:27	and this is completely inconsistent with the assumptions
0:36:29	all
0:36:30	joint factor analysis because
0:36:33	there seems to be
0:36:34	for each speaker
0:36:36	a principal axes of variability that passes through the speakers me
0:36:40	the
0:36:42	session variability for speaker is augmented
0:36:45	in a particular direction
0:36:46	the direction i mean vector
0:36:48	where is
0:36:49	jfa or P L V A assumes
0:36:53	that you can't model
0:36:55	session
0:36:56	okay
0:36:57	for all speakers in the same way
0:37:00	the strip
0:37:01	that's three
0:37:01	statistical independence
0:37:03	assumption
0:37:03	in in in jfa
0:37:08	um
0:37:11	i thought of necessarily just
0:37:13	to add a
0:37:14	you
0:37:15	have the ad
0:37:16	in interpreting these
0:37:18	these plots to have to be careful that it's not a notified to
0:37:21	the
0:37:22	well the way you estimate supervectors and so on we
0:37:25	we do find these plots with with an vectors but we have to cherry
0:37:29	the results in order to get
0:37:30	um ice pictures like one
0:37:32	right
0:37:33	i showed you
0:37:34	but the the principle that
0:37:36	okay for this type of behaviour
0:37:39	which i call directional scatter
0:37:41	is the effect
0:37:42	that's
0:37:43	of the
0:37:44	colour distance
0:37:45	matcher
0:37:46	yeah
0:37:47	uh
0:37:48	in speaker recognition
0:37:51	i don't know how to account for it i'm not concerned with that question
0:37:54	the only question i would like
0:37:56	to answer
0:37:56	is how to model this type of behaviour probabilistic
0:38:05	okay as i
0:38:06	i said this part is going to get of the technical it's addressed to people who have
0:38:11	red
0:38:11	the chapter
0:38:13	and
0:38:14	bashers book
0:38:15	um
0:38:16	variational
0:38:16	right
0:38:18	uh in order to get a handle on this problem there seems to be a natural strategy
0:38:23	okay instead of representing
0:38:25	each speaker by a single point
0:38:27	next one
0:38:28	and the speaker factor space
0:38:30	represent each speaker by a distribution which is specified by
0:38:34	i mean vector
0:38:35	you and the precision matrix model
0:38:41	the
0:38:42	i'm vectors are then generated by sampling speaker factors from this just
0:38:45	version
0:38:46	i have
0:38:47	but this inverted commas
0:38:49	because the speaker factors
0:38:51	very
0:38:51	from one recording to remember
0:38:54	okay as to channel
0:38:56	but the
0:38:57	mechanism by push the generator is quite different
0:39:00	that's willing to come
0:39:01	the man
0:39:04	the trick is to choose the prior
0:39:06	on the
0:39:07	mean and precision matrix read speaker
0:39:10	in which
0:39:11	you and then the
0:39:12	or not
0:39:13	statistically independent
0:39:15	because what you want
0:39:17	is
0:39:18	you want to precision matrix for each speaker
0:39:21	which varies
0:39:22	with the location of
0:39:24	speakers mean vector
0:39:28	and of course
0:39:29	once you set this out
0:39:31	your
0:39:32	immediately going to run into problems you you does not hold
0:39:34	all of doing point estimation of the perceptual matrix if you only have one or two
0:39:39	observations of the speaker
0:39:42	uh you have to follow the rules of probability
0:39:44	system play
0:39:46	integrator prior
0:39:47	and the way to do that
0:39:48	courses with
0:39:49	um
0:39:49	right
0:39:56	okay so he was an accountant
0:39:58	we can either seems to be only one way to to
0:40:01	um
0:40:02	one natural prior on precision matrices
0:40:04	although we should prior
0:40:08	uh i won't talk about this
0:40:09	okay i just
0:40:10	put it down there so that if you're interested you be able to recognise that this is
0:40:15	just a generalisation of the gamma distribution
0:40:18	okay if you take an equal to one this will reduce to the gamma distribution
0:40:23	in higher dimensions it's concentrating on positive definite
0:40:27	major
0:40:30	um
0:40:32	there is a parameter call the the number of degrees of freedom again
0:40:35	okay that
0:40:36	so determines how P
0:40:38	uh this uh distribution is
0:40:41	uh also
0:40:42	this point i think is worth mentioning there's no loss of generality in assuming that W
0:40:47	which would matrix here
0:40:48	is it good to be identity
0:40:51	the reason this is worth mentioning is that this turns out to correspond exactly to something that nudging does
0:40:58	and
0:40:59	uh he's processing
0:41:01	if you're familiar with his work
0:41:03	you know that
0:41:05	uh
0:41:05	he estimates that W C C N
0:41:09	matrix
0:41:10	in the
0:41:11	speaker space
0:41:12	and then lightens the data with that matrix
0:41:15	before evaluating the
0:41:18	uh
0:41:18	because
0:41:23	okay
0:41:24	first thing then
0:41:25	we have generated the
0:41:27	decision matrix for the speaker the next step
0:41:29	is to generate
0:41:30	the
0:41:31	the mean vector
0:41:32	speaker
0:41:34	and you do that
0:41:35	using
0:41:36	a student's T distribution
0:41:39	okay once you have a precision matrix
0:41:42	that's all you need
0:41:43	if you
0:41:44	just adding the gamma distribution
0:41:47	you can sample
0:41:48	the mean vector
0:41:49	according to a student's T distribution
0:41:51	uh and explained in the manual white
0:41:53	you need to use the student's T distribution
0:41:58	uh
0:41:59	the
0:42:00	point i would just like to draw your attention to at this stage
0:42:04	is that
0:42:05	because
0:42:06	the distribution of you depends on the land there
0:42:10	the conditional distribution lambda depends on you
0:42:14	okay
0:42:14	so
0:42:15	that means
0:42:17	but
0:42:18	he precision matrix for a speaker
0:42:20	and
0:42:21	on location
0:42:22	all the speaker
0:42:24	in the speaker factor space
0:42:25	so that means that you have somehow
0:42:28	modelling
0:42:28	this
0:42:29	directional scout
0:42:35	skip that
0:42:36	um
0:42:36	go to the
0:42:38	um
0:42:39	graphical model
0:42:42	i think it's clear from this uh remember
0:42:44	when you're confronted with something like this that
0:42:46	everything inside the plate
0:42:48	is replicated
0:42:50	for each of the recordings
0:42:52	speaker
0:42:53	everything that outside of the plate
0:42:54	is done
0:42:55	once
0:42:57	per speaker
0:42:58	okay so the first step
0:43:00	is it generate the precision matrix
0:43:04	you then generate the mean for the speaker
0:43:06	by sampling from
0:43:08	um
0:43:09	a student's T distribution of call the hidden scale factor W
0:43:13	and the parameters of the gamma distribution out
0:43:15	data
0:43:16	once you have the mean
0:43:18	and the precision matrix
0:43:20	you generate the speaker factors
0:43:22	re speaker
0:43:24	uh for each recording
0:43:25	remember we're making the speaker factors depend on
0:43:28	or
0:43:29	okay bye
0:43:30	something from another student's T distribution
0:43:34	the interesting thing
0:43:35	is that
0:43:36	these three parameters alpha beta and tell
0:43:39	the term and
0:43:40	whether or not
0:43:41	this
0:43:42	oh
0:43:42	business
0:43:43	it's going to
0:43:44	exhibit directions cat
0:43:46	normal
0:43:51	okay
0:43:52	sorry
0:43:54	this can be explained without some hundred
0:43:56	you have to do it calculation
0:43:59	remember landers
0:44:00	a session matrix land inverse is the
0:44:03	covariance matrix someone and comparing here
0:44:07	is the distribution of the covariance matrix
0:44:09	given the speaker dependent parameters
0:44:13	and the prior distribution of the covariance
0:44:16	you see what you have is a weighted average
0:44:19	of the prior
0:44:20	expectation
0:44:22	and
0:44:23	another term
0:44:25	now this
0:44:26	second term here
0:44:27	and
0:44:28	all the speakers me
0:44:30	it's a rank one covariance matrix the only variability that's allowed
0:44:34	is in the direction of the mean vector
0:44:37	this is
0:44:37	a picture book
0:44:38	response to it
0:44:39	which is exactly what
0:44:41	the doctor
0:44:43	four
0:44:43	action scatter
0:44:46	um
0:44:48	i'd
0:44:49	draw your attention to the fact
0:44:50	that
0:44:50	the
0:44:52	this
0:44:53	term here is multiplied by this so it depends on how the number of degrees of freedom
0:44:58	and
0:44:59	this uh random scale factor that
0:45:03	okay so the extent
0:45:05	a directional scattering
0:45:07	is going to
0:45:08	and
0:45:09	on the behaviour of this uh
0:45:11	this much
0:45:16	uh
0:45:17	it depends
0:45:17	in fact on the parameters which govern the distribution
0:45:21	oh
0:45:21	the
0:45:22	random scale factor W
0:45:24	yeah
0:45:25	W
0:45:27	has
0:45:28	a large mean and a small variance you can say that
0:45:32	this
0:45:33	this thing but
0:45:35	the
0:45:36	a fact
0:45:36	all the variability in the direction of the mean vector
0:45:40	okay so in that case
0:45:42	directions kevin would be present to a large extent
0:45:46	four
0:45:47	um
0:45:48	most speakers
0:45:49	in the data
0:45:50	on the other hand there's another limiting case where
0:45:53	uh you can show that the thing reduces to to heavy tailed field again and there's no directional scattering at
0:45:59	all
0:46:00	so the
0:46:02	key question would be to see how this model trains that
0:46:05	uh
0:46:06	to be frank this is going to take a couple
0:46:07	models
0:46:09	uh i don't have any
0:46:10	results to uh
0:46:13	okay so in conclusion
0:46:15	um
0:46:16	well guess immediately it's an effective model for speaker recognition
0:46:20	and it's just joint factor analysis with ivectors
0:46:23	uh as features
0:46:25	my experience
0:46:26	spain
0:46:27	that it works better
0:46:29	then
0:46:29	uh traditional joint factor analysis even though the basic assumptions
0:46:33	or
0:46:34	are open to question
0:46:36	okay
0:46:37	variational bayes
0:46:39	allows you to go a long way
0:46:41	in relaxing these assumptions you can model outliers by adding these
0:46:45	hidden
0:46:46	variables
0:46:47	you can model directional scattering by having
0:46:50	these variables
0:46:53	the
0:46:54	derivation of the variational bayes update formulas is mechanical
0:46:59	no i'm not saying it's always easy but it is
0:47:01	coming
0:47:02	okay
0:47:03	and
0:47:04	it comes with um
0:47:05	yeah my convergence guarantees so that you can
0:47:09	you have some hope of uh the barking or implementation
0:47:15	one can get is that
0:47:16	in practise you have to stay inside the exponential
0:47:19	second work
0:47:20	uh
0:47:21	i can
0:47:21	the other one
0:47:23	uh
0:47:23	it's also
0:47:24	i'm
0:47:25	personally of the opinion that is uh
0:47:27	in order to get the full benefit of these methods we need for recall
0:47:30	informative priors
0:47:33	that is to say
0:47:34	prior distributions on the hidden variables whose
0:47:36	parameters can be
0:47:39	i use the word of this is it because uh estimated is is really an appropriate here
0:47:44	and this is a strong uh
0:47:46	larger training sets
0:47:48	so the example is that
0:47:50	one of the hidden variables that i just
0:47:53	uh disk right
0:47:54	okay are controlled by a handful of
0:47:57	scalar degrees of freedom
0:47:59	and these can all be estimated using the
0:48:02	using the evidence criterion
0:48:04	from uh
0:48:05	from training data
0:48:09	now it to be to be
0:48:11	trying to locate the advantage of probabilistic methods is is that you have
0:48:15	uh logically coherent way reasoning and the phase uncertainty
0:48:19	the disadvantage is
0:48:21	that it needs
0:48:22	timing
0:48:23	um
0:48:23	after
0:48:24	okay too
0:48:26	to master the techniques and to program them
0:48:29	if you're
0:48:31	principal concern is to get a good system
0:48:33	up and running quickly
0:48:35	i would recommend
0:48:36	um
0:48:38	something michael signed distance
0:48:41	uh
0:48:42	on the other hand
0:48:43	if you're interested in
0:48:45	mastering
0:48:46	this
0:48:46	family of methods
0:48:48	i think they're really only three things you need to look at
0:48:51	okay there's the original
0:48:53	paper by prince analogy or a probabilistic linear discriminant analysis in
0:48:58	uh face recognition
0:49:00	that's the gaussian case
0:49:04	everything you need to know
0:49:05	about probabilities ambitious book
0:49:08	which ah i highly recommend it
0:49:10	so
0:49:10	it's very well written and it starts from first run
0:49:13	oh
0:49:15	uh this is the
0:49:17	this is
0:49:17	paper
0:49:19	um i don't believe the paper is actually found its way into
0:49:23	proceedings
0:49:24	but it is available along those lines
0:49:26	uh
0:49:28	okay
0:49:30	thank you
0:49:41	much
0:49:43	right
0:49:44	this is the
0:49:45	action
0:49:54	no
0:49:56	yeah
0:49:57	but it
0:50:01	no
0:50:02	and of course
0:50:03	uh
0:50:03	thanks representation which uh
0:50:05	reuniting
0:50:06	uh
0:50:07	you use you to uh
0:50:10	uh encourage us to
0:50:11	uh as you said
0:50:12	if you wanna which solution you can do it that way
0:50:15	if you want a more principled solution
0:50:17	uh
0:50:18	but
0:50:18	i
0:50:19	i cannot
0:50:20	uh
0:50:21	notice i just know is that uh
0:50:23	they use a point of uh you algorithm
0:50:26	is based on that point is to
0:50:28	um
0:50:28	so you have
0:50:29	a speech utterance
0:50:31	uh use your factor analyses
0:50:34	to summarise i decided
0:50:36	and you completely ignored in certain people to process and then from that you should use that
0:50:41	we should keep track of that uncertainty
0:50:44	so how do you like that it's a an entirely empirical uh decision
0:50:48	based on on the effectiveness of of machines uh cosine distance scoring
0:50:53	no it just works really well um
0:50:56	attend somewhere maybe
0:50:58	so
0:50:58	um
0:50:59	incorporate the uncertainty
0:51:01	in the
0:51:03	i vector estimation procedure
0:51:05	don't seem to have they
0:51:06	complicate life
0:51:08	it's it's really imperative
0:51:10	what
0:51:11	it's dictated by baltimore
0:51:23	um
0:51:24	but you know it's true
0:51:25	tuition
0:51:25	um
0:51:26	one one question regarding you results are presented so i would uh one categories remote
0:51:32	yeah
0:51:32	um
0:51:33	conversation sides down
0:51:35	and so
0:51:35	you know
0:51:36	you were
0:51:37	yeah
0:51:37	you
0:51:38	i picture
0:51:39	which
0:51:39	finding your
0:51:41	um
0:51:41	retail setup
0:51:43	house and
0:51:44	you
0:51:44	nine
0:51:45	you
0:51:46	but when they did that
0:51:47	you
0:51:47	see
0:51:48	it was
0:51:49	not
0:51:50	the ten second data
0:51:53	you can circle
0:51:54	um i
0:51:56	well the best results were obtained without score normalisation
0:52:00	okay so we're was no question of uh
0:52:04	introducing a corporate your question is maybe
0:52:06	in the gaussian case
0:52:07	should we should be used that
0:52:10	oh no
0:52:10	a what you need
0:52:11	yeah
0:52:12	to me
0:52:13	distribution
0:52:14	i
0:52:15	right so
0:52:16	you see i
0:52:17	yeah
0:52:19	when you open
0:52:20	we estimate
0:52:22	you do
0:52:23	these
0:52:23	particular i picked
0:52:24	right
0:52:25	maybe
0:52:25	oh
0:52:27	and second
0:52:29	um
0:52:30	but i
0:52:31	my experience has been and then this
0:52:33	black or white
0:52:34	okay is that it's better not to use
0:52:36	ten seconds later
0:52:37	right
0:52:39	uh
0:52:40	in the case of the
0:52:42	indicate an interesting
0:52:44	aspect of ivectors
0:52:45	is
0:52:46	but
0:52:46	um
0:52:48	they
0:52:48	perform
0:52:49	very well on the ten second in sec
0:52:52	okay
0:52:53	in other words the estimation
0:52:55	figure drawing vectors
0:52:56	is
0:52:56	much less sense
0:52:58	so
0:52:59	um
0:53:01	short duration
0:53:02	um relevance map
0:53:03	right
0:53:04	prob
0:53:11	a high one based on what the impact
0:53:13	uh you make an assumption that um some um
0:53:17	fig oceanic you
0:53:19	the last
0:53:20	the slide
0:53:21	somehow
0:53:22	exhibit a gaussian decent
0:53:24	last of it
0:53:25	i this way
0:53:26	i mean he's doing at a nonparametric way to do so
0:53:30	and i only sensations back
0:53:32	so i i think i was careful to use students T distributions everywhere yeah decreases that that that require that
0:53:38	it's that which gives me the flexibility
0:53:40	to model of players and directions got
0:53:44	that does that answer your question or
0:53:46	yeah innocently used to model it uh some highlights
0:53:50	are made
0:53:51	are much more
0:53:52	oh at the last
0:53:53	uh last
0:53:55	variational bayes
0:53:56	does require that
0:53:58	and in fact it was an actual an extra restriction that you have
0:54:02	stay inside the
0:54:03	the exponential
0:54:05	uh funnelling solely
0:54:07	the art
0:54:08	consists in achieving what you want to do
0:54:11	subject of those uh
0:54:12	strange
0:54:14	i
0:54:15	is that an adequate response
0:54:18	yeah
0:54:34	about the product
0:54:35	yeah
0:54:37	you you
0:54:38	hmmm you know
0:54:39	so
0:54:39	and like you
0:54:41	yeah
0:54:42	i
0:54:43	i
0:54:44	you
0:54:45	right
0:54:46	you can
0:54:48	well
0:54:48	in fact uh we use
0:54:50	the evidence criterion
0:54:51	you
0:54:52	which is exactly the same criterion for estimating these
0:54:56	the the numbers of trees of freedom
0:54:58	as we did for estimating the eigenvoices
0:55:01	and the eigenchannels
0:55:02	so it's completely consistent
0:55:04	there was no manual tuning
0:55:07	thank you
0:55:21	so
0:55:22	there was a question
0:55:23	let me think
0:55:24	but okay
0:55:32	because

Bayesian Speaker Verification with Heavy-Tailed Priors

INVITED TALKS

Přidáno: 14. 7. 2010 11:08, Autor: Patrick Kenny (Centre de recherche informatique de Montreal), Délka: 0:55:32