Přepis řeči - AN ACOUSTICALLY-MOTIVATED SPATIAL PRIOR FOR UNDER-DETERMINED REVERBERANT SOURCE SEPARATION

0:00:15	a
0:00:17	hello do every i'm one of the you
0:00:19	uh i right to P G uh
0:00:21	our paper court co we the "'em" to fine and and ripple of from uh india yeah random
0:00:27	a fast
0:00:29	a tight and acoustically motivated base apply a four hundred a mean do were and source separation
0:00:36	and is is sporting uh we have looks the first to present a send which is mainly for on uh
0:00:43	as all the source spectrum or like uh and M maps or a S M at as we could before
0:00:48	so i like to have a size that uh oh work here on the for read close on a
0:00:54	the space a model
0:00:56	that is some more at to source space of position
0:01:01	so here's a are i have to present patient uh first i right to uh people E a rats so
0:01:06	proper and a follow by you is general or a gaussian modeling framework uh for source separation
0:01:12	then be moved to the main contribution of the work that he's by uh designing a new acoustically motivated space
0:01:19	of prior
0:01:20	and uh design or a maximum of to be up a be to estimation that to you hand the source
0:01:26	separation performance
0:01:28	and finally i so some uh experimental results and conclusion
0:01:33	okay uh
0:01:35	here we are considering zero source separation a problem where we use a a i uh month each and signal
0:01:42	you know to by i se T
0:01:44	two separate so all
0:01:46	S up say
0:01:47	and where are a a as the number of sensor is some more attention most all C is it's a
0:01:52	under mean case
0:01:54	K and it structure
0:01:57	and if
0:01:57	creating node
0:01:58	by us is is is a contribution of only source S they to the microphone array so she's a is
0:02:05	called source image she's
0:02:07	oh which is related to the origin is all is by a mixing process sees
0:02:11	characterised by them we see feature uh it's straight
0:02:15	is that a more drilling uh the acoustic the process is from the source to the microphone
0:02:21	and in the call type i do you "'em" since is in which are is them because of several sources
0:02:25	sees
0:02:26	so we have ice tea
0:02:27	is this sum
0:02:28	of is i
0:02:30	okay that's a missing more
0:02:33	so uh most uh state of the ask a process uh four hundred in mean as source separation operates in
0:02:39	the frequency domain
0:02:40	where a as the convolution in the time domain is up it it by the complex value month view case
0:02:47	in needs the you me which is a simple form
0:02:51	and and as a so the on this plastic secure some and uh where are only a few scenes
0:02:56	uh i was assumed to be active at
0:02:58	i frequency point
0:03:00	for used no value yeah pop you uh uh do a and we assume been in close to to step
0:03:06	uh of uh uh uh the estimates and we if of a uh is here i actually
0:03:11	and then
0:03:12	is just a square use in uh
0:03:14	and state or is i
0:03:16	is still we have a a by at for used in a binary mask
0:03:20	where only one source is he's see that to be active like its time-frequency point
0:03:25	so but this is taken it green be main you need to you know really stick the people over an
0:03:30	environment
0:03:31	as since the narrowband approximation here here than a how
0:03:36	so you our work
0:03:37	we uh a you go to different frame
0:03:41	where where uh as a sock comes with just one coefficient of the source in these these
0:03:46	is more as a zero-mean of gaussian random variable
0:03:49	so a a is more a as the gaussian with a zero mean
0:03:54	and covariance man sees a signal actually
0:03:57	and we further fight the rise stick my as i
0:04:00	by to to high a bit to V a N as a
0:04:03	and V a a is the scalar sauce that yeah we encode suspect show how of the sources
0:04:10	so that is for more just tossed that spec chili from set
0:04:14	and actually
0:04:16	is the spatial covariance matrix because these
0:04:18	we in
0:04:19	is space to a used an of the source
0:04:22	okay and we are focusing more on the morning of the uh actually
0:04:30	so uh as cool state of asks uh you lying on the net of approach to mason uh wind results
0:04:37	on the wrong one and then is so as a
0:04:39	is still products of to two we see that the is
0:04:43	but in our world
0:04:44	uh we yeah proposed the for right matt she's for as a way as a coefficient of actually
0:04:51	and not deterministic lead elated
0:04:54	okay so is no such fall rises
0:04:58	so given an uh low and modeling framework and the parameterization as the source separation architecture we need to for
0:05:05	step uh so we need these people are
0:05:07	a for as to handle me signal is me into frequency domain
0:05:11	and then the and the model me till here is the sauce value and and space of query matches
0:05:17	and then uh as as a source coefficient is to be cap by uh we of in the way kind
0:05:23	of soft masking and then every construct a time-domain signal
0:05:28	so we have a "'cause" uh from now on a uh you on the estimation of a more to to
0:05:33	we select a yahoo
0:05:35	defined it here
0:05:38	okay and uh
0:05:40	here here is a P jen the main contribution of the paper is score of
0:05:44	acoustically motivated this space apply prior
0:05:46	so we have to see the reason the sort of and in some situations an
0:05:50	where are the view T set can be no
0:05:53	just secure S and can come a for this than in the past
0:05:57	for a where as the police in of the right is fixed
0:06:00	or in the form meeting whereas as a push is in of this
0:06:03	uh do later use fixed
0:06:05	for used in or on the broadcast thing where we know exactly
0:06:09	the put to denote the salt sees and the room acoustic
0:06:13	so given Z is known you make says think uh we can exploit is an all these about the sauce
0:06:19	score is and and two character
0:06:22	to in hand the source separation performance
0:06:25	that's the motivation for the work
0:06:28	and here we see oh one he's an all is for material
0:06:31	whom acoustic
0:06:33	so
0:06:34	if you assume that uh a as the D test pass and are we were in a a a and
0:06:38	correlate that
0:06:39	and the event a is fused
0:06:42	is means that as the how can come form more old pushed in these a two
0:06:46	so uh is
0:06:48	uh that you we uh win uh leonard no is the mean of the space of or very in is
0:06:53	we need close the contribution
0:06:56	of of that's part
0:06:57	which is defined it here and the covariance up to a T was and a
0:07:01	and all these parameter
0:07:03	a a it's just a a and C can be computed directly
0:07:06	even to you you setting
0:07:08	so uh for the next at time i we not present a at a how we can be computed but
0:07:13	you can be for to the paper
0:07:15	so uh okay uh
0:07:17	that's a again so given the room with the the the
0:07:22	a Q Q missus setting uh we can compute dean's up the space of corbin and bases
0:07:27	and even as is uh mean oh we D five i as the inverse process prior over uh the space
0:07:34	the is
0:07:35	so
0:07:35	as a follows the inverse process distribution
0:07:39	with the mean
0:07:41	given by here and be computed from form the to really of statistical room acoustic
0:07:46	and is a value in which is going to by uh the parameter at
0:07:51	it's called a degree of freedom
0:07:53	can be learned from the training data in the maximum like lisa was sent
0:07:57	okay i mean not represent a in about the learning process
0:08:01	the reason we choose in speech that's here is that it's a could you could you case prior to the
0:08:06	them a gaussian people
0:08:08	so we been to as in in a close form a the later on
0:08:14	okay so uh
0:08:16	now i'll i'll oh is to estimate the as the pen to me to C time
0:08:21	and uh we use the expectation maximization yeah and we them a for is proposed
0:08:28	where
0:08:28	is step
0:08:29	uh we estimate uh the empirical covariance of bits of cheese
0:08:33	uh a man has to to here
0:08:36	uh by Z C question where uh that you we still owe simply a window if the we a multichannel
0:08:41	wiener of in ring
0:08:43	and in the and step uh uh that is you know a that for the map at don't be to
0:08:48	up this that we start things
0:08:50	so you were see of these a and and say uh can be it a T V updates
0:08:55	in is uh jens that
0:08:58	and if you see L C question up C separate you can uh uh see that uh
0:09:03	he the contribution of the likelihood
0:09:05	and Z power come from the contribution of the prior
0:09:09	uh that we have it
0:09:10	and gamma is the
0:09:13	a chair up on a bit error we J D to means the contribution of the pilot
0:09:17	and if you want to a bit uh to the me to in the maximum likelihood sense C be step
0:09:22	uh a guy is zero
0:09:24	so we can come
0:09:26	to that like to said
0:09:29	okay and now uh we have everything in hand us and uh
0:09:33	that's size so some experiment with a
0:09:37	so we we compare the source separation performance up propose uh
0:09:43	use the paper using uh
0:09:45	uh the map of how to meet estimation we there uh
0:09:49	uh the maximum likelihood and with them the to likelihood mites re
0:09:54	we had the first one is that a uh we don't know every any C uh the you
0:09:58	a a so as a a a a blindly the initial i
0:10:02	and the second one is that the uh as a is in is a light from the same you made
0:10:07	see setting
0:10:08	so we a fair comparison
0:10:11	we still that if we know some uh uh are you mess stepping before here
0:10:15	uh we can improve the source
0:10:18	and B so compare as source separation with the base i uh binary mask
0:10:22	rather than be some few is fixed
0:10:24	the fourth i in the to
0:10:27	that
0:10:28	but see that is computed that of uh from that you see set thing
0:10:32	a a so the formula before
0:10:35	and here a some up how a need to die
0:10:38	speech and sampling rate number of yeah the works and
0:10:41	yeah
0:10:43	and he is a find a reason as uh is is the every three as uh
0:10:47	in terms of signal to distortion ratio we them as of the overall distortion and
0:10:53	and uh and uh
0:10:55	we compare this separation results the over at feast or on which are a four sources
0:11:00	uh with you where you here
0:11:02	and microphone spacing things five something meter
0:11:05	and uh we uh
0:11:07	compute separation results with D for an uh a reverberation time ranging from um
0:11:12	a very here uh and that weights so fifty millisecond very uh people about "'em" and five hundred
0:11:20	and i
0:11:20	use that
0:11:21	rule i
0:11:22	he's the results given by our for pos the and we uh where the prior information
0:11:28	you
0:11:29	and you we can see that uh
0:11:31	uh of the proposed at with them out form or or or a maximum likelihood at with them and baseline
0:11:37	approach
0:11:38	a in all uh people over and
0:11:40	a have thing
0:11:43	okay for instance uh
0:11:44	guess that will uh
0:11:46	sam
0:11:58	all
0:11:59	okay
0:12:07	okay or maybe this
0:12:08	is that in
0:12:21	okay alright right gig can uh
0:12:23	V
0:12:23	so uh
0:12:24	you at see that are for sample at uh the revision in time up but two and a few T
0:12:29	is a a moderate use in time
0:12:30	oh proposed and with them where we know some up iron or is about
0:12:35	set the and uh in a hand the stuff that separation form by one
0:12:39	that's yeah
0:12:40	go back to uh an ad at which and
0:12:44	okay he's
0:12:45	whose and
0:12:46	a uh in the uh our work we propose an acoustically motivated this space of Y are uh
0:12:52	which is
0:12:53	a from that you rio
0:12:55	is that the seek a room acoustic
0:12:57	and we derive for the maximum of post the right be a a a a at with uh week so
0:13:01	of uh presuppose to to the estimation of the more apparent be to
0:13:06	and and the permutation problem okay
0:13:09	a i like to every size this one because even known you made testing
0:13:13	with the map and with them uh we do not of for from the well-known known with a simple them
0:13:18	in the frequency domain source separation
0:13:21	and importantly we so with that to prove but was
0:13:24	with the help of a
0:13:26	yes
0:13:27	but uh a at this point uh we still need to know a many how to meet error like the
0:13:33	source sports is and and the re in time a uh a to compute a a a the mean of
0:13:38	the space of a very much as
0:13:40	so that use your work can be D put good the
0:13:42	to a fully a an source separation by estimate the or the acoustic
0:13:47	yeah
0:13:50	okay that's and of my yeah and they said thank you
0:14:00	we have time for
0:14:01	so question
0:14:13	Q for the presentation my name's is of some of the in T D
0:14:17	on the a ha how do cut it does
0:14:19	speech are right are in the I
0:14:21	right
0:14:23	in in the one yeah
0:14:24	speech of right yeah how do you
0:14:26	cricket
0:14:28	uh
0:14:29	okay
0:14:31	so he's a space of fire and so uh
0:14:34	the distribution is even so
0:14:37	what we need
0:14:38	to know is the mean
0:14:40	and uh uh the variance
0:14:42	he's the if i did i i
0:14:44	so for the mean see time
0:14:47	uh we can compute directly from the you miss yet thing
0:14:51	for example you've we know the distant from the source to the microphone
0:14:55	we can compute the forth uh that it's a from the sound of a microphone
0:15:00	and uh so it uh we use to that the even at
0:15:03	if few
0:15:04	so uh
0:15:05	that the yep that's
0:15:06	school i see all politically state the main are a
0:15:09	i i i i and that's um
0:15:11	so my question needs
0:15:12	if the the the uh
0:15:14	right yeah yeah
0:15:15	different from there
0:15:17	really you like
0:15:18	the the you
0:15:19	uh
0:15:20	different like so you could you oh well
0:15:23	a distance is before and
0:15:25	a how low a lot of things P
0:15:27	to you are loaded
0:15:29	yeah to be and is uh i i have been in to get
0:15:32	and so got C but that yeah it's a very good so
0:15:36	for future investigation
0:15:38	yeah
0:15:39	and i actually at at this is uh that's uh in this still a where we uh tried to prove
0:15:44	that a even as some known you missus set thing again improves the principal it's a separate simple performance
0:15:51	but we us
0:15:52	that's sky of was source that's said
0:15:54	or or as the based in if each you like to estimate these parameters
0:15:59	a bentley a from the mixture
0:16:01	so at the time do not have a
0:16:03	yeah such a a a a a uh variation
0:16:06	you
0:16:24	okay uh yes okay firstly uh which is a very well known uh do you that we might to be
0:16:31	mask P a we can see that is eat to as zero baseline a part
0:16:35	and i actually in our previous work that with the same uh
0:16:39	a
0:16:40	with the same at more frame will uh like the sees and with them with maximum neck
0:16:45	a presented in how a previous paper we also compared to perform and
0:16:50	scum at a state of the that we said
0:16:52	some be that the size
0:16:53	and
0:16:55	yeah
0:16:55	a using a would be nice more and it's is both but was uh approach outperformed performed sees at with
0:17:01	which is already compared
0:17:03	some as a of
0:17:05	i i i would not say one at a state of the but
0:17:07	this
0:17:08	some of and
0:17:10	as the baseline
0:17:17	that's the questions
0:17:25	then that thank the speaker

AN ACOUSTICALLY-MOTIVATED SPATIAL PRIOR FOR UNDER-DETERMINED REVERBERANT SOURCE SEPARATION

Acoustic Source Separation

Přednášející: Ngoc Duong, Autoři: Ngoc Q. K. Duong, Emmanuel Vincent, Rémi Gribonval, INRIA / Centre de Rennes - Bretagne Atlantique, France