Přepis řeči - POINT PROCESS MCMC FOR SEQUENTIAL MUSIC TRANSCRIPTION

0:00:15	mean
0:00:15	and
0:00:16	i'm started to just one right but it's it's not simon
0:00:19	i
0:00:20	um
0:00:21	so i'm gonna talk about the music transcription work um from my master's project last year
0:00:25	and
0:00:27	so
0:00:28	uh just to go to what we to the transcription it's and say i musical signal might look some a
0:00:33	bit like that's
0:00:34	so this is a a a a just a time domain signal it's can be roughly periodic and it's it's
0:00:38	can have a a whole that's of
0:00:40	um sinusoidal
0:00:41	and
0:00:42	components each with a a a different time varying amplitude
0:00:46	and but that's how we perceive music is that
0:00:48	and this is what we
0:00:49	what we sort
0:00:51	i think of we we have a we think of a a a is no
0:00:53	and and then some of a high tech and high level properties
0:00:57	uh
0:00:57	such as the and expression and that the timbre of instrument
0:01:00	so
0:01:01	and so what would like is a system that can take some might this and turn it into something like
0:01:06	this
0:01:07	now that's quite an ambitious things do one step
0:01:09	say
0:01:10	and we gonna a for a a the intermediate results
0:01:12	that something like this
0:01:14	this is that it can or roll
0:01:15	uh and we've got
0:01:17	um um like
0:01:19	or got and the pitch of the night sub side yeah and time on bottom and the line indicating which
0:01:23	makes the presents
0:01:24	and this is from them and you work silence
0:01:26	and just on a single byte modeling
0:01:30	um
0:01:30	so what i'm gonna do is just talk about a um
0:01:34	uh sequential um
0:01:35	framework that
0:01:37	doing this night estimation
0:01:39	and and not talk about the the models
0:01:40	um
0:01:41	that we we using say we we got a a like you'd model using a some point point processes and
0:01:46	then something simple dynamic models for them next evaluation
0:01:49	and then not talk that's and and C M C scheme to some results
0:01:52	and so first all um i'm a music is
0:01:55	and a continuous
0:01:56	signal and
0:01:58	uh we we want to look at
0:01:59	and
0:02:01	we can see domain model said pressing we gonna do is to um chop it up into frames and i
0:02:05	will reference the frames with this then subscript Z here
0:02:09	and then for each frame would like to estimate that was set to make its presence which will
0:02:14	cool be to T an out and given the data that we've got for that frame to go white T
0:02:18	and and the way we can do this is by looking at this uh a joint posterior a of the
0:02:23	V
0:02:23	the notes in the current frame and the previous frame and you're recognise this from the the previous talk it's
0:02:28	that the same
0:02:29	um
0:02:29	that
0:02:30	and say we've got
0:02:31	we we can expand this one three times um i like it's um yeah a transition that sticks and
0:02:36	and then this
0:02:38	um
0:02:40	uh posterior time from the previous
0:02:42	and processing step
0:02:44	say and you might in this uh
0:02:46	T minus one implements to then
0:02:48	just a marginal of that
0:02:50	so i got a yeah
0:02:51	a particle up or the previous frame we can smoke that
0:02:56	so let's less of for yeah it's the the models
0:02:59	that we using for that selected
0:03:01	at um
0:03:02	so
0:03:03	um i mentioned we can use frequency domain model say this is just the actual time area
0:03:08	transform of uh one of the frames
0:03:10	i'm see that what we interested in it is
0:03:12	this that of P down here
0:03:14	and that's that's a lot of redundant information down here in the noise level
0:03:17	so the first thing we gonna do straight away
0:03:19	they will run of that just a peak detection algorithm
0:03:22	and this is very simple we we just looking at the first order difference
0:03:26	and then and give a median threshold on it
0:03:29	and so we would use
0:03:30	the bispectrum down to just this that's of a red circle pizza
0:03:35	um
0:03:36	now
0:03:37	what would like to model is but the frequency and the amplitude
0:03:40	uh it it sends out the the amplitude these peaks
0:03:43	is dependent on a an off lot factors
0:03:44	including in but that's playing
0:03:46	and uh the you
0:03:48	recording environment
0:03:49	and most of all them are very very of a time
0:03:51	so and
0:03:53	print together a simple and robust models as
0:03:55	is difficult say what we're gonna start of but just looking at a model for the
0:03:59	the frequencies of the the set of so
0:04:04	and
0:04:06	say
0:04:08	if we and if we have one night playing you know
0:04:10	um
0:04:12	frame then what we what we
0:04:14	C characteristically is a a peak at some fundamental frequency that's that's
0:04:18	the the lowest P can't with and with than it a yeah the fundamental
0:04:21	and then we see yeah a sets of peaks
0:04:22	that's
0:04:23	um i times a partial frequencies
0:04:25	i would is the set up here and there approximately in multiples of the uh a fundamental
0:04:31	and but we don't always get
0:04:32	a P in all these locations some plus one thing here
0:04:35	and we don't know how many of them
0:04:37	they'll be ha ha
0:04:38	how high we have to go up
0:04:40	and
0:04:40	in addition we gonna get some cuts of up yeah and it's gonna be due to
0:04:44	um
0:04:46	a a noise all transients affects which were not really modeling
0:04:49	and or the non musical
0:04:51	sounds and recording
0:04:52	i
0:04:54	uh
0:04:54	so if we if we have a lot of
0:04:56	no
0:04:57	present in the frame
0:04:58	we up with a horrible they rest a station issue where we we'd like to link every P we've i
0:05:03	that one of the nets presents all
0:05:05	a a cut the price
0:05:07	and
0:05:08	but so that that gives us some horrible scaling in complexity as we increase number of nights of the number
0:05:12	of at times
0:05:14	um so we can get around this by um
0:05:16	making it a um using up a possible
0:05:19	process assumption about the uh the generation of peaks in a spectrum
0:05:24	so
0:05:25	we seen that's and
0:05:27	for each of its own and the pizza generated in the in the spectrum according to a poisson process
0:05:33	and we can construct a and in intensity functions this
0:05:36	um
0:05:37	for some process by which has a maximum at the expected uh frequency of the
0:05:42	uh i that i
0:05:44	uh no this is quite a significant assumption
0:05:46	um
0:05:48	germany many where we only
0:05:49	expect a see no P
0:05:50	school one be or maybe in some rare cases some some respect peak
0:05:54	um
0:05:55	now with this assumption we we gonna have a a a a some distributions of the number of peaks at
0:05:59	that time that time
0:06:00	and so
0:06:02	and that's that's the bad thing that the good thing is that
0:06:05	um because of the union property of price some processes we could just at the intensity functions
0:06:09	uh for each i've i
0:06:11	a to of us
0:06:12	uh and that's T
0:06:13	function like this for the a whole night's as a a personal press
0:06:17	and and you can see we we constructed this
0:06:19	um
0:06:20	with a
0:06:21	a very
0:06:22	now large can combine it's that fundamental
0:06:25	showing that way it would pretty certain is gonna be a peak that and we and we quite sure
0:06:29	uh what frequency will be at
0:06:31	and we've got some a a small components it's a high frequencies where with less that exactly what frequency the
0:06:36	people look occur
0:06:39	um
0:06:39	and then if we have more one they present the again we can just at these and intensity functions together
0:06:45	a for all the different nights
0:06:46	and give us a i
0:06:47	and a poisson process but for all the peaks in a and all spectrum
0:06:51	uh say just that
0:06:53	he's a mac
0:06:54	and
0:06:55	this is uh
0:06:56	we would been using a a gaussian mixture model to to construct these these note
0:07:00	and intensity
0:07:01	function
0:07:03	and and then we just
0:07:04	uh adding them together to give the entire frame
0:07:06	and intensity function
0:07:08	and then adding on and a little bit extra um uniformly
0:07:11	to account for that that scott's of peaks so
0:07:13	the
0:07:14	cut up for some process
0:07:16	and and then once we got this we
0:07:18	uh i like uh a like cleared
0:07:20	uh expressions of the
0:07:22	um
0:07:23	frame
0:07:23	so
0:07:24	um
0:07:27	just a integrating the intensity function a each um and of the fast a transform that would give us the
0:07:32	an expectation of
0:07:34	um for the the presence of a peak in that bin
0:07:36	and then
0:07:37	uh we can just um
0:07:39	take a like you like this
0:07:41	um
0:07:42	from a from for speech and and then not all together to give the cycle frame likely
0:07:48	and
0:07:48	now i said i'll of the uh
0:07:50	and attains a cow approximately at integer multiples of the fundamental
0:07:54	um
0:07:55	and
0:07:56	it it sends out that for um
0:07:58	especially for a stringed instruments
0:08:00	uh they the you ten step the spread out so high frequent
0:08:03	so
0:08:03	and we've been using a um a models of this in a menace T and the going to this formula
0:08:08	i can from the that
0:08:09	and and this introduces another parameter that we can have to rest which is if this be here which is
0:08:14	that it's a and in how many city parameter a for each night
0:08:18	um
0:08:19	so
0:08:20	the things we have to estimate and now adding up
0:08:22	that that speech to that we had a idea
0:08:24	and
0:08:25	if if we use
0:08:26	take this be that the set the problems as we need to estimate we've got so and the number of
0:08:30	notes and then for each night a fundamental frequency
0:08:33	the number of partials annals that in in how T
0:08:39	um maybe non to the um
0:08:40	transition density and now we've been using some very simple models least a for um and they'd been based on
0:08:46	two
0:08:47	quite basic observations say press the that's if an is present in one frame then it's like that that it
0:08:53	is also
0:08:54	and present in the next frame
0:08:56	um and second that's uh it this is the number of nights present in one frame then it's like you
0:09:00	will have the same number of nights in the next frame
0:09:02	and i'll we see there are formal um higher a levels of modeling the we could do here looking at
0:09:07	how the the number of
0:09:09	partial frequencies change between frames
0:09:11	so you expect that the K
0:09:12	um
0:09:14	and also um
0:09:15	modeling a a note onset set sets we have like that i
0:09:18	is
0:09:21	um
0:09:22	but would now got everything we need to do some inference
0:09:24	say
0:09:25	this is that
0:09:27	this is the thing we trying to rest make run but and we defined a model for the like it
0:09:30	that's that the poisson model and we've got a my simple models for the
0:09:33	the transition
0:09:34	then T
0:09:35	and so now we can use the um and C C particles out with them uh which
0:09:40	but never and just the last talk
0:09:41	and
0:09:43	uh two
0:09:44	S make this this joint that's T
0:09:46	um
0:09:48	now
0:09:50	the the problem is that
0:09:51	if we've got a large number of next then we've got
0:09:53	a lot of parameters now um
0:09:55	to about three from this region a remember
0:09:58	at which means if we try and change all of them at once we end up with very low acceptance
0:10:01	rates than all
0:10:02	um markov chain
0:10:06	um
0:10:08	okay that the way to get around this um for we gonna have to sorts of move we can have
0:10:12	means where we only and change the
0:10:14	and the current frame parameters
0:10:15	and then all these where we we trying change by the previous frame and the current frame from
0:10:20	and for the current frame
0:10:21	and it's it's nice you we can just use metropolis with gives them a just choose to use change some
0:10:26	subsets of the problem as that once will just change
0:10:28	and the three parameters the say seated one nights
0:10:31	uh in each step
0:10:33	um
0:10:34	the joint moves it gets a little more complex
0:10:36	um what would like to do is
0:10:38	sample poll a the T minus one from the
0:10:41	um
0:10:42	possible distribution from a from the previous frame
0:10:45	and then uh propose the card frame is from some provides
0:10:49	uh say the the problem here is that if
0:10:51	when we do the sampling we will be changing all of the T minus one promises as in one guy
0:10:55	and and again that gives the is very low acceptance rates
0:10:59	uh say
0:11:01	a solution to this this being C to take the the particle distribution and it's of collapse it onto to
0:11:06	a a a single
0:11:06	univariate histogram uh for for all the different possible notes that we have in the previous frame
0:11:12	and then we use this
0:11:13	to as an approximation for the
0:11:14	the the marginal
0:11:16	distribution of
0:11:18	um each night and then and the the of
0:11:21	for uh independent it
0:11:22	that's you my as one and this means that we can sample
0:11:25	um one day to to time uh the um
0:11:28	of the but the T minus one parameter
0:11:31	um and that and again gives
0:11:33	acceptable acceptable uh
0:11:34	except at
0:11:36	uh a to finally we we want to made the number of makes present in each frame
0:11:39	and that can
0:11:40	be done very nice just by putting the whole thing into a a reversible jump
0:11:44	um formulation
0:11:46	and
0:11:47	so that's some look at some results
0:11:49	and so this is the the output from a couple of markov chains this is a a a a simple
0:11:53	case where we just got one night
0:11:54	and we're not looking at reversible jump a that what we fixing the number of nights that one button yeah
0:12:00	and
0:12:00	you can see that it it
0:12:02	and it picks up the correct night
0:12:04	in on the first iteration in fact factor
0:12:06	um
0:12:08	and the other a from just on the tree can you a green i think but it
0:12:11	so takes about twenty frames segments
0:12:14	and
0:12:15	and then here on the right got a a three nee case um and we doing reversible jump mcmc now
0:12:19	say um rest making the number of nights air
0:12:22	um and that's
0:12:23	and yeah so yeah that
0:12:25	pretty much correct
0:12:26	um
0:12:28	with the the frequency say we see a fixed
0:12:30	to of the knight's you very quickly and then it its troubles to choose between three possibilities here
0:12:34	and
0:12:35	this the three cases are in fact
0:12:37	space
0:12:38	i not to the parts and and the reason that some confusion there is "'cause" the three next have you
0:12:42	much the same sets of i but i
0:12:44	we of partial frequencies
0:12:46	um
0:12:48	i finally just
0:12:49	a few results
0:12:49	uh this is
0:12:51	and a simple um sort of
0:12:53	a loud test piece
0:12:54	so it's it's just three chords each of three nights
0:12:57	um
0:12:57	so we've got time on bottom them here and then the the frequency of the knight's present a of the
0:13:02	this
0:13:03	and
0:13:03	and the the blue dots that
0:13:06	um it's K and it estimates and we can see a fixed up
0:13:08	um all night quite nicely here
0:13:10	and this that one
0:13:12	but one just dropping out here um as the the I the K at the end of the night
0:13:17	um
0:13:19	do errors here a a at the beginning of each night
0:13:21	and and easy
0:13:22	"'cause" by a transient effects the beginning of the like which will we're not modelling at my
0:13:28	and and then find a we we tried on some real music
0:13:31	um so this is a a a a kind of piece
0:13:34	and and you C it picks up these the base nice
0:13:37	but
0:13:37	quite nicely
0:13:39	and
0:13:40	so the the travel mates it it
0:13:42	doing a bad job out here that's there's a lot of false alarms and its of the going on and
0:13:46	again that's to to um some trend like transient affects the beginning of each night which we we're not modeling
0:13:52	well
0:13:54	and sorry the i've a late or just the the ground
0:13:59	so you just to each um but that and the um
0:14:01	the on
0:14:02	point process model which we using so it's you on a search of you for each
0:14:06	um
0:14:07	uh each frame given given the nights and some simple a dynamic models that
0:14:11	so
0:14:12	for the evaluation of the X at a time
0:14:14	and he's will the these us to do and sequential inference is the the mcmc particles goes out for them
0:14:19	to find the the number of nights in each frame and
0:14:22	and estimates of that that
0:14:23	um frequency is the problem
0:14:26	and
0:14:28	say that there's lots of ways we could extend this say i i mention that E that we we what
0:14:31	you look at P camp use "'cause" that's to hot
0:14:33	so i and we do mess that need gets nice performance of we looked at then
0:14:37	and and also a a at the phase they that we haven't that that's all yeah
0:14:41	um and all step by by looking it's that's more complex
0:14:44	a dynamical
0:14:45	uh
0:14:46	and how to given the simplicity of them it seems we in quite well now
0:15:00	um
0:15:12	a it's quite a long might of real time about it
0:15:15	it's a
0:15:24	a i we haven't been aiming to get it real time maybe
0:15:27	oh
0:15:39	uh yes or something
0:15:42	and
0:15:43	so
0:15:47	and i was able to look through what looks simple peak detection we is possible to find a the of
0:15:51	features would like to more spectrum to simple peter good score but
0:15:55	your term record more as you are used to but spike to hear more just can be limited to from
0:16:00	do
0:16:00	spurt maybe or
0:16:02	you you such teams still
0:16:04	so a i i i a by the your we use to do are go to be to measurements just
0:16:08	peak detection like and the peach that you're detecting great
0:16:12	and number uh uh is better to four two peaks for example of after some some doing to get to
0:16:16	do to use like more still
0:16:18	sparks some stuff you that was a some smoothing moving that's
0:16:23	but it's doing a
0:16:24	sure that for
0:16:25	all source or they're them
0:16:27	i was to the this to the the errors and the of real hard we can what because of to
0:16:33	use from minimum or maybe
0:16:35	yeah that's that a trade between if you give a pretty much averaging in that it news the sum of
0:16:39	the different now or site it's got just a very sure on the which seems
0:16:50	a
0:16:55	i
0:16:58	oh
0:17:00	i
0:17:01	i five
0:17:03	oh
0:17:04	oh
0:17:05	a
0:17:06	i
0:17:06	i
0:17:07	a
0:17:08	i
0:17:09	i
0:17:10	oh
0:17:11	i
0:17:13	which
0:17:21	and i think another silence do is that something along
0:17:49	in in

POINT PROCESS MCMC FOR SEQUENTIAL MUSIC TRANSCRIPTION

Particle Filtering for High Dimensional Problems

Přednášející: Pete Bunch, Autoři: Pete Bunch, Simon J. Godsill, University of Cambridge, United Kingdom