Přepis řeči - EFFICIENT SEARCH OF MUSIC PITCH CONTOURS USING WAVELET TRANSFORMS AND SEGMENTED DYNAMIC TIME WARPING

0:00:13	a
0:00:14	um
0:00:14	and which a yeah and i can be talking to about better work on um
0:00:18	if search search of music pitch contours
0:00:21	uh a first uh i'm gonna
0:00:23	if
0:00:23	start with a short overview about uh content
0:00:26	based music search especially melody based music search
0:00:29	a first we need to define what's a we mean by main melody
0:00:33	um
0:00:34	the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in
0:00:39	the uh
0:00:40	music information retrieval community
0:00:42	which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or
0:00:48	hum a piece of polyphonic music
0:00:50	and that a listener would recognise as being be
0:00:53	as sense of that music when when heard in compares
0:00:56	um so based on this definition of main melody are we're just gonna make it even
0:01:00	further loose sure
0:01:02	and just assume that melody means dominant pitch contour
0:01:05	which in turn we're gonna assume a means
0:01:08	dominant the fundamental frequency
0:01:10	contour and of course all three of these concepts
0:01:13	you know are actually
0:01:14	subtly different uh especially uh
0:01:17	not only between melody and pitch but also between pitch and fundamental frequency but
0:01:21	here we're just gonna lose assume that
0:01:24	fundamental frequency under some kind of map map or notion of of down minutes
0:01:28	um
0:01:29	uh define as the main melody
0:01:31	so um
0:01:32	oh we know that a you know in searching we based on melody is is a very interesting problem people
0:01:37	have done for a long time
0:01:38	a a query by humming is a very well known
0:01:41	application
0:01:42	um in doing this
0:01:43	a task
0:01:44	um there are different methods of representing melodies
0:01:47	for for music search
0:01:48	and uh the most well-known known um is note
0:01:52	transcriptions were or note interval transcriptions
0:01:54	so if you have some kind of melody and then for example as you can see this example here
0:01:59	um you can just
0:02:00	a
0:02:01	represent or mobile T as a sequence of notes
0:02:03	see for D for you forty three G three
0:02:05	or even just model the a transitions between notes
0:02:08	such says up down repeat
0:02:11	um and and using these transcriptions people have been quite successful when in doing a query by humming
0:02:16	uh another method is to just use the pay be continuous or or frame based pitch contour
0:02:22	a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital
0:02:28	uh signal but nevertheless we all continuous because it
0:02:31	so pretty much uh resembles a a a a a smooth curve
0:02:35	um so just to give an example
0:02:37	um it it in the
0:02:39	uh
0:02:40	example here
0:02:41	um at the top and see a lot frequency
0:02:44	spectrum that's actually a uh you know a small a small segment of of a beatles song
0:02:48	and then based on that you can extract some kind of dominant F zero contour
0:02:52	uh that can see a um at the bottom
0:02:55	and again you have a a a a a time axis and then you have or frequency axis
0:02:59	yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what
0:03:03	we're gonna call
0:03:04	continuous F O contour now the reason why this is interesting
0:03:08	is um for i'll get to that of a in a bit
0:03:11	but uh
0:03:12	let's see how we would actually do a melody based music search uh using that
0:03:16	so this is just an example uh you have a query melody they can see here
0:03:20	and then you have a a a a a much longer target
0:03:22	a music piece that's and maybe some kind of music database
0:03:25	and you can see that there's a small part of that target
0:03:28	that
0:03:29	a very closely resembles a the query
0:03:32	um
0:03:32	it's just one small segment of that target
0:03:35	no and the object of music search is to know given this query you try to match it up with
0:03:40	that part
0:03:41	and the target
0:03:41	now ready you see that there was you know a number of of
0:03:44	problems most you have a solve here for example the length of the query here
0:03:47	is about seventeen seconds but then the corresponding part
0:03:50	um and the target is maybe a little over ten seconds
0:03:54	so the target
0:03:55	has a higher tempo it's been sound faster
0:03:58	and the query so you have to be able to just for that are from the temple
0:04:01	they can also be differences in in the in the uh musical key so maybe the queries
0:04:05	as
0:04:06	you know
0:04:07	some that a higher key or at a low key compared to the target
0:04:11	so
0:04:13	well when you do a music search a first all you have to be able to search all possible starting
0:04:17	locations of each target because people may sing you know for starting from the beginning of some saw
0:04:22	not this are so at the beginning
0:04:24	um
0:04:26	a people may start thing if the middle of of of the song
0:04:29	uh you want to be able to adjust for differences in speed or temple adjust for differences in key
0:04:33	um people may also have some inconsistencies in rhythm in pitch
0:04:38	within their saying it
0:04:41	so uh dynamic time warping is a very popular uh
0:04:45	no technique that's used to uh compare two different uh sequences of pitch
0:04:50	uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques
0:04:55	uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed
0:05:01	you where you maybe have a sequence of notes or sequence of
0:05:04	of transitions
0:05:05	um
0:05:07	uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch
0:05:13	contours that we just saw a a a can work better
0:05:16	and the the prime time primary argument for this is that the no transcriptions can only be very real right
0:05:21	we obtained from on a funny music
0:05:23	if you have polyphonic music
0:05:25	then you're no transcription is very easily break down um if there
0:05:29	there errors in the pitch transcription for a fun music the no transcriptions can compound those errors
0:05:34	so uh people have suggested just working on these original can pitch contour
0:05:39	uh of course that's a problem
0:05:40	because
0:05:41	the amount of data is
0:05:42	so much larger compared to just using the nodes
0:05:45	so it's very computationally um
0:05:47	expensive
0:05:48	so if if for example we've uh in in our previous examples
0:05:51	if you had maybe five notes
0:05:53	a if you using the no transcription then your input data is just five
0:05:57	by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if
0:06:01	you're sampling every
0:06:02	there two miliseconds
0:06:05	so
0:06:06	uh in order to uh try to use
0:06:08	uh is uh
0:06:10	a tune is
0:06:11	a pitch contours well also efficiently doing music search we previously
0:06:15	a proposed of a method of indexing melody fragments using a set of
0:06:19	a what we call time normalized key normalized we would coefficients and store these coefficients in a tree
0:06:25	uh to search then
0:06:26	them efficiently um there was a lot of mathematical teacher that when it to this that um
0:06:31	uh i won't get into here
0:06:32	a the problem with this method is that it just compares home of melody fragments to rigid late that it
0:06:37	it just uses a simple euclidean distance between a different melody fragments is it does not really a a with
0:06:43	the a query changes
0:06:44	in the read them although it does allow well i differences in tempo
0:06:48	so uh to really do this problem
0:06:50	properly we have to do some kind of
0:06:52	some kind of dynamic time warping
0:06:55	so that's
0:06:56	just take a look at
0:06:57	a what exactly that how that dynamic time warping can be um
0:07:00	formulated related
0:07:01	so if you assume
0:07:02	some query sequence Q
0:07:04	a uh Q
0:07:05	which
0:07:06	yeah it's just a set of of uh
0:07:08	uh a pitch samples and then you have a target sequence P
0:07:11	um the classic dtw equation
0:07:14	is
0:07:15	what you see here are you based be a a a a scene you have R warping operations and you
0:07:19	try to
0:07:20	you you set up your compass your total cost as a summation
0:07:24	of of costs
0:07:25	a a a your warping
0:07:27	uh dimension or warping operation the dimension
0:07:30	and uh the path that you take a along this path
0:07:33	is defined by uh these warping functions um fee if Q and fee of P this is just
0:07:38	from just very traditional speech recognition or
0:07:42	or other pattern recognition text
0:07:44	um and here
0:07:45	a we just defined the the distance
0:07:47	between a query value and a pitch value as a that euclidean dist
0:07:52	no note that here go we also have a and and and an extra parameter called the B
0:07:57	which uh
0:07:58	which actually represents the difference in the musical key between the query and the pitch because as you can see
0:08:04	in the uh calculation of the uh
0:08:06	of the just as
0:08:07	are you also have to
0:08:08	uh
0:08:09	um take into account that the crew can be sung at a at a different key from the from targets
0:08:14	we can't just subtract to balance you have to at some kind of offset
0:08:18	and you you add here because the
0:08:20	frequencies are log frequency so it of ski shift
0:08:23	and the lot frequency "'cause" you domain just becomes a a a a constant by
0:08:27	so the B is is some bias factor model difference in musical key and um
0:08:32	you we can't of course constraint this true main roughly cons
0:08:35	constant uh which does make sense if you just let the
0:08:38	be change for every single different pitch value
0:08:41	it wouldn't make any sense at all you have to seen that
0:08:43	that that both the query and the target
0:08:45	and they may have different keys
0:08:47	between each other but you still maintain your key a within that so so i'm me sing higher then
0:08:53	then
0:08:53	john one and but then
0:08:55	um
0:08:56	i was still maintain a
0:08:59	my musical key
0:09:00	even though i choose a different one
0:09:02	so i'm and i'm fruit to choose whatever key at one
0:09:05	so you can see ut
0:09:06	replace the be so by with with just a cost of valid B
0:09:10	but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole
0:09:14	equation
0:09:15	you basically have to compute a was over three dimensional space you have you
0:09:20	your your query access you have your target access and then you also have to
0:09:24	are i try every single possible
0:09:26	difference
0:09:27	in the uh in the musical key so you have to compute
0:09:31	are are the costs in a Q times P to be
0:09:33	E space so the cost is it's just too high
0:09:36	um
0:09:37	the on original study that try to use ms cool pitch a pitch contours actually did exactly back
0:09:44	so i to to address this concern um you know people have
0:09:47	have
0:09:48	uh
0:09:48	a propose different ways
0:09:50	and
0:09:51	uh what we did was to to speed up the dtw we we partition the query into segments
0:09:57	treating each segment as a as their rhythmically consistent unit
0:10:00	no no this idea of partitioning up
0:10:03	queries or or or melodies into to segments
0:10:05	that's not really do you and people have done it in in one form or another
0:10:08	um
0:10:09	what we did here though is uh
0:10:11	there are a number of mathematical issues that you have to um
0:10:15	i H
0:10:16	you know a handle in terms of how the
0:10:18	the the by can change it and that things like that
0:10:21	and uh so these are just more algorithmic details that that are in the paper so i won't really get
0:10:26	it to them
0:10:27	here
0:10:28	but really matters is that we are uh just dividing up the query it to segments and this drastically reduces
0:10:34	the the amount of computation and we propose
0:10:37	a framework for for for doing that
0:10:39	um
0:10:41	so this can actually solve
0:10:43	quite elegantly
0:10:44	um
0:10:45	using a yeah just a very classical level building but with just some once subtle different
0:10:51	um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the
0:10:57	rip be or too long speech recognition book
0:10:59	um where you
0:11:00	but upon levels were each level represents
0:11:03	one word segment
0:11:04	um and here we have a
0:11:05	or query that's divided up into to different segments
0:11:08	but then we also introduce what we call can something like buffer zones
0:11:11	between the query segments and these are employed
0:11:14	to a lower the uh some target segments to to overlap
0:11:18	and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap
0:11:23	then um you introduce too much rigidity um into to the system
0:11:27	so all all this
0:11:28	uh it can be done quite elegantly uh with with a level building scheme
0:11:34	so uh we actually um
0:11:36	implemented this
0:11:37	and
0:11:38	uh uh we we and this
0:11:40	on the uh mare's you thousand six
0:11:41	test set
0:11:43	and these are just some
0:11:44	uh mean a standard deviations for the for this average start sisters there's time per query that you could it
0:11:50	cheap
0:11:51	uh using these techniques
0:11:53	and uh what you see at the bottom are just asymptotic search costs
0:11:56	um
0:11:57	uh that's a so did associated with each different type of of of a search technique
0:12:02	um and T W is actually are original work where we just rigidly the
0:12:06	um try to
0:12:08	compare
0:12:09	um mel what you fragments was together using a simple euclidean distance
0:12:12	and for that
0:12:13	yeah that each search or or i i just the each comparison can just be done add in constant time
0:12:19	because it's just
0:12:19	a euclidean distance
0:12:21	a multi-scale query windowing is just a variation of that
0:12:24	um and the be a search cost is going to be asymptotically uh big oh of and what are and
0:12:30	is the number
0:12:31	a ship query segments
0:12:33	and then the segmented dtw is is our proposed method
0:12:36	uh for which be there's cost is
0:12:38	it's and times P
0:12:40	where again and is number of query segments and P is the link
0:12:43	of of your target signal
0:12:44	and the dtw is a is the brute force dtw T W that are originally would you the sort optimum
0:12:50	equation in for that the search cost is P times to times be again you have to do it over
0:12:54	all three dimensions
0:12:55	so that's are sign for this
0:12:57	is the greatest
0:12:58	and and that you can obviously see that uh in actual experimentation
0:13:02	um the uh
0:13:04	multiscale just kill target don't is of course the the fastest
0:13:07	uh and uh
0:13:09	the um and U W the multiscale query when doing followed by our proposed S D
0:13:14	W is
0:13:15	much slower but still zero point six four seconds per
0:13:19	queries
0:13:20	wasn't a very bad figure
0:13:22	and then we
0:13:23	just estimated the amount of time would take if we use P brute force optimum dtw T which she C
0:13:29	is
0:13:30	much much uh takes much much longer and than any of these given methods
0:13:34	um in terms of the uh search performance
0:13:37	um
0:13:38	we found that
0:13:39	uh be uh
0:13:41	the uh segmented dtw obviously is going to give the best
0:13:45	are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and
0:13:52	all these other
0:13:53	variations that's uh the M T W or D but the previous and to W
0:13:57	can't
0:13:58	um
0:13:59	and uh you can see that uh the M are that we got was your point he three eight
0:14:03	um this is actually a much lower than the state-of-the-art art
0:14:07	uh which is higher than of the a point nine um however um and that particular work
0:14:12	um
0:14:13	there was a constraint
0:14:14	that's
0:14:15	that
0:14:15	um
0:14:16	that all queries
0:14:18	charted at the beginning of
0:14:20	of melodies or
0:14:21	or
0:14:22	mel what melodic phrase
0:14:23	uh in our case we searched every possible location in all the songs in a database and source search space
0:14:29	as is much larger and there's much more room for or
0:14:32	so to just to be on an equal footing with the C every the are we also tried constraining all
0:14:37	a queries to start at the beginning of melodies
0:14:40	and they that gave us an M am or of zero point nine to one which is almost the same
0:14:44	as that the of the art
0:14:46	and we also tried an a
0:14:48	upper real
0:14:49	a limit every power of polyphonic music set
0:14:51	uh which we just create on real using
0:14:54	commercial pop some recordings so the are actual
0:14:56	acoustic polyphonic a music recordings from which we did
0:15:00	just some kind of automatic um a F zero extraction
0:15:04	and then a are test on that
0:15:05	and
0:15:06	obviously the results are gonna be much lower "'cause" we now we using polyphonic music
0:15:10	and uh there's can be lots of errors the pitch transcriptions
0:15:13	but again our are
0:15:14	our proposed method was still able to give the best for
0:15:19	so and this is just a
0:15:21	showing the relation between the number of query segments that use verses the M are are and the search time
0:15:26	the more segments you have you have basically the close you get to the optimum dtw so W M R
0:15:31	is gonna go well but then at the same time you're search times also critical uh
0:15:35	so
0:15:36	that's pretty much it
0:15:38	and channel of this work
0:15:40	there any questions
0:15:52	i
0:16:00	the
0:16:00	more efficient way of trying
0:16:03	a fine the bias like the the i your
0:16:06	we task
0:16:07	that is the reigning
0:16:08	that's a way that and trying every possible E
0:16:12	all
0:16:13	while trying every possible be yeah than the computer are large you you uh one of a smaller resolution
0:16:20	right so i mean
0:16:22	if you just going to try every possible be
0:16:26	by definition you're trying every possible be
0:16:28	so
0:16:30	you're
0:16:33	i mean you could try using some kind of heuristics maybe to reduce
0:16:37	you're search space ensure the be trying instead of trying all those possible of of the subset
0:16:41	um in our work uh we just a a did simplification so that we use some kind of optimum be
0:16:46	that's estimate just using the first query segment
0:16:49	and that's how we were able to just eliminate that whole search space
0:16:52	but uh there probably other more other heuristic
0:16:56	but the that you could
0:16:56	apply there

EFFICIENT SEARCH OF MUSIC PITCH CONTOURS USING WAVELET TRANSFORMS AND SEGMENTED DYNAMIC TIME WARPING

Multimedia Indexing and Retrieval

Přednášející: Woojay Jeon, Autoři: Woojay Jeon, Changxue Ma, Motorola, United States