0:00:13 | a |
---|
0:00:14 | um |
---|
0:00:14 | and which a yeah and i can be talking to about better work on um |
---|
0:00:18 | if search search of music pitch contours |
---|
0:00:21 | uh a first uh i'm gonna |
---|
0:00:23 | if |
---|
0:00:23 | start with a short overview about uh content |
---|
0:00:26 | based music search especially melody based music search |
---|
0:00:29 | a first we need to define what's a we mean by main melody |
---|
0:00:33 | um |
---|
0:00:34 | the exact mean logical definition is probably a lot more complicated but we're just gonna use something that's except in |
---|
0:00:39 | the uh |
---|
0:00:40 | music information retrieval community |
---|
0:00:42 | which is the single modify nick pitch sequence that a listener might trip reproduce it fast to was so or |
---|
0:00:48 | hum a piece of polyphonic music |
---|
0:00:50 | and that a listener would recognise as being be |
---|
0:00:53 | as sense of that music when when heard in compares |
---|
0:00:56 | um so based on this definition of main melody are we're just gonna make it even |
---|
0:01:00 | further loose sure |
---|
0:01:02 | and just assume that melody means dominant pitch contour |
---|
0:01:05 | which in turn we're gonna assume a means |
---|
0:01:08 | dominant the fundamental frequency |
---|
0:01:10 | contour and of course all three of these concepts |
---|
0:01:13 | you know are actually |
---|
0:01:14 | subtly different uh especially uh |
---|
0:01:17 | not only between melody and pitch but also between pitch and fundamental frequency but |
---|
0:01:21 | here we're just gonna lose assume that |
---|
0:01:24 | fundamental frequency under some kind of map map or notion of of down minutes |
---|
0:01:28 | um |
---|
0:01:29 | uh define as the main melody |
---|
0:01:31 | so um |
---|
0:01:32 | oh we know that a you know in searching we based on melody is is a very interesting problem people |
---|
0:01:37 | have done for a long time |
---|
0:01:38 | a a query by humming is a very well known |
---|
0:01:41 | application |
---|
0:01:42 | um in doing this |
---|
0:01:43 | a task |
---|
0:01:44 | um there are different methods of representing melodies |
---|
0:01:47 | for for music search |
---|
0:01:48 | and uh the most well-known known um is note |
---|
0:01:52 | transcriptions were or note interval transcriptions |
---|
0:01:54 | so if you have some kind of melody and then for example as you can see this example here |
---|
0:01:59 | um you can just |
---|
0:02:00 | a |
---|
0:02:01 | represent or mobile T as a sequence of notes |
---|
0:02:03 | see for D for you forty three G three |
---|
0:02:05 | or even just model the a transitions between notes |
---|
0:02:08 | such says up down repeat |
---|
0:02:11 | um and and using these transcriptions people have been quite successful when in doing a query by humming |
---|
0:02:16 | uh another method is to just use the pay be continuous or or frame based pitch contour |
---|
0:02:22 | a note the course i have a around continuous "'cause" it's not really continuous it's still a sample digital |
---|
0:02:28 | uh signal but nevertheless we all continuous because it |
---|
0:02:31 | so pretty much uh resembles a a a a a smooth curve |
---|
0:02:35 | um so just to give an example |
---|
0:02:37 | um it it in the |
---|
0:02:39 | uh |
---|
0:02:40 | example here |
---|
0:02:41 | um at the top and see a lot frequency |
---|
0:02:44 | spectrum that's actually a uh you know a small a small segment of of a beatles song |
---|
0:02:48 | and then based on that you can extract some kind of dominant F zero contour |
---|
0:02:52 | uh that can see a um at the bottom |
---|
0:02:55 | and again you have a a a a a time axis and then you have or frequency axis |
---|
0:02:59 | yeah and you just basically just plot the evolution of the uh the dominant a phone and this is what |
---|
0:03:03 | we're gonna call |
---|
0:03:04 | continuous F O contour now the reason why this is interesting |
---|
0:03:08 | is um for i'll get to that of a in a bit |
---|
0:03:11 | but uh |
---|
0:03:12 | let's see how we would actually do a melody based music search uh using that |
---|
0:03:16 | so this is just an example uh you have a query melody they can see here |
---|
0:03:20 | and then you have a a a a a much longer target |
---|
0:03:22 | a music piece that's and maybe some kind of music database |
---|
0:03:25 | and you can see that there's a small part of that target |
---|
0:03:28 | that |
---|
0:03:29 | a very closely resembles a the query |
---|
0:03:32 | um |
---|
0:03:32 | it's just one small segment of that target |
---|
0:03:35 | no and the object of music search is to know given this query you try to match it up with |
---|
0:03:40 | that part |
---|
0:03:41 | and the target |
---|
0:03:41 | now ready you see that there was you know a number of of |
---|
0:03:44 | problems most you have a solve here for example the length of the query here |
---|
0:03:47 | is about seventeen seconds but then the corresponding part |
---|
0:03:50 | um and the target is maybe a little over ten seconds |
---|
0:03:54 | so the target |
---|
0:03:55 | has a higher tempo it's been sound faster |
---|
0:03:58 | and the query so you have to be able to just for that are from the temple |
---|
0:04:01 | they can also be differences in in the in the uh musical key so maybe the queries |
---|
0:04:05 | as |
---|
0:04:06 | you know |
---|
0:04:07 | some that a higher key or at a low key compared to the target |
---|
0:04:11 | so |
---|
0:04:13 | well when you do a music search a first all you have to be able to search all possible starting |
---|
0:04:17 | locations of each target because people may sing you know for starting from the beginning of some saw |
---|
0:04:22 | not this are so at the beginning |
---|
0:04:24 | um |
---|
0:04:26 | a people may start thing if the middle of of of the song |
---|
0:04:29 | uh you want to be able to adjust for differences in speed or temple adjust for differences in key |
---|
0:04:33 | um people may also have some inconsistencies in rhythm in pitch |
---|
0:04:38 | within their saying it |
---|
0:04:41 | so uh dynamic time warping is a very popular uh |
---|
0:04:45 | no technique that's used to uh compare two different uh sequences of pitch |
---|
0:04:50 | uh for doing uh a music search and then there's been other state-of-the-art fingerprinting or hashing techniques |
---|
0:04:55 | uh that have a allowed very efficient and effective search using the note transcription data that's the example were showed |
---|
0:05:01 | you where you maybe have a sequence of notes or sequence of |
---|
0:05:04 | of transitions |
---|
0:05:05 | um |
---|
0:05:07 | uh is it it has however also been suggested that sort of using these no transcriptions using the continuous pitch |
---|
0:05:13 | contours that we just saw a a a can work better |
---|
0:05:16 | and the the prime time primary argument for this is that the no transcriptions can only be very real right |
---|
0:05:21 | we obtained from on a funny music |
---|
0:05:23 | if you have polyphonic music |
---|
0:05:25 | then you're no transcription is very easily break down um if there |
---|
0:05:29 | there errors in the pitch transcription for a fun music the no transcriptions can compound those errors |
---|
0:05:34 | so uh people have suggested just working on these original can pitch contour |
---|
0:05:39 | uh of course that's a problem |
---|
0:05:40 | because |
---|
0:05:41 | the amount of data is |
---|
0:05:42 | so much larger compared to just using the nodes |
---|
0:05:45 | so it's very computationally um |
---|
0:05:47 | expensive |
---|
0:05:48 | so if if for example we've uh in in our previous examples |
---|
0:05:51 | if you had maybe five notes |
---|
0:05:53 | a if you using the no transcription then your input data is just five |
---|
0:05:57 | by values but if we using the contain pitch contour it can be hundreds and hundreds of values that if |
---|
0:06:01 | you're sampling every |
---|
0:06:02 | there two miliseconds |
---|
0:06:05 | so |
---|
0:06:06 | uh in order to uh try to use |
---|
0:06:08 | uh is uh |
---|
0:06:10 | a tune is |
---|
0:06:11 | a pitch contours well also efficiently doing music search we previously |
---|
0:06:15 | a proposed of a method of indexing melody fragments using a set of |
---|
0:06:19 | a what we call time normalized key normalized we would coefficients and store these coefficients in a tree |
---|
0:06:25 | uh to search then |
---|
0:06:26 | them efficiently um there was a lot of mathematical teacher that when it to this that um |
---|
0:06:31 | uh i won't get into here |
---|
0:06:32 | a the problem with this method is that it just compares home of melody fragments to rigid late that it |
---|
0:06:37 | it just uses a simple euclidean distance between a different melody fragments is it does not really a a with |
---|
0:06:43 | the a query changes |
---|
0:06:44 | in the read them although it does allow well i differences in tempo |
---|
0:06:48 | so uh to really do this problem |
---|
0:06:50 | properly we have to do some kind of |
---|
0:06:52 | some kind of dynamic time warping |
---|
0:06:55 | so that's |
---|
0:06:56 | just take a look at |
---|
0:06:57 | a what exactly that how that dynamic time warping can be um |
---|
0:07:00 | formulated related |
---|
0:07:01 | so if you assume |
---|
0:07:02 | some query sequence Q |
---|
0:07:04 | a uh Q |
---|
0:07:05 | which |
---|
0:07:06 | yeah it's just a set of of uh |
---|
0:07:08 | uh a pitch samples and then you have a target sequence P |
---|
0:07:11 | um the classic dtw equation |
---|
0:07:14 | is |
---|
0:07:15 | what you see here are you based be a a a a scene you have R warping operations and you |
---|
0:07:19 | try to |
---|
0:07:20 | you you set up your compass your total cost as a summation |
---|
0:07:24 | of of costs |
---|
0:07:25 | a a a your warping |
---|
0:07:27 | uh dimension or warping operation the dimension |
---|
0:07:30 | and uh the path that you take a along this path |
---|
0:07:33 | is defined by uh these warping functions um fee if Q and fee of P this is just |
---|
0:07:38 | from just very traditional speech recognition or |
---|
0:07:42 | or other pattern recognition text |
---|
0:07:44 | um and here |
---|
0:07:45 | a we just defined the the distance |
---|
0:07:47 | between a query value and a pitch value as a that euclidean dist |
---|
0:07:52 | no note that here go we also have a and and and an extra parameter called the B |
---|
0:07:57 | which uh |
---|
0:07:58 | which actually represents the difference in the musical key between the query and the pitch because as you can see |
---|
0:08:04 | in the uh calculation of the uh |
---|
0:08:06 | of the just as |
---|
0:08:07 | are you also have to |
---|
0:08:08 | uh |
---|
0:08:09 | um take into account that the crew can be sung at a at a different key from the from targets |
---|
0:08:14 | we can't just subtract to balance you have to at some kind of offset |
---|
0:08:18 | and you you add here because the |
---|
0:08:20 | frequencies are log frequency so it of ski shift |
---|
0:08:23 | and the lot frequency "'cause" you domain just becomes a a a a constant by |
---|
0:08:27 | so the B is is some bias factor model difference in musical key and um |
---|
0:08:32 | you we can't of course constraint this true main roughly cons |
---|
0:08:35 | constant uh which does make sense if you just let the |
---|
0:08:38 | be change for every single different pitch value |
---|
0:08:41 | it wouldn't make any sense at all you have to seen that |
---|
0:08:43 | that that both the query and the target |
---|
0:08:45 | and they may have different keys |
---|
0:08:47 | between each other but you still maintain your key a within that so so i'm me sing higher then |
---|
0:08:53 | then |
---|
0:08:53 | john one and but then |
---|
0:08:55 | um |
---|
0:08:56 | i was still maintain a |
---|
0:08:59 | my musical key |
---|
0:09:00 | even though i choose a different one |
---|
0:09:02 | so i'm and i'm fruit to choose whatever key at one |
---|
0:09:05 | so you can see ut |
---|
0:09:06 | replace the be so by with with just a cost of valid B |
---|
0:09:10 | but even in this case uh when you actually tried to do the dynamic programming to to minimize this whole |
---|
0:09:14 | equation |
---|
0:09:15 | you basically have to compute a was over three dimensional space you have you |
---|
0:09:20 | your your query access you have your target access and then you also have to |
---|
0:09:24 | are i try every single possible |
---|
0:09:26 | difference |
---|
0:09:27 | in the uh in the musical key so you have to compute |
---|
0:09:31 | are are the costs in a Q times P to be |
---|
0:09:33 | E space so the cost is it's just too high |
---|
0:09:36 | um |
---|
0:09:37 | the on original study that try to use ms cool pitch a pitch contours actually did exactly back |
---|
0:09:44 | so i to to address this concern um you know people have |
---|
0:09:47 | have |
---|
0:09:48 | uh |
---|
0:09:48 | a propose different ways |
---|
0:09:50 | and |
---|
0:09:51 | uh what we did was to to speed up the dtw we we partition the query into segments |
---|
0:09:57 | treating each segment as a as their rhythmically consistent unit |
---|
0:10:00 | no no this idea of partitioning up |
---|
0:10:03 | queries or or or melodies into to segments |
---|
0:10:05 | that's not really do you and people have done it in in one form or another |
---|
0:10:08 | um |
---|
0:10:09 | what we did here though is uh |
---|
0:10:11 | there are a number of mathematical issues that you have to um |
---|
0:10:15 | i H |
---|
0:10:16 | you know a handle in terms of how the |
---|
0:10:18 | the the by can change it and that things like that |
---|
0:10:21 | and uh so these are just more algorithmic details that that are in the paper so i won't really get |
---|
0:10:26 | it to them |
---|
0:10:27 | here |
---|
0:10:28 | but really matters is that we are uh just dividing up the query it to segments and this drastically reduces |
---|
0:10:34 | the the amount of computation and we propose |
---|
0:10:37 | a framework for for for doing that |
---|
0:10:39 | um |
---|
0:10:41 | so this can actually solve |
---|
0:10:43 | quite elegantly |
---|
0:10:44 | um |
---|
0:10:45 | using a yeah just a very classical level building but with just some once subtle different |
---|
0:10:51 | um it's kind of like in connected word recognition what the classical word recognition technique that maybe you seen the |
---|
0:10:57 | rip be or too long speech recognition book |
---|
0:10:59 | um where you |
---|
0:11:00 | but upon levels were each level represents |
---|
0:11:03 | one word segment |
---|
0:11:04 | um and here we have a |
---|
0:11:05 | or query that's divided up into to different segments |
---|
0:11:08 | but then we also introduce what we call can something like buffer zones |
---|
0:11:11 | between the query segments and these are employed |
---|
0:11:14 | to a lower the uh some target segments to to overlap |
---|
0:11:18 | and that's provides a much greater flexibility in the dtw if you if you don't know about this overlap |
---|
0:11:23 | then um you introduce too much rigidity um into to the system |
---|
0:11:27 | so all all this |
---|
0:11:28 | uh it can be done quite elegantly uh with with a level building scheme |
---|
0:11:34 | so uh we actually um |
---|
0:11:36 | implemented this |
---|
0:11:37 | and |
---|
0:11:38 | uh uh we we and this |
---|
0:11:40 | on the uh mare's you thousand six |
---|
0:11:41 | test set |
---|
0:11:43 | and these are just some |
---|
0:11:44 | uh mean a standard deviations for the for this average start sisters there's time per query that you could it |
---|
0:11:50 | cheap |
---|
0:11:51 | uh using these techniques |
---|
0:11:53 | and uh what you see at the bottom are just asymptotic search costs |
---|
0:11:56 | um |
---|
0:11:57 | uh that's a so did associated with each different type of of of a search technique |
---|
0:12:02 | um and T W is actually are original work where we just rigidly the |
---|
0:12:06 | um try to |
---|
0:12:08 | compare |
---|
0:12:09 | um mel what you fragments was together using a simple euclidean distance |
---|
0:12:12 | and for that |
---|
0:12:13 | yeah that each search or or i i just the each comparison can just be done add in constant time |
---|
0:12:19 | because it's just |
---|
0:12:19 | a euclidean distance |
---|
0:12:21 | a multi-scale query windowing is just a variation of that |
---|
0:12:24 | um and the be a search cost is going to be asymptotically uh big oh of and what are and |
---|
0:12:30 | is the number |
---|
0:12:31 | a ship query segments |
---|
0:12:33 | and then the segmented dtw is is our proposed method |
---|
0:12:36 | uh for which be there's cost is |
---|
0:12:38 | it's and times P |
---|
0:12:40 | where again and is number of query segments and P is the link |
---|
0:12:43 | of of your target signal |
---|
0:12:44 | and the dtw is a is the brute force dtw T W that are originally would you the sort optimum |
---|
0:12:50 | equation in for that the search cost is P times to times be again you have to do it over |
---|
0:12:54 | all three dimensions |
---|
0:12:55 | so that's are sign for this |
---|
0:12:57 | is the greatest |
---|
0:12:58 | and and that you can obviously see that uh in actual experimentation |
---|
0:13:02 | um the uh |
---|
0:13:04 | multiscale just kill target don't is of course the the fastest |
---|
0:13:07 | uh and uh |
---|
0:13:09 | the um and U W the multiscale query when doing followed by our proposed S D |
---|
0:13:14 | W is |
---|
0:13:15 | much slower but still zero point six four seconds per |
---|
0:13:19 | queries |
---|
0:13:20 | wasn't a very bad figure |
---|
0:13:22 | and then we |
---|
0:13:23 | just estimated the amount of time would take if we use P brute force optimum dtw T which she C |
---|
0:13:29 | is |
---|
0:13:30 | much much uh takes much much longer and than any of these given methods |
---|
0:13:34 | um in terms of the uh search performance |
---|
0:13:37 | um |
---|
0:13:38 | we found that |
---|
0:13:39 | uh be uh |
---|
0:13:41 | the uh segmented dtw obviously is going to give the best |
---|
0:13:45 | are performance because again it it doesn't actual dynamic time warping between between queries and can handle rhythmic inconsistencies and |
---|
0:13:52 | all these other |
---|
0:13:53 | variations that's uh the M T W or D but the previous and to W |
---|
0:13:57 | can't |
---|
0:13:58 | um |
---|
0:13:59 | and uh you can see that uh the M are that we got was your point he three eight |
---|
0:14:03 | um this is actually a much lower than the state-of-the-art art |
---|
0:14:07 | uh which is higher than of the a point nine um however um and that particular work |
---|
0:14:12 | um |
---|
0:14:13 | there was a constraint |
---|
0:14:14 | that's |
---|
0:14:15 | that |
---|
0:14:15 | um |
---|
0:14:16 | that all queries |
---|
0:14:18 | charted at the beginning of |
---|
0:14:20 | of melodies or |
---|
0:14:21 | or |
---|
0:14:22 | mel what melodic phrase |
---|
0:14:23 | uh in our case we searched every possible location in all the songs in a database and source search space |
---|
0:14:29 | as is much larger and there's much more room for or |
---|
0:14:32 | so to just to be on an equal footing with the C every the are we also tried constraining all |
---|
0:14:37 | a queries to start at the beginning of melodies |
---|
0:14:40 | and they that gave us an M am or of zero point nine to one which is almost the same |
---|
0:14:44 | as that the of the art |
---|
0:14:46 | and we also tried an a |
---|
0:14:48 | upper real |
---|
0:14:49 | a limit every power of polyphonic music set |
---|
0:14:51 | uh which we just create on real using |
---|
0:14:54 | commercial pop some recordings so the are actual |
---|
0:14:56 | acoustic polyphonic a music recordings from which we did |
---|
0:15:00 | just some kind of automatic um a F zero extraction |
---|
0:15:04 | and then a are test on that |
---|
0:15:05 | and |
---|
0:15:06 | obviously the results are gonna be much lower "'cause" we now we using polyphonic music |
---|
0:15:10 | and uh there's can be lots of errors the pitch transcriptions |
---|
0:15:13 | but again our are |
---|
0:15:14 | our proposed method was still able to give the best for |
---|
0:15:19 | so and this is just a |
---|
0:15:21 | showing the relation between the number of query segments that use verses the M are are and the search time |
---|
0:15:26 | the more segments you have you have basically the close you get to the optimum dtw so W M R |
---|
0:15:31 | is gonna go well but then at the same time you're search times also critical uh |
---|
0:15:35 | so |
---|
0:15:36 | that's pretty much it |
---|
0:15:38 | and channel of this work |
---|
0:15:40 | there any questions |
---|
0:15:52 | i |
---|
0:16:00 | the |
---|
0:16:00 | more efficient way of trying |
---|
0:16:03 | a fine the bias like the the i your |
---|
0:16:06 | we task |
---|
0:16:07 | that is the reigning |
---|
0:16:08 | that's a way that and trying every possible E |
---|
0:16:12 | all |
---|
0:16:13 | while trying every possible be yeah than the computer are large you you uh one of a smaller resolution |
---|
0:16:20 | right so i mean |
---|
0:16:22 | if you just going to try every possible be |
---|
0:16:26 | by definition you're trying every possible be |
---|
0:16:28 | so |
---|
0:16:30 | you're |
---|
0:16:33 | i mean you could try using some kind of heuristics maybe to reduce |
---|
0:16:37 | you're search space ensure the be trying instead of trying all those possible of of the subset |
---|
0:16:41 | um in our work uh we just a a did simplification so that we use some kind of optimum be |
---|
0:16:46 | that's estimate just using the first query segment |
---|
0:16:49 | and that's how we were able to just eliminate that whole search space |
---|
0:16:52 | but uh there probably other more other heuristic |
---|
0:16:56 | but the that you could |
---|
0:16:56 | apply there |
---|