0:00:13 | uh the uh a goal of this work was to uh improve upon state-of-the-art transcription |
---|
0:00:19 | uh by explicitly incorporate information about not tons is an offset |
---|
0:00:25 | uh some general information about transcription which you might have had before by not but on no |
---|
0:00:29 | i it's the process of |
---|
0:00:31 | combating an audio recording it do some for them music notation |
---|
0:00:34 | eight has |
---|
0:00:35 | numerous applications thing |
---|
0:00:37 | and my are in interactive music system such as uh automated score following already uh competition colour G |
---|
0:00:45 | and it could be divided into several subtasks |
---|
0:00:48 | uh such as multi pitch just mission a detection of note onset and offsets |
---|
0:00:53 | uh instrument litigation |
---|
0:00:55 | extraction of rhythmic information is the temple and |
---|
0:00:58 | in the multi pitch multiple instrument |
---|
0:01:00 | case it still remains an open problem |
---|
0:01:04 | uh some related work in which is linked to this |
---|
0:01:08 | work |
---|
0:01:09 | is the uh iterative |
---|
0:01:11 | spectral subtraction based system by to poor which |
---|
0:01:14 | propose the spectral smoothness principle |
---|
0:01:18 | uh the rule based system by rules do who also proposed as a time representation the resonator time-frequency image which |
---|
0:01:25 | is also used |
---|
0:01:26 | and this work |
---|
0:01:28 | uh i thing yes |
---|
0:01:30 | joint model is your estimation method which |
---|
0:01:32 | continues you ranks first in the U X uh |
---|
0:01:36 | public evaluations for uh most as your estimation and note tracking |
---|
0:01:41 | and then iterative estimation |
---|
0:01:43 | system for multi pitch estimation which exploit the temporal evolution which was previously proposed but you're |
---|
0:01:50 | uh also some related work an onset detection |
---|
0:01:53 | uh is the well known |
---|
0:01:55 | uh |
---|
0:01:56 | fused onset detection functions by one problem well low which combine energy and phase |
---|
0:02:02 | base |
---|
0:02:02 | measures |
---|
0:02:04 | and uh |
---|
0:02:05 | a more recent development which was late fusion |
---|
0:02:09 | by holds up to which |
---|
0:02:11 | fused at the onset descriptors of the decision level |
---|
0:02:16 | and uh in this work |
---|
0:02:17 | uh we propose a system for joint multi just estimation |
---|
0:02:21 | which will |
---|
0:02:22 | to exploit on and than of the detection |
---|
0:02:24 | in an effort to have improved multiple pitch estimation results |
---|
0:02:28 | um |
---|
0:02:31 | not also detection feature were developed |
---|
0:02:33 | and proposed which were derived from preprocessing steps from the description system |
---|
0:02:39 | and offsets |
---|
0:02:40 | a we believe are the first time to be explicitly exploited by using uh a kid a markov model |
---|
0:02:48 | this is the basic outline of the system |
---|
0:02:50 | basically there is a preprocessing step where the uh time-frequency representation is extracted |
---|
0:02:56 | spectral whitening is performed |
---|
0:02:58 | and noise is suppressed and they pitch |
---|
0:03:00 | sailing an sort pitch strength function is extracted |
---|
0:03:04 | of the was the core the system is the |
---|
0:03:07 | onset detection using late fusion and the proposed scriptures |
---|
0:03:11 | joint |
---|
0:03:11 | multipitch estimation |
---|
0:03:14 | afterwards |
---|
0:03:14 | each wise of the detection and the result is the general transcription in a T four |
---|
0:03:21 | uh this is an example of the uh |
---|
0:03:24 | time for the series we used |
---|
0:03:26 | which was the resonator time-frequency image which is uh |
---|
0:03:29 | resonator filter bank |
---|
0:03:31 | um we use that's |
---|
0:03:34 | and course with them more |
---|
0:03:36 | common on and Q transform for example because |
---|
0:03:38 | of of its exponential decay factor it's uh had the |
---|
0:03:41 | but temporal resolution in low because that you mike |
---|
0:03:44 | see here |
---|
0:03:46 | this is a very typical uh are |
---|
0:03:48 | um |
---|
0:03:49 | recording from the mikes to thousand seven competition which is |
---|
0:03:52 | usually |
---|
0:03:53 | employ |
---|
0:03:56 | after the extraction of the uh a time for this representation presentation aspects a whitening is performed you know to |
---|
0:04:01 | suppress timbral information and |
---|
0:04:04 | make the system more robust to different sound sources |
---|
0:04:07 | uh the |
---|
0:04:08 | what method but to it was used to that end |
---|
0:04:11 | and uh it was followed by a two |
---|
0:04:14 | once the octave span a if filtering procedure |
---|
0:04:18 | and the based on that uh |
---|
0:04:20 | white and and noise suppressed presentation |
---|
0:04:22 | a pitch aliens |
---|
0:04:24 | or pitch |
---|
0:04:24 | strength function is extracted |
---|
0:04:27 | uh a along with tuning and how many to coefficients |
---|
0:04:30 | and the lower figure you can see in the bottom that |
---|
0:04:32 | are the |
---|
0:04:33 | are T i |
---|
0:04:35 | spectrum of a C four pound notes |
---|
0:04:37 | and in the lower left and right figure you can see the corresponding peaks set function when you see a |
---|
0:04:42 | prominent peak |
---|
0:04:43 | in the C for note but you can also see several peaks sing sub harmonic positions or in super how |
---|
0:04:48 | positions of that |
---|
0:04:50 | can have |
---|
0:04:53 | a a so what's onset detection is |
---|
0:04:56 | uh forms |
---|
0:04:57 | two |
---|
0:04:57 | also the scriptures were extracted |
---|
0:04:59 | and proposed utilise information from the preprocessing steps |
---|
0:05:03 | of the multi pitch estimation stage |
---|
0:05:06 | first first proposed a script was a spectral flux basis to which also incorporated tuning information |
---|
0:05:11 | and is essentially E was |
---|
0:05:14 | probe motivated because in |
---|
0:05:17 | uh many cases you have |
---|
0:05:19 | for as is called by V brought T or a by tuning changes and these might uh |
---|
0:05:25 | give many false alarms in uh normal energy |
---|
0:05:29 | based uh also detection measure |
---|
0:05:32 | and these proposed measure is basically a half wave rectified uh a might on resolution fills bank |
---|
0:05:37 | which also incorporate an information from the extract pitch salience function |
---|
0:05:42 | as so what's on it's can be easily detected by P B all that |
---|
0:05:46 | but the function |
---|
0:05:48 | uh a second function |
---|
0:05:50 | a a for the detection was also proposed |
---|
0:05:53 | in order to detect soft on source of dozens of |
---|
0:05:56 | are produced without any knots of to change |
---|
0:05:59 | my be produced by both string is as for example |
---|
0:06:02 | and the proposed function was based on the P on a chrome are up to version of the |
---|
0:06:07 | extract pitch salience function |
---|
0:06:10 | um which was also have a rectified of work |
---|
0:06:14 | you know to combine these two you want the scriptures late fusion was applied |
---|
0:06:19 | and uh in know two |
---|
0:06:21 | train the late fusion problems as a development set from uh |
---|
0:06:25 | ghent university was "'cause" this is of ten thirty second uh a classic music X was |
---|
0:06:29 | you |
---|
0:06:32 | uh for multiple of zero estimation |
---|
0:06:35 | for each |
---|
0:06:36 | frame for each uh |
---|
0:06:39 | a kind of that's are extracted |
---|
0:06:40 | and for each possible combination |
---|
0:06:43 | uh the overlapped partials are estimated |
---|
0:06:46 | and overlapping partial treatment is applied |
---|
0:06:49 | basically for each combination of partial collision is is computed |
---|
0:06:53 | and that's was the um pitches of the overlapped partials are estimated by uh |
---|
0:06:57 | this script cepstrum base spectral envelope estimation procedure in the low frequency domain |
---|
0:07:01 | in the figure here you can see the uh in the right you can see the uh |
---|
0:07:05 | harmonic partial sequence of a |
---|
0:07:08 | have but G five for can nodes and the course one express of them a |
---|
0:07:14 | after what's for each possible peach combination for to a frame |
---|
0:07:18 | a score function is computed |
---|
0:07:20 | which exploits uh several spectral features |
---|
0:07:25 | and also aims to minimize the residual spectrum |
---|
0:07:29 | so um the features of what use were the uh spectral flatness |
---|
0:07:33 | for that's harmonic partial sequence |
---|
0:07:36 | a smooth this measured based on these spectral smoothness principle |
---|
0:07:39 | the spectral centroid which is these centre of gravity for that |
---|
0:07:43 | harmonic partial sequence aiming for a low |
---|
0:07:45 | spectral centroid |
---|
0:07:47 | is usually an indication of a |
---|
0:07:50 | musical is one harmonic sound |
---|
0:07:52 | uh a novel feature was proposed which was the harmonic related speech ratio which was a a you know to |
---|
0:07:58 | to um |
---|
0:07:59 | suppress press |
---|
0:08:00 | any harmonic or sub money cared |
---|
0:08:03 | and finally we try to minimize the uh flatness for the residual spectrum to much my of there is is |
---|
0:08:10 | suspect |
---|
0:08:13 | so of the optimal speech kind it said is one that actually maximise that |
---|
0:08:18 | score function |
---|
0:08:20 | and the weight promises as for that's score function were trained using now the my station |
---|
0:08:25 | using a development sense of one hundred kind of samples from the media lines |
---|
0:08:29 | kind of sounds |
---|
0:08:30 | database from uh that was propose developed by fun is in a me from uh in india |
---|
0:08:38 | i to the uh the pitch estimation stage the of the texan is proposed |
---|
0:08:42 | and it's |
---|
0:08:43 | applied |
---|
0:08:44 | uh and it's done using two state on of |
---|
0:08:48 | hidden markov models for each |
---|
0:08:50 | single speech |
---|
0:08:52 | and in this system an off that is defined as the |
---|
0:08:55 | time frame between two consecutive on sets |
---|
0:08:58 | well the |
---|
0:09:00 | at this stage of a peach |
---|
0:09:01 | firstly turns in any if states |
---|
0:09:04 | you know it to act uh |
---|
0:09:07 | compute the state priors and state transition for that a man |
---|
0:09:11 | um E files from the other C database were used from the classic and jazz |
---|
0:09:16 | john |
---|
0:09:16 | and for the observation probably we |
---|
0:09:20 | a the information from the |
---|
0:09:21 | pretty extracted pitch sense function |
---|
0:09:24 | and uh basically the observation function for not to pitch is essentially a sigmoid function for that |
---|
0:09:29 | extracted salience function and |
---|
0:09:31 | see the basic structure of the peach wise H of them for of the detection in both |
---|
0:09:38 | for evaluation we use the we used in just get really it's true a a set of twelve twenty three |
---|
0:09:43 | second X or some the other C the base |
---|
0:09:46 | which consist of classic and just music excerpts |
---|
0:09:49 | uh uh you can see most of these pieces are are but not all of them in five there as |
---|
0:09:54 | many guitars and there's a very nice court is also |
---|
0:09:58 | um here's a basic example of that |
---|
0:10:01 | just caption shen |
---|
0:10:03 | in the upper |
---|
0:10:04 | figure you you can see um |
---|
0:10:06 | the beach ground truth for a D tar X or the lower have you can see that description |
---|
0:10:11 | this is what the original recording sounds like hmmm mm um are able |
---|
0:10:22 | and uh this is the synthesized transcription for this same recording oh you're you're |
---|
0:10:34 | generally you can see that um |
---|
0:10:36 | the going |
---|
0:10:37 | doesn't have men false alarms but in some cases tends to underestimate the |
---|
0:10:43 | chord thing number notes of polyphony number so it has some |
---|
0:10:46 | miss detections |
---|
0:10:47 | but overall deep is |
---|
0:10:49 | quite good |
---|
0:10:50 | and uh these are the results for that system |
---|
0:10:54 | uh |
---|
0:10:55 | the results in terms of accuracy using ten millisecond |
---|
0:10:58 | a a evaluation |
---|
0:11:00 | is sixty |
---|
0:11:01 | point five percent for just a frame based evaluation with not without on of the detection |
---|
0:11:06 | eight fifty nine point seven percent utilizing information for since only because it so uh has money more false alarms |
---|
0:11:13 | because it doesn't have any the activations what beaches |
---|
0:11:16 | and it right up to sixty one point two percent |
---|
0:11:19 | for the joint owns and that of the case |
---|
0:11:22 | and when compared to the uh various |
---|
0:11:24 | all a in there it so so that so as the uh method by |
---|
0:11:28 | can as case that which had a uh |
---|
0:11:30 | gmms that and that |
---|
0:11:32 | spectrum models uh of these spec but |
---|
0:11:35 | uh |
---|
0:11:36 | method by site or the H T C up than that was also present before |
---|
0:11:41 | uh results are |
---|
0:11:42 | about two percent improves in terms of like accuracy |
---|
0:11:46 | multi detail these |
---|
0:11:48 | might be given with some additional metrics where can be seen that's |
---|
0:11:51 | most of the errors if at the uh uh a are uh a false negatives missed detections |
---|
0:11:56 | where as the number of false positives |
---|
0:11:59 | R is |
---|
0:11:59 | relatively to be smaller |
---|
0:12:02 | and finally some results on the onset detection |
---|
0:12:05 | procedure |
---|
0:12:06 | uh |
---|
0:12:07 | it should be noted that we were aiming for a high recall a not a high measure because we want |
---|
0:12:12 | went is the state in um |
---|
0:12:15 | or segmenting the signal but rather to capture as many on sits as possible |
---|
0:12:21 | finally the contributions what work where the onset detection features that's where do right for from speech |
---|
0:12:27 | estimation |
---|
0:12:28 | preprocessing |
---|
0:12:30 | uh score function that complain combine several features for multi pitch estimation including uh |
---|
0:12:36 | no feature for suppressing come on pitches |
---|
0:12:40 | apps of detection |
---|
0:12:41 | uh using pitch wise |
---|
0:12:43 | hmms |
---|
0:12:44 | and so could show results using the a C database which |
---|
0:12:47 | a perform state-of-the-art |
---|
0:12:49 | and uh in the future we like to explicitly model uh the |
---|
0:12:54 | no detailed they produce sound stay says as the uh attack to change in since and the case part of |
---|
0:12:59 | the produce sound |
---|
0:13:01 | uh a phone joint model which just mission and not tracking not separately |
---|
0:13:06 | and finally publicly clear at uh methods to the marks framework of was done in a previous method of or |
---|
0:13:12 | thank you |
---|
0:13:21 | right of those control so this questions have time |
---|
0:13:34 | hi so um |
---|
0:13:36 | i notice that you said you train your onset detection or on piano |
---|
0:13:41 | i i i think it was a detection no was trained and general uh classic music X most of them |
---|
0:13:46 | were string actually |
---|
0:13:47 | okay i guess that was gonna be my next question because |
---|
0:13:50 | oh |
---|
0:13:51 | a lot of times on |
---|
0:13:52 | plucked strings and |
---|
0:13:54 | and struck instruments it's much much easier not to do that yeah onset set |
---|
0:13:58 | and uh i was just wondering |
---|
0:14:00 | if you feel like you're onset detection has anything new to say about detecting on sets and things with bows |
---|
0:14:06 | or |
---|
0:14:06 | with singing or you know things words or yeah uh that's why the uh uh a second tones measure was |
---|
0:14:11 | proposed not to detect soft once it's which was a bit a pitch based measure which is actually i think |
---|
0:14:16 | the only |
---|
0:14:17 | reliable way to detect |
---|
0:14:19 | on sets without any energy change and |
---|
0:14:21 | in fact we put some examples transcription exams like they wanna you before in the web |
---|
0:14:27 | uh where i have a another exam from a string quartets description which is actually pretty accurate |
---|
0:14:36 | i one question you say more about |
---|
0:14:38 | you also sets |
---|
0:14:40 | also signed for perceptually important |
---|
0:14:42 | and one in wire was so important to your performance |
---|
0:14:46 | well uh the thing is that most |
---|
0:14:48 | multipitch pitch estimation methods |
---|
0:14:50 | to that they do not explicitly export some information about |
---|
0:14:54 | the excitation time the |
---|
0:14:56 | octave octave the activation time of the produce sound and also |
---|
0:15:00 | the the activation that sound and uh |
---|
0:15:03 | by incorporating not information |
---|
0:15:06 | in fact uh |
---|
0:15:07 | to demonstrate that we can also |
---|
0:15:10 | just |
---|
0:15:11 | improve the bit |
---|
0:15:12 | on that frame based multi pitch estimation |
---|
0:15:15 | C G and um |
---|
0:15:17 | yeah i mean uh generally |
---|
0:15:19 | on so |
---|
0:15:20 | should |
---|
0:15:21 | if fact be used um |
---|
0:15:23 | more widely and |
---|
0:15:25 | not |
---|
0:15:26 | be left outside |
---|
0:15:27 | which just is them as is user down |
---|
0:15:30 | the look like a a look like a lot of errors were cars part missing notes |
---|
0:15:34 | yeah means you are so was helping was impaired |
---|
0:15:36 | the missing notes were actually producing case of |
---|
0:15:39 | then scored when you might have also some uh octave errors |
---|
0:15:43 | uh sometimes the upper pitch might not be detect in that case of you have an to |
---|
0:15:48 | and that doesn't have anything to do we don't sets because uh all it's are |
---|
0:15:53 | the for the lower note about it has something to say about the features of we might use for |
---|
0:15:59 | um |
---|
0:16:00 | multipitch pitch estimation that we need features that might be more |
---|
0:16:04 | robust that's say to uh in the case of overlap no |
---|
0:16:12 | know questions |
---|
0:16:15 | oh one "'em" up monopolise support |
---|
0:16:18 | when you have a core |
---|
0:16:19 | i mean can we really hope to get all the notes so |
---|
0:16:22 | we automatic means |
---|
0:16:23 | well uh depending on the instrument model is that you have and |
---|
0:16:28 | if you exam if we for example you have |
---|
0:16:30 | change the promise of you system based on that |
---|
0:16:32 | specific and smith so that might be |
---|
0:16:34 | generally easier |
---|
0:16:36 | compared to let's say you have a change your problems in a counter and tested it in uh |
---|
0:16:40 | course |
---|
0:16:41 | from both string |
---|
0:16:44 | so you are good |
---|
0:16:45 | music |
---|
0:16:45 | are used to be dependent models can be used as an i i think that's |
---|
0:16:49 | generally a trend would be to |
---|
0:16:51 | for mean the future estimate to speak description |
---|
0:16:54 | so that it will also include that joint |
---|
0:16:57 | in meant relations that |
---|
0:17:01 | yeah else |
---|
0:17:02 | yeah bills |
---|
0:17:03 | because remote |
---|
0:17:04 | Q |
---|