0:00:13i L one
0:00:18oh
0:00:19please
0:00:20oh copy
0:00:22a a number
0:00:24two are and
0:00:25number five so
0:00:26okay
0:00:31i i'm year of the session i the my you know from university of to group power
0:00:35K
0:00:36so let's start
0:00:37so first presentation
0:00:40read three yeah
0:00:42yeah
0:00:44uh so combining a gmm base melody extraction and
0:00:48and based
0:00:50soft masking for
0:00:51a separating
0:00:52voice and
0:00:54a company meant
0:00:56oh from one or or or a wood
0:00:59okay
0:01:00so young
0:01:01wang and um
0:01:03oh
0:01:04okay please
0:01:07okay good morning everyone
0:01:09um yeah and presenting in my paper uh out
0:01:11combining hmm based melody extraction and nmf based soft masking for separate voice and accompaniment from one or or audio
0:01:22and so he first you see uh block diagram of a
0:01:26a of
0:01:26most set uh separate system for voice and accompaniment
0:01:29is made up of two main model one it the melody extraction which the i outputs a pitch contour from
0:01:35the audio the audio signal
0:01:37and then the time-frequency masking works on the spectrogram to give an estimation of the spectrogram
0:01:43a voice and a coming
0:01:45so the difference system are different in the the techniques the use for this uh a these individual models goes
0:01:51so for extraction the a popular
0:01:54a a point or methods a hidden markov models and a to matrix factorization
0:01:58and for
0:01:59that type because the masking
0:02:01there's a a a a hard masking and soft mask
0:02:04our work is largely based on the work of you could
0:02:07a which which is the light based on on net Q non-negative matrix factorization
0:02:12but we find that
0:02:14format the traction the and M at doesn't work where while so uh is that we a week
0:02:20the we are inspired by that were close you the which does
0:02:23uh the extraction of a markov models and then
0:02:28from our work
0:02:30so for a for i'll give a brief review of about the a and a have an and nmf based
0:02:35melody extraction and also the time proxy mask
0:02:39so in the non as can make to a factorization the uh
0:02:43the observed
0:02:44spectrogram of the given
0:02:45audio signal is
0:02:47regarded as a stochastic process
0:02:50we're each element a bayes a a a a i i is a complex number of being uh
0:02:55so caution
0:02:56distribution where there's a
0:02:58various parameter T and if you put all the D's together you get the power spectrum
0:03:03and
0:03:05the problems of the non-negative matrix factorization is to estimate this
0:03:09a power but power spectrum the to
0:03:12max my the likelihood of observed spectrum X
0:03:17the power spectrum the of the total signal can be
0:03:21and
0:03:21decompose into two parts the spectrogram of the voice and the spectral put
0:03:26the spectrogram of the music
0:03:28that or the accompaniment
0:03:30and for them more the
0:03:31spectrum of the voice can be to calm
0:03:34decomposed into the product
0:03:36oh the spectrograms of the class to execution and vocoder
0:03:41now you you show these parentheses is
0:03:45the matrix P it can be regarded as a code books and that matrix a can you got is as
0:03:50lena calm but combination coefficients of these
0:03:53uh
0:03:54basis vectors so a a a a a a let me show you should hold this work
0:03:57um let's take the plot i i a got the excitation a matrix pf and a for example
0:04:02the pf makes looks like this
0:04:04so a each column is the
0:04:07the spectrum of
0:04:08of the class excitation
0:04:10at a certain fundamental frequency
0:04:12fundamental frequencies i express media numbers which is the log scale of the frequency
0:04:19and
0:04:20here
0:04:20you can see two old columns of the a matrix that one is for the media number fifty five and
0:04:25the other for seven
0:04:26you can see that the four fifty five it
0:04:28it has a lower fundamental frequencies so they are so the harmonics a placed close or
0:04:33and for seven they are a a a place for the for further part
0:04:38and for
0:04:39the a F matrix which is uh a combination coefficients for these based a basis vectors
0:04:45for example if we look at
0:04:46this
0:04:52or activated
0:04:54so
0:04:55there is a a a a a coefficient for basis vector and me number sixty and a smaller quite
0:05:01for the
0:05:02the
0:05:03basis vector ads
0:05:04need you number send
0:05:05and uh
0:05:06so uh if you the
0:05:08all these F matrix out you can
0:05:10actually be realized that pitch contour on
0:05:12on this matrix which is the dark line there
0:05:15so a
0:05:16the the lie above is like at the
0:05:19uh as the second harmonic and this small lines maybe that common men
0:05:27so that procedure for a melody extraction and soft masking using an amp paid is as follows
0:05:33first so we fixed the pf F matrix as shown in the previous slide
0:05:38and then at we so we saw using an iterative procedure that for the other fine matrices
0:05:43and we are specially interested in yeah
0:05:49next uh we find the
0:05:50strong as can do no speech track on this a yeah matrix using done and dynamic programming
0:05:57and then we cleared the other ones that the that is far from the
0:06:02the can you no speech rec
0:06:04and
0:06:05with this new a have we we saw for the other four is which can be a more accurate estimate
0:06:12a for solving all that all the but matrices in the decomposition of
0:06:16the power spectrum the week and then use a
0:06:19wiener filtering to
0:06:21estimate the
0:06:23in will spectrum of the voice and accompaniment and then we were this into the time domain with or add
0:06:28a method
0:06:29then we get an estimate of the voice and components
0:06:32respectively
0:06:34so here are the most important part of my lecture here
0:06:38uh which is that we find that the non egg you the factorization is it doesn't work well enough for
0:06:44that that extraction
0:06:46so a
0:06:47a a a a matrix i shown in the previous slides are just like the ideal ones once but the
0:06:52actual yeah i get is looks like this
0:06:55so we can see that
0:06:56there is a great imbalance in the
0:06:59in different frequency
0:07:00for high frequencies is the yeah values are large and for to the hours there small
0:07:09um so uh we have identified a identified to close this for this balance the first is the nonlinear T
0:07:14of the mean an numb scale where using
0:07:17so the mean there a meeting number scale is a logarithm
0:07:21scale of the
0:07:22frequency and if we
0:07:38four
0:07:38the for the same as same amount of energy in the low frequency a look low frequency and
0:07:44a we have more basis vector to divide it so the coefficient for individual
0:07:50basis vectors we get smaller than the higher or frequency "'cause" the end
0:07:53this is one of the reason why
0:07:55yeah
0:07:56yeah matrix has
0:07:57smaller values in the lower frequency range
0:08:02and
0:08:03to compensate for this in as we have
0:08:07uh we now we a multiply apply uh a one term into that yeah matrix
0:08:11here F is them a
0:08:13the frequency in first and and is the median number
0:08:16so a
0:08:17the first derivative of have the respect to and is a
0:08:21is like the this city of uh
0:08:24the basis vectors at a certain frequency
0:08:27and by dividing a this
0:08:29a
0:08:30a a a actually is a i've more and must placating the
0:08:33uh the city of the basis vectors we can make the
0:08:37values at the lower frequencies
0:08:39the bit larger
0:08:43now the second "'cause" we i i the i didn't fight is that the columns of the P a matrix
0:08:48a not normalized
0:08:49so a
0:08:50and as you can see the for uh uh lower a media number like fifty five
0:08:55there are more harmonics
0:08:56and
0:08:57since the M to use of these high
0:08:58harmonics are similar
0:09:00because they are more
0:09:02harmonics in the low frequency bass a basis vector the total energy is also higher
0:09:07therefore for this again can that contributes to the balance in yeah
0:09:11so to compensate for this
0:09:13before for the multiplied the
0:09:15a a for each
0:09:17unit in the A F matrix
0:09:19we multiply the
0:09:20total energy in the basis vector
0:09:23a
0:09:24of the corresponding frequency
0:09:26and
0:09:27this is this is that total a station that we can out bit
0:09:33uh in in do was original paper he also it came up with a conversation which is not a a
0:09:39most multiple a multiplicative as ours but additive
0:09:43so uh basically what this means is that
0:09:46for each unit in the F matrix
0:09:48half of the bad or
0:09:49at the unit one octave higher is added to the
0:09:53or you not unit
0:09:57but the effect of these conversations and not so good
0:10:01as you can see a
0:10:02the leftmost
0:10:03figure is the original yeah matrix
0:10:06uh in the middle is the yeah measure is calmness it it's using do queries
0:10:11um
0:10:11at to to conversation and the rightmost most is our multiplicative the conversation
0:10:16so uh you can see that's uh after applying these conversations
0:10:20the
0:10:21lower or but a
0:10:22the values at lower frequencies of the F matrix do can larger but if you look at the
0:10:27uh
0:10:28pitch contours extracted with done and we then i'm a programming
0:10:31you get a you see that
0:10:33yeah you
0:10:34like the all about the true pitch contour
0:10:37with a which is just the result of this embarrassing the ad
0:10:41so our conclusion here is
0:10:43even if you do comes john yeah matrix it a you cannot totally eliminate the imbalance and that can have
0:10:50a pet effect on the pitch control that you
0:10:52that to extract with dynamic programming
0:10:57um therefore for we propose or on hmm based melody extraction
0:11:02the future we use is called energy as gsm it ones of interest
0:11:06which is an integral the say function with it within each segment on and we use
0:11:10um there is thirty six
0:11:12um
0:11:13the mentions
0:11:14that is the media numbers from
0:11:16the thirty nine to seventy four
0:11:19the same function is uh
0:11:21wait is some of the
0:11:23a of the spectrum of
0:11:25the given all or the signal and
0:11:28i use is run here
0:11:29it's the so there
0:11:31the red a parse show the large values and blue part of the small values and you can actually see
0:11:36the
0:11:42on this data structure function map
0:11:47oh the signal uh we calculate this say this function i four
0:11:51the at a step of zero point one meeting numbers
0:11:54so a
0:11:55that it that gives use like a more than three hundred dimensions mentions uh a a feature and
0:11:59which is
0:12:00too much for that the M
0:12:02therefore probably integrated in it into the S i features at a there six M once
0:12:07and
0:12:09we also use these sent ones at the states of the hmm they are fully connected is and the all
0:12:14core probability you for each hmm is
0:12:16models with a
0:12:17eight component gmm
0:12:20the parameters of this M is trained from the M my are one K database base it a his annotated
0:12:25with the at frame level with the
0:12:28a
0:12:29a fundamental frequency
0:12:30and if you do a viterbi decoding on
0:12:33on the
0:12:34oh the on a piece so all with it is a hmm if will use uh
0:12:38pitch
0:12:38pitch contour for a query to once i talk
0:12:43in in order to get a a fine P track which is a a a a a a great down
0:12:46to zero point once i meet ones
0:12:48um
0:12:49we been take the maximum value of the C is function map
0:12:53a
0:12:54in a their point five some into range around that for speech
0:13:00and then a show you a how a for or hmm a
0:13:04is based matter tracking
0:13:06uh
0:13:06contrasts with the an mm based
0:13:09pitch tracking and uh also
0:13:11a a a a
0:13:12they fact of the net and then map soft masking
0:13:15in contrast with the hard masking
0:13:17so the evaluation corpora we use are the M our K
0:13:21database the it and also some of the clips available and the please bats that
0:13:26the items a evaluation encode the
0:13:29the sept the separate model was and also that or all
0:13:32form
0:13:34so first for the melody extraction uh a if force it compare are uh our
0:13:40uh our system with a with use a which which use also based on a hmm and yes yes i
0:13:45features
0:13:46but there are i features at different a a defined differently from hours and the use two streams of features
0:13:51why we use only one stream
0:13:53and
0:13:54the performance of the but the two systems are comparable
0:14:00um the the a result of our keys here so uh this at the comparison of the pitch tracking of
0:14:05our proposed hmm based a method and you could use an ml based method that
0:14:11uh so for if you look at the accuracy and our is much higher than the than the row and
0:14:16M have and also higher than the
0:14:18compensated at math
0:14:21and he's these process out the down of errors so we can you can see for our hmm based
0:14:27methods uh there the isn't a very much uh errors and mostly a a like one octave higher at the
0:14:34twelve some once and one E
0:14:36but up to lower at the minus simon ones
0:14:39and for the and then have you can see that there is always
0:14:47but distributed cost a large range of a
0:14:50uh
0:14:51errors
0:14:51a so that this is
0:14:53right to to the imbalance in the F matrix
0:14:56so if you use dp it will always like pick something
0:14:59uh about the true pitch contour
0:15:01and even you even if you do the compensation
0:15:05is that like that comes to you a person is are in is not completely
0:15:10cleared
0:15:13also worth mentioning is that uh because already each am and uh
0:15:17each i meant based
0:15:18P tracking method a trained offline and the online part does the does not you will you bought and iterations
0:15:24so this run six to seven times sure then the
0:15:28it or to an M F for C
0:15:32for the time-frequency masking we the we compare our system with a hard masking system a of shoe
0:15:37and
0:15:39a a week uh evaluate them at the three and mixing uh
0:15:43S not snrs
0:15:44like a a man five zero five db
0:15:47um
0:15:48now it first
0:15:50you look at the blue
0:15:51the blue squares where we use the annotated pitch tracking so we isolated isolate the
0:15:56T a a a a a T F three masking part
0:15:59and
0:16:00i see that a a a all the snr we shows our
0:16:04our system
0:16:05uh performs better
0:16:06and
0:16:07but mentioning it that's our
0:16:10the
0:16:10our performance for the
0:16:12i two did pitch tracking which it use is soft masking
0:16:15guess close or even exceed the hard masking
0:16:19i do you ideal masks which is
0:16:21kind of a a a per for that
0:16:22for of the heart must
0:16:26now a for that the overall evaluation we use the extracted pitch tracks uh
0:16:31or all or or and we see that
0:16:33it also performs better than the haar must insist
0:16:39and then now uh we
0:16:41and here the or system of ours with duke clues which is completely based on
0:16:47i
0:16:48i like to show you
0:16:50yeah
0:16:51i
0:16:52i i a
0:16:58so this is a make sure
0:17:00and this is the separation results
0:17:02using do please never have based method
0:17:08i don't know oh
0:17:10oh
0:17:11oh
0:17:14that's see that a for the last notes the pitch contour
0:17:17the pitch is like
0:17:19uh twice the true pitch
0:17:21i
0:17:23i
0:17:25i
0:17:29it's so you here that some of the voices that's in the common men
0:17:36oh you i no no oh
0:17:42so a a a our pitch the pitch you structure for the last noise correct
0:17:48i
0:17:48i
0:17:51i
0:17:53i
0:17:54so the common here is green or than do you please system
0:17:58and
0:17:59a if you look at these results to some of them
0:18:03for some of them our system force better and for some of them
0:18:05it's worse
0:18:06the the reason here is a is like mainly it it determined by the performance of the
0:18:12matt extraction
0:18:15okay so of for the conclusion
0:18:18and
0:18:18oh control can that's that an M at A based net the extraction of suffers from and embarrassing the F
0:18:24matrix and for
0:18:26for this matter each be be better and also run faster
0:18:29and for the tf masking and M
0:18:32and based soft masking is much better than hard masking
0:18:34so uh we we propose the combination of hmm the extraction be and at based soft mask
0:18:40thank you
0:18:47a any questions
0:18:48real time for you
0:18:51yeah piece
0:18:55yeah so thank you for you of your of your tool
0:18:57i have one question i mean to question actually no one question is them
0:19:01you're method is um uh should provide or you have some only
0:19:06yeah why you're your we you compared to a method to do real at which is completely and provide
0:19:11so uh my question is in which way
0:19:14um
0:19:14the learning you do
0:19:16it could be to generic and can be applied to
0:19:19completely different signals and my second question would be
0:19:22do you have to sonic samples where
0:19:23you methods is slightly
0:19:25let's performance and do used method
0:19:28yes
0:19:29if you can play also them
0:19:30that would you know noise
0:19:32oh okay
0:19:34or a
0:19:35this the other all uh the others
0:19:38leaves the separate it all side will will available on that a demo a web page which it is a
0:19:42where the U R are is included in our paper
0:19:45and for a for the for the first question in it that's
0:19:48uh we use this to supervised the method because we find that the imbalance are
0:19:53is
0:19:54that's the results very much and
0:19:57uh
0:19:57and actually a
0:19:59so
0:20:00do you use a conversation is like
0:20:02uh some ad hoc rule based uh
0:20:04compensation
0:20:06uh like like this one so uh
0:20:08this is not completely unsupervised is
0:20:11he also looks at the the
0:20:15like a like a a what the imbalance looks like and design this rule to
0:20:19to compensate for this thing and
0:20:21our H am training is like a to learn this
0:20:24to learn the
0:20:26a
0:20:26to learn what the in looks like a by and
0:20:29a automatically learning method
0:20:32okay let's go to
0:20:34okay thank you