0:00:21a good off no everyone my name's you huh
0:00:24some some from the interactive you what you live of for most mattering your numerous D
0:00:28uh and to gonna present not each traction
0:00:31a using in probably a probabilistic latent component analysis
0:00:36so is a joint work we chin which and uh from the media technology like of a grace not
0:00:41so
0:00:42a sorry
0:00:44so this it
0:00:46a and a followed walks to for us i'm gonna give a brief introduction of a system
0:00:50and in i'm gonna and introduced to model we use the pharmacist system
0:00:55and uh i describe our a system and a present the experiment result
0:01:00and uh in the and i conclude that would talk
0:01:03uh so in this but you could uh paper we treat the singing voice as a matter of the of
0:01:08the song
0:01:09so it's not always to
0:01:11but we but if is true in a lot of a cases
0:01:14so
0:01:15this that ensure it this is a a a a a battery a brief intro we you in the real
0:01:19from system
0:01:20so
0:01:21we have a a all audio signal
0:01:23and all the fit to can it the for at the segment at in two
0:01:26non vocal segment and a walk segment of a singing voice detection
0:01:30what do you know system
0:01:31so the local step segment of the audio signal is just the
0:01:35uh the segment so contains only to accompaniment of the cell
0:01:39and the work segment it just segment are a which can of local the thing voice as a as the
0:01:45accompaniment of the so
0:01:46i then in the next that
0:01:48uh we train
0:01:49we train a company model from the non vocal segment
0:01:52using
0:01:54a P L C A which i'm gonna discuss later in this uh talk
0:01:57and then uh after we train a company model a common model will be applied
0:02:02to the local segment
0:02:03to extract
0:02:04the singing ways of the song
0:02:06and it doing in the and a P check make out with them is applied to the
0:02:10tract is one
0:02:11ways due to attract the that is of the
0:02:14of the cell
0:02:16so
0:02:18uh i'm gonna to introduce uh i one model use in our system
0:02:24so
0:02:25to mixing in here
0:02:26we start from a single slice of the spectrogram
0:02:29we put a single size thousand
0:02:31that's spectrogram gram a uh on the right of the figure
0:02:34and the which treat
0:02:35these
0:02:36spectrum
0:02:37and the histogram
0:02:39so this is his ground to be generated by some kind of a probability but
0:02:43and uh it's just a um
0:02:45a a multinomial distribution
0:02:47so
0:02:48we look at different so a spectrogram at a different at for this to uh
0:02:54of the signal all we can see that
0:02:56different at different have a it will have a diff
0:02:58different spectrum vector
0:03:00and uh
0:03:01so we we could possibly just the use a multinomial distribution for every single
0:03:08uh a that rome
0:03:10of the signal but sure is gonna
0:03:12we in this case we when you don't out of a
0:03:15uh a component
0:03:16it's gonna be a lot dictionary so is that we just use a dictionary of for like a one hundred
0:03:23a spectral vector at so you each
0:03:25so back from of the style
0:03:27we be uh the your component uh
0:03:30in yeah combination of the spectral vector
0:03:33so
0:03:34in this case we you we should have a model which is the probability latent component analysis
0:03:38so we have a a dictionary of a spectral vector
0:03:43uh which we need to to learn from the of the re showing we G the spectrogram of the song
0:03:49and uh
0:03:50it's called and the component analysis because these parameters to are late and
0:03:55compose uh and they and of arrivals
0:03:57and uh
0:03:58in this case so this is the mixture weight and this is the spectra
0:04:02oh a vector at and uh
0:04:05we a model these
0:04:06spectrum user in your comp
0:04:08mean your combination of the spectrum vectors
0:04:11and uh this to parameterize of our model can be estimated by
0:04:15expectation maximum
0:04:17algorithms resumes are will not going to the detail of the estimation you can refer to our paper for the
0:04:22detail
0:04:22so now we we should have a more though that's that we are going
0:04:26uh do the singing voice extraction
0:04:28so you to the
0:04:30uh at constraint
0:04:31i'm gonna just of focus on this part
0:04:34of our paper so
0:04:38so
0:04:39this
0:04:40is uh image each uh this it the audio signal and the we how
0:04:44uh us you wise detection and i was um
0:04:47to a segment of signal into different card
0:04:50which is still not work or part of it the
0:04:52segment and is a walk or part of a segment
0:04:54so this is just a a you dust reaching example in in the real case
0:04:57i to gonna be much company it is not that's a
0:05:00so non vocal segment we'll real just in the beginning and the vocal part it in the and the so
0:05:04is not that case i just a want to
0:05:06uh give us an
0:05:07a point you know situation shape so
0:05:09after we at then you find it's a non vocal segment are we use a P L A P L
0:05:13C a model to training uh
0:05:15a spectral back to a dictionary of spectral or for the accompaniment
0:05:19so now we have a dictionary each week sprang only use the accompaniment
0:05:23and then the next step
0:05:25so for the local segment
0:05:27we do with the P L C you training as you were
0:05:29but should
0:05:30we fix some of of the uh component
0:05:33two is already pre
0:05:35ah
0:05:36spectrum back for the non local segment
0:05:38and the we have some uh free component to we explain
0:05:42the uh i don't know close
0:05:44still still a part of the singing wise
0:05:46so in the end that we will have a two different group of a dictionary
0:05:50the this group is the pre-trained
0:05:52fig
0:05:54uh a component for the non local power
0:05:56and of this part of real is the new train from the local segments really like spring
0:06:01most of the the
0:06:03uh the scene voice of the south
0:06:04so
0:06:05as the and we just a re
0:06:07reconstruct out signal
0:06:09a separate the T we use uh
0:06:12new new don't a component for the work all
0:06:14to extract a to to reconstruct so
0:06:17uh
0:06:18the singing voice and the we use the
0:06:20fixed now what corpora uh that the non components two
0:06:25to to extract the accompaniment to in the local that
0:06:28so
0:06:29in the and at a simple uh
0:06:32pitch estimation algorithm will be applied
0:06:34to this extracted a seen ways to extract this the P of
0:06:38of the song
0:06:40uh
0:06:40so
0:06:43i not to mention that so
0:06:45oh very is similar
0:06:47uh system is proposed in this paper
0:06:50a but as margaret
0:06:52some at this uh so
0:06:54the different speech i was system them and uh he's system is that our system a
0:06:58has a uh
0:06:59a pre-trained trained us so
0:07:01a sting was detection what do so we so is totally automatic
0:07:06and that in you his paper so
0:07:09they just a manually detected the singing voice and at the
0:07:13uh so so now pad and the training data manual
0:07:16uh
0:07:17so we also you value of our a system uh like a a the experiment without
0:07:21and this is just a simple ink example point this paper
0:07:24so
0:07:25yeah is some uh just reach an example of of a system
0:07:29so this on on the top of the graph he's uh
0:07:33uh
0:07:34it's this on us it's a simple matter
0:07:36and uh the second figure it's a you track is singing voice of i was is um
0:07:41a to uh applied to the to the polyphonic music
0:07:44and that the sort of a you got a is it i
0:07:46is a original separated like a scene wise
0:07:50and and uh last the you tracked leader
0:07:53estimation from
0:07:55how we track the singing voice
0:07:57and uh
0:07:58are gonna
0:07:59listen to the example
0:08:23but
0:08:28a a gonna
0:08:30um
0:08:32and
0:08:33and
0:08:36i
0:08:38sure
0:08:44just can easily of a the system is an of the stuff for this is just about to
0:08:48compress a company might show that uh
0:08:52keep the same voice
0:08:53that for
0:08:55nonetheless
0:08:58well
0:08:59so this are it only of it is much better
0:09:03when uh
0:09:08so
0:09:10after after we extract it
0:09:11the thing you voice we just apply a wire simple
0:09:15autocorrelation correlation based
0:09:17pitch estimation algorithm to this map
0:09:19oh we can still here as a accompaniment
0:09:22in the uh and them uh in this track is singing voice
0:09:25but uh
0:09:27oh auto correlation would extract the lack a in but you gonna get example way that you track at eighty
0:09:31percent of the correct each for the same way
0:09:35so we do some uh
0:09:37comparison to other two system
0:09:39the first system them he is uh a multi estimation system at you about developed the you our live
0:09:45so it somewhat pitch estimation system so we treat stuff us the peach estimate in each frame as that
0:09:52i them out of the of the star
0:09:54and the second system is a a a singing voice extraction system
0:09:57so
0:09:58that the sort that is a result of our system
0:10:01so as we can see our system has a bad or read for F measure an accuracy
0:10:06a compared to the at the system and uh i has a comparable
0:10:10uh
0:10:10procedure jane
0:10:11uh compared to do bass the system
0:10:14in precision uh you evaluation
0:10:16and the second
0:10:18one has a relatively low performance
0:10:21uh we believe that
0:10:23so
0:10:24why in the thing voice see trashing uh i was um but should we only use a
0:10:29uh uh we only use the predominant pitch estimation at result with been in this work and uh
0:10:35so is
0:10:36kind of a a the track a P H
0:10:39uh
0:10:40is not this
0:10:41is not the singing voice the P each patch is that the other accompaniment instrument a peach
0:10:46so maybe it's that's that treat uh the other company min at the pre don't peach
0:10:51so we P D a tuning of the parrot or at of for the second a system would increase the
0:10:56performance without
0:10:57uh
0:10:58so
0:11:00we conclude our paper
0:11:02here uh and uh
0:11:03first the of the probability late and were rival model is uh
0:11:08intra use the to company meant and the lead singing ways that that you would be you have system
0:11:13and uh the experimental results show that
0:11:16the at the of the thing ways
0:11:18could be
0:11:18six F extracted in
0:11:20uh in that eight that the be used
0:11:22uh a a paper
0:11:24and and they are of with do some future
0:11:27uh directions so for us to
0:11:30so the work on a local singing voice detection
0:11:33i was um is uh a based on causing extra model
0:11:37it's so we we data of land and the opposite have some improvement uh space so we want to
0:11:45uh don't future uh you research on this but also we want a better pitch mission algorithm
0:11:51uh for the thing was uh
0:11:53what the scene was detection uh
0:11:55what do so
0:11:57this can our paper
0:11:59as as we still have have a so i is this work uh while i was doing an internship should
0:12:04meeting region
0:12:05so in greece note company so
0:12:07i i want to sec my colleagues
0:12:09in grace notes for
0:12:11the could use for discussions and uh we also want to thank uh
0:12:15review or of of a paper to help improve a paper
0:12:19and i want to send so my to wise and them according signals were in to help me to improve
0:12:24the
0:12:24presentation
0:12:26just
0:12:26thank you
0:12:39yes a of my name is gail uh from data compare stick
0:12:42and i one question concerning the segmentation you have not talk too much about segmentation but i guess that
0:12:48if you are
0:12:49you you have a on the segmentation you would probably have
0:12:52less good the um the separation of back from middle these two can you give us
0:12:57some hints of a whole well or from the segmentation sure sure
0:13:01so
0:13:02uh
0:13:03the segmentation is based on cost a mixture model also we trained the cost a mixture model
0:13:08um like a
0:13:10fifty
0:13:12pre label the manual label the uh sounds like a commercial music is include the as a pop music rock
0:13:18music
0:13:19and we just a pre-trained as this
0:13:20uh model wines and the the four
0:13:23uh for the new racks
0:13:25uh a so you ways you in uh so
0:13:28a data as that
0:13:29and uh
0:13:30so uh we're a accuracy of for the scene what detection more do is a run seventy person
0:13:36and the you mention that you've uh yeah our system that depends on the performance of the singing voice
0:13:41detection module so you've the thing was to talks to more that than work
0:13:45it will feel because T to real treat some a local segment of uh at of non vocal segment and
0:13:51uh
0:13:51the diction we train from this part of we also you for and uh seem ways
0:13:56so in this part of it's uh
0:13:58a system a
0:13:59but no book where where L so that's i want to research in this prior
0:14:03i in the future
0:14:07i a question for you i was of your or you example was wonderful but
0:14:11it's some like them the major draw the fact i good the symbols
0:14:14coming through the both okay i a more that one was were popping out
0:14:18uh
0:14:19so on one is i don't know the cyst so sit them automatically trained that uh
0:14:24so that that's the late and
0:14:25uh dictionary to explain the accompaniment
0:14:28so
0:14:29um by maybe be and one possible reason is that are made is that it actually do
0:14:34not local segment that does not contain the simple or the symbol is not a predominant in now
0:14:39to be experiment
0:14:40and of the at
0:14:41uh yeah that's my good mate
0:14:44you explanation based for so well we do the system a segmentation we try to use the a uh use
0:14:50the
0:14:51uh so non vocal segment uh
0:14:54you to the local segment it to explain it it's each other that we do not want to use
0:14:58not the non vocal segment or you the and the to explained uh
0:15:01the what segment in the beginning so
0:15:04a i was some is that the sound to be consistent the with the
0:15:07a calm accompaniment but they to be a change or what town
0:15:10so
0:15:11yeah
0:15:15you know questions
0:15:19like like remote