0:00:18 | hello everyone |
---|
0:00:20 | my name is very extra iron from singapore inverse of technology and design using or |
---|
0:00:25 | today i will be talking about are okay oracle generative inverse real networks for singing |
---|
0:00:31 | voice conversion |
---|
0:00:32 | we i mean a pattern training they are we have combat that this research together |
---|
0:00:36 | with michael we actually from the machine restore single |
---|
0:00:42 | so that the basic definition of singing voice conversion it is that have to convert |
---|
0:00:47 | one single his voice to sound like that of another without changing the lyrical content |
---|
0:00:53 | you can also see an illustration of it |
---|
0:00:56 | you have a sore single |
---|
0:00:58 | who is thinking mummy out here i going |
---|
0:01:01 | and we will and the singing voice conversion year |
---|
0:01:04 | and you're gonna change identical is |
---|
0:01:07 | or some sort it sounds like this lady is thinking the same song |
---|
0:01:13 | and i don't like to highlight that sinking armies lexical and emotional information through or |
---|
0:01:18 | and all |
---|
0:01:20 | well data being transferred from the source to the target speaker |
---|
0:01:26 | so in this paper we will propose a novel solutions to singing voice conversion based |
---|
0:01:31 | on generative address real networks and without parallel training data |
---|
0:01:37 | and let's |
---|
0:01:39 | briefly talk about singing-voice partition |
---|
0:01:43 | singing voice conversion is another very user |
---|
0:01:47 | because in itself is not an easy task |
---|
0:01:50 | and to mimic someone thinking is more difficult |
---|
0:01:54 | well professional fingers are trained to control and very they walk a timber |
---|
0:01:58 | they're by the by the physical limit of your remote production system |
---|
0:02:03 | and singing voice conversion provides an extension two months or collect be able to control |
---|
0:02:08 | the voice |
---|
0:02:09 | beyond the physical limit and expressive |
---|
0:02:12 | in extend this very small way |
---|
0:02:16 | so singing voice conversion has lots of applications and some of them are listed here |
---|
0:02:21 | such as singing synthesis the bingo soundtrack |
---|
0:02:24 | and grapheme one thinking |
---|
0:02:26 | and there is also challenge here that i would like to highlight |
---|
0:02:30 | thinking is a final or and any distortion of the remote the singing voice cannot |
---|
0:02:35 | be tolerated |
---|
0:02:38 | so of nist singing voice conversion you melting like there is a voice conversion what |
---|
0:02:43 | is the difference between singing voice conversion and the traditional voice conversion well they share |
---|
0:02:49 | similar moderation |
---|
0:02:50 | in the conventional speech was a motion which we also all the identity or version |
---|
0:02:55 | unseen was on version differs from speech voice conversion in many ways that are listed |
---|
0:03:00 | here |
---|
0:03:01 | starting in the traditional speech voice conversion speech processing it includes speech dynamics durational words |
---|
0:03:09 | they'll is right speaker individuality |
---|
0:03:11 | therefore we need to transform from the source to the target speaker |
---|
0:03:17 | in singing voice conversion the matter of thinking is grammar that it's removed by the |
---|
0:03:22 | sheet music itself so it is considered as far as an independent |
---|
0:03:27 | therefore in singing voice conversion only the characteristics of voice identity |
---|
0:03:32 | so just where |
---|
0:03:34 | are considered as the price and the trains to the contrary to |
---|
0:03:38 | so in this paper we will only focus on the spectrum or emotion |
---|
0:03:42 | aspect of thinking voice conversion |
---|
0:03:46 | so before starting to talk about are proposing impose farmers a model i would like |
---|
0:03:51 | to their belief that were generated by terrestrial networks and my mutual i mean this |
---|
0:03:55 | paper |
---|
0:03:56 | so the traditional generates about restaurant that for once the generative and discriminative training of |
---|
0:04:01 | your may already know the |
---|
0:04:03 | and generate bidirectional networks have recently we wish to be effective |
---|
0:04:08 | instead i mean it feels |
---|
0:04:09 | listed below in a generation image translation speech enhancement language identification |
---|
0:04:16 | it's just speech sentences anyone in speech voice conversion |
---|
0:04:20 | and in this paper we propose to generate vectors are not or |
---|
0:04:24 | that's where |
---|
0:04:26 | that's where that works for i was thinking voice conversion well with and without where |
---|
0:04:31 | the training data |
---|
0:04:34 | so |
---|
0:04:35 | i don't like at least a contributions here to start with me propose a singing |
---|
0:04:39 | voice conversion frame or |
---|
0:04:41 | it is based on channel factors from the four |
---|
0:04:44 | and v h one martin singing-voice middle an extra no such as speech recognition which |
---|
0:04:51 | is not very easy to train |
---|
0:04:53 | i think cycle can be achieved by the other data free thinking voice that on |
---|
0:04:59 | the baseline |
---|
0:05:00 | and last but not least mean reduce the reliance on large amount of data |
---|
0:05:05 | well what are the and non-parallel training scenario |
---|
0:05:09 | we would like to know that this paper reports the for a successful at time |
---|
0:05:13 | to yield a gender at birth and that's where |
---|
0:05:16 | okay though they are thinking one version |
---|
0:05:22 | phone based i'm thinking were voice farmer human have the training data |
---|
0:05:27 | and the statistical methods such as gaussian mixture models are presented and the success of |
---|
0:05:32 | singing-voice origin |
---|
0:05:34 | we have multiple listed |
---|
0:05:35 | some of these works here there are a great idea |
---|
0:05:39 | a do not use the learning most of the time and the for ideas in |
---|
0:05:43 | the learning has a positive impact in training feels with no exception to singing-voice origin |
---|
0:05:49 | and hearing this paper we propose to use and to learn the essential differences between |
---|
0:05:54 | the source thinking and the original target singing train discriminative process you know testing |
---|
0:06:00 | and in this paper we further understand your processing as a part of the gas |
---|
0:06:05 | solutions to singing voice or emotion |
---|
0:06:07 | in a comparative study |
---|
0:06:13 | so let's try the training phase of the thing and three main steps provided here |
---|
0:06:20 | the first |
---|
0:06:22 | is to perform |
---|
0:06:23 | well analysis |
---|
0:06:26 | to obtain a spectral and prosodic features as provided here we develop more |
---|
0:06:32 | and the second step is to use dynamic time warping algorithm for temporal alignment of |
---|
0:06:37 | source and target singing spectral features it is also provided here with the blue collar |
---|
0:06:42 | we will either here is a and |
---|
0:06:44 | the algorithms that you cannot training |
---|
0:06:48 | and the last is to train to generate electricity and that's for by using the |
---|
0:06:52 | aligned thinking source and target feature |
---|
0:06:54 | i don't like to highlight you know one more time that we haven't data from |
---|
0:06:58 | source and target english and they are thinking the same |
---|
0:07:03 | this is class to their parallel training data for thinking voice conversion |
---|
0:07:08 | and it also would like to highlight that the previous studies top loss in french |
---|
0:07:13 | and the singing voice conversion it is not always necessary to transform based or values |
---|
0:07:19 | from the source target singular a meeting possible singles of a single key |
---|
0:07:24 | and the combination of a realistic usually has a small |
---|
0:07:28 | in k until the singing voice |
---|
0:07:31 | so therefore in this paper |
---|
0:07:34 | beyond from spectral feature vectors h in acceptable singing-voice where the |
---|
0:07:42 | and may need to run time version we again have three mains yes |
---|
0:07:46 | provided here the first step based on things or thinking features using to roll analysis |
---|
0:07:52 | and the second studies to generate the climate is sinking spectral features by using the |
---|
0:07:57 | which is already to train during the training phase |
---|
0:08:01 | and last but not really is we're gonna generated by just a four by using |
---|
0:08:05 | girls and |
---|
0:08:07 | i like to highlight in this paper |
---|
0:08:10 | standard by the previous studies we don't know from f their original |
---|
0:08:14 | but in french and there's singing voice conversion experiment |
---|
0:08:18 | for two gender singing voice conversion experiments we performed in so version |
---|
0:08:24 | and in all experiments |
---|
0:08:26 | but we have in this paper in order to distill the scores in getting but |
---|
0:08:31 | are |
---|
0:08:33 | so this is at all |
---|
0:08:36 | are the data case |
---|
0:08:37 | but it's not singing voice conversion without her the training data |
---|
0:08:42 | before we discuss singing voice conversion time and high like something |
---|
0:08:49 | learned from the guy the training data file |
---|
0:08:51 | as also cited here and she's right well that the voice conversion each force or |
---|
0:08:56 | version i mean also provides a solution to model |
---|
0:09:00 | the single translation |
---|
0:09:02 | and for best knowledge |
---|
0:09:04 | so again has not instantly or singing voice conversion |
---|
0:09:08 | and in this paper saying trying to find an optimal set okay |
---|
0:09:13 | the good singing data of speakers |
---|
0:09:18 | for singing voice conversion uses |
---|
0:09:20 | so i just as follows |
---|
0:09:21 | adversely lowest and i that the maximal |
---|
0:09:25 | and we decide again engages and have demonstrated that i'd estimate involves here |
---|
0:09:31 | this allows us to preserve the lyrical content of the source euchre |
---|
0:09:35 | sorry sourcing |
---|
0:09:37 | so i'm to make slice we will be discussing very briefly why we need you |
---|
0:09:41 | loss function |
---|
0:09:43 | let's start entwined being an adversarial all |
---|
0:09:47 | is voice conversion are paying optimize the distribution of the remote thinking feature |
---|
0:09:53 | as much as closer to the distribution of targets there |
---|
0:09:56 | and also the distribution of convergence data comes to that of target single |
---|
0:10:01 | let's learned a little speaker |
---|
0:10:03 | and we can achieve high speaker similarity singing voice conversion |
---|
0:10:08 | so that we need to find a system small |
---|
0:10:12 | the reason is the and with a global mean tells us better than on version |
---|
0:10:16 | of the target single state the distribution |
---|
0:10:20 | and does not help to results are think this contextual information |
---|
0:10:24 | and it's a distance loss we can maintain the contextual information it single source and |
---|
0:10:30 | target hair |
---|
0:10:32 | well i estimate models it was decided that systems a lot of rice clustering wanna |
---|
0:10:37 | structure however it will not surface to guarantee that the mapping always with their little |
---|
0:10:43 | one of those for center |
---|
0:10:45 | so explicit presented little and |
---|
0:10:49 | gonna incorporate and i'd estimate involved here |
---|
0:10:54 | let's look at the experiment |
---|
0:10:56 | and |
---|
0:10:57 | true this paper we are from objective and subjective evaluation with a us singing database |
---|
0:11:03 | and describe the second system audio recordings |
---|
0:11:07 | from point a english or by about professional seniors |
---|
0:11:11 | and for all other than that in the training data settings |
---|
0:11:15 | in one that experiments tree and five or started singing care |
---|
0:11:23 | and we extract twenty four mel cepstral coefficients logarithmic fundamental and after this |
---|
0:11:29 | and we normalize the source and target and cepstral zero mean and unit variance by |
---|
0:11:33 | using the statistics of the training |
---|
0:11:36 | so on |
---|
0:11:37 | let's get the objective evaluation here |
---|
0:11:41 | the mel-cepstral distortion between the targets english nature thinking and converts it is warm and |
---|
0:11:47 | its you may no longer mel-cepstral distortion value in the case more spectral distortion |
---|
0:11:53 | and hearing this table one meal a quadtree framework |
---|
0:11:57 | and if you personally interested how we trained these networks |
---|
0:12:00 | please note that |
---|
0:12:02 | all these models and experimental conditions are provided in the paper |
---|
0:12:07 | so you can just go and check |
---|
0:12:09 | for each time we provide another one paragraph and explain how we trained them |
---|
0:12:13 | and three army training male to male and female to male conversion |
---|
0:12:18 | and for in the anything we have a nice and the training data |
---|
0:12:22 | tri-phone from each speaker and types for each speaker |
---|
0:12:26 | and if you just a good thing in the nine you are going on the |
---|
0:12:29 | always outperforms in |
---|
0:12:32 | so i shows that if we have not training vector |
---|
0:12:37 | is a much better solution than the nn for singing voice conversion |
---|
0:12:41 | and this cycle again no i guess problem is more challenging because we are doing |
---|
0:12:47 | a very low hanging one are shown |
---|
0:12:50 | which means the lyrical content is different during the training |
---|
0:12:54 | i don't the data is not depend on |
---|
0:12:56 | so i again achieves comparable results or something one battery o |
---|
0:13:01 | and the gmm baseline |
---|
0:13:02 | and i'm and not in the in the baseline use of hello they all these |
---|
0:13:06 | results show that which is much better without we don't think so |
---|
0:13:11 | i mean then if we do not readily only castilian cycle |
---|
0:13:16 | and achieve comparable or even better results to that of in a |
---|
0:13:23 | so |
---|
0:13:24 | in the next slide here we report the subjective evaluation we have our experiments indicate |
---|
0:13:29 | are about to the interest of time i already a |
---|
0:13:32 | some of them |
---|
0:13:33 | here in the presentation |
---|
0:13:35 | so what mean opinion score |
---|
0:13:37 | and we have fifteen subjects participated in the listening test on each subject listens to |
---|
0:13:42 | based on merit |
---|
0:13:43 | singing voices |
---|
0:13:44 | and the anything ghana trained in parallel data verified against train kernel training data |
---|
0:13:51 | and if you look at the end and you are but also that |
---|
0:13:57 | and i don't know |
---|
0:13:59 | and even though they use the same amount of training data |
---|
0:14:03 | results show last the |
---|
0:14:06 | outperforms the n and it should be used for singing voice or emotional word in |
---|
0:14:10 | there and if you look cycle again you train the same amount of training data |
---|
0:14:16 | but it does not parallel which means it's the more challenging |
---|
0:14:19 | and you for a more challenging task |
---|
0:14:21 | i again |
---|
0:14:22 | you know |
---|
0:14:23 | a choose a very similar performance to that of yen and |
---|
0:14:27 | then the and then use of parallel training data |
---|
0:14:29 | so we really the performance of cycle again you know is the remote will |
---|
0:14:34 | assuming that uses non-parallel training then |
---|
0:14:38 | another experiment that we wanna compere he recycling andreas's again |
---|
0:14:43 | for speaker similarity |
---|
0:14:45 | i think this experiment reported here in a separate friends task of speaker similarity you |
---|
0:14:50 | five minutes on their scores type again training |
---|
0:14:54 | where is the audio stream and that's one for training |
---|
0:14:57 | and |
---|
0:14:59 | this experiment shows that the actual again we thinking they are not clear that singing |
---|
0:15:04 | thing to this bar achieves comparable results to |
---|
0:15:08 | it and the sinking they are |
---|
0:15:10 | if it just doesn't have the battery sample for forty eight point one percent of |
---|
0:15:14 | the time |
---|
0:15:15 | which we believe is a remarkable because if you know |
---|
0:15:18 | having the training database a much more challenging task aiming at a training dataset or |
---|
0:15:24 | so we believe that cycle again issues |
---|
0:15:26 | really the performance in terms of singing voice conversion line you have |
---|
0:15:30 | no further training data |
---|
0:15:33 | so some in this paper we propose a novel solution based on generative accuracy and |
---|
0:15:38 | that's where it's just singing voice conversion |
---|
0:15:40 | we and we don't parallel training data |
---|
0:15:43 | and the whole and framework which is very well documented anymore spectral training data |
---|
0:15:48 | i know exactly yes to reno to the error between source and target fingers |
---|
0:15:54 | and you and i mean and not a training data |
---|
0:15:57 | we show that it works really well |
---|
0:16:00 | furthermore we also show that the proposed framework for better |
---|
0:16:04 | in less training data and the n and which we really remarkable |
---|
0:16:09 | that one leaves with or without parental training data available generative and restroom that's where |
---|
0:16:15 | if you high i anymore |
---|
0:16:19 | and you're for listening |
---|