0:00:13 | i well i guess how these is you don't |
---|
0:00:17 | in the residual also |
---|
0:00:20 | and today i'm going to present you a |
---|
0:00:24 | residual methods for music signals |
---|
0:00:27 | and indices |
---|
0:00:28 | but and endings phase |
---|
0:00:31 | with the increasing the actual text to speech |
---|
0:00:34 | and voice conversion methods |
---|
0:00:37 | there is it will we need for solving |
---|
0:00:40 | the only yes is for each other series has resulting right progress |
---|
0:00:45 | the what is a |
---|
0:00:47 | there are so open challenge is how |
---|
0:00:51 | the elements of comedy shows |
---|
0:00:53 | in reality noise scenarios that is you very little research |
---|
0:01:00 | and i a lot of the problem |
---|
0:01:02 | is that i work so phenomena |
---|
0:01:05 | the acoustic information |
---|
0:01:07 | exploited by actually just |
---|
0:01:09 | exactly |
---|
0:01:11 | it is challenging looking size box |
---|
0:01:15 | in this study |
---|
0:01:17 | we propose a new |
---|
0:01:19 | died resonant gmms for sure |
---|
0:01:22 | and we compare systematically |
---|
0:01:25 | its performance |
---|
0:01:26 | to the ideas is able to times and i think |
---|
0:01:31 | this includes |
---|
0:01:33 | two hundred uses |
---|
0:01:35 | or performance |
---|
0:01:36 | in various types of noise scenario |
---|
0:01:40 | and we also |
---|
0:01:42 | to look inside this |
---|
0:01:43 | seemingly |
---|
0:01:45 | in data |
---|
0:01:46 | black box |
---|
0:01:48 | model |
---|
0:01:49 | so this will be encountered a problem |
---|
0:01:52 | is a mixture of and i read that |
---|
0:01:55 | and the gmm |
---|
0:01:58 | retrain basically gmm |
---|
0:02:00 | well the and endings |
---|
0:02:02 | the ones |
---|
0:02:04 | by a i wrestler |
---|
0:02:07 | in i able to that vectors |
---|
0:02:11 | well i data base cu these background |
---|
0:02:14 | as input features |
---|
0:02:17 | and |
---|
0:02:19 | but it is easy to |
---|
0:02:20 | then i did convolutional layer is |
---|
0:02:24 | in each that we can see that there is a max pooling which is essential |
---|
0:02:29 | to result in and i'm selling factor of two |
---|
0:02:34 | well i think so is there is that so you can actually includes |
---|
0:02:39 | this gives connections |
---|
0:02:41 | connecting the convolutional layer is a weighting |
---|
0:02:44 | training of very you know one at a picture |
---|
0:02:49 | finally |
---|
0:02:50 | in the gmm is |
---|
0:02:53 | we have a whole incorrectly layer |
---|
0:02:56 | and we data and endings |
---|
0:02:58 | to train |
---|
0:03:00 | the gmm or vector |
---|
0:03:02 | a gmm their true can have the and the h |
---|
0:03:06 | but including putting |
---|
0:03:09 | a likelihood ratio |
---|
0:03:11 | or worse still mask |
---|
0:03:14 | this enables |
---|
0:03:15 | to include a human little |
---|
0:03:18 | for the automatic speaker verification |
---|
0:03:20 | or just implement the rejection based |
---|
0:03:27 | in this fight i present |
---|
0:03:29 | the overall performance |
---|
0:03:31 | all the two baselines |
---|
0:03:33 | the two challenge baseline |
---|
0:03:35 | and assisi gmm |
---|
0:03:38 | c use this is gmm |
---|
0:03:40 | and the proposed data |
---|
0:03:43 | see you did the gmm the an |
---|
0:03:45 | and the usenet oneida |
---|
0:03:48 | security the |
---|
0:03:49 | but all |
---|
0:03:50 | with the sole saw fusion system which are the fusion of the mfcc gmm |
---|
0:03:56 | c is easy gmm |
---|
0:03:59 | and the cuda gmm system |
---|
0:04:04 | we can see that role |
---|
0:04:06 | the sum fusion |
---|
0:04:08 | that's cool was |
---|
0:04:11 | but also that s |
---|
0:04:13 | and a very straight north |
---|
0:04:16 | using the different architectures |
---|
0:04:18 | in the different kind of smoothing types and thus |
---|
0:04:23 | i would like to emphasise you |
---|
0:04:25 | the table we apply |
---|
0:04:26 | one minute or |
---|
0:04:28 | one political access portion |
---|
0:04:30 | or the u s is nineteen |
---|
0:04:35 | because it will hear system mapping dataset is and noise it is not very suitable |
---|
0:04:43 | to test |
---|
0:04:44 | a noisy scenario |
---|
0:04:47 | really i'm noise original this data |
---|
0:04:50 | so we have to create |
---|
0:04:52 | but noise is the |
---|
0:04:55 | it is computationally very expensive |
---|
0:04:59 | to create |
---|
0:05:01 | noise in scenarios |
---|
0:05:02 | for the speech samples |
---|
0:05:05 | so instead of this i do they |
---|
0:05:08 | but less computationally intensive approach |
---|
0:05:12 | by something |
---|
0:05:13 | a subset of the yes easy to nineteen dataset |
---|
0:05:17 | in a bottle ancillary and by well i mean we mean |
---|
0:05:23 | the bonds respect to the data used to be s |
---|
0:05:27 | the there exists |
---|
0:05:28 | in the dataset |
---|
0:05:30 | then |
---|
0:05:31 | we rst noise samples from then used on dataset |
---|
0:05:35 | these are all three |
---|
0:05:37 | the signal-to-noise ratio |
---|
0:05:39 | all five test |
---|
0:05:43 | we have a selection of c six |
---|
0:05:46 | speakers on the speech for them use an dataset |
---|
0:05:50 | a random music file |
---|
0:05:52 | and the remember noise data |
---|
0:05:55 | from the nuisance dataset |
---|
0:05:57 | by noise |
---|
0:05:58 | really fair to the noise category all the muse and data |
---|
0:06:04 | big noise is also where i |
---|
0:06:07 | by since the functional generation |
---|
0:06:10 | at a signal-to-noise ratio of five the signals |
---|
0:06:14 | and also reverberation was applied |
---|
0:06:16 | using simulated woman close this is from the y alright |
---|
0:06:22 | we can see the overall performance results |
---|
0:06:25 | all the all vectors |
---|
0:06:27 | in the presence of |
---|
0:06:28 | also i |
---|
0:06:30 | we see |
---|
0:06:32 | the results of noise |
---|
0:06:34 | this but architecture for best |
---|
0:06:38 | and without noise |
---|
0:06:40 | this is usually gmm vienna |
---|
0:06:43 | and the sum fusion on a circle |
---|
0:06:49 | we have also |
---|
0:06:52 | that is this sort of a tradeoff |
---|
0:06:54 | big in the security in the n f c but the gmm |
---|
0:07:00 | the c d v d n |
---|
0:07:02 | performance |
---|
0:07:03 | better in noisy cases |
---|
0:07:06 | but slightly worse |
---|
0:07:08 | in always this case |
---|
0:07:10 | compresses but gmm |
---|
0:07:13 | finally |
---|
0:07:14 | we have to also the |
---|
0:07:17 | that old s e c g and the c use is e g m all |
---|
0:07:21 | characters |
---|
0:07:23 | a the performing compared to the that the proposed architecture and compared to the cu |
---|
0:07:29 | maybe a |
---|
0:07:32 | in these noisy and with this scenarios |
---|
0:07:35 | you we see the same feature but in |
---|
0:07:39 | therefore that occurs |
---|
0:07:40 | rather than |
---|
0:07:41 | you know all of that they |
---|
0:07:46 | we can see the sum fusion |
---|
0:07:48 | performs best |
---|
0:07:50 | in the noiseless scenario |
---|
0:07:52 | the noise this setup |
---|
0:07:54 | is not by this |
---|
0:07:56 | though not installed s |
---|
0:07:59 | why the noisy scenario is denoted by six right |
---|
0:08:03 | the continuous time |
---|
0:08:06 | overall we can also that the cu due to the nn off factor is the |
---|
0:08:12 | most robust to noise in this whole audio |
---|
0:08:16 | and we can also the |
---|
0:08:18 | this kind of trade off there |
---|
0:08:20 | with this but gmm |
---|
0:08:22 | and the cu and |
---|
0:08:25 | three shows that we have also seen previously |
---|
0:08:29 | in a you know the |
---|
0:08:33 | we then proceeded to do |
---|
0:08:35 | visualisations |
---|
0:08:37 | this is you didn't the nn and endings |
---|
0:08:39 | first |
---|
0:08:40 | with pca |
---|
0:08:45 | really the visualisation |
---|
0:08:47 | and so to solve the class is |
---|
0:08:51 | it became apparent |
---|
0:08:53 | that most of the school classes |
---|
0:08:55 | so it's very well |
---|
0:08:58 | from these green |
---|
0:09:00 | point cloud |
---|
0:09:02 | which corresponds to the bottom |
---|
0:09:06 | exact |
---|
0:09:07 | the v c classes |
---|
0:09:09 | the classes corresponding to voice conversion |
---|
0:09:14 | we sort of all of that |
---|
0:09:16 | we don't wanna cost |
---|
0:09:19 | this explains |
---|
0:09:21 | the fusion detection performance |
---|
0:09:23 | with some p c s is |
---|
0:09:26 | because these can be separated |
---|
0:09:30 | linearly |
---|
0:09:30 | in the to these days |
---|
0:09:33 | we see |
---|
0:09:34 | a similar |
---|
0:09:35 | consistent picture |
---|
0:09:37 | another dimensionality reduction the |
---|
0:09:40 | but these three |
---|
0:09:42 | which stands for sixty still fifty mean and that |
---|
0:09:46 | and what we the |
---|
0:09:48 | is this same feature |
---|
0:09:50 | of the v c cost use |
---|
0:09:52 | all of that |
---|
0:09:54 | with the one activities |
---|
0:09:57 | on the and then proceeded to do an additional experiments |
---|
0:10:03 | the goal of this experiment force |
---|
0:10:05 | to see |
---|
0:10:07 | how and then he's moving is |
---|
0:10:10 | then there so gently |
---|
0:10:12 | to these different kind of noise and i was |
---|
0:10:16 | in the bigger what you can see |
---|
0:10:19 | is what happens |
---|
0:10:21 | in case |
---|
0:10:23 | of variations |
---|
0:10:26 | those also that this figure |
---|
0:10:31 | the blue point counts |
---|
0:10:33 | the red points while |
---|
0:10:35 | and the green points l |
---|
0:10:37 | is actually the same |
---|
0:10:39 | that's in the pca side |
---|
0:10:42 | now |
---|
0:10:43 | we proceed to solve a whole |
---|
0:10:46 | some samples |
---|
0:10:49 | these ones |
---|
0:10:50 | from the one of the |
---|
0:10:52 | and these ones |
---|
0:10:54 | but the ones |
---|
0:10:55 | from this tool |
---|
0:10:58 | and the be |
---|
0:10:59 | following the lee |
---|
0:11:01 | noise |
---|
0:11:03 | with this reverberation |
---|
0:11:06 | and what we see |
---|
0:11:09 | is that being the ones |
---|
0:11:10 | corresponding to the one thing |
---|
0:11:13 | big on these green dots |
---|
0:11:16 | moving closer to the actual decision boundary |
---|
0:11:22 | and we can also see |
---|
0:11:24 | that a little |
---|
0:11:27 | become these orange dolls |
---|
0:11:30 | we closer to the decision boundary |
---|
0:11:34 | but still on the right side of this each |
---|
0:11:38 | then well |
---|
0:11:40 | this gives us a according to the u |
---|
0:11:42 | the hot picture is robust to the duration |
---|
0:11:47 | because you know |
---|
0:11:48 | no one's |
---|
0:11:51 | matrix |
---|
0:11:52 | this is all |
---|
0:11:54 | close as a decision boundary which is exactly |
---|
0:12:00 | we can see that a mass |
---|
0:12:03 | the right classification decision |
---|
0:12:06 | is retained |
---|
0:12:07 | now i'm going to talk about |
---|
0:12:09 | alright cleanable algorithm based techniques |
---|
0:12:14 | the first thing i'm going to talk about |
---|
0:12:17 | is the graph based technique |
---|
0:12:19 | which is a basis |
---|
0:12:23 | first |
---|
0:12:24 | we can only the security spectrum |
---|
0:12:27 | based |
---|
0:12:28 | on the all we also |
---|
0:12:31 | down with the reckon |
---|
0:12:34 | we obtain a sensitivity |
---|
0:12:37 | this sensitivity man that sass |
---|
0:12:40 | one loss of the spectrum well |
---|
0:12:42 | i don't most important to me |
---|
0:12:45 | the classification this procedure better the speech or if to whether it is natural |
---|
0:12:51 | what can do |
---|
0:12:53 | is a threshold we sense it's gonna |
---|
0:12:57 | the whole thing this binary mask |
---|
0:13:00 | in c |
---|
0:13:02 | which is basically segments this for four hours |
---|
0:13:06 | does not reach five important on |
---|
0:13:09 | and you can be should be |
---|
0:13:13 | see that |
---|
0:13:14 | if we will lie |
---|
0:13:15 | the original security spectral again but i mean |
---|
0:13:20 | really all the in this |
---|
0:13:22 | picture |
---|
0:13:23 | which we again |
---|
0:13:24 | i don't normalization |
---|
0:13:27 | sensitive refuelling waller |
---|
0:13:30 | to thing |
---|
0:13:32 | reconstructed way |
---|
0:13:35 | and how what we rewrite when you is a series of trainable all it was |
---|
0:13:41 | right of each other |
---|
0:13:43 | first you are going to here |
---|
0:13:45 | the original well |
---|
0:13:47 | then |
---|
0:13:48 | you are going to hear already construction of the original using all the features |
---|
0:13:54 | and finally going to you possible the audio that the no one extra innings sports |
---|
0:14:17 | so you can do something about the real the speech |
---|
0:14:20 | and the again here on a particular type of |
---|
0:14:24 | viewing these examples bridge indicate what what's of the speech signal |
---|
0:14:29 | might be important |
---|
0:14:33 | that i think that we have five |
---|
0:14:36 | this is that both |
---|
0:14:38 | mean you know we'll technique |
---|
0:14:42 | we all we want to all audio files based on how challenging air |
---|
0:14:48 | the more challenging only what lies |
---|
0:14:51 | i usually the ones |
---|
0:14:53 | that are closer to the cm threshold |
---|
0:14:58 | and the definitely once i goals |
---|
0:15:00 | which are the from the c a threshold |
---|
0:15:05 | and what we can do |
---|
0:15:07 | is we can exploit this phenomena |
---|
0:15:10 | this clueless the cm stressful |
---|
0:15:12 | and use these |
---|
0:15:14 | two or two was |
---|
0:15:16 | based on this yes of course |
---|
0:15:18 | and i think he's grew out was the main noticeable o as we can obtain |
---|
0:15:23 | and you all recently collected by consent |
---|
0:15:27 | where we don't understand the needle individual |
---|
0:15:31 | but three |
---|
0:15:33 | a fourth the voters on the acoustics |
---|
0:15:38 | so |
---|
0:15:39 | i'm going to show you what okay given the case of a eighteen |
---|
0:15:44 | and |
---|
0:15:45 | i'm going to and you are going to |
---|
0:15:48 | he of progressively so was that |
---|
0:15:53 | i first variability so |
---|
0:15:56 | five from the c and search for in the direction of what you |
---|
0:16:00 | and then finally ones that are there is to someone's the batteries two |
---|
0:16:28 | so let us was to here is that there is a noise more aggressively present |
---|
0:16:35 | when you use a listening to a morse two |
---|
0:16:38 | all videos |
---|
0:16:39 | in general we also that there is a more |
---|
0:16:44 | no one set of speech in the school speech can be also |
---|
0:16:49 | in general |
---|
0:16:50 | in this actually involve your examples |
---|
0:16:53 | you can hear more these extended while you're was |
---|
0:16:56 | by scan disk you a whole or just picking the mean |
---|
0:17:01 | we also be some definites experiments using the most that architecture which can be used |
---|
0:17:08 | to cooperate |
---|
0:17:09 | objective measure on estimation |
---|
0:17:12 | we find a as for the zero point three more five |
---|
0:17:16 | these being the mean opinion score and of the screen |
---|
0:17:20 | the s is that |
---|
0:17:22 | is the first principal axes the first nine dimensional well i principal component |
---|
0:17:28 | and this year |
---|
0:17:29 | that was |
---|
0:17:31 | actually a single |
---|
0:17:33 | then a bonus aspects of the speech |
---|
0:17:38 | and interestingly we the exact show these voice cooking categories |
---|
0:17:43 | also all was i think more natural than the actual one of the signals |
---|
0:17:50 | waiting to the most |
---|
0:17:51 | and point out directions for future |
---|
0:17:56 | recognizer redeemable water |
---|
0:17:59 | as an image reconstruction what you |
---|
0:18:01 | in a minimal audio case |
---|
0:18:04 | so in the future we want to use an l one based solution |
---|
0:18:09 | trained on c you the |
---|
0:18:10 | spectrograms |
---|
0:18:12 | because these have been previously shall tools lingual speech coding i bit conventional fft spectrum |
---|
0:18:20 | finally we also recognise |
---|
0:18:22 | that's the data bases clicks voice activity detection |
---|
0:18:27 | would be essential |
---|
0:18:30 | but is always this each region both this can be important for cm investigation |
---|
0:18:37 | in the case of political access data |
---|
0:18:41 | but it would be thing important |
---|
0:18:43 | to design a good calibration stuff i |
---|
0:18:45 | we investigate to what extent |
---|
0:18:48 | this is thus |
---|
0:18:49 | really i |
---|
0:18:50 | but non speech |
---|
0:18:51 | versus their use |
---|
0:18:54 | to summarize |
---|
0:18:55 | we have found |
---|
0:18:57 | that are known to have a second the measures |
---|
0:19:00 | a robust to noise and you know have a better understanding that even though |
---|
0:19:06 | then i don't exactly know |
---|
0:19:08 | well for doing |
---|
0:19:10 | well the |
---|
0:19:13 | we know that the a robust to noise more robust to noise that the gmm |
---|
0:19:17 | can |
---|
0:19:20 | nevertheless we have a managed to the in more insight into these |
---|
0:19:27 | by generating explainable as |
---|
0:19:31 | finally |
---|
0:19:33 | we have also |
---|
0:19:34 | a investigate the of an important concept |
---|
0:19:38 | which is the and the things correlate with subjective naturalness i'll show the diagonal |
---|
0:19:46 | meaning that a texture |
---|
0:19:48 | no in a |
---|
0:19:50 | considers the naturalness |
---|
0:19:52 | s i si |
---|
0:19:54 | i hope this presentation and i is you |
---|
0:19:57 | did not be afraid i |
---|
0:19:59 | of using the minutes of this |
---|
0:20:02 | in your work |
---|
0:20:04 | due to |
---|
0:20:05 | to just the sheer ease an unexplained i |
---|
0:20:08 | and i would like to thank you |
---|
0:20:10 | for your attention |
---|