0:00:13i well i guess how these is you don't
0:00:17in the residual also
0:00:20and today i'm going to present you a
0:00:24residual methods for music signals
0:00:27and indices
0:00:28but and endings phase
0:00:31with the increasing the actual text to speech
0:00:34and voice conversion methods
0:00:37there is it will we need for solving
0:00:40the only yes is for each other series has resulting right progress
0:00:45the what is a
0:00:47there are so open challenge is how
0:00:51the elements of comedy shows
0:00:53in reality noise scenarios that is you very little research
0:01:00and i a lot of the problem
0:01:02is that i work so phenomena
0:01:05the acoustic information
0:01:07exploited by actually just
0:01:09exactly
0:01:11it is challenging looking size box
0:01:15in this study
0:01:17we propose a new
0:01:19died resonant gmms for sure
0:01:22and we compare systematically
0:01:25its performance
0:01:26to the ideas is able to times and i think
0:01:31this includes
0:01:33two hundred uses
0:01:35or performance
0:01:36in various types of noise scenario
0:01:40and we also
0:01:42to look inside this
0:01:43seemingly
0:01:45in data
0:01:46black box
0:01:48model
0:01:49so this will be encountered a problem
0:01:52is a mixture of and i read that
0:01:55and the gmm
0:01:58retrain basically gmm
0:02:00well the and endings
0:02:02the ones
0:02:04by a i wrestler
0:02:07in i able to that vectors
0:02:11well i data base cu these background
0:02:14as input features
0:02:17and
0:02:19but it is easy to
0:02:20then i did convolutional layer is
0:02:24in each that we can see that there is a max pooling which is essential
0:02:29to result in and i'm selling factor of two
0:02:34well i think so is there is that so you can actually includes
0:02:39this gives connections
0:02:41connecting the convolutional layer is a weighting
0:02:44training of very you know one at a picture
0:02:49finally
0:02:50in the gmm is
0:02:53we have a whole incorrectly layer
0:02:56and we data and endings
0:02:58to train
0:03:00the gmm or vector
0:03:02a gmm their true can have the and the h
0:03:06but including putting
0:03:09a likelihood ratio
0:03:11or worse still mask
0:03:14this enables
0:03:15to include a human little
0:03:18for the automatic speaker verification
0:03:20or just implement the rejection based
0:03:27in this fight i present
0:03:29the overall performance
0:03:31all the two baselines
0:03:33the two challenge baseline
0:03:35and assisi gmm
0:03:38c use this is gmm
0:03:40and the proposed data
0:03:43see you did the gmm the an
0:03:45and the usenet oneida
0:03:48security the
0:03:49but all
0:03:50with the sole saw fusion system which are the fusion of the mfcc gmm
0:03:56c is easy gmm
0:03:59and the cuda gmm system
0:04:04we can see that role
0:04:06the sum fusion
0:04:08that's cool was
0:04:11but also that s
0:04:13and a very straight north
0:04:16using the different architectures
0:04:18in the different kind of smoothing types and thus
0:04:23i would like to emphasise you
0:04:25the table we apply
0:04:26one minute or
0:04:28one political access portion
0:04:30or the u s is nineteen
0:04:35because it will hear system mapping dataset is and noise it is not very suitable
0:04:43to test
0:04:44a noisy scenario
0:04:47really i'm noise original this data
0:04:50so we have to create
0:04:52but noise is the
0:04:55it is computationally very expensive
0:04:59to create
0:05:01noise in scenarios
0:05:02for the speech samples
0:05:05so instead of this i do they
0:05:08but less computationally intensive approach
0:05:12by something
0:05:13a subset of the yes easy to nineteen dataset
0:05:17in a bottle ancillary and by well i mean we mean
0:05:23the bonds respect to the data used to be s
0:05:27the there exists
0:05:28in the dataset
0:05:30then
0:05:31we rst noise samples from then used on dataset
0:05:35these are all three
0:05:37the signal-to-noise ratio
0:05:39all five test
0:05:43we have a selection of c six
0:05:46speakers on the speech for them use an dataset
0:05:50a random music file
0:05:52and the remember noise data
0:05:55from the nuisance dataset
0:05:57by noise
0:05:58really fair to the noise category all the muse and data
0:06:04big noise is also where i
0:06:07by since the functional generation
0:06:10at a signal-to-noise ratio of five the signals
0:06:14and also reverberation was applied
0:06:16using simulated woman close this is from the y alright
0:06:22we can see the overall performance results
0:06:25all the all vectors
0:06:27in the presence of
0:06:28also i
0:06:30we see
0:06:32the results of noise
0:06:34this but architecture for best
0:06:38and without noise
0:06:40this is usually gmm vienna
0:06:43and the sum fusion on a circle
0:06:49we have also
0:06:52that is this sort of a tradeoff
0:06:54big in the security in the n f c but the gmm
0:07:00the c d v d n
0:07:02performance
0:07:03better in noisy cases
0:07:06but slightly worse
0:07:08in always this case
0:07:10compresses but gmm
0:07:13finally
0:07:14we have to also the
0:07:17that old s e c g and the c use is e g m all
0:07:21characters
0:07:23a the performing compared to the that the proposed architecture and compared to the cu
0:07:29maybe a
0:07:32in these noisy and with this scenarios
0:07:35you we see the same feature but in
0:07:39therefore that occurs
0:07:40rather than
0:07:41you know all of that they
0:07:46we can see the sum fusion
0:07:48performs best
0:07:50in the noiseless scenario
0:07:52the noise this setup
0:07:54is not by this
0:07:56though not installed s
0:07:59why the noisy scenario is denoted by six right
0:08:03the continuous time
0:08:06overall we can also that the cu due to the nn off factor is the
0:08:12most robust to noise in this whole audio
0:08:16and we can also the
0:08:18this kind of trade off there
0:08:20with this but gmm
0:08:22and the cu and
0:08:25three shows that we have also seen previously
0:08:29in a you know the
0:08:33we then proceeded to do
0:08:35visualisations
0:08:37this is you didn't the nn and endings
0:08:39first
0:08:40with pca
0:08:45really the visualisation
0:08:47and so to solve the class is
0:08:51it became apparent
0:08:53that most of the school classes
0:08:55so it's very well
0:08:58from these green
0:09:00point cloud
0:09:02which corresponds to the bottom
0:09:06exact
0:09:07the v c classes
0:09:09the classes corresponding to voice conversion
0:09:14we sort of all of that
0:09:16we don't wanna cost
0:09:19this explains
0:09:21the fusion detection performance
0:09:23with some p c s is
0:09:26because these can be separated
0:09:30linearly
0:09:30in the to these days
0:09:33we see
0:09:34a similar
0:09:35consistent picture
0:09:37another dimensionality reduction the
0:09:40but these three
0:09:42which stands for sixty still fifty mean and that
0:09:46and what we the
0:09:48is this same feature
0:09:50of the v c cost use
0:09:52all of that
0:09:54with the one activities
0:09:57on the and then proceeded to do an additional experiments
0:10:03the goal of this experiment force
0:10:05to see
0:10:07how and then he's moving is
0:10:10then there so gently
0:10:12to these different kind of noise and i was
0:10:16in the bigger what you can see
0:10:19is what happens
0:10:21in case
0:10:23of variations
0:10:26those also that this figure
0:10:31the blue point counts
0:10:33the red points while
0:10:35and the green points l
0:10:37is actually the same
0:10:39that's in the pca side
0:10:42now
0:10:43we proceed to solve a whole
0:10:46some samples
0:10:49these ones
0:10:50from the one of the
0:10:52and these ones
0:10:54but the ones
0:10:55from this tool
0:10:58and the be
0:10:59following the lee
0:11:01noise
0:11:03with this reverberation
0:11:06and what we see
0:11:09is that being the ones
0:11:10corresponding to the one thing
0:11:13big on these green dots
0:11:16moving closer to the actual decision boundary
0:11:22and we can also see
0:11:24that a little
0:11:27become these orange dolls
0:11:30we closer to the decision boundary
0:11:34but still on the right side of this each
0:11:38then well
0:11:40this gives us a according to the u
0:11:42the hot picture is robust to the duration
0:11:47because you know
0:11:48no one's
0:11:51matrix
0:11:52this is all
0:11:54close as a decision boundary which is exactly
0:12:00we can see that a mass
0:12:03the right classification decision
0:12:06is retained
0:12:07now i'm going to talk about
0:12:09alright cleanable algorithm based techniques
0:12:14the first thing i'm going to talk about
0:12:17is the graph based technique
0:12:19which is a basis
0:12:23first
0:12:24we can only the security spectrum
0:12:27based
0:12:28on the all we also
0:12:31down with the reckon
0:12:34we obtain a sensitivity
0:12:37this sensitivity man that sass
0:12:40one loss of the spectrum well
0:12:42i don't most important to me
0:12:45the classification this procedure better the speech or if to whether it is natural
0:12:51what can do
0:12:53is a threshold we sense it's gonna
0:12:57the whole thing this binary mask
0:13:00in c
0:13:02which is basically segments this for four hours
0:13:06does not reach five important on
0:13:09and you can be should be
0:13:13see that
0:13:14if we will lie
0:13:15the original security spectral again but i mean
0:13:20really all the in this
0:13:22picture
0:13:23which we again
0:13:24i don't normalization
0:13:27sensitive refuelling waller
0:13:30to thing
0:13:32reconstructed way
0:13:35and how what we rewrite when you is a series of trainable all it was
0:13:41right of each other
0:13:43first you are going to here
0:13:45the original well
0:13:47then
0:13:48you are going to hear already construction of the original using all the features
0:13:54and finally going to you possible the audio that the no one extra innings sports
0:14:17so you can do something about the real the speech
0:14:20and the again here on a particular type of
0:14:24viewing these examples bridge indicate what what's of the speech signal
0:14:29might be important
0:14:33that i think that we have five
0:14:36this is that both
0:14:38mean you know we'll technique
0:14:42we all we want to all audio files based on how challenging air
0:14:48the more challenging only what lies
0:14:51i usually the ones
0:14:53that are closer to the cm threshold
0:14:58and the definitely once i goals
0:15:00which are the from the c a threshold
0:15:05and what we can do
0:15:07is we can exploit this phenomena
0:15:10this clueless the cm stressful
0:15:12and use these
0:15:14two or two was
0:15:16based on this yes of course
0:15:18and i think he's grew out was the main noticeable o as we can obtain
0:15:23and you all recently collected by consent
0:15:27where we don't understand the needle individual
0:15:31but three
0:15:33a fourth the voters on the acoustics
0:15:38so
0:15:39i'm going to show you what okay given the case of a eighteen
0:15:44and
0:15:45i'm going to and you are going to
0:15:48he of progressively so was that
0:15:53i first variability so
0:15:56five from the c and search for in the direction of what you
0:16:00and then finally ones that are there is to someone's the batteries two
0:16:28so let us was to here is that there is a noise more aggressively present
0:16:35when you use a listening to a morse two
0:16:38all videos
0:16:39in general we also that there is a more
0:16:44no one set of speech in the school speech can be also
0:16:49in general
0:16:50in this actually involve your examples
0:16:53you can hear more these extended while you're was
0:16:56by scan disk you a whole or just picking the mean
0:17:01we also be some definites experiments using the most that architecture which can be used
0:17:08to cooperate
0:17:09objective measure on estimation
0:17:12we find a as for the zero point three more five
0:17:16these being the mean opinion score and of the screen
0:17:20the s is that
0:17:22is the first principal axes the first nine dimensional well i principal component
0:17:28and this year
0:17:29that was
0:17:31actually a single
0:17:33then a bonus aspects of the speech
0:17:38and interestingly we the exact show these voice cooking categories
0:17:43also all was i think more natural than the actual one of the signals
0:17:50waiting to the most
0:17:51and point out directions for future
0:17:56recognizer redeemable water
0:17:59as an image reconstruction what you
0:18:01in a minimal audio case
0:18:04so in the future we want to use an l one based solution
0:18:09trained on c you the
0:18:10spectrograms
0:18:12because these have been previously shall tools lingual speech coding i bit conventional fft spectrum
0:18:20finally we also recognise
0:18:22that's the data bases clicks voice activity detection
0:18:27would be essential
0:18:30but is always this each region both this can be important for cm investigation
0:18:37in the case of political access data
0:18:41but it would be thing important
0:18:43to design a good calibration stuff i
0:18:45we investigate to what extent
0:18:48this is thus
0:18:49really i
0:18:50but non speech
0:18:51versus their use
0:18:54to summarize
0:18:55we have found
0:18:57that are known to have a second the measures
0:19:00a robust to noise and you know have a better understanding that even though
0:19:06then i don't exactly know
0:19:08well for doing
0:19:10well the
0:19:13we know that the a robust to noise more robust to noise that the gmm
0:19:17can
0:19:20nevertheless we have a managed to the in more insight into these
0:19:27by generating explainable as
0:19:31finally
0:19:33we have also
0:19:34a investigate the of an important concept
0:19:38which is the and the things correlate with subjective naturalness i'll show the diagonal
0:19:46meaning that a texture
0:19:48no in a
0:19:50considers the naturalness
0:19:52s i si
0:19:54i hope this presentation and i is you
0:19:57did not be afraid i
0:19:59of using the minutes of this
0:20:02in your work
0:20:04due to
0:20:05to just the sheer ease an unexplained i
0:20:08and i would like to thank you
0:20:10for your attention