i well i guess how these is you don't
in the residual also
and today i'm going to present you a
residual methods for music signals
and indices
but and endings phase
with the increasing the actual text to speech
and voice conversion methods
there is it will we need for solving
the only yes is for each other series has resulting right progress
the what is a
there are so open challenge is how
the elements of comedy shows
in reality noise scenarios that is you very little research
and i a lot of the problem
is that i work so phenomena
the acoustic information
exploited by actually just
exactly
it is challenging looking size box
in this study
we propose a new
died resonant gmms for sure
and we compare systematically
its performance
to the ideas is able to times and i think
this includes
two hundred uses
or performance
in various types of noise scenario
and we also
to look inside this
seemingly
in data
black box
model
so this will be encountered a problem
is a mixture of and i read that
and the gmm
retrain basically gmm
well the and endings
the ones
by a i wrestler
in i able to that vectors
well i data base cu these background
as input features
and
but it is easy to
then i did convolutional layer is
in each that we can see that there is a max pooling which is essential
to result in and i'm selling factor of two
well i think so is there is that so you can actually includes
this gives connections
connecting the convolutional layer is a weighting
training of very you know one at a picture
finally
in the gmm is
we have a whole incorrectly layer
and we data and endings
to train
the gmm or vector
a gmm their true can have the and the h
but including putting
a likelihood ratio
or worse still mask
this enables
to include a human little
for the automatic speaker verification
or just implement the rejection based
in this fight i present
the overall performance
all the two baselines
the two challenge baseline
and assisi gmm
c use this is gmm
and the proposed data
see you did the gmm the an
and the usenet oneida
security the
but all
with the sole saw fusion system which are the fusion of the mfcc gmm
c is easy gmm
and the cuda gmm system
we can see that role
the sum fusion
that's cool was
but also that s
and a very straight north
using the different architectures
in the different kind of smoothing types and thus
i would like to emphasise you
the table we apply
one minute or
one political access portion
or the u s is nineteen
because it will hear system mapping dataset is and noise it is not very suitable
to test
a noisy scenario
really i'm noise original this data
so we have to create
but noise is the
it is computationally very expensive
to create
noise in scenarios
for the speech samples
so instead of this i do they
but less computationally intensive approach
by something
a subset of the yes easy to nineteen dataset
in a bottle ancillary and by well i mean we mean
the bonds respect to the data used to be s
the there exists
in the dataset
then
we rst noise samples from then used on dataset
these are all three
the signal-to-noise ratio
all five test
we have a selection of c six
speakers on the speech for them use an dataset
a random music file
and the remember noise data
from the nuisance dataset
by noise
really fair to the noise category all the muse and data
big noise is also where i
by since the functional generation
at a signal-to-noise ratio of five the signals
and also reverberation was applied
using simulated woman close this is from the y alright
we can see the overall performance results
all the all vectors
in the presence of
also i
we see
the results of noise
this but architecture for best
and without noise
this is usually gmm vienna
and the sum fusion on a circle
we have also
that is this sort of a tradeoff
big in the security in the n f c but the gmm
the c d v d n
performance
better in noisy cases
but slightly worse
in always this case
compresses but gmm
finally
we have to also the
that old s e c g and the c use is e g m all
characters
a the performing compared to the that the proposed architecture and compared to the cu
maybe a
in these noisy and with this scenarios
you we see the same feature but in
therefore that occurs
rather than
you know all of that they
we can see the sum fusion
performs best
in the noiseless scenario
the noise this setup
is not by this
though not installed s
why the noisy scenario is denoted by six right
the continuous time
overall we can also that the cu due to the nn off factor is the
most robust to noise in this whole audio
and we can also the
this kind of trade off there
with this but gmm
and the cu and
three shows that we have also seen previously
in a you know the
we then proceeded to do
visualisations
this is you didn't the nn and endings
first
with pca
really the visualisation
and so to solve the class is
it became apparent
that most of the school classes
so it's very well
from these green
point cloud
which corresponds to the bottom
exact
the v c classes
the classes corresponding to voice conversion
we sort of all of that
we don't wanna cost
this explains
the fusion detection performance
with some p c s is
because these can be separated
linearly
in the to these days
we see
a similar
consistent picture
another dimensionality reduction the
but these three
which stands for sixty still fifty mean and that
and what we the
is this same feature
of the v c cost use
all of that
with the one activities
on the and then proceeded to do an additional experiments
the goal of this experiment force
to see
how and then he's moving is
then there so gently
to these different kind of noise and i was
in the bigger what you can see
is what happens
in case
of variations
those also that this figure
the blue point counts
the red points while
and the green points l
is actually the same
that's in the pca side
now
we proceed to solve a whole
some samples
these ones
from the one of the
and these ones
but the ones
from this tool
and the be
following the lee
noise
with this reverberation
and what we see
is that being the ones
corresponding to the one thing
big on these green dots
moving closer to the actual decision boundary
and we can also see
that a little
become these orange dolls
we closer to the decision boundary
but still on the right side of this each
then well
this gives us a according to the u
the hot picture is robust to the duration
because you know
no one's
matrix
this is all
close as a decision boundary which is exactly
we can see that a mass
the right classification decision
is retained
now i'm going to talk about
alright cleanable algorithm based techniques
the first thing i'm going to talk about
is the graph based technique
which is a basis
first
we can only the security spectrum
based
on the all we also
down with the reckon
we obtain a sensitivity
this sensitivity man that sass
one loss of the spectrum well
i don't most important to me
the classification this procedure better the speech or if to whether it is natural
what can do
is a threshold we sense it's gonna
the whole thing this binary mask
in c
which is basically segments this for four hours
does not reach five important on
and you can be should be
see that
if we will lie
the original security spectral again but i mean
really all the in this
picture
which we again
i don't normalization
sensitive refuelling waller
to thing
reconstructed way
and how what we rewrite when you is a series of trainable all it was
right of each other
first you are going to here
the original well
then
you are going to hear already construction of the original using all the features
and finally going to you possible the audio that the no one extra innings sports
so you can do something about the real the speech
and the again here on a particular type of
viewing these examples bridge indicate what what's of the speech signal
might be important
that i think that we have five
this is that both
mean you know we'll technique
we all we want to all audio files based on how challenging air
the more challenging only what lies
i usually the ones
that are closer to the cm threshold
and the definitely once i goals
which are the from the c a threshold
and what we can do
is we can exploit this phenomena
this clueless the cm stressful
and use these
two or two was
based on this yes of course
and i think he's grew out was the main noticeable o as we can obtain
and you all recently collected by consent
where we don't understand the needle individual
but three
a fourth the voters on the acoustics
so
i'm going to show you what okay given the case of a eighteen
and
i'm going to and you are going to
he of progressively so was that
i first variability so
five from the c and search for in the direction of what you
and then finally ones that are there is to someone's the batteries two
so let us was to here is that there is a noise more aggressively present
when you use a listening to a morse two
all videos
in general we also that there is a more
no one set of speech in the school speech can be also
in general
in this actually involve your examples
you can hear more these extended while you're was
by scan disk you a whole or just picking the mean
we also be some definites experiments using the most that architecture which can be used
to cooperate
objective measure on estimation
we find a as for the zero point three more five
these being the mean opinion score and of the screen
the s is that
is the first principal axes the first nine dimensional well i principal component
and this year
that was
actually a single
then a bonus aspects of the speech
and interestingly we the exact show these voice cooking categories
also all was i think more natural than the actual one of the signals
waiting to the most
and point out directions for future
recognizer redeemable water
as an image reconstruction what you
in a minimal audio case
so in the future we want to use an l one based solution
trained on c you the
spectrograms
because these have been previously shall tools lingual speech coding i bit conventional fft spectrum
finally we also recognise
that's the data bases clicks voice activity detection
would be essential
but is always this each region both this can be important for cm investigation
in the case of political access data
but it would be thing important
to design a good calibration stuff i
we investigate to what extent
this is thus
really i
but non speech
versus their use
to summarize
we have found
that are known to have a second the measures
a robust to noise and you know have a better understanding that even though
then i don't exactly know
well for doing
well the
we know that the a robust to noise more robust to noise that the gmm
can
nevertheless we have a managed to the in more insight into these
by generating explainable as
finally
we have also
a investigate the of an important concept
which is the and the things correlate with subjective naturalness i'll show the diagonal
meaning that a texture
no in a
considers the naturalness
s i si
i hope this presentation and i is you
did not be afraid i
of using the minutes of this
in your work
due to
to just the sheer ease an unexplained i
and i would like to thank you
for your attention