i well i guess how these is you don't

in the residual also

and today i'm going to present you a

residual methods for music signals

and indices

but and endings phase

with the increasing the actual text to speech

and voice conversion methods

there is it will we need for solving

the only yes is for each other series has resulting right progress

the what is a

there are so open challenge is how

the elements of comedy shows

in reality noise scenarios that is you very little research

and i a lot of the problem

is that i work so phenomena

the acoustic information

exploited by actually just

exactly

it is challenging looking size box

in this study

we propose a new

died resonant gmms for sure

and we compare systematically

its performance

to the ideas is able to times and i think

this includes

two hundred uses

or performance

in various types of noise scenario

and we also

to look inside this

seemingly

in data

black box

model

so this will be encountered a problem

is a mixture of and i read that

and the gmm

retrain basically gmm

well the and endings

the ones

by a i wrestler

in i able to that vectors

well i data base cu these background

as input features

and

but it is easy to

then i did convolutional layer is

in each that we can see that there is a max pooling which is essential

to result in and i'm selling factor of two

well i think so is there is that so you can actually includes

this gives connections

connecting the convolutional layer is a weighting

training of very you know one at a picture

finally

in the gmm is

we have a whole incorrectly layer

and we data and endings

to train

the gmm or vector

a gmm their true can have the and the h

but including putting

a likelihood ratio

or worse still mask

this enables

to include a human little

for the automatic speaker verification

or just implement the rejection based

in this fight i present

the overall performance

all the two baselines

the two challenge baseline

and assisi gmm

c use this is gmm

and the proposed data

see you did the gmm the an

and the usenet oneida

security the

but all

with the sole saw fusion system which are the fusion of the mfcc gmm

c is easy gmm

and the cuda gmm system

we can see that role

the sum fusion

that's cool was

but also that s

and a very straight north

using the different architectures

in the different kind of smoothing types and thus

i would like to emphasise you

the table we apply

one minute or

one political access portion

or the u s is nineteen

because it will hear system mapping dataset is and noise it is not very suitable

to test

a noisy scenario

really i'm noise original this data

so we have to create

but noise is the

it is computationally very expensive

to create

noise in scenarios

for the speech samples

so instead of this i do they

but less computationally intensive approach

by something

a subset of the yes easy to nineteen dataset

in a bottle ancillary and by well i mean we mean

the bonds respect to the data used to be s

the there exists

in the dataset

then

we rst noise samples from then used on dataset

these are all three

the signal-to-noise ratio

all five test

we have a selection of c six

speakers on the speech for them use an dataset

a random music file

and the remember noise data

from the nuisance dataset

by noise

really fair to the noise category all the muse and data

big noise is also where i

by since the functional generation

at a signal-to-noise ratio of five the signals

and also reverberation was applied

using simulated woman close this is from the y alright

we can see the overall performance results

all the all vectors

in the presence of

also i

we see

the results of noise

this but architecture for best

and without noise

this is usually gmm vienna

and the sum fusion on a circle

we have also

that is this sort of a tradeoff

big in the security in the n f c but the gmm

the c d v d n

performance

better in noisy cases

but slightly worse

in always this case

compresses but gmm

finally

we have to also the

that old s e c g and the c use is e g m all

characters

a the performing compared to the that the proposed architecture and compared to the cu

maybe a

in these noisy and with this scenarios

you we see the same feature but in

therefore that occurs

rather than

you know all of that they

we can see the sum fusion

performs best

in the noiseless scenario

the noise this setup

is not by this

though not installed s

why the noisy scenario is denoted by six right

the continuous time

overall we can also that the cu due to the nn off factor is the

most robust to noise in this whole audio

and we can also the

this kind of trade off there

with this but gmm

and the cu and

three shows that we have also seen previously

in a you know the

we then proceeded to do

visualisations

this is you didn't the nn and endings

first

with pca

really the visualisation

and so to solve the class is

it became apparent

that most of the school classes

so it's very well

from these green

point cloud

which corresponds to the bottom

exact

the v c classes

the classes corresponding to voice conversion

we sort of all of that

we don't wanna cost

this explains

the fusion detection performance

with some p c s is

because these can be separated

linearly

in the to these days

we see

a similar

consistent picture

another dimensionality reduction the

but these three

which stands for sixty still fifty mean and that

and what we the

is this same feature

of the v c cost use

all of that

with the one activities

on the and then proceeded to do an additional experiments

the goal of this experiment force

to see

how and then he's moving is

then there so gently

to these different kind of noise and i was

in the bigger what you can see

is what happens

in case

of variations

those also that this figure

the blue point counts

the red points while

and the green points l

is actually the same

that's in the pca side

now

we proceed to solve a whole

some samples

these ones

from the one of the

and these ones

but the ones

from this tool

and the be

following the lee

noise

with this reverberation

and what we see

is that being the ones

corresponding to the one thing

big on these green dots

moving closer to the actual decision boundary

and we can also see

that a little

become these orange dolls

we closer to the decision boundary

but still on the right side of this each

then well

this gives us a according to the u

the hot picture is robust to the duration

because you know

no one's

matrix

this is all

close as a decision boundary which is exactly

we can see that a mass

the right classification decision

is retained

now i'm going to talk about

alright cleanable algorithm based techniques

the first thing i'm going to talk about

is the graph based technique

which is a basis

first

we can only the security spectrum

based

on the all we also

down with the reckon

we obtain a sensitivity

this sensitivity man that sass

one loss of the spectrum well

i don't most important to me

the classification this procedure better the speech or if to whether it is natural

what can do

is a threshold we sense it's gonna

the whole thing this binary mask

in c

which is basically segments this for four hours

does not reach five important on

and you can be should be

see that

if we will lie

the original security spectral again but i mean

really all the in this

picture

which we again

i don't normalization

sensitive refuelling waller

to thing

reconstructed way

and how what we rewrite when you is a series of trainable all it was

right of each other

first you are going to here

the original well

then

you are going to hear already construction of the original using all the features

and finally going to you possible the audio that the no one extra innings sports

so you can do something about the real the speech

and the again here on a particular type of

viewing these examples bridge indicate what what's of the speech signal

might be important

that i think that we have five

this is that both

mean you know we'll technique

we all we want to all audio files based on how challenging air

the more challenging only what lies

i usually the ones

that are closer to the cm threshold

and the definitely once i goals

which are the from the c a threshold

and what we can do

is we can exploit this phenomena

this clueless the cm stressful

and use these

two or two was

based on this yes of course

and i think he's grew out was the main noticeable o as we can obtain

and you all recently collected by consent

where we don't understand the needle individual

but three

a fourth the voters on the acoustics

so

i'm going to show you what okay given the case of a eighteen

and

i'm going to and you are going to

he of progressively so was that

i first variability so

five from the c and search for in the direction of what you

and then finally ones that are there is to someone's the batteries two

so let us was to here is that there is a noise more aggressively present

when you use a listening to a morse two

all videos

in general we also that there is a more

no one set of speech in the school speech can be also

in general

in this actually involve your examples

you can hear more these extended while you're was

by scan disk you a whole or just picking the mean

we also be some definites experiments using the most that architecture which can be used

to cooperate

objective measure on estimation

we find a as for the zero point three more five

these being the mean opinion score and of the screen

the s is that

is the first principal axes the first nine dimensional well i principal component

and this year

that was

actually a single

then a bonus aspects of the speech

and interestingly we the exact show these voice cooking categories

also all was i think more natural than the actual one of the signals

waiting to the most

and point out directions for future

recognizer redeemable water

as an image reconstruction what you

in a minimal audio case

so in the future we want to use an l one based solution

trained on c you the

spectrograms

because these have been previously shall tools lingual speech coding i bit conventional fft spectrum

finally we also recognise

that's the data bases clicks voice activity detection

would be essential

but is always this each region both this can be important for cm investigation

in the case of political access data

but it would be thing important

to design a good calibration stuff i

we investigate to what extent

this is thus

really i

but non speech

versus their use

to summarize

we have found

that are known to have a second the measures

a robust to noise and you know have a better understanding that even though

then i don't exactly know

well for doing

well the

we know that the a robust to noise more robust to noise that the gmm

can

nevertheless we have a managed to the in more insight into these

by generating explainable as

finally

we have also

a investigate the of an important concept

which is the and the things correlate with subjective naturalness i'll show the diagonal

meaning that a texture

no in a

considers the naturalness

s i si

i hope this presentation and i is you

did not be afraid i

of using the minutes of this

in your work

due to

to just the sheer ease an unexplained i

and i would like to thank you

for your attention