
good thing under need for a nice introduction

actually anthony what's in my life for several years a four years

during the

twenty nine ten leading not introduce him

i can also be very well

okay so first of all i would like to conform to do that actually indeed

speech and language as p l c is the most vibrant to a special interest

group we know it's got

and not that the all be cased i don't

presence of myself i was also don't things and then the past of the vice

president of this guy and then

and jean francois a past president of is colour or

must come to show the support

and also like the thing release and

at a dual due to

to have

brought them all the c in to spend a belief that many of us to

have wanted to come to appeal power in the visit the basque country for low

income and then he may want excuse for all of us

to counter just beautiful and

harry oppression

a year ago

i do not all relevant to me doing but i mean yes that the extend

the invitation

to ask me too

to talk about indy spoofing

i thought they do would be this will be a topic that

it's very close to will be discussing speaker recognition

and that it also made me one very hot the past few days to put

together the selected this is really the for like this presentation on this topic

is actually a topic but my

phd student

and session who

right you eighty two years ago now he told me that he's working in apple


he's not here it's is are you hear no

and are like to start with our like to thank the global for people to


at all ni

nick you've and same opportunity

for sharing with me a set of all presented a slight step save me a

lot of time they did that you total in a nasty a in the asia

pacific signalling information processing society and you summit conference


on call used in hong kong i vocal folds at any last december i attended

the talk and then they may be the set of slicing i extract quite a

number of them from

from their presentations i just want to say that thanks to them

and also

a also thing my student on another student show higher to

prepare some experiments a to just to make might also complete

wonderful so

so my topic will be on and useful thing i understand that infeasible think is

actually not the scientific disciplines is that kind of application to that goes with a

speaker recognition system

and also because it's not yet the of the establish displaying so that's why i

don't see i don't think this so what definition it or what and t spoofing

is anything that to protect the security of a speaker recognition system

that's what we

think about so today only share with you some of the

i experience that we had we touch a pontoon perhaps those experience can

speaker for the discussion so during to


voiced by metric used based actually i actually just like a the name of speaker

recognition or community in the twenty twelve there's a report saying that at eighteen off

at top bangs in the world have adopted speaker recognition the system actually now

the number numbers increase the tremendously

i just a month ago many of my tingles announcement density per in addition to

a launch voice authentication system for a


call center services and

for somalia we also part of this project and just

turn people that

for the first time we are paid to become a heck of so the system

so we just two

evaluate the up these security of features of the also just a speaker recognition to

the point

this is a projection by

try to come the market size one or

what kind of a biometrics

used in

both ranking financial findings of something of course maybe other areas

and you can see that a voice parametric is actually want of the growth area

no i both the colour with them for my laptop screen but it just the

the last

however it shows that so it is

we see a tremendous group which is a must be good in fingerprint because finger

brings can still a mature technology time with these

and we talk to customers i was working institute's into we face a lot of

our industry up on the someone the need to deploy speaker recognition system

the question they ask that's not so much how accurate system these because they can

see that this is kind of given the because the system

must be what must what within the be well the question usually to ask is

how secure the system is in face of that x and


i know using the other things like that


recently we actually to you three years ago we deploy a technology to

the noble smartphone if you get the learnable's smartphone the screen unlocking

likely to be a

to includes a voice authentication it is somewhat technology and of course they all day

or also ask her voice ask for indy spoofing

can isn't note that

to go against a three by tech

please i will talk about it

so i talked to someone who talk about four

man items one is people would be one this exploding the text talk about

most compression in the artifact stick we may discomfort in the in the voice

in a also lastly

yes to be automatic speaker verification

in t spoofing the comparing the last year

i don't want to go through the details of the evaluation campaign by will talk


some of some of the observations i suppose a different a start

okay so typically a speaker verification system a taken voice as input to doing that

make a decision is to set identity claim to reject

most of the time we assume that the voice input is actually from a sheep

life a person like speech

in reality it may not be true

the of we can categorise all it is possible detecting to this four types impersonation

just like a getting a person to mimic two

a impersonate your voice


replacing managed to record somebody's voice you can play back to the to the system

and speech synthesis and postcompletions these are the scientific

thank technology means of a creating a speech

well the could be some other new methods that do the because invents now i

suppose we know we that the weights of the fact that can be categorized is

this for every as

used table summarize are the

we're going to assess abilities the effectiveness the and the reason to the system and

the com the availability of the countermeasures so sensibility meanings that how easy no this

you have access to this technology to spoof a system

so they're studies on the impersonations of basically you get a person to

to act as another person

this is actually part one of the very old the performing arts usually you try

to learn to maybe

some of this voice


study shows that system so i think people like this may be able to a

maybe another person very well to the human years actually the voice may not be

a very

very strong as a as a as a tech because the computer listen so differently

form of human yes

and the is also difficult to kind of a train the person to make some

of these voice so basically the it is a little accessibility saying it is not

it doesn't propose a present a strong the risk of to a speaker verification system

a replay the tech is basically to have somebody is

voice winding up talking and then you play back to the system

which is a low tech there is a

usually in the context of text-dependent

if it is a text independent are used have some of these voice see that

golding vad

we added the voice input impact so basically forced into the

the speech synthesis in voice conversion all categories so

for replay attacks


evaluate the oak to the risk

mostly in the context of text-dependent the speaker verification

when we talk about the voice i'm looking screen of untenable phone

it should we develop a system that is kind of a

taking the a unique features up of voice

optimal pretty back

we know that

a human

vocal system cannot repeat digits only the same

a voice to construct so if you happen that you you're able to record all

the voices and then when they have comes in

and you compare the incoming voice with data in the storage if they are exactly

the same the timings

this is a deeply fact

so we have mechanism to do this but there could be also other ways to

do this for example

they're studies

she by idea those group on the some years ago on the on the

protecting replay attacks obviously the

the idea is well you replay

actually it is a replay of a recording in the recording usually is

taken from a far-field microphone

the level to

the did not always of we re overruns

and acoustic effect of the room

if you're able to categorise it

you able to detect the retailing deeply example here so this is original speech

for the works

no i

well i

so you hear you hear it reverberation in the noise level in this is

extent it might

unique characteristics of a far-field microphone recordings that if we detect this thing of course

you can you can you can accept or reject a recording voice but this is

very difficult because room acoustics that changes from place to place it's very difficult to

be just one model there

that kind of a identify or the car or the room acoustics

another techniques that we i just dimension is got audio fingerprinting

yes idea the idea is that if we can


the voice

in the storage

for at least cells they are presented to the system

of course you'd only to do that we keep the recording as a whole cube

away from this a whole

think of this we do fingerprint recognition actually the system doesn't we call doesn't get

the picture the picture

of the figure three

you keep only the cued always the training voicing what those be cocky points of

the fingerprints

the same for the for audio

there's a this the software that the quality quite

show them something you know you can

you can record the piece of music and then you retrieve

we choose the collection of the audio from the

from the system is the same technology you have a you have a voice recording

contained in the spectrogram

and then you kind of finalise the spectrogram into pixels in the only remember the

keypoints key point of those data

of high energy so high contrast an data

and actually you only need like

so the forty bytes

keep recording off

five seconds


practically you can kind of the store a lot unlimited number of entries in the


so when the test speech comes just compare

one by one then if this check matching just rejected because

no one can produce a voice of identical voice

this to time signal to noise

then comes to speak speech synthesis in voice conversion this to share

many common

properties for example within it difficult to

generate the voice they rely on a cook statistical models to the generate the features

et cetera so

that's london to get the so today open focus of will be on voice conversion

thing of course of this as many of them

much of the techniques also we plan to

a speech synthesis detection

we do a speaker verification

the we what on robust features

features has to be a real course has to be reliable has to be robust

and so we see this is chip in good

well we start a fine okay

features of both

properties but

most of us to use the short-term spectral features because these easy to achieve

and is actually but the reliable

and robust against noise was getting c

ageing have states

a channel

variation says that's what is at all focus

the typically to type of features one is on

voice production system like a lpc features you consider the vocal system as a

as excitation follow but followed by a resonance a few to right so you bottle

the excite that the second the source model with the future this where you kind

of similar

production system there's another type of thinking days to formal like the are required the

peripheral auditory system we report use a cell we don't we don't hear part of

it so things like you can see that it

the court we have possible a member right


and path bandpass filters to get the signals

and we try to

derive features that

kind of the follow

bandpass filters at different scales of mel scale

in this set of parameters to record all jittery

features things like mfcc have

many other little talk about the tree transform

et cetera

unfortunately most of them we will on robustness we try to extract the people's

characteristics unique characteristics speaker characteristics

we can see that the rest as a noise to try to accommodate

so as a result and no more robust the speaker recognition system they also means

is more vulnerable to the tech because it'll when we synthesise the voice you have

all kind of variations and we've real features are very good in

overcoming the

what kind of noise actually your system become very vulnerable to the system so we

have like a contradicting

requested to the system a one hand we want to detect the synthetic voice which


unwanted and on the other hand we want to be

a robust in these two things

are not are the same direction therefore we cannot have one system that does both

you to go t we have one system this for synthetic speech detection in the

front as a filter so when

we detect yes this is

this is a not that it a synthetic voice then the signal pasta a speaker

verification system

next ever going to talk about people voice comparison so voice compression this actually now

is very accessible so we can even go to amazon dot com you can buy

a box ninety nine point i five dollars

ready for

we those k

and actually allows you to a change your voice to masquerading a voice to be

too i mean to check you identity from one to another or to

kind of a you can use that

step here

so basically okay five two and the system the formants the peach you can

you can try use this

put forth a to kind of possible for

a speaker verification system

clearly in your

in your room so


it will cost a

if we understand well how postcompletions done maybe we can be with system to detect

synthetic voice quality points the system is a basically three parts

at all of this like formant judge in the slides i believe that

distance voice with must you with my student

voices very different from a from his voice at one time this analysis and you

can be a system that combine the voice one another this must be very strong


voice comparison system

so busy that three modules

to analyze compare the features and

and to synthesise

by analyze because

it's very

how to deal with the time-domain signal so you compared to

the two men that you can

many project

releasing frequency domain

and then you complete the features into

where you manipulating the way you want then you have put them back

synthesising generate the voice of another plus

we do that this is a couple coding to actually it is

account isn't it was very well studied in a

only as in communication you're all common people want to transmit signals duty codings the

one to compress the signal

they want to

multiplex the signal they want to increase the signals


coats et cetera

so they and that i think into features into the parameters then you do what

they want to estimate the over the narrowband channel and a at the end to

make sure that all the signal can be we can put the signal back you

are using the parameters so this was better but at a traditional framework for

in the communications and

today actually we replace the transmission channel with a

feature compression that

allows us to do voice compression

they all kind of voters on the data does this

we just group their body into two categories of people

why in speech synthesis and all this very well one score

sinusoidal vocoders basically the idea is a similar to

to generate signal that please all yes we

okay how much human the voices so generating so much to generate some of which

sounds very natural humour years which is a good

so the idea is to components

i mean to decompose the

the two row ticks lawns into a collection of

and i'm harmonics and then of course to include writing

and record the modulated noise components so you have the noise which represent the fricatives

in a the harmonic components that representing involves input this to get together you can

regenerate the cell


kind of vocal the

or in this study is that it's


evaluated in found that they are actually very natural and that has some issues a

some of the issues of like

because you've completed to this harmonic opal components and the number of parameters data they

need to describe the signals varies from

from the signal itself like

like fundamental frequencies like something rates et cetera they affect the numbers for every frame

we have different number problematic

present the problem you want to the model it

in a in the

statistical model we need the same number of parameters to model

of course they also like and they have a single overcome this so the studies

on this if focusing on how to manage the number of features in the data

on the other hand how to manage the noise because

harmonics is you know this card

good to describe karate signal is not very well in describing

another type of for a few days sorry overcome this call source-filter model which is

i think i mentioned earlier you can you think of this vocal production system

you have the

source excitation thank you of resonance you to anything you try to model

this both

and the good thing about this is

than parameters for example you use a linear predictive coding

actually you can fix the number of a parameters

and that helps to have stopped the modelling

of course addition also has a problem you compared this to the final sort of

encoding signal so you don't called the

some of the study seem like music a synthesis the quite face welcome to say


they allows you to scale in both time and frequency domain so we hang


control the phase of the signal the many to interface for source-filter model

you don't this filter has to be

stapling call calls so you have all the all the all the remote set to

be reading the

the unit circle in

because of all day so if a low minimum phase

a strategy we reconstruct the signal that actually cost artifacts

it is good for a

a defect detection synthetic speech detection

on so this up

where simple study which stuff by a judge and a few years ago and doesn't

do a very simple test you have a number of vocoders and that you to

copy synthesis you do not you just

simply analysing to the features we compose the signals

it was see what they detected this is synthetic voice on

and the result shows that with this modified group delay are cepstral coefficient you can

you can do very well in detecting the synthetic voice so there's artifacts all the

data and a lot effect to be analytically visualise but

popular features of okay

you can actually detect

so no

after talking about the vocal we talk about voice compose


voice conversion basically you want to convert ones

spectral from one person to a not while the

things that is quite people

a voice quite a number of things the main the main items that the formants

the formants about the formants the first is that it to tell which is how

it is by all will or the valves which one of these

but you also has the personal

a we also represent the vocal tract structure in a different way people are different

formant structures in

maybe formant tracks

of course you have also be fundamental frequency which is the peach and also the

intensity of the

the energy envelope all these are very difficult to kind of a manipulate individually what

we usually do is

spectral compose compare one

expect special level one person's voice to a not to kind of a transform

a typical example we select is

so well usually do is that you have

also called parallel corpus

we have samples of the same content and you do alignment you can do just

to do a dtw alignment and then you come up with the panel of

features right

in then and then you

divide a track compression function from

from this past

you have all the past stop suppose the enough to cover all the mappings

and then you do the combustion this a one time

and one topic that the important you have the

source features you prior to compression functioning then you come up with the help

so the m many techniques and you are not this is not for reading this

is presented by children as a sparse the web or for the progress of the

this research is to say that but they are

many of a

compression techniques

using samples

and linear regression

it's a linear function to convert source to target one not normally a method to

do we and then this way is kind of the transfer learning so you know


people a chance for the form one percent to another's voice at england the transform

matrix from many pairs of people named now you only have a very little samples

and you language or

the history of a number of using the dependence that they rely on from other

people's in this way you height of all to

composed so that allows you to use fewer data thank with we to estimate fewer

number put parameters will achieve the same goal

so i just us

i would just a touch upon a few

basic approach so one disk or complement mapping so basically the same thing to alignment

you get the parents and you do vector quantization for the past

and this is in past so with the runtime we only have one


for example with the sources right column

in the and it and you the green is a target at the green you

don't have so you are right the source into this vectors you get all those

cool was and then you


the green ones to get in the generate a target voice

of course this is very elementary techniques

to do this

imagine you to do this

you focus very much on the parent

you know the source and target match but had a cat too much about the

continually t and the target

therefore this a lot of continuing discontinuity in the in the generative voice

another technique is to kind of a convert this you to a continuous a space

but and that if you do this and then you can have a formula like

this you have access the input as the source in the white yourself and this

is a linear transformation

i think of the is it is kind of a

the previous one this is quite a few them are coke bottle of cohen this

is a continuous version of it

right fielder continuous version of it

and then and then of course of this one generate that slightly a smooth the

a voice

in another technique is a the previous two are kind of our remembering the samples


in this one we deal with a remembering the samples we remembered the


the warping functions you know that

source speaker the target speakers if you have enough samples we can kind of derive


well warping function

between them and we don't remember this warping functions

and run time when the test data comes is applied the right

what in function to generate target

it is not technical frame selection

frame selection

does not talk about a global approach

basically doesn't care too much about the continuing tid target

is one plastic taking into consideration

so you have certain frames uk in the training data article on

each other with a similar peach similar

or phonetic context thing they tend to get together so we have a kind of

a selection process not just by

a source target distance about also talking to talk it's friend distance to ensure the

continuing this one

give us a little bit smooth the

i'll post

thank of course is it is unit selection technique this is a very non in


of speech synthesis

where you have a

sufficient sample maybe you have pain twenty utterances of fifty utterances of a target language

you can achieve break it down to elements components

and then

at one time you want to compose something just pull the samples together you concatenated

into one piecing playback

this is actually one of the come away on doing that as a specious feuding

the speech synthesis system but think of this if you do this there is a

discontinued we do between the

between the between that units

both in magnitude and phase in this could be at the next we can detect

so the some summarise so i just say about a

actually we did in a voice compression and in the a speech synthesis

studies we have

subjective evaluation objective evaluation

and actually not of their address spoofing quality of a synthetic voice

looks at unvoice lisa in one of the example you hear that

you a see that


assembled that you cannot even understand but it is a very it

including a very strong

taking voice for speaker verification system so this'll to analysis are

well for

kind of a

quality perceptual quality i evaluation

but when it comes to spoofing the tech i believe that this effort to us

define what the best ways to analyze the completely voice as of the details of

the strings to the text system in last year is yes the spoof evaluation campaign

my view is providing object wondering allows us to kind of evaluating the string of

for a synthetic voices are completely points

okay so it makes a let me talk about

the effects of

the artifacts of

we size in the synthetic voice that possibly we can detect

we know that we cannot visualise the i-th effects is very difficult to see it

actually i


get my student group to try to

so all spectrograms in to see that differences


there is no direct ways to kind of measuring but their indirect way of a

model ringing for example

if you know that the signal is discontinuous of course you can use features that

represents a thinking kind of this crap continue we deal first speech you both in

can both in many do anything phase kind of to model

the data

that'll things that we should look into one is the manager

and the other is the phase i mean this is like the standard tech signal

processing a textbook

what was important is a

in most of the speech recognition thing speech synthesis of research

we pay much attention to the many to get interface

for simple reason that

my to do is easier to manage it is you easier to


and that this case is a much more difficult to update to


to describe to associate the parameters with the physical meaning


but actually they a lot of research in the literature on phase features for speech

recognition and that provides a

kind of a to see for us to

to start this research

so in terms of ninety two

we don't know that to analyze the speech signal we need to do this short

time fourier transform

i don't you use

sinusoidal coding will use the source-filter vocoder you wanted to do this short time

time-frequency analysis

in this present at effect you know that we don't is a fft

then you use a fixed window length

and then you have


you change in you have

windowing effects of all these all these are at effects

produced by the by the in by the system in the process

and then you have this more think effect you know that when we do

introducing this is a compilation most almost all models are

maximum likelihood estimation right next a more likely to wasn't maximum like the

estimation trying to do

they try to give you the average over everything

because the averaged you always higher

the higher

probabilities right

and they cause a problem

the limited dynamic range of the

of the signals without test generated in the could be at effect so that we


the same for phase

the same faces a bigger problem

often time what we do as i said that when we do synthesis we do

recognition we use a many to features a week actually

in order to ignore the phase i mean we still think that face continua t

v is a is important and we don't think that modeling the faces as important

as the many achieved it also present an opportunity for us to kind of the

tech artifacts we can model still patterns of


distribution seen

a natural speech then we are able to detect synthetic

next just some examples of this is a

just to really wanna say that a short time frequency

analysis you use a fixed window fixed length window to analyze the

to analyse the signal was saying

and up you have a

record the interference between

frequency being

i'll the energies across the frequency

and are the same time because you do shifting window to window without overlap sending

actually you also have this smearing expect i don't have time axis so we have

you have the interference a the convex s and you also have the in the

this mary factor in the frequency


if we were able to detect

detect this then this could be

something that

a signature morphosyntactic balls

well coldest

most of the everything the vocal this actually two

kind of a remote

two most the they'll the waveform as a result of this

short time sometime

effect and

where people set actually you are using

one artifacts to correct another artifact so you have to short time frequency short-term a


really cage so we take cost you problems and that you are used another smoothing

methods kind of try to smooth everything about so you have a quality factor corrigan

out if you have to a different significant

and you can

kind of a extract the signal but interestingly this smoothing effect because you use human

years to kind of a pressure the quality actually after this of the smoothing evaluation

says that

the sound quality suppressed

but i believe that they're artifacts inside you can describe and just not also mention

that we use statistical model

i don't in the voice compression

or in the in

okay to markov model or a synthesis

and then we try to i'll try to estimate the

how to generate the parameters using maximum likelihood

criteria they always give you the average will not always you have other ways to

just means you might disagree with me a but the other ways to model the

to the dynamics about the

in general systems give you kind of a


a signal

that is a limited dynamic range of a completely speech

this example a i just plot the spectrogram of the natural speech in the copy

synthesis speech and hearing see that


absolute differences in the spectral

two main

in this is a pitch patterns in the

get a map that he hmm based a synthetic

well ways we know that a human speech

actually the peach patent is not so stable as you know synthetic voice using the

paper by what you have

twenty two thousand five for the height of a trot to chart one shows the

synthetic voice which has a very straight up each pattern

it is in a p h this is the autocorrelation of this at the time

domain signals

and you see that a natural speech actually has about has something like you know

when you believing loosing you have this but broughton

the two roddick modulation top each round

and also some peach level

and synthetic voices rather strict

because of this if we believe that this

there is a lack of a dynamic range in the synthetic voicing completely voice then

the dynamic range of the spectrogram can be used as a features also of great

one paper by tom these group that we talk about only use their with and

without delta dynamic features of


spectrograms as the features i ignoring the static features are used to detect synthetic voice

in the also techniques to

a model the temporal modulation features you know when we have a feature frames which

is a like the usually one frame by frame by frame we selected ten miliseconds

shift in this

cut a piece of signals well like the fifty frames and you extract it into

a temporal a few using the temporal futile to model it is and then use

this to

to oaks former oak supervector like this to model

the model that i'd of the many to

features audio based features and it works for

well for this a complementary features into

in the extended voice detection

phase is something that we will this was

us to pay attention to

why people don't use face creatures is because mostly because it's of it difficult to

to describe it and it because

many unique properties for example we have this mapping effect when you want to see

you have to unblacked it this is a real red

record a signal you can see any patents

but actually

if you think if you have a real time you have you have a real

signal and then we do fourier transform you have the

the real part in have the imaginary parts right and then

the man did you is come from this to pass in the face also come

from these two pass and


by right they should present a similar patterns like this if you many people have

shown that day

unwrapping do it properly with proper normalization you see similar patterns

face feature and thus many the you manicure feature the looks about the same

and they give us a opportunities to i mean another new features to look into

you in synthetic voice and completely both people do not pay enough

i things into two

to a face increase feature become very useful for detection for synthetic oppose detection

a to have to be too

must there are many papers on all this other techniques with recordings ten years instantaneous

frequency which is the time that derivative of the phase signal so basically you have

to frames

and then this is the method you look very similar but they are

phase features

could be very the face sorry the phase

if a square and could be very different

good very different



by taking their


as a features

you're able to extend it to remember something

we strip is remembered every sample was in the time-domain actually this two pi shift

of the signal to maintain the continuing this so we want to do this you

have to kind of unwrap it

because usually we should do window by ten milisecond twenty millisecond not by every samples


so when you take this thing you want to make sure that the

features are

kind of a complete the phase are continues you have to do

kind of a normalization

ross a little bit

and then is a group delay features this which is a frequency derivative of phase

you know we have a single like these

and you have the power spectrum which shows the two resonance pick here

you see really and then the

group delay also shows or something like this

and these features rather complex

a mechanism but at least a show you initial step

a similar utterance has many to a feature

this is a novel different plots or spectrograms face were so that a development my

student groups in last year's this

if you spoofing and compare their see that if the if the log magnitude spectrum

make it you

and a you can have a group delay unit we probably actually you see the

similar patterns

did manage

and you have many other things non modified group delay of the instantaneous frequencies on

the other features you see the paper but specimen to print

in this

allpass also but features to do that the detection

finally comes to the last year so that into scooting evaluation

each shows that

this is

a performance on the data a speaker recognition you've just use the gmm the standard

gmm system and then once you a lda with the spoofing voice us anything voice

the performance at twelve o a missile

okay looked in the evaluation they were kind of five

synthetic voice

which is used as a training development data how this is called norm that x

you have to access to the to the training data of the synthesized

and i have another five that you don't have access to tell you what how

to generate

and then you only given the evaluation data supposed to detect d a synthetic voice

for all of them so you typically use the five a

a voice to train your system and used

use the system to tessa

across the ten evaluation

ten a voices and this is a brief summary of the resulting see that for

the not attack italy the performance the average is kind of a

for unknown to take it gives to like a four times higher so error rates

so of course of this is kind of a and

known beforehand you know

we denote this signals you of course you can do something you're trying to train

the detector using the samples in the you detect that

like to actually we do one particular i think use a synthesiser which is a

kind of outline of the system

you know

all the system did pretty these are the sixteen estimations of the system thing the

rank by the performance

most of them did very well for but to take one is example t very

well for

for all of them we don't

as ten without the unique selection synthesizer right

and it pretty reasonably well

and then one comes to f k even very the equal error is very high

so basically all the features kind of felt that for

for testing

so was tested

as in this is the tts


unit selection and replay

sound clip to see show you how it is that this is a testing

if i should

if i should


so we say it's night so here okay thank you can really hear this a

this set in the s k i s ten present the strongest

the text to the speaker recognition system i believe that is because his unit selection

demos of the salsa silence frames because we do frame-by-frame

and the frames are a natural voice except the

the vad the connection points which is represented minority the yep in the back or


nowadays everything must have a little bit of a deep neural network so i also

include neural network my presentation

so this is a

very simple deep neural a simple neural network is this is not appear

there's one layer anyway neural network there has to take the speech as the input

take the features as input for type of features

and then generated output


the sounds that this the closer to

the something closer to the right things like is more natural speech and laughing size

more synthetic voice is occurring see that has can

overlaps with natural speech very much as ten and natural speech you give a very

similar score

that makes the features that we have kind of this difference to differentiate them

so i wasn't another recent research in this is a very recent resistant work

we take one hundred frames as the input to a

convolutional neural network so you have how different to do polling and

and all this allows you to get a wider range of a samples

how difference actually can cover the kind of one

one minute thing to make sure the in the one when it is there are

some transition of a


acoustic units between

their subjects junctions

in it

we can see that as ten and natural speech kind of that so but the

good separation


as a

positive and studies are i read quite a number of literatures one of them is

a multiple of the things is given to

features that d

the best in the evaluation which is a so-called ward italy transform basically the idea

is in canada here you have this

you have this

you have this

possible to member different few

filters with different pen with a different center frequencies that so you're kind of a

trying to be filters of the kind of a different awfully good a good friends

with different pen with a to get the coefficients

this is actually not new in the on "'em" scale a good cepstral coefficient already

is doing this but was differences

in this

a set of you just a status similar to kind of a wavelet kind of


it's question you have we for low-frequency you have a longer windows

and but for high frequency inverse filtering

yes up to sort the shot to

response function in this way you get

different resolutions to

two different frequency bands

this is a paper this is a

slight so there is given to the need via by any device and a big


just got a very impressive result that he's going to present in all this is

why don't

one to jump from one

so i try to share with us so

is the effect of see that this is

spectral where that is shown that using

constant q cepstral coefficient for in the similar concept of auditory transform

at the

low frequencies the better frequency resolutions but poor convex solutions

point time resolutions allows

asked to

have a bigger windows in terms of time it has a bigger range

range the cover to cover

you know the

the discontinuity of the features

the higher frequency is a better time resolution

it is equal costs them to collect

need to ship would you littering his one presentation

so it

with these techniques to give the very impressive without giving the evaluation the best result

was equal error rate eight point

five percent


then with a with days

do you achieved like a one percent equal error rate this is really impressive

okay to some


splitting the deck a spoofing the tech a this many challenges and opportunities in the

most systems also on there

that the input speech is actually natural speech i don't know this is opportunity always

the challenge it depends on your the heckler you want to

the system developed a

more robust speaker verification system

many meetings this very vulnerable to have text and then we need to take special

because to address this issue and


but the speech perceptual quality doesn't equal to

to less artifacts actually in speech synthesis

my impression of this we try to the just the

the output signal just to please the human years by actually in the spectral

ram it in the it generate or interface claim has a lot of artifacts day

that yet to be discovered

motion humans listen different


mostly now listen to frame-by-frame features i remember the days when they were in the

company we wanted to keep those ten will be a single ip what demonstration for

tts system to have a dialog speech recognition system

every time i give just tamil

and people across very happy


and people thought that this was

magical demonstration but to me to save a demonstration because the lpc features top of

lpc features every time they get the echoes get the was correct if i talked

to the system some something kimmy role model without leaves acoustically ninety five percent

so matching and humans listen to different things and we need to discover it is

more and a

and the study also shows that from the last two yes i

publications shows that features are more important than classifier

or maybe we have not reach the level having good features so the a lot

more study to be that the features for the thing to classify

this way

thank you

so you how do for this presentation so we have time for a couple of


then when you want your judgements to start


we get idiots from terrorists

obviously have the voice pitch the use of pitch stretch a pitch appending algorithm so

it sounds like this begin with very low voice either to discuss the voice or

to sound more threatening

the question is that way of

of inferring the degree of change that has been made to the pit the pitch

can both either just formant frequency or formants and found difficult or just formants so

but would be here we are the loss of any way of knowing whether and


what extent it is possible to and four

the degree of change that has been made in order to

to change it back


we don't have

i think for forensic unit kind of visual tools that allows you to

to do analysis have believed that the

the features that we just talk about things tending as instantaneous frequencies the group feature

group delay modified group delay

cepstral coefficients the constant q cepstral coefficients those are the wonderful tools for you to


comparison since i just show you just shout on the second

so actually a

so we did in the left when we

analyze the features

we did

observe some features for example this is a call

relative of phase shift is a natural speech this is a synthetic speech and you

can see you cannot hear

the difference because they are all very natural you cannot really

here any differences but

the craze gram actually tells you something so i believe that

maybe it can be used as the tools

i don't think anybody has really appealing to a system for practical used yet

you is just

so very nice talk time was very happy to see the breath of work field

and beam our work is actually been covered by someone you folks here i wanted

to make one comment i think one of the fundamental challenges when you look at

voice conversion most of that research is really focused on humans being able to

assess the quality should usually for human consumption not necessarily for speaker recognition systems

so if you look at voice conversion technologies most end up focusing on making sure

that the prosody is correct because that's something it's pretty easy to kind of assess

it was different like fundamental frequency and so forth so i think in a bit

nineties we had to some work or what we did as we took

the output of natural speech and segment based speech synthesis and fed into archer here

cell models and look at here some firing characteristics on the output of what we

saw was that in regular normal speech

there's an actual production evolution that takes place in the articulators

the corresponding here saw firing characteristic also have a natural variation

but in the synthetic side in segment based synthesis

when you could be hair cell finding characteristics they don't necessarily behave the same way

so we found that was actually very interesting way to kind of bring kind of

the signal processing side of the hearing into the speaker assessment side

you could find actually really more quality speech synthesis

but the hair cell firing characteristics would be able to pick up that differences there

certainly i think just now example asked an the unit selection feature is a quite

example you can hear anything by actually it is stronger is exposed in voiced

yes one last question and then after ending you can break

so part of what you with through you're talking about the different aspects as try

to detect whether the voice that modify

right and you're look so the things in there were the

looking at the pitch the phase and so forth but what is the really that

isn't speaker

verification be used is because the big with it is of the handset being delivered

around most speech coming in the systems already gonna go through some form vocoder

so is that by its nature going to start to

you know you're gonna get you're gonna start detecting a lot of these artifacts are

really gonna be natural artifacts of the communication system itself and

i think is that thing looked at is look like most of this is based

on what inputs are happening right at the from lips into the system

so i think that's vertical question so i think the challenge now we just two

model different type of artifact the artifacts to are susceptible to the system for example

you set a two

the could be if the system that's going through communication channel days editable coding that


but most of them do not really manipulative

parameter stay just try to recover the signal as much as possible



at the moment the researches focusing the task focused very much on

the features they are able to surf a store scientific

exactly the if we have good features so we can tell we can model them


and of by this is a mostly also telephone channel two different channel you mentioned

that and ten telephone channel it is also

you know

and a lot channel digital channels in all kind of things

so i

they could be issue but they also asked me about this when we do not

this when we kind of doing this analysis the or digital

the by actually this process the complete to analogue an income packaging

what the effects of the data

we have not really studied

okay thank you i think we have to stick to scale so thank let thanks

again purpose only how do