she
good thing under need for a nice introduction
actually anthony what's in my life for several years a four years
during the
twenty nine ten leading not introduce him
i can also be very well
okay so first of all i would like to conform to do that actually indeed
speech and language as p l c is the most vibrant to a special interest
group we know it's got
and not that the all be cased i don't
presence of myself i was also don't things and then the past of the vice
president of this guy and then
and jean francois a past president of is colour or
must come to show the support
and also like the thing release and
at a dual due to
to have
brought them all the c in to spend a belief that many of us to
have wanted to come to appeal power in the visit the basque country for low
income and then he may want excuse for all of us
to counter just beautiful and
harry oppression
a year ago
i do not all relevant to me doing but i mean yes that the extend
the invitation
to ask me too
to talk about indy spoofing
i thought they do would be this will be a topic that
it's very close to will be discussing speaker recognition
and that it also made me one very hot the past few days to put
together the selected this is really the for like this presentation on this topic
is actually a topic but my
phd student
and session who
right you eighty two years ago now he told me that he's working in apple
computer
he's not here it's is are you hear no
and are like to start with our like to thank the global for people to
ten
at all ni
nick you've and same opportunity
for sharing with me a set of all presented a slight step save me a
lot of time they did that you total in a nasty a in the asia
pacific signalling information processing society and you summit conference
in
on call used in hong kong i vocal folds at any last december i attended
the talk and then they may be the set of slicing i extract quite a
number of them from
from their presentations i just want to say that thanks to them
and also
a also thing my student on another student show higher to
prepare some experiments a to just to make might also complete
wonderful so
so my topic will be on and useful thing i understand that infeasible think is
actually not the scientific disciplines is that kind of application to that goes with a
speaker recognition system
and also because it's not yet the of the establish displaying so that's why i
don't see i don't think this so what definition it or what and t spoofing
is anything that to protect the security of a speaker recognition system
that's what we
think about so today only share with you some of the
i experience that we had we touch a pontoon perhaps those experience can
speaker for the discussion so during to
workshop
voiced by metric used based actually i actually just like a the name of speaker
recognition or community in the twenty twelve there's a report saying that at eighteen off
at top bangs in the world have adopted speaker recognition the system actually now
the number numbers increase the tremendously
i just a month ago many of my tingles announcement density per in addition to
a launch voice authentication system for a
for
call center services and
for somalia we also part of this project and just
turn people that
for the first time we are paid to become a heck of so the system
so we just two
evaluate the up these security of features of the also just a speaker recognition to
the point
this is a projection by
try to come the market size one or
what kind of a biometrics
used in
both ranking financial findings of something of course maybe other areas
and you can see that a voice parametric is actually want of the growth area
no i both the colour with them for my laptop screen but it just the
the last
however it shows that so it is
we see a tremendous group which is a must be good in fingerprint because finger
brings can still a mature technology time with these
and we talk to customers i was working institute's into we face a lot of
our industry up on the someone the need to deploy speaker recognition system
the question they ask that's not so much how accurate system these because they can
see that this is kind of given the because the system
must be what must what within the be well the question usually to ask is
how secure the system is in face of that x and
and
i know using the other things like that
so
recently we actually to you three years ago we deploy a technology to
the noble smartphone if you get the learnable's smartphone the screen unlocking
likely to be a
to includes a voice authentication it is somewhat technology and of course they all day
or also ask her voice ask for indy spoofing
can isn't note that
to go against a three by tech
please i will talk about it
so i talked to someone who talk about four
man items one is people would be one this exploding the text talk about
most compression in the artifact stick we may discomfort in the in the voice
in a also lastly
yes to be automatic speaker verification
in t spoofing the comparing the last year
i don't want to go through the details of the evaluation campaign by will talk
about
some of some of the observations i suppose a different a start
okay so typically a speaker verification system a taken voice as input to doing that
make a decision is to set identity claim to reject
most of the time we assume that the voice input is actually from a sheep
life a person like speech
in reality it may not be true
the of we can categorise all it is possible detecting to this four types impersonation
just like a getting a person to mimic two
a impersonate your voice
and
replacing managed to record somebody's voice you can play back to the to the system
and speech synthesis and postcompletions these are the scientific
thank technology means of a creating a speech
well the could be some other new methods that do the because invents now i
suppose we know we that the weights of the fact that can be categorized is
this for every as
used table summarize are the
we're going to assess abilities the effectiveness the and the reason to the system and
the com the availability of the countermeasures so sensibility meanings that how easy no this
you have access to this technology to spoof a system
so they're studies on the impersonations of basically you get a person to
to act as another person
this is actually part one of the very old the performing arts usually you try
to learn to maybe
some of this voice
and
study shows that system so i think people like this may be able to a
maybe another person very well to the human years actually the voice may not be
a very
very strong as a as a as a tech because the computer listen so differently
form of human yes
and the is also difficult to kind of a train the person to make some
of these voice so basically the it is a little accessibility saying it is not
it doesn't propose a present a strong the risk of to a speaker verification system
a replay the tech is basically to have somebody is
voice winding up talking and then you play back to the system
which is a low tech there is a
usually in the context of text-dependent
if it is a text independent are used have some of these voice see that
golding vad
we added the voice input impact so basically forced into the
the speech synthesis in voice conversion all categories so
for replay attacks
we
evaluate the oak to the risk
mostly in the context of text-dependent the speaker verification
when we talk about the voice i'm looking screen of untenable phone
it should we develop a system that is kind of a
taking the a unique features up of voice
optimal pretty back
we know that
a human
vocal system cannot repeat digits only the same
a voice to construct so if you happen that you you're able to record all
the voices and then when they have comes in
and you compare the incoming voice with data in the storage if they are exactly
the same the timings
this is a deeply fact
so we have mechanism to do this but there could be also other ways to
do this for example
they're studies
she by idea those group on the some years ago on the on the
protecting replay attacks obviously the
the idea is well you replay
actually it is a replay of a recording in the recording usually is
taken from a far-field microphone
the level to
the did not always of we re overruns
and acoustic effect of the room
if you're able to categorise it
you able to detect the retailing deeply example here so this is original speech
for the works
no i
well i
so you hear you hear it reverberation in the noise level in this is
extent it might
unique characteristics of a far-field microphone recordings that if we detect this thing of course
you can you can you can accept or reject a recording voice but this is
very difficult because room acoustics that changes from place to place it's very difficult to
be just one model there
that kind of a identify or the car or the room acoustics
another techniques that we i just dimension is got audio fingerprinting
yes idea the idea is that if we can
keep
the voice
in the storage
for at least cells they are presented to the system
of course you'd only to do that we keep the recording as a whole cube
away from this a whole
think of this we do fingerprint recognition actually the system doesn't we call doesn't get
the picture the picture
of the figure three
you keep only the cued always the training voicing what those be cocky points of
the fingerprints
the same for the for audio
there's a this the software that the quality quite
show them something you know you can
you can record the piece of music and then you retrieve
we choose the collection of the audio from the
from the system is the same technology you have a you have a voice recording
contained in the spectrogram
and then you kind of finalise the spectrogram into pixels in the only remember the
keypoints key point of those data
of high energy so high contrast an data
and actually you only need like
so the forty bytes
keep recording off
five seconds
so
practically you can kind of the store a lot unlimited number of entries in the
system
so when the test speech comes just compare
one by one then if this check matching just rejected because
no one can produce a voice of identical voice
this to time signal to noise
then comes to speak speech synthesis in voice conversion this to share
many common
properties for example within it difficult to
generate the voice they rely on a cook statistical models to the generate the features
et cetera so
that's london to get the so today open focus of will be on voice conversion
thing of course of this as many of them
much of the techniques also we plan to
a speech synthesis detection
we do a speaker verification
the we what on robust features
features has to be a real course has to be reliable has to be robust
and so we see this is chip in good
well we start a fine okay
features of both
properties but
most of us to use the short-term spectral features because these easy to achieve
and is actually but the reliable
and robust against noise was getting c
ageing have states
a channel
variation says that's what is at all focus
the typically to type of features one is on
voice production system like a lpc features you consider the vocal system as a
as excitation follow but followed by a resonance a few to right so you bottle
the excite that the second the source model with the future this where you kind
of similar
production system there's another type of thinking days to formal like the are required the
peripheral auditory system we report use a cell we don't we don't hear part of
it so things like you can see that it
the court we have possible a member right
this
and path bandpass filters to get the signals
and we try to
derive features that
kind of the follow
bandpass filters at different scales of mel scale
in this set of parameters to record all jittery
features things like mfcc have
many other little talk about the tree transform
et cetera
unfortunately most of them we will on robustness we try to extract the people's
characteristics unique characteristics speaker characteristics
we can see that the rest as a noise to try to accommodate
so as a result and no more robust the speaker recognition system they also means
is more vulnerable to the tech because it'll when we synthesise the voice you have
all kind of variations and we've real features are very good in
overcoming the
what kind of noise actually your system become very vulnerable to the system so we
have like a contradicting
requested to the system a one hand we want to detect the synthetic voice which
is
unwanted and on the other hand we want to be
a robust in these two things
are not are the same direction therefore we cannot have one system that does both
you to go t we have one system this for synthetic speech detection in the
front as a filter so when
we detect yes this is
this is a not that it a synthetic voice then the signal pasta a speaker
verification system
next ever going to talk about people voice comparison so voice compression this actually now
is very accessible so we can even go to amazon dot com you can buy
a box ninety nine point i five dollars
ready for
we those k
and actually allows you to a change your voice to masquerading a voice to be
too i mean to check you identity from one to another or to
kind of a you can use that
step here
so basically okay five two and the system the formants the peach you can
you can try use this
put forth a to kind of possible for
a speaker verification system
clearly in your
in your room so
so
it will cost a
if we understand well how postcompletions done maybe we can be with system to detect
synthetic voice quality points the system is a basically three parts
at all of this like formant judge in the slides i believe that
distance voice with must you with my student
voices very different from a from his voice at one time this analysis and you
can be a system that combine the voice one another this must be very strong
the
voice comparison system
so busy that three modules
to analyze compare the features and
and to synthesise
by analyze because
it's very
how to deal with the time-domain signal so you compared to
the two men that you can
many project
releasing frequency domain
and then you complete the features into
where you manipulating the way you want then you have put them back
synthesising generate the voice of another plus
we do that this is a couple coding to actually it is
account isn't it was very well studied in a
only as in communication you're all common people want to transmit signals duty codings the
one to compress the signal
they want to
multiplex the signal they want to increase the signals
with
coats et cetera
so they and that i think into features into the parameters then you do what
they want to estimate the over the narrowband channel and a at the end to
make sure that all the signal can be we can put the signal back you
are using the parameters so this was better but at a traditional framework for
in the communications and
today actually we replace the transmission channel with a
feature compression that
allows us to do voice compression
they all kind of voters on the data does this
we just group their body into two categories of people
why in speech synthesis and all this very well one score
sinusoidal vocoders basically the idea is a similar to
to generate signal that please all yes we
okay how much human the voices so generating so much to generate some of which
sounds very natural humour years which is a good
so the idea is to components
i mean to decompose the
the two row ticks lawns into a collection of
and i'm harmonics and then of course to include writing
and record the modulated noise components so you have the noise which represent the fricatives
in a the harmonic components that representing involves input this to get together you can
regenerate the cell
this
kind of vocal the
or in this study is that it's
people
evaluated in found that they are actually very natural and that has some issues a
some of the issues of like
because you've completed to this harmonic opal components and the number of parameters data they
need to describe the signals varies from
from the signal itself like
like fundamental frequencies like something rates et cetera they affect the numbers for every frame
we have different number problematic
present the problem you want to the model it
in a in the
statistical model we need the same number of parameters to model
of course they also like and they have a single overcome this so the studies
on this if focusing on how to manage the number of features in the data
on the other hand how to manage the noise because
harmonics is you know this card
good to describe karate signal is not very well in describing
another type of for a few days sorry overcome this call source-filter model which is
i think i mentioned earlier you can you think of this vocal production system
you have the
source excitation thank you of resonance you to anything you try to model
this both
and the good thing about this is
than parameters for example you use a linear predictive coding
actually you can fix the number of a parameters
and that helps to have stopped the modelling
of course addition also has a problem you compared this to the final sort of
encoding signal so you don't called the
some of the study seem like music a synthesis the quite face welcome to say
they're
they allows you to scale in both time and frequency domain so we hang
actually
control the phase of the signal the many to interface for source-filter model
you don't this filter has to be
stapling call calls so you have all the all the all the remote set to
be reading the
the unit circle in
because of all day so if a low minimum phase
a strategy we reconstruct the signal that actually cost artifacts
it is good for a
a defect detection synthetic speech detection
on so this up
where simple study which stuff by a judge and a few years ago and doesn't
do a very simple test you have a number of vocoders and that you to
copy synthesis you do not you just
simply analysing to the features we compose the signals
it was see what they detected this is synthetic voice on
and the result shows that with this modified group delay are cepstral coefficient you can
you can do very well in detecting the synthetic voice so there's artifacts all the
data and a lot effect to be analytically visualise but
popular features of okay
you can actually detect
so no
after talking about the vocal we talk about voice compose
so
voice conversion basically you want to convert ones
spectral from one person to a not while the
things that is quite people
a voice quite a number of things the main the main items that the formants
the formants about the formants the first is that it to tell which is how
it is by all will or the valves which one of these
but you also has the personal
a we also represent the vocal tract structure in a different way people are different
formant structures in
maybe formant tracks
of course you have also be fundamental frequency which is the peach and also the
intensity of the
the energy envelope all these are very difficult to kind of a manipulate individually what
we usually do is
spectral compose compare one
expect special level one person's voice to a not to kind of a transform
a typical example we select is
so well usually do is that you have
also called parallel corpus
we have samples of the same content and you do alignment you can do just
to do a dtw alignment and then you come up with the panel of
features right
in then and then you
divide a track compression function from
from this past
you have all the past stop suppose the enough to cover all the mappings
and then you do the combustion this a one time
and one topic that the important you have the
source features you prior to compression functioning then you come up with the help
so the m many techniques and you are not this is not for reading this
is presented by children as a sparse the web or for the progress of the
this research is to say that but they are
many of a
compression techniques
using samples
and linear regression
it's a linear function to convert source to target one not normally a method to
do we and then this way is kind of the transfer learning so you know
that
people a chance for the form one percent to another's voice at england the transform
matrix from many pairs of people named now you only have a very little samples
and you language or
the history of a number of using the dependence that they rely on from other
people's in this way you height of all to
composed so that allows you to use fewer data thank with we to estimate fewer
number put parameters will achieve the same goal
so i just us
i would just a touch upon a few
basic approach so one disk or complement mapping so basically the same thing to alignment
you get the parents and you do vector quantization for the past
and this is in past so with the runtime we only have one
samples
for example with the sources right column
in the and it and you the green is a target at the green you
don't have so you are right the source into this vectors you get all those
cool was and then you
string
the green ones to get in the generate a target voice
of course this is very elementary techniques
to do this
imagine you to do this
you focus very much on the parent
you know the source and target match but had a cat too much about the
continually t and the target
therefore this a lot of continuing discontinuity in the in the generative voice
another technique is to kind of a convert this you to a continuous a space
but and that if you do this and then you can have a formula like
this you have access the input as the source in the white yourself and this
is a linear transformation
i think of the is it is kind of a
the previous one this is quite a few them are coke bottle of cohen this
is a continuous version of it
right fielder continuous version of it
and then and then of course of this one generate that slightly a smooth the
a voice
in another technique is a the previous two are kind of our remembering the samples
right
in this one we deal with a remembering the samples we remembered the
competition
the warping functions you know that
source speaker the target speakers if you have enough samples we can kind of derive
a
well warping function
between them and we don't remember this warping functions
and run time when the test data comes is applied the right
what in function to generate target
it is not technical frame selection
frame selection
does not talk about a global approach
basically doesn't care too much about the continuing tid target
is one plastic taking into consideration
so you have certain frames uk in the training data article on
each other with a similar peach similar
or phonetic context thing they tend to get together so we have a kind of
a selection process not just by
a source target distance about also talking to talk it's friend distance to ensure the
continuing this one
give us a little bit smooth the
i'll post
thank of course is it is unit selection technique this is a very non in
the
of speech synthesis
where you have a
sufficient sample maybe you have pain twenty utterances of fifty utterances of a target language
you can achieve break it down to elements components
and then
at one time you want to compose something just pull the samples together you concatenated
into one piecing playback
this is actually one of the come away on doing that as a specious feuding
the speech synthesis system but think of this if you do this there is a
discontinued we do between the
between the between that units
both in magnitude and phase in this could be at the next we can detect
so the some summarise so i just say about a
actually we did in a voice compression and in the a speech synthesis
studies we have
subjective evaluation objective evaluation
and actually not of their address spoofing quality of a synthetic voice
looks at unvoice lisa in one of the example you hear that
you a see that
or
assembled that you cannot even understand but it is a very it
including a very strong
taking voice for speaker verification system so this'll to analysis are
well for
kind of a
quality perceptual quality i evaluation
but when it comes to spoofing the tech i believe that this effort to us
define what the best ways to analyze the completely voice as of the details of
the strings to the text system in last year is yes the spoof evaluation campaign
my view is providing object wondering allows us to kind of evaluating the string of
for a synthetic voices are completely points
okay so it makes a let me talk about
the effects of
the artifacts of
we size in the synthetic voice that possibly we can detect
we know that we cannot visualise the i-th effects is very difficult to see it
actually i
i
get my student group to try to
so all spectrograms in to see that differences
and
there is no direct ways to kind of measuring but their indirect way of a
model ringing for example
if you know that the signal is discontinuous of course you can use features that
represents a thinking kind of this crap continue we deal first speech you both in
can both in many do anything phase kind of to model
the data
that'll things that we should look into one is the manager
and the other is the phase i mean this is like the standard tech signal
processing a textbook
what was important is a
in most of the speech recognition thing speech synthesis of research
we pay much attention to the many to get interface
for simple reason that
my to do is easier to manage it is you easier to
visualise
and that this case is a much more difficult to update to
to
to describe to associate the parameters with the physical meaning
and
but actually they a lot of research in the literature on phase features for speech
recognition and that provides a
kind of a to see for us to
to start this research
so in terms of ninety two
we don't know that to analyze the speech signal we need to do this short
time fourier transform
i don't you use
sinusoidal coding will use the source-filter vocoder you wanted to do this short time
time-frequency analysis
in this present at effect you know that we don't is a fft
then you use a fixed window length
and then you have
spectral
you change in you have
windowing effects of all these all these are at effects
produced by the by the in by the system in the process
and then you have this more think effect you know that when we do
introducing this is a compilation most almost all models are
maximum likelihood estimation right next a more likely to wasn't maximum like the
estimation trying to do
they try to give you the average over everything
because the averaged you always higher
the higher
probabilities right
and they cause a problem
the limited dynamic range of the
of the signals without test generated in the could be at effect so that we
can
the same for phase
the same faces a bigger problem
often time what we do as i said that when we do synthesis we do
recognition we use a many to features a week actually
in order to ignore the phase i mean we still think that face continua t
v is a is important and we don't think that modeling the faces as important
as the many achieved it also present an opportunity for us to kind of the
tech artifacts we can model still patterns of
phase
distribution seen
a natural speech then we are able to detect synthetic
next just some examples of this is a
just to really wanna say that a short time frequency
analysis you use a fixed window fixed length window to analyze the
to analyse the signal was saying
and up you have a
record the interference between
frequency being
i'll the energies across the frequency
and are the same time because you do shifting window to window without overlap sending
actually you also have this smearing expect i don't have time axis so we have
you have the interference a the convex s and you also have the in the
this mary factor in the frequency
access
if we were able to detect
detect this then this could be
something that
a signature morphosyntactic balls
well coldest
most of the everything the vocal this actually two
kind of a remote
two most the they'll the waveform as a result of this
short time sometime
effect and
where people set actually you are using
one artifacts to correct another artifact so you have to short time frequency short-term a
spectral
really cage so we take cost you problems and that you are used another smoothing
methods kind of try to smooth everything about so you have a quality factor corrigan
out if you have to a different significant
and you can
kind of a extract the signal but interestingly this smoothing effect because you use human
years to kind of a pressure the quality actually after this of the smoothing evaluation
says that
the sound quality suppressed
but i believe that they're artifacts inside you can describe and just not also mention
that we use statistical model
i don't in the voice compression
or in the in
okay to markov model or a synthesis
and then we try to i'll try to estimate the
how to generate the parameters using maximum likelihood
criteria they always give you the average will not always you have other ways to
just means you might disagree with me a but the other ways to model the
to the dynamics about the
in general systems give you kind of a
average
a signal
that is a limited dynamic range of a completely speech
this example a i just plot the spectrogram of the natural speech in the copy
synthesis speech and hearing see that
actually
absolute differences in the spectral
two main
in this is a pitch patterns in the
get a map that he hmm based a synthetic
well ways we know that a human speech
actually the peach patent is not so stable as you know synthetic voice using the
paper by what you have
twenty two thousand five for the height of a trot to chart one shows the
synthetic voice which has a very straight up each pattern
it is in a p h this is the autocorrelation of this at the time
domain signals
and you see that a natural speech actually has about has something like you know
when you believing loosing you have this but broughton
the two roddick modulation top each round
and also some peach level
and synthetic voices rather strict
because of this if we believe that this
there is a lack of a dynamic range in the synthetic voicing completely voice then
the dynamic range of the spectrogram can be used as a features also of great
one paper by tom these group that we talk about only use their with and
without delta dynamic features of
of
spectrograms as the features i ignoring the static features are used to detect synthetic voice
in the also techniques to
a model the temporal modulation features you know when we have a feature frames which
is a like the usually one frame by frame by frame we selected ten miliseconds
shift in this
cut a piece of signals well like the fifty frames and you extract it into
a temporal a few using the temporal futile to model it is and then use
this to
to oaks former oak supervector like this to model
the model that i'd of the many to
features audio based features and it works for
well for this a complementary features into
in the extended voice detection
phase is something that we will this was
us to pay attention to
why people don't use face creatures is because mostly because it's of it difficult to
to describe it and it because
many unique properties for example we have this mapping effect when you want to see
you have to unblacked it this is a real red
record a signal you can see any patents
but actually
if you think if you have a real time you have you have a real
signal and then we do fourier transform you have the
the real part in have the imaginary parts right and then
the man did you is come from this to pass in the face also come
from these two pass and
then
by right they should present a similar patterns like this if you many people have
shown that day
unwrapping do it properly with proper normalization you see similar patterns
face feature and thus many the you manicure feature the looks about the same
and they give us a opportunities to i mean another new features to look into
you in synthetic voice and completely both people do not pay enough
i things into two
to a face increase feature become very useful for detection for synthetic oppose detection
a to have to be too
must there are many papers on all this other techniques with recordings ten years instantaneous
frequency which is the time that derivative of the phase signal so basically you have
to frames
and then this is the method you look very similar but they are
phase features
could be very the face sorry the phase
if a square and could be very different
good very different
so
by
by taking their
difference
as a features
you're able to extend it to remember something
we strip is remembered every sample was in the time-domain actually this two pi shift
of the signal to maintain the continuing this so we want to do this you
have to kind of unwrap it
because usually we should do window by ten milisecond twenty millisecond not by every samples
right
so when you take this thing you want to make sure that the
features are
kind of a complete the phase are continues you have to do
kind of a normalization
ross a little bit
and then is a group delay features this which is a frequency derivative of phase
you know we have a single like these
and you have the power spectrum which shows the two resonance pick here
you see really and then the
group delay also shows or something like this
and these features rather complex
a mechanism but at least a show you initial step
a similar utterance has many to a feature
this is a novel different plots or spectrograms face were so that a development my
student groups in last year's this
if you spoofing and compare their see that if the if the log magnitude spectrum
make it you
and a you can have a group delay unit we probably actually you see the
similar patterns
did manage
and you have many other things non modified group delay of the instantaneous frequencies on
the other features you see the paper but specimen to print
in this
allpass also but features to do that the detection
finally comes to the last year so that into scooting evaluation
each shows that
this is
a performance on the data a speaker recognition you've just use the gmm the standard
gmm system and then once you a lda with the spoofing voice us anything voice
the performance at twelve o a missile
okay looked in the evaluation they were kind of five
synthetic voice
which is used as a training development data how this is called norm that x
you have to access to the to the training data of the synthesized
and i have another five that you don't have access to tell you what how
to generate
and then you only given the evaluation data supposed to detect d a synthetic voice
for all of them so you typically use the five a
a voice to train your system and used
use the system to tessa
across the ten evaluation
ten a voices and this is a brief summary of the resulting see that for
the not attack italy the performance the average is kind of a
for unknown to take it gives to like a four times higher so error rates
so of course of this is kind of a and
known beforehand you know
we denote this signals you of course you can do something you're trying to train
the detector using the samples in the you detect that
like to actually we do one particular i think use a synthesiser which is a
kind of outline of the system
you know
all the system did pretty these are the sixteen estimations of the system thing the
rank by the performance
most of them did very well for but to take one is example t very
well for
for all of them we don't
as ten without the unique selection synthesizer right
and it pretty reasonably well
and then one comes to f k even very the equal error is very high
so basically all the features kind of felt that for
for testing
so was tested
as in this is the tts
using
unit selection and replay
sound clip to see show you how it is that this is a testing
if i should
if i should
actually
so we say it's night so here okay thank you can really hear this a
this set in the s k i s ten present the strongest
the text to the speaker recognition system i believe that is because his unit selection
demos of the salsa silence frames because we do frame-by-frame
and the frames are a natural voice except the
the vad the connection points which is represented minority the yep in the back or
friends
nowadays everything must have a little bit of a deep neural network so i also
include neural network my presentation
so this is a
very simple deep neural a simple neural network is this is not appear
there's one layer anyway neural network there has to take the speech as the input
take the features as input for type of features
and then generated output
so
the sounds that this the closer to
the something closer to the right things like is more natural speech and laughing size
more synthetic voice is occurring see that has can
overlaps with natural speech very much as ten and natural speech you give a very
similar score
that makes the features that we have kind of this difference to differentiate them
so i wasn't another recent research in this is a very recent resistant work
we take one hundred frames as the input to a
convolutional neural network so you have how different to do polling and
and all this allows you to get a wider range of a samples
how difference actually can cover the kind of one
one minute thing to make sure the in the one when it is there are
some transition of a
of
acoustic units between
their subjects junctions
in it
we can see that as ten and natural speech kind of that so but the
good separation
and
as a
positive and studies are i read quite a number of literatures one of them is
a multiple of the things is given to
features that d
the best in the evaluation which is a so-called ward italy transform basically the idea
is in canada here you have this
you have this
you have this
possible to member different few
filters with different pen with a different center frequencies that so you're kind of a
trying to be filters of the kind of a different awfully good a good friends
with different pen with a to get the coefficients
this is actually not new in the on "'em" scale a good cepstral coefficient already
is doing this but was differences
in this
a set of you just a status similar to kind of a wavelet kind of
a
it's question you have we for low-frequency you have a longer windows
and but for high frequency inverse filtering
yes up to sort the shot to
response function in this way you get
different resolutions to
two different frequency bands
this is a paper this is a
slight so there is given to the need via by any device and a big
they
just got a very impressive result that he's going to present in all this is
why don't
one to jump from one
so i try to share with us so
is the effect of see that this is
spectral where that is shown that using
constant q cepstral coefficient for in the similar concept of auditory transform
at the
low frequencies the better frequency resolutions but poor convex solutions
point time resolutions allows
asked to
have a bigger windows in terms of time it has a bigger range
range the cover to cover
you know the
the discontinuity of the features
the higher frequency is a better time resolution
it is equal costs them to collect
need to ship would you littering his one presentation
so it
with these techniques to give the very impressive without giving the evaluation the best result
was equal error rate eight point
five percent
and
then with a with days
do you achieved like a one percent equal error rate this is really impressive
okay to some
so
splitting the deck a spoofing the tech a this many challenges and opportunities in the
most systems also on there
that the input speech is actually natural speech i don't know this is opportunity always
the challenge it depends on your the heckler you want to
the system developed a
more robust speaker verification system
many meetings this very vulnerable to have text and then we need to take special
because to address this issue and
and
but the speech perceptual quality doesn't equal to
to less artifacts actually in speech synthesis
my impression of this we try to the just the
the output signal just to please the human years by actually in the spectral
ram it in the it generate or interface claim has a lot of artifacts day
that yet to be discovered
motion humans listen different
matching
mostly now listen to frame-by-frame features i remember the days when they were in the
company we wanted to keep those ten will be a single ip what demonstration for
tts system to have a dialog speech recognition system
every time i give just tamil
and people across very happy
and
and people thought that this was
magical demonstration but to me to save a demonstration because the lpc features top of
lpc features every time they get the echoes get the was correct if i talked
to the system some something kimmy role model without leaves acoustically ninety five percent
so matching and humans listen to different things and we need to discover it is
more and a
and the study also shows that from the last two yes i
publications shows that features are more important than classifier
or maybe we have not reach the level having good features so the a lot
more study to be that the features for the thing to classify
this way
thank you
so you how do for this presentation so we have time for a couple of
questions
then when you want your judgements to start
anyone
we get idiots from terrorists
obviously have the voice pitch the use of pitch stretch a pitch appending algorithm so
it sounds like this begin with very low voice either to discuss the voice or
to sound more threatening
the question is that way of
of inferring the degree of change that has been made to the pit the pitch
can both either just formant frequency or formants and found difficult or just formants so
but would be here we are the loss of any way of knowing whether and
it
what extent it is possible to and four
the degree of change that has been made in order to
to change it back
well
we don't have
i think for forensic unit kind of visual tools that allows you to
to do analysis have believed that the
the features that we just talk about things tending as instantaneous frequencies the group feature
group delay modified group delay
cepstral coefficients the constant q cepstral coefficients those are the wonderful tools for you to
do
comparison since i just show you just shout on the second
so actually a
so we did in the left when we
analyze the features
we did
observe some features for example this is a call
relative of phase shift is a natural speech this is a synthetic speech and you
can see you cannot hear
the difference because they are all very natural you cannot really
here any differences but
the craze gram actually tells you something so i believe that
maybe it can be used as the tools
i don't think anybody has really appealing to a system for practical used yet
you is just
so very nice talk time was very happy to see the breath of work field
and beam our work is actually been covered by someone you folks here i wanted
to make one comment i think one of the fundamental challenges when you look at
voice conversion most of that research is really focused on humans being able to
assess the quality should usually for human consumption not necessarily for speaker recognition systems
so if you look at voice conversion technologies most end up focusing on making sure
that the prosody is correct because that's something it's pretty easy to kind of assess
it was different like fundamental frequency and so forth so i think in a bit
nineties we had to some work or what we did as we took
the output of natural speech and segment based speech synthesis and fed into archer here
cell models and look at here some firing characteristics on the output of what we
saw was that in regular normal speech
there's an actual production evolution that takes place in the articulators
the corresponding here saw firing characteristic also have a natural variation
but in the synthetic side in segment based synthesis
when you could be hair cell finding characteristics they don't necessarily behave the same way
so we found that was actually very interesting way to kind of bring kind of
the signal processing side of the hearing into the speaker assessment side
you could find actually really more quality speech synthesis
but the hair cell firing characteristics would be able to pick up that differences there
certainly i think just now example asked an the unit selection feature is a quite
example you can hear anything by actually it is stronger is exposed in voiced
yes one last question and then after ending you can break
so part of what you with through you're talking about the different aspects as try
to detect whether the voice that modify
right and you're look so the things in there were the
looking at the pitch the phase and so forth but what is the really that
isn't speaker
verification be used is because the big with it is of the handset being delivered
around most speech coming in the systems already gonna go through some form vocoder
so is that by its nature going to start to
you know you're gonna get you're gonna start detecting a lot of these artifacts are
really gonna be natural artifacts of the communication system itself and
i think is that thing looked at is look like most of this is based
on what inputs are happening right at the from lips into the system
so i think that's vertical question so i think the challenge now we just two
model different type of artifact the artifacts to are susceptible to the system for example
you set a two
the could be if the system that's going through communication channel days editable coding that
already
but most of them do not really manipulative
parameter stay just try to recover the signal as much as possible
so
and
at the moment the researches focusing the task focused very much on
the features they are able to surf a store scientific
exactly the if we have good features so we can tell we can model them
effective
and of by this is a mostly also telephone channel two different channel you mentioned
that and ten telephone channel it is also
you know
and a lot channel digital channels in all kind of things
so i
they could be issue but they also asked me about this when we do not
this when we kind of doing this analysis the or digital
the by actually this process the complete to analogue an income packaging
what the effects of the data
we have not really studied
okay thank you i think we have to stick to scale so thank let thanks
again purpose only how do