hello everyone
my name is what's industry and i'm gonna present a people
i feel
so and modeling or splitting detection
you know make speaker verification
so in the next few flavours
of like to do with five from one
speaker
can you three
and compromises
so and automatic speaker verification system
instead of verifying the identity of james
so in this paper
options
a more distant from various it's a
and this speaker
you know the fine with the whole speech comes from the same speaker that the
case
the system accepts the test speech that can speaker
otherwise the system
as an enforced
so common application of is very good
for user authentication for example in by classification
recent studies have shown
the systems are not
a lot of the listings of
which can be
from various research and
one of the using synthesis has
and synthetic thing
speaker
second with involves and voice conversion have
to generate the speech of speaker and the colour from the track involved
replay attacks which involves link back
are created by means of a speaker
and for on the back in one
if question at the voice of target or not
a given x is then used system
so in this paper
or focuses on
replay attacks
or two main reasons
one
because of its simplicity
this formal that doesn't require any studies in seen any
a signal processing more speech data
second
the formal that is
i difficult to detect from the results because we are actually feeling that real speech
so it doesn't use any
a complete the algorithm the general speech
so this and ones for all target speaker
so
spoofing on to missus are primarily developed
two
based on dsps
so it really very fine control
video
so and i four think problem it can be viewed as a binary classifier which
means that discriminately with the input speech
is a non speech
or is supposed speech
so in our case when we talk about this posting on the major
so what exactly so there's a them in the
so but not in five
what we factors interest you know
independently back speech
so essentially you know case doesn't be
different channel characteristics
in this during playback and we recording speech
and also
background noise
involve doing this you are
parameters are essentially that you that are we expect model exploiting for playback speech detection
before i start
you know holes sub band modeling framework
for
like of all a sum of common commonly used a approaches towards designing us
supposing images
which since also
exploiting information of course on
so what we call is from one of them is a
for the on here is the
is an example of four
optimal or
gaussian mixture model which is a generative model
as can be seen
of given from the speech prince common spectrum is extracted
on its extractor
acoustic features
for example to use is you know is very low baseline feature
and on this speaker she's data
basically
a gmm models one for school and one on my
utterances are train
you not of the model the distribution of this
this process
so this is an example of commonly used gmm model it has shown promising results
in the st score evaluations
category all a total ban on committees that you're gonna study
is
discriminatively trained unit it's
which on very
problems on and in the thing
nineteen this peaceful nations
so as can be seen so this model takes for quantified in speech or its
and basically going on this isn't wrong three
between the one
so this actually motivates our current work
and in fact that means just question that motivates our work is
do we really need
all the speech
all the informational costs all the three best friends
already split into action
so or intuition is that may be doing
maybe they next speech to use might be somewhere in high frequency regions or maybe
you know very low frequency regions
so in this framework of what we what defined with different understand the importance of
different frequency bands for sporting detection
and also trying to come up of s o s and combination that improved if
the columns
so we test this hypothesis this idea of old remote on two benchmarks
this is feasible thing to seventeen and
is useful information
so this texas also proposed methodology
can be expected to us
in the first one with we basically
if or input into
i
spectrogram of different
and train and independency inance on top of the frame discriminative
and likewise what we obtain is we have been and dependency in that are trained
on and sum of spectrograms
so
this allows of actually look at
or indicate and sometimes
and understand which frequency regions are more discriminant e
of course for detection on this data sets
so
so this can also be viewed as a employing and independent workers
just focusing on
small sub band information and trying to and trying to exploit discriminative information for incorrectly
and a one speech
rather than having once you in which
it hasn't is the baseline once you know that is at least having to look
all the information and also
how to
if the bulletin of are on by means an introduction and while retaining the
most discriminative information what we do we begin by the fast basically be a we
kind of
allow independence into focus on one-pass
and then
we into a
improve performance by doing so
so without in the second step what
we and
basically think this prevent models and combine this alone features
and train or another classifier on top of so many all this
different features and then jointly update the weights of entire framework
so in it what we do
we are also making use of the cold information but and
not using only once you
you in
in
any given in c n n's and then we give this and independent features on
cognitive and then than if we train
the classifier is trained on
so this is our proposed method which we test on the two it's really seventeen
and twenty nineteen is possible data
so
let me let me talk about the experimental results well start with the baseline
baseline gmm and the ceiling
so the gmm baseline is trained on stick uses it is
and
the c n is trained on spectrogram and spectrogram
condition from the experimental results
we find that the discriminatively trained union
in performance
two
sufficiency handcrafted gmm baseline
on the data six
and one thing the of the on the training dataset to increase equal error rate
on gmm is given the fact that we apply
preprocessing step
on the audio signals
which involves discarding the cost of p zero value of silence
from evidence only not prior work in interspeech trained and
and that's the reason why our baseline security gmm is quite
different and then the
baseline and the for this effect
so no let me talk about our first experiment is okay
where we
spectral input into uniform sometimes
and one thing to more here is that
in this paper we have adopted a very simple to extract is your are uniformly
segmenting the input and have non-overlapping
so we just in the in between two
six
we
we
frame independence in and one of two in
and then having frame that model was intensely from we combine them to train a
concatenated model or a joint model
so i would be calling disjoint framework task at in the experimental results
which basically is that
the framework of this except that we trained
pictures and classifier is trained one
so let's look at the experimental results
as can be seen from this
right distribution
on the extended data set we find that the commission in high frequency reasons seem
to be more likely in contrast to if you regions
on the contrary
in between the same reasoning can be the same define objects
we see people
in c reasons be more discriminative in contrast the high frequency region
nonetheless on what the datasets our proposed
of frame more difficult compare
the on off between models anymore
since all four improve performance on what the data set
so in this section in the second experimental setup what we do not is you
know for that is there or input
into four
uniform segments each of the two sub bands
no with i
we can look at
more detail on discriminative information rather than just having a
for us so i
so it is our rows or and experimental setup so we have or independency anything
on two khz and a nation
and then there's models are placed on mine
we have been a feature is it into this classifier and the whole framework is
trained
well over the so
you know if all the with this
so
yes take a look at the experimental results so well as can be seen on
different this we find that
the
information in between
two khz to six khz in to be not comedy
in contrast to
information present in the last two killers sub band and we first
so in contrast on the training dataset we find that the first two khz sub
band is more informative
how valuable of the safety
as in previous case
our on a model
some of the best results
in what the data sets
so
the next experiment a set of what we do
is we know
but this
a input for their into
eight subbands so it's of one khz
so essentially what we do if we see it independency and everything on one delivered
formation
and is a this frequency entirely online again to
previous
a framework it is it's data it units all data
do
you can improve the
do so okay performance
so this is the experimental results for one lower sub-bands units
so this distribution across different sometimes allows us to actually
understand the impact of different bands
so only twenty seven dataset what this
is
one khz information seem to be the most informative
as we in and
eer of a two point one which is
e r in contrast or the frequency bands and interesting about defined is you've on
this isn't between
of different this and we just one two khz and
and it is seven killers is a system that informative as can be seen high
eer
and these second informative frequency for fixed expensive the be the first one khz
and of course will be or compact model operating on the
all h
and seem to give with comments but then we also right in just the last
a seven eight khz band and the first one dollars
which seem to give us the best performance in one for here
on the financial data see what we found is we found that the first one
is most informative
in contrast of the fifty s
so as i mentioned earlier so this is due to the fact that the twenty
seventeen and twenty nine dataset completely different so the fink intention is a simulated data
while twenty seven dataset is
is the real data that was
recorded and it back
using a
speaker verification it has it all right
so that this kind of explains the
difference a mismatch in the behavior
the final set of experiment we performed in this study is in terms of prostate
a simple ones
or with we with some of the best
models
not mentioned in dataset and original dataset and test it
comment on is visible twenty nineteen real be tested
we have used it is a very small essay
of
thousand utterances that was
instinctively conditions like the is organized as
and we want to see how this models
performed on realistic s conditions can be seen from the high
error rate distributions for all our models
this solicitous that the cutting us holding datasets training model doesn't actually
much on the realistic or if the conditions
so this thing that we might have to think about a few design or
training and validation sets for or standing s
so to improvement all
in this paper work with the we will be basically
but at all events and in
by discriminatively training independent seen in on
and
so
if variable a figment
and then there's a lady the later on the combined
and
and independent possible is trained on top of that
using the proposed methodology we found people performance on will be twenty seventy three datasets
and an interesting observation but not for or it which is a language that some
of the for this war is that
under twenty seventy dataset
e
seventy eight khz frequency formations in to be more informative
with however doesn't hold true on the training dataset
between ti din dataset the first one khz information seems to be more formally
and we also found that
the this wanna do not generalize a real on the on the realistically if conditions
with so this
that it is still room or for kids from
designing and validate in
this dataset for training effective
are replaced with an addiction models
so that i would like to control my
and you very much