0:00:16hello everyone
0:00:18my name is what's industry and i'm gonna present a people
0:00:22i feel
0:00:23so and modeling or splitting detection
0:00:25you know make speaker verification
0:00:29so in the next few flavours
0:00:32of like to do with five from one
0:00:35speaker
0:00:37can you three
0:00:38and compromises
0:00:41so and automatic speaker verification system
0:00:45instead of verifying the identity of james
0:00:48so in this paper
0:00:50options
0:00:52a more distant from various it's a
0:00:57and this speaker
0:01:00you know the fine with the whole speech comes from the same speaker that the
0:01:04case
0:01:04the system accepts the test speech that can speaker
0:01:08otherwise the system
0:01:11as an enforced
0:01:12so common application of is very good
0:01:15for user authentication for example in by classification
0:01:23recent studies have shown
0:01:25the systems are not
0:01:28a lot of the listings of
0:01:31which can be
0:01:33from various research and
0:01:37one of the using synthesis has
0:01:40and synthetic thing
0:01:42speaker
0:01:44second with involves and voice conversion have
0:01:50to generate the speech of speaker and the colour from the track involved
0:01:58replay attacks which involves link back
0:02:01are created by means of a speaker
0:02:04and for on the back in one
0:02:07if question at the voice of target or not
0:02:11a given x is then used system
0:02:14so in this paper
0:02:16or focuses on
0:02:18replay attacks
0:02:20or two main reasons
0:02:21one
0:02:22because of its simplicity
0:02:25this formal that doesn't require any studies in seen any
0:02:30a signal processing more speech data
0:02:33second
0:02:34the formal that is
0:02:36i difficult to detect from the results because we are actually feeling that real speech
0:02:43so it doesn't use any
0:02:44a complete the algorithm the general speech
0:02:47so this and ones for all target speaker
0:02:54so
0:02:55spoofing on to missus are primarily developed
0:02:58two
0:02:59based on dsps
0:03:01so it really very fine control
0:03:04video
0:03:07so and i four think problem it can be viewed as a binary classifier which
0:03:12means that discriminately with the input speech
0:03:16is a non speech
0:03:18or is supposed speech
0:03:21so in our case when we talk about this posting on the major
0:03:25so what exactly so there's a them in the
0:03:29so but not in five
0:03:31what we factors interest you know
0:03:33independently back speech
0:03:36so essentially you know case doesn't be
0:03:40different channel characteristics
0:03:41in this during playback and we recording speech
0:03:45and also
0:03:46background noise
0:03:49involve doing this you are
0:03:51parameters are essentially that you that are we expect model exploiting for playback speech detection
0:03:59before i start
0:04:01you know holes sub band modeling framework
0:04:04for
0:04:07like of all a sum of common commonly used a approaches towards designing us
0:04:13supposing images
0:04:14which since also
0:04:16exploiting information of course on
0:04:20so what we call is from one of them is a
0:04:23for the on here is the
0:04:25is an example of four
0:04:27optimal or
0:04:29gaussian mixture model which is a generative model
0:04:32as can be seen
0:04:34of given from the speech prince common spectrum is extracted
0:04:38on its extractor
0:04:39acoustic features
0:04:40for example to use is you know is very low baseline feature
0:04:45and on this speaker she's data
0:04:48basically
0:04:49a gmm models one for school and one on my
0:04:52utterances are train
0:04:54you not of the model the distribution of this
0:04:57this process
0:04:58so this is an example of commonly used gmm model it has shown promising results
0:05:03in the st score evaluations
0:05:07category all a total ban on committees that you're gonna study
0:05:11is
0:05:12discriminatively trained unit it's
0:05:15which on very
0:05:17problems on and in the thing
0:05:21nineteen this peaceful nations
0:05:24so as can be seen so this model takes for quantified in speech or its
0:05:29and basically going on this isn't wrong three
0:05:33between the one
0:05:35so this actually motivates our current work
0:05:38and in fact that means just question that motivates our work is
0:05:44do we really need
0:05:45all the speech
0:05:47all the informational costs all the three best friends
0:05:51already split into action
0:05:52so or intuition is that may be doing
0:05:55maybe they next speech to use might be somewhere in high frequency regions or maybe
0:06:00you know very low frequency regions
0:06:02so in this framework of what we what defined with different understand the importance of
0:06:07different frequency bands for sporting detection
0:06:11and also trying to come up of s o s and combination that improved if
0:06:16the columns
0:06:18so we test this hypothesis this idea of old remote on two benchmarks
0:06:26this is feasible thing to seventeen and
0:06:28is useful information
0:06:34so this texas also proposed methodology
0:06:37can be expected to us
0:06:39in the first one with we basically
0:06:42if or input into
0:06:44i
0:06:46spectrogram of different
0:06:49and train and independency inance on top of the frame discriminative
0:06:56and likewise what we obtain is we have been and dependency in that are trained
0:07:00on and sum of spectrograms
0:07:02so
0:07:03this allows of actually look at
0:07:05or indicate and sometimes
0:07:07and understand which frequency regions are more discriminant e
0:07:10of course for detection on this data sets
0:07:14so
0:07:15so this can also be viewed as a employing and independent workers
0:07:19just focusing on
0:07:21small sub band information and trying to and trying to exploit discriminative information for incorrectly
0:07:26and a one speech
0:07:28rather than having once you in which
0:07:30it hasn't is the baseline once you know that is at least having to look
0:07:36all the information and also
0:07:38how to
0:07:39if the bulletin of are on by means an introduction and while retaining the
0:07:44most discriminative information what we do we begin by the fast basically be a we
0:07:48kind of
0:07:51allow independence into focus on one-pass
0:07:54and then
0:07:54we into a
0:07:56improve performance by doing so
0:07:58so without in the second step what
0:08:01we and
0:08:02basically think this prevent models and combine this alone features
0:08:06and train or another classifier on top of so many all this
0:08:11different features and then jointly update the weights of entire framework
0:08:16so in it what we do
0:08:18we are also making use of the cold information but and
0:08:22not using only once you
0:08:23you in
0:08:25in
0:08:28any given in c n n's and then we give this and independent features on
0:08:32cognitive and then than if we train
0:08:35the classifier is trained on
0:08:38so this is our proposed method which we test on the two it's really seventeen
0:08:43and twenty nineteen is possible data
0:08:48so
0:08:49let me let me talk about the experimental results well start with the baseline
0:08:55baseline gmm and the ceiling
0:08:57so the gmm baseline is trained on stick uses it is
0:09:02and
0:09:03the c n is trained on spectrogram and spectrogram
0:09:07condition from the experimental results
0:09:09we find that the discriminatively trained union
0:09:14in performance
0:09:16two
0:09:17sufficiency handcrafted gmm baseline
0:09:21on the data six
0:09:23and one thing the of the on the training dataset to increase equal error rate
0:09:28on gmm is given the fact that we apply
0:09:32preprocessing step
0:09:34on the audio signals
0:09:36which involves discarding the cost of p zero value of silence
0:09:40from evidence only not prior work in interspeech trained and
0:09:45and that's the reason why our baseline security gmm is quite
0:09:50different and then the
0:09:52baseline and the for this effect
0:09:56so no let me talk about our first experiment is okay
0:10:00where we
0:10:01spectral input into uniform sometimes
0:10:04and one thing to more here is that
0:10:06in this paper we have adopted a very simple to extract is your are uniformly
0:10:11segmenting the input and have non-overlapping
0:10:16so we just in the in between two
0:10:19six
0:10:20we
0:10:21we
0:10:22frame independence in and one of two in
0:10:26and then having frame that model was intensely from we combine them to train a
0:10:30concatenated model or a joint model
0:10:34so i would be calling disjoint framework task at in the experimental results
0:10:39which basically is that
0:10:41the framework of this except that we trained
0:10:44pictures and classifier is trained one
0:10:48so let's look at the experimental results
0:10:52as can be seen from this
0:10:55right distribution
0:10:57on the extended data set we find that the commission in high frequency reasons seem
0:11:03to be more likely in contrast to if you regions
0:11:07on the contrary
0:11:09in between the same reasoning can be the same define objects
0:11:13we see people
0:11:14in c reasons be more discriminative in contrast the high frequency region
0:11:20nonetheless on what the datasets our proposed
0:11:24of frame more difficult compare
0:11:26the on off between models anymore
0:11:29since all four improve performance on what the data set
0:11:33so in this section in the second experimental setup what we do not is you
0:11:37know for that is there or input
0:11:39into four
0:11:41uniform segments each of the two sub bands
0:11:44no with i
0:11:45we can look at
0:11:47more detail on discriminative information rather than just having a
0:11:52for us so i
0:11:55so it is our rows or and experimental setup so we have or independency anything
0:12:02on two khz and a nation
0:12:05and then there's models are placed on mine
0:12:09we have been a feature is it into this classifier and the whole framework is
0:12:14trained
0:12:15well over the so
0:12:18you know if all the with this
0:12:22so
0:12:23yes take a look at the experimental results so well as can be seen on
0:12:29different this we find that
0:12:32the
0:12:33information in between
0:12:36two khz to six khz in to be not comedy
0:12:40in contrast to
0:12:42information present in the last two killers sub band and we first
0:12:48so in contrast on the training dataset we find that the first two khz sub
0:12:55band is more informative
0:12:58how valuable of the safety
0:13:00as in previous case
0:13:01our on a model
0:13:03some of the best results
0:13:05in what the data sets
0:13:06so
0:13:07the next experiment a set of what we do
0:13:10is we know
0:13:12but this
0:13:13a input for their into
0:13:16eight subbands so it's of one khz
0:13:20so essentially what we do if we see it independency and everything on one delivered
0:13:26formation
0:13:28and is a this frequency entirely online again to
0:13:31previous
0:13:33a framework it is it's data it units all data
0:13:39do
0:13:41you can improve the
0:13:42do so okay performance
0:13:46so this is the experimental results for one lower sub-bands units
0:13:51so this distribution across different sometimes allows us to actually
0:13:57understand the impact of different bands
0:14:01so only twenty seven dataset what this
0:14:04is
0:14:06one khz information seem to be the most informative
0:14:10as we in and
0:14:12eer of a two point one which is
0:14:15e r in contrast or the frequency bands and interesting about defined is you've on
0:14:20this isn't between
0:14:22of different this and we just one two khz and
0:14:26and it is seven killers is a system that informative as can be seen high
0:14:31eer
0:14:34and these second informative frequency for fixed expensive the be the first one khz
0:14:39and of course will be or compact model operating on the
0:14:44all h
0:14:45and seem to give with comments but then we also right in just the last
0:14:52a seven eight khz band and the first one dollars
0:14:55which seem to give us the best performance in one for here
0:14:59on the financial data see what we found is we found that the first one
0:15:03is most informative
0:15:05in contrast of the fifty s
0:15:08so as i mentioned earlier so this is due to the fact that the twenty
0:15:12seventeen and twenty nine dataset completely different so the fink intention is a simulated data
0:15:18while twenty seven dataset is
0:15:21is the real data that was
0:15:24recorded and it back
0:15:26using a
0:15:28speaker verification it has it all right
0:15:31so that this kind of explains the
0:15:34difference a mismatch in the behavior
0:15:39the final set of experiment we performed in this study is in terms of prostate
0:15:44a simple ones
0:15:45or with we with some of the best
0:15:49models
0:15:50not mentioned in dataset and original dataset and test it
0:15:56comment on is visible twenty nineteen real be tested
0:16:00we have used it is a very small essay
0:16:03of
0:16:04thousand utterances that was
0:16:07instinctively conditions like the is organized as
0:16:11and we want to see how this models
0:16:15performed on realistic s conditions can be seen from the high
0:16:20error rate distributions for all our models
0:16:25this solicitous that the cutting us holding datasets training model doesn't actually
0:16:32much on the realistic or if the conditions
0:16:35so this thing that we might have to think about a few design or
0:16:39training and validation sets for or standing s
0:16:46so to improvement all
0:16:50in this paper work with the we will be basically
0:16:54but at all events and in
0:16:58by discriminatively training independent seen in on
0:17:03and
0:17:04so
0:17:07if variable a figment
0:17:09and then there's a lady the later on the combined
0:17:13and
0:17:14and independent possible is trained on top of that
0:17:17using the proposed methodology we found people performance on will be twenty seventy three datasets
0:17:23and an interesting observation but not for or it which is a language that some
0:17:29of the for this war is that
0:17:31under twenty seventy dataset
0:17:33e
0:17:34seventy eight khz frequency formations in to be more informative
0:17:38with however doesn't hold true on the training dataset
0:17:42between ti din dataset the first one khz information seems to be more formally
0:17:47and we also found that
0:17:51the this wanna do not generalize a real on the on the realistically if conditions
0:17:57with so this
0:17:59that it is still room or for kids from
0:18:03designing and validate in
0:18:05this dataset for training effective
0:18:08are replaced with an addiction models
0:18:11so that i would like to control my
0:18:14and you very much