0:00:16 | hello everyone |
---|
0:00:18 | my name is what's industry and i'm gonna present a people |
---|
0:00:22 | i feel |
---|
0:00:23 | so and modeling or splitting detection |
---|
0:00:25 | you know make speaker verification |
---|
0:00:29 | so in the next few flavours |
---|
0:00:32 | of like to do with five from one |
---|
0:00:35 | speaker |
---|
0:00:37 | can you three |
---|
0:00:38 | and compromises |
---|
0:00:41 | so and automatic speaker verification system |
---|
0:00:45 | instead of verifying the identity of james |
---|
0:00:48 | so in this paper |
---|
0:00:50 | options |
---|
0:00:52 | a more distant from various it's a |
---|
0:00:57 | and this speaker |
---|
0:01:00 | you know the fine with the whole speech comes from the same speaker that the |
---|
0:01:04 | case |
---|
0:01:04 | the system accepts the test speech that can speaker |
---|
0:01:08 | otherwise the system |
---|
0:01:11 | as an enforced |
---|
0:01:12 | so common application of is very good |
---|
0:01:15 | for user authentication for example in by classification |
---|
0:01:23 | recent studies have shown |
---|
0:01:25 | the systems are not |
---|
0:01:28 | a lot of the listings of |
---|
0:01:31 | which can be |
---|
0:01:33 | from various research and |
---|
0:01:37 | one of the using synthesis has |
---|
0:01:40 | and synthetic thing |
---|
0:01:42 | speaker |
---|
0:01:44 | second with involves and voice conversion have |
---|
0:01:50 | to generate the speech of speaker and the colour from the track involved |
---|
0:01:58 | replay attacks which involves link back |
---|
0:02:01 | are created by means of a speaker |
---|
0:02:04 | and for on the back in one |
---|
0:02:07 | if question at the voice of target or not |
---|
0:02:11 | a given x is then used system |
---|
0:02:14 | so in this paper |
---|
0:02:16 | or focuses on |
---|
0:02:18 | replay attacks |
---|
0:02:20 | or two main reasons |
---|
0:02:21 | one |
---|
0:02:22 | because of its simplicity |
---|
0:02:25 | this formal that doesn't require any studies in seen any |
---|
0:02:30 | a signal processing more speech data |
---|
0:02:33 | second |
---|
0:02:34 | the formal that is |
---|
0:02:36 | i difficult to detect from the results because we are actually feeling that real speech |
---|
0:02:43 | so it doesn't use any |
---|
0:02:44 | a complete the algorithm the general speech |
---|
0:02:47 | so this and ones for all target speaker |
---|
0:02:54 | so |
---|
0:02:55 | spoofing on to missus are primarily developed |
---|
0:02:58 | two |
---|
0:02:59 | based on dsps |
---|
0:03:01 | so it really very fine control |
---|
0:03:04 | video |
---|
0:03:07 | so and i four think problem it can be viewed as a binary classifier which |
---|
0:03:12 | means that discriminately with the input speech |
---|
0:03:16 | is a non speech |
---|
0:03:18 | or is supposed speech |
---|
0:03:21 | so in our case when we talk about this posting on the major |
---|
0:03:25 | so what exactly so there's a them in the |
---|
0:03:29 | so but not in five |
---|
0:03:31 | what we factors interest you know |
---|
0:03:33 | independently back speech |
---|
0:03:36 | so essentially you know case doesn't be |
---|
0:03:40 | different channel characteristics |
---|
0:03:41 | in this during playback and we recording speech |
---|
0:03:45 | and also |
---|
0:03:46 | background noise |
---|
0:03:49 | involve doing this you are |
---|
0:03:51 | parameters are essentially that you that are we expect model exploiting for playback speech detection |
---|
0:03:59 | before i start |
---|
0:04:01 | you know holes sub band modeling framework |
---|
0:04:04 | for |
---|
0:04:07 | like of all a sum of common commonly used a approaches towards designing us |
---|
0:04:13 | supposing images |
---|
0:04:14 | which since also |
---|
0:04:16 | exploiting information of course on |
---|
0:04:20 | so what we call is from one of them is a |
---|
0:04:23 | for the on here is the |
---|
0:04:25 | is an example of four |
---|
0:04:27 | optimal or |
---|
0:04:29 | gaussian mixture model which is a generative model |
---|
0:04:32 | as can be seen |
---|
0:04:34 | of given from the speech prince common spectrum is extracted |
---|
0:04:38 | on its extractor |
---|
0:04:39 | acoustic features |
---|
0:04:40 | for example to use is you know is very low baseline feature |
---|
0:04:45 | and on this speaker she's data |
---|
0:04:48 | basically |
---|
0:04:49 | a gmm models one for school and one on my |
---|
0:04:52 | utterances are train |
---|
0:04:54 | you not of the model the distribution of this |
---|
0:04:57 | this process |
---|
0:04:58 | so this is an example of commonly used gmm model it has shown promising results |
---|
0:05:03 | in the st score evaluations |
---|
0:05:07 | category all a total ban on committees that you're gonna study |
---|
0:05:11 | is |
---|
0:05:12 | discriminatively trained unit it's |
---|
0:05:15 | which on very |
---|
0:05:17 | problems on and in the thing |
---|
0:05:21 | nineteen this peaceful nations |
---|
0:05:24 | so as can be seen so this model takes for quantified in speech or its |
---|
0:05:29 | and basically going on this isn't wrong three |
---|
0:05:33 | between the one |
---|
0:05:35 | so this actually motivates our current work |
---|
0:05:38 | and in fact that means just question that motivates our work is |
---|
0:05:44 | do we really need |
---|
0:05:45 | all the speech |
---|
0:05:47 | all the informational costs all the three best friends |
---|
0:05:51 | already split into action |
---|
0:05:52 | so or intuition is that may be doing |
---|
0:05:55 | maybe they next speech to use might be somewhere in high frequency regions or maybe |
---|
0:06:00 | you know very low frequency regions |
---|
0:06:02 | so in this framework of what we what defined with different understand the importance of |
---|
0:06:07 | different frequency bands for sporting detection |
---|
0:06:11 | and also trying to come up of s o s and combination that improved if |
---|
0:06:16 | the columns |
---|
0:06:18 | so we test this hypothesis this idea of old remote on two benchmarks |
---|
0:06:26 | this is feasible thing to seventeen and |
---|
0:06:28 | is useful information |
---|
0:06:34 | so this texas also proposed methodology |
---|
0:06:37 | can be expected to us |
---|
0:06:39 | in the first one with we basically |
---|
0:06:42 | if or input into |
---|
0:06:44 | i |
---|
0:06:46 | spectrogram of different |
---|
0:06:49 | and train and independency inance on top of the frame discriminative |
---|
0:06:56 | and likewise what we obtain is we have been and dependency in that are trained |
---|
0:07:00 | on and sum of spectrograms |
---|
0:07:02 | so |
---|
0:07:03 | this allows of actually look at |
---|
0:07:05 | or indicate and sometimes |
---|
0:07:07 | and understand which frequency regions are more discriminant e |
---|
0:07:10 | of course for detection on this data sets |
---|
0:07:14 | so |
---|
0:07:15 | so this can also be viewed as a employing and independent workers |
---|
0:07:19 | just focusing on |
---|
0:07:21 | small sub band information and trying to and trying to exploit discriminative information for incorrectly |
---|
0:07:26 | and a one speech |
---|
0:07:28 | rather than having once you in which |
---|
0:07:30 | it hasn't is the baseline once you know that is at least having to look |
---|
0:07:36 | all the information and also |
---|
0:07:38 | how to |
---|
0:07:39 | if the bulletin of are on by means an introduction and while retaining the |
---|
0:07:44 | most discriminative information what we do we begin by the fast basically be a we |
---|
0:07:48 | kind of |
---|
0:07:51 | allow independence into focus on one-pass |
---|
0:07:54 | and then |
---|
0:07:54 | we into a |
---|
0:07:56 | improve performance by doing so |
---|
0:07:58 | so without in the second step what |
---|
0:08:01 | we and |
---|
0:08:02 | basically think this prevent models and combine this alone features |
---|
0:08:06 | and train or another classifier on top of so many all this |
---|
0:08:11 | different features and then jointly update the weights of entire framework |
---|
0:08:16 | so in it what we do |
---|
0:08:18 | we are also making use of the cold information but and |
---|
0:08:22 | not using only once you |
---|
0:08:23 | you in |
---|
0:08:25 | in |
---|
0:08:28 | any given in c n n's and then we give this and independent features on |
---|
0:08:32 | cognitive and then than if we train |
---|
0:08:35 | the classifier is trained on |
---|
0:08:38 | so this is our proposed method which we test on the two it's really seventeen |
---|
0:08:43 | and twenty nineteen is possible data |
---|
0:08:48 | so |
---|
0:08:49 | let me let me talk about the experimental results well start with the baseline |
---|
0:08:55 | baseline gmm and the ceiling |
---|
0:08:57 | so the gmm baseline is trained on stick uses it is |
---|
0:09:02 | and |
---|
0:09:03 | the c n is trained on spectrogram and spectrogram |
---|
0:09:07 | condition from the experimental results |
---|
0:09:09 | we find that the discriminatively trained union |
---|
0:09:14 | in performance |
---|
0:09:16 | two |
---|
0:09:17 | sufficiency handcrafted gmm baseline |
---|
0:09:21 | on the data six |
---|
0:09:23 | and one thing the of the on the training dataset to increase equal error rate |
---|
0:09:28 | on gmm is given the fact that we apply |
---|
0:09:32 | preprocessing step |
---|
0:09:34 | on the audio signals |
---|
0:09:36 | which involves discarding the cost of p zero value of silence |
---|
0:09:40 | from evidence only not prior work in interspeech trained and |
---|
0:09:45 | and that's the reason why our baseline security gmm is quite |
---|
0:09:50 | different and then the |
---|
0:09:52 | baseline and the for this effect |
---|
0:09:56 | so no let me talk about our first experiment is okay |
---|
0:10:00 | where we |
---|
0:10:01 | spectral input into uniform sometimes |
---|
0:10:04 | and one thing to more here is that |
---|
0:10:06 | in this paper we have adopted a very simple to extract is your are uniformly |
---|
0:10:11 | segmenting the input and have non-overlapping |
---|
0:10:16 | so we just in the in between two |
---|
0:10:19 | six |
---|
0:10:20 | we |
---|
0:10:21 | we |
---|
0:10:22 | frame independence in and one of two in |
---|
0:10:26 | and then having frame that model was intensely from we combine them to train a |
---|
0:10:30 | concatenated model or a joint model |
---|
0:10:34 | so i would be calling disjoint framework task at in the experimental results |
---|
0:10:39 | which basically is that |
---|
0:10:41 | the framework of this except that we trained |
---|
0:10:44 | pictures and classifier is trained one |
---|
0:10:48 | so let's look at the experimental results |
---|
0:10:52 | as can be seen from this |
---|
0:10:55 | right distribution |
---|
0:10:57 | on the extended data set we find that the commission in high frequency reasons seem |
---|
0:11:03 | to be more likely in contrast to if you regions |
---|
0:11:07 | on the contrary |
---|
0:11:09 | in between the same reasoning can be the same define objects |
---|
0:11:13 | we see people |
---|
0:11:14 | in c reasons be more discriminative in contrast the high frequency region |
---|
0:11:20 | nonetheless on what the datasets our proposed |
---|
0:11:24 | of frame more difficult compare |
---|
0:11:26 | the on off between models anymore |
---|
0:11:29 | since all four improve performance on what the data set |
---|
0:11:33 | so in this section in the second experimental setup what we do not is you |
---|
0:11:37 | know for that is there or input |
---|
0:11:39 | into four |
---|
0:11:41 | uniform segments each of the two sub bands |
---|
0:11:44 | no with i |
---|
0:11:45 | we can look at |
---|
0:11:47 | more detail on discriminative information rather than just having a |
---|
0:11:52 | for us so i |
---|
0:11:55 | so it is our rows or and experimental setup so we have or independency anything |
---|
0:12:02 | on two khz and a nation |
---|
0:12:05 | and then there's models are placed on mine |
---|
0:12:09 | we have been a feature is it into this classifier and the whole framework is |
---|
0:12:14 | trained |
---|
0:12:15 | well over the so |
---|
0:12:18 | you know if all the with this |
---|
0:12:22 | so |
---|
0:12:23 | yes take a look at the experimental results so well as can be seen on |
---|
0:12:29 | different this we find that |
---|
0:12:32 | the |
---|
0:12:33 | information in between |
---|
0:12:36 | two khz to six khz in to be not comedy |
---|
0:12:40 | in contrast to |
---|
0:12:42 | information present in the last two killers sub band and we first |
---|
0:12:48 | so in contrast on the training dataset we find that the first two khz sub |
---|
0:12:55 | band is more informative |
---|
0:12:58 | how valuable of the safety |
---|
0:13:00 | as in previous case |
---|
0:13:01 | our on a model |
---|
0:13:03 | some of the best results |
---|
0:13:05 | in what the data sets |
---|
0:13:06 | so |
---|
0:13:07 | the next experiment a set of what we do |
---|
0:13:10 | is we know |
---|
0:13:12 | but this |
---|
0:13:13 | a input for their into |
---|
0:13:16 | eight subbands so it's of one khz |
---|
0:13:20 | so essentially what we do if we see it independency and everything on one delivered |
---|
0:13:26 | formation |
---|
0:13:28 | and is a this frequency entirely online again to |
---|
0:13:31 | previous |
---|
0:13:33 | a framework it is it's data it units all data |
---|
0:13:39 | do |
---|
0:13:41 | you can improve the |
---|
0:13:42 | do so okay performance |
---|
0:13:46 | so this is the experimental results for one lower sub-bands units |
---|
0:13:51 | so this distribution across different sometimes allows us to actually |
---|
0:13:57 | understand the impact of different bands |
---|
0:14:01 | so only twenty seven dataset what this |
---|
0:14:04 | is |
---|
0:14:06 | one khz information seem to be the most informative |
---|
0:14:10 | as we in and |
---|
0:14:12 | eer of a two point one which is |
---|
0:14:15 | e r in contrast or the frequency bands and interesting about defined is you've on |
---|
0:14:20 | this isn't between |
---|
0:14:22 | of different this and we just one two khz and |
---|
0:14:26 | and it is seven killers is a system that informative as can be seen high |
---|
0:14:31 | eer |
---|
0:14:34 | and these second informative frequency for fixed expensive the be the first one khz |
---|
0:14:39 | and of course will be or compact model operating on the |
---|
0:14:44 | all h |
---|
0:14:45 | and seem to give with comments but then we also right in just the last |
---|
0:14:52 | a seven eight khz band and the first one dollars |
---|
0:14:55 | which seem to give us the best performance in one for here |
---|
0:14:59 | on the financial data see what we found is we found that the first one |
---|
0:15:03 | is most informative |
---|
0:15:05 | in contrast of the fifty s |
---|
0:15:08 | so as i mentioned earlier so this is due to the fact that the twenty |
---|
0:15:12 | seventeen and twenty nine dataset completely different so the fink intention is a simulated data |
---|
0:15:18 | while twenty seven dataset is |
---|
0:15:21 | is the real data that was |
---|
0:15:24 | recorded and it back |
---|
0:15:26 | using a |
---|
0:15:28 | speaker verification it has it all right |
---|
0:15:31 | so that this kind of explains the |
---|
0:15:34 | difference a mismatch in the behavior |
---|
0:15:39 | the final set of experiment we performed in this study is in terms of prostate |
---|
0:15:44 | a simple ones |
---|
0:15:45 | or with we with some of the best |
---|
0:15:49 | models |
---|
0:15:50 | not mentioned in dataset and original dataset and test it |
---|
0:15:56 | comment on is visible twenty nineteen real be tested |
---|
0:16:00 | we have used it is a very small essay |
---|
0:16:03 | of |
---|
0:16:04 | thousand utterances that was |
---|
0:16:07 | instinctively conditions like the is organized as |
---|
0:16:11 | and we want to see how this models |
---|
0:16:15 | performed on realistic s conditions can be seen from the high |
---|
0:16:20 | error rate distributions for all our models |
---|
0:16:25 | this solicitous that the cutting us holding datasets training model doesn't actually |
---|
0:16:32 | much on the realistic or if the conditions |
---|
0:16:35 | so this thing that we might have to think about a few design or |
---|
0:16:39 | training and validation sets for or standing s |
---|
0:16:46 | so to improvement all |
---|
0:16:50 | in this paper work with the we will be basically |
---|
0:16:54 | but at all events and in |
---|
0:16:58 | by discriminatively training independent seen in on |
---|
0:17:03 | and |
---|
0:17:04 | so |
---|
0:17:07 | if variable a figment |
---|
0:17:09 | and then there's a lady the later on the combined |
---|
0:17:13 | and |
---|
0:17:14 | and independent possible is trained on top of that |
---|
0:17:17 | using the proposed methodology we found people performance on will be twenty seventy three datasets |
---|
0:17:23 | and an interesting observation but not for or it which is a language that some |
---|
0:17:29 | of the for this war is that |
---|
0:17:31 | under twenty seventy dataset |
---|
0:17:33 | e |
---|
0:17:34 | seventy eight khz frequency formations in to be more informative |
---|
0:17:38 | with however doesn't hold true on the training dataset |
---|
0:17:42 | between ti din dataset the first one khz information seems to be more formally |
---|
0:17:47 | and we also found that |
---|
0:17:51 | the this wanna do not generalize a real on the on the realistically if conditions |
---|
0:17:57 | with so this |
---|
0:17:59 | that it is still room or for kids from |
---|
0:18:03 | designing and validate in |
---|
0:18:05 | this dataset for training effective |
---|
0:18:08 | are replaced with an addiction models |
---|
0:18:11 | so that i would like to control my |
---|
0:18:14 | and you very much |
---|