0:00:16 | hi everyone i'm john allen from computer with such and still montreal |
---|
0:00:20 | to them going to talk about a multi condition training strategy for contra miseries against |
---|
0:00:27 | spoofing attacks |
---|
0:00:29 | two speaker recognizers |
---|
0:00:30 | this is the joint work we did wrong wanted to and channel four |
---|
0:00:35 | in this presentation i'm going to provide an overview of our work |
---|
0:00:40 | one and two ends of introduction |
---|
0:00:44 | i employing to deanna channel that utilizes documentation to increase |
---|
0:00:50 | the amount of |
---|
0:00:51 | training data for improving |
---|
0:00:54 | performance on the unseen test data |
---|
0:01:04 | the outline of my talk for a i lists start with the |
---|
0:01:09 | introduction element known then i'm going to talk able |
---|
0:01:13 | spoof an detection documentation |
---|
0:01:16 | baseline use for this task |
---|
0:01:20 | and to an approach is to list of introduction using to deanna literature |
---|
0:01:26 | and finally lame going to provide some results for performance evaluation and |
---|
0:01:32 | and i'm going to conclude my toe |
---|
0:01:39 | i had is the introduction and background |
---|
0:01:43 | given a p r of recording |
---|
0:01:45 | the goal of |
---|
0:01:47 | speaker verification system is to do to mine |
---|
0:01:50 | whether the recordings are from the same speaker or |
---|
0:01:55 | from two different speakers |
---|
0:01:57 | in order to do so |
---|
0:01:59 | and speaker of a speaker verification system utilizes a set of recognisable |
---|
0:02:04 | and very what verifiable voice characteristics |
---|
0:02:09 | which are normally considered a unique and the specific to a person |
---|
0:02:15 | districts are normally extracted |
---|
0:02:19 | in the feature extraction module it of a speaker verification system |
---|
0:02:23 | in a controlled setting |
---|
0:02:25 | speaker verification system perform very well |
---|
0:02:29 | but it performs |
---|
0:02:30 | it performance degrades in real-world setting |
---|
0:02:34 | where n in boston can pretend to be a generally speaker by foraging |
---|
0:02:39 | a genuinely speaker voice recording |
---|
0:02:43 | or when there is a mismatch between training and test environment |
---|
0:02:48 | in this work we and mainly concerned with the |
---|
0:02:51 | for signal generally speaker y's by of embossed of |
---|
0:02:59 | in an speaker verification system |
---|
0:03:02 | the claimed identity |
---|
0:03:05 | can we generally are forwarded by a for mister |
---|
0:03:09 | well was goal is to get any illegitimate access to the system |
---|
0:03:15 | so the manipulation of and authentication system by impostors is nobody known as well thing |
---|
0:03:22 | speaker verification system are born unable to spoofing attacks generated by |
---|
0:03:29 | in the replay |
---|
0:03:31 | speech synthesis voice conversion |
---|
0:03:34 | and in boston impersonations |
---|
0:03:37 | except in impersonations all other three attacks are normally considered major trade |
---|
0:03:44 | to a speaker verification system |
---|
0:03:47 | among the three major attack times |
---|
0:03:50 | all replay is known as the physical access advance where m is |
---|
0:03:55 | speech synthesis |
---|
0:03:57 | and voice conversion networks are known as the logical alexis attacks |
---|
0:04:05 | makes them to talk about this poof and detection |
---|
0:04:10 | fortunately all data styles |
---|
0:04:14 | discussed in the previous slide that means replay |
---|
0:04:19 | speech synthesis and voice conversion leave some traces in the converted to speech in the |
---|
0:04:25 | formal be able artifacts |
---|
0:04:28 | spoofing detection techniques normally use this to do what are to fix |
---|
0:04:33 | in order to distinguish |
---|
0:04:35 | spoof the speech from the generated speech |
---|
0:04:40 | to make speaker verification systems recording is spoofing attack |
---|
0:04:46 | speaker verification and the spoofing detection system can be a |
---|
0:04:51 | connected in parallel |
---|
0:04:53 | in the left side of the finger already present it is performed detection is followed |
---|
0:04:58 | by is to get a verification system |
---|
0:05:01 | well the recording of the claim identity is files is |
---|
0:05:05 | initially passed through the speaker verification system to make verification decision |
---|
0:05:10 | if the identity |
---|
0:05:12 | is accepted but the verification system |
---|
0:05:15 | it is then passed through with spoofing detection system |
---|
0:05:18 | to find out what with that the plan mightn't is actually |
---|
0:05:22 | generally in order to put noticeable |
---|
0:05:26 | in that are set of the free good or whatever speaker verification system is followed |
---|
0:05:31 | by this perform detection system in this case |
---|
0:05:35 | there |
---|
0:05:37 | claim identities if the claim i didn't is found channeling only then it is past |
---|
0:05:42 | where verification system tool make verification decision |
---|
0:05:48 | finally |
---|
0:05:51 | speaker verification is performed detection system can be connected in parallel |
---|
0:05:57 | i in this case |
---|
0:05:59 | the fused score or of |
---|
0:06:02 | speaker verification and the spoofing detection |
---|
0:06:06 | system is used to make accept or reject decision |
---|
0:06:11 | that advantage of this approach is the only want racial is required us |
---|
0:06:17 | to perform verification decision |
---|
0:06:20 | like the to those in a few clean and total seventeen editions of this spoofing |
---|
0:06:24 | challenge |
---|
0:06:26 | to the nineteen addition of is visible of challenge into the ninety additional admissible challenge |
---|
0:06:33 | the participants wider us to be list nn on the spoofing detection system irrespective of |
---|
0:06:38 | a speaker verification system |
---|
0:06:42 | but in two thousand nine doing additional was something challenge organisers provided the verification is |
---|
0:06:48 | called for the participant |
---|
0:06:51 | so this participant can e-model it is performed detection is score |
---|
0:06:55 | in terms of tandem detection cost function when used to alongside with the verification system |
---|
0:07:08 | next i'm going to talk about that augmentation |
---|
0:07:13 | more animation only models such as deep-learning architectures |
---|
0:07:19 | may have billions of parameters and normally require a large amount of data for training |
---|
0:07:26 | and but in |
---|
0:07:28 | most of the application cases have been large amount of data is normally not possible |
---|
0:07:35 | well as for example consider the case of is visible challenges where training data provided |
---|
0:07:41 | to the participant or not are not sufficient enough |
---|
0:07:46 | to expect generalize performance using |
---|
0:07:50 | deep-learning be approaches |
---|
0:07:53 | so |
---|
0:07:56 | two used to barely i architecture we need to increase the training data |
---|
0:08:00 | the process of increasing and the amount and the diversity of a training data is |
---|
0:08:06 | nobody non is |
---|
0:08:09 | documentation |
---|
0:08:10 | documentation normally serve |
---|
0:08:12 | two proposes |
---|
0:08:13 | one propose is the domain adaptation on roman generalisation |
---|
0:08:18 | in this case the main goal is to compensate for mismatch |
---|
0:08:22 | environmental between training and test data |
---|
0:08:25 | and this approach is normally widely used in the speech based applications |
---|
0:08:30 | for example speaker recognition a speech recognition |
---|
0:08:34 | another corpus for the documentation is the regularization |
---|
0:08:38 | the main goal is to improve performance on unseen test data |
---|
0:08:43 | by |
---|
0:08:44 | increasing the training data |
---|
0:08:46 | in this work or corpus was to |
---|
0:08:49 | do regularization |
---|
0:08:53 | for this work we try to adopt and domain adaptation this strategy that preserve the |
---|
0:08:59 | artifacts of the spoofing attacks |
---|
0:09:02 | and that the same time and does not |
---|
0:09:05 | use any external data such as |
---|
0:09:08 | noise reverberation et cetera |
---|
0:09:11 | data from addition a strategy adopted in this war this presented |
---|
0:09:17 | in this figure |
---|
0:09:19 | all the slide |
---|
0:09:21 | i hear additional training data were created by using speed perturbation with the freeways the |
---|
0:09:27 | partition perturbation vector or of |
---|
0:09:29 | zero point nine and one point one |
---|
0:09:32 | low-pass and high-pass filtering on the training data |
---|
0:09:38 | by doing the documentation in this case we were able to increase the training data |
---|
0:09:44 | five tens of the original training that no |
---|
0:09:51 | next am gonna talk about |
---|
0:09:53 | speech representation only used |
---|
0:09:56 | for this work |
---|
0:09:59 | in the course of |
---|
0:10:00 | to build a few demon to total seventeen additions of |
---|
0:10:04 | is a peaceful of challenges |
---|
0:10:06 | and after the evaluation it became almost clear that the most effective countermeasures errors |
---|
0:10:13 | for spoken detection is then local the speech representation |
---|
0:10:18 | by local mean frame level feature |
---|
0:10:21 | which are typically extracted over ten millisecond interval |
---|
0:10:25 | for is this for to the nineteen challenge does |
---|
0:10:30 | we |
---|
0:10:31 | use three way to use your colour speech representation |
---|
0:10:35 | one of them is |
---|
0:10:37 | and linear frequency cepstral coefficient feature |
---|
0:10:40 | and various is to have to compute this feature is presented in left |
---|
0:10:46 | ten side of figure |
---|
0:10:48 | another |
---|
0:10:50 | feature is the |
---|
0:10:53 | sequence is if we check honest and you cepstral coefficient feature |
---|
0:10:57 | which was phone very effective for it |
---|
0:11:00 | to than fifty variation of stopping challenge task |
---|
0:11:04 | and |
---|
0:11:05 | we use |
---|
0:11:06 | this feature also in the stars |
---|
0:11:10 | as this feature was provided in the one |
---|
0:11:13 | with the baseline |
---|
0:11:15 | and to compute the sequence of feature the various steps are presented in the right |
---|
0:11:21 | hand side of the figure |
---|
0:11:27 | another local the speech representation we use for this work is the |
---|
0:11:33 | provide spectrum which is the product of power spectrum in group delay function |
---|
0:11:39 | this feature incorporates both the amplitude and phase spectral compare the |
---|
0:11:44 | and various steps for completing this feature is presented |
---|
0:11:50 | in this figure out of the slide |
---|
0:11:55 | next and lead to talk about the baseline used for supporting detection in the stars |
---|
0:12:02 | in order to make competition a performance we used to baseline provided by the are |
---|
0:12:07 | gonna the one of the baseline is sequences the feature with gmm classifier |
---|
0:12:12 | and another baseline is the |
---|
0:12:14 | elasticy feature with the same gmm classifier |
---|
0:12:19 | besides we also created our own baselines one of our baseline is mfcc with the |
---|
0:12:25 | g m and then and there is the i-vector p l d v is then |
---|
0:12:28 | what of our baseline where |
---|
0:12:31 | but is encoded toolkit |
---|
0:12:37 | in the speaker of this slide presented the gmm based framework for a simple from |
---|
0:12:42 | detection |
---|
0:12:44 | in this framework |
---|
0:12:45 | i generally dennis of the gmm models are trained |
---|
0:12:49 | using genuine and it's |
---|
0:12:51 | spoofing speech training data |
---|
0:12:54 | then given a test recording generally noticeable decision is made based on the likelihood ratio |
---|
0:13:02 | computed using the trained gmm models |
---|
0:13:09 | next i'm going to talk about the end when approaches that we used for is |
---|
0:13:13 | performed detection in this stuff |
---|
0:13:19 | in an end-to-end approach non local the speech representation are normally do typically map twist |
---|
0:13:25 | proving detection a score |
---|
0:13:27 | in this approach for modeling we add up to the two d n in |
---|
0:13:31 | more detail was architecture is presented in |
---|
0:13:34 | table one of the slight |
---|
0:13:37 | in this architectures several |
---|
0:13:40 | one variational convolutional perform to encode |
---|
0:13:44 | to encode |
---|
0:13:47 | input local the speech representation to local countdown as ours |
---|
0:13:51 | statistics putting laity the eagles to summarise this sequence of |
---|
0:13:56 | local counter miseries into a global condor major |
---|
0:14:00 | finally the global control method is projected into have final output the score |
---|
0:14:07 | trying to affine transformation and along with the complete model |
---|
0:14:16 | for training |
---|
0:14:18 | binary cross entropy laws is you lost in a standard binary classification setting |
---|
0:14:24 | the as you can see |
---|
0:14:27 | the we have seen previously the training database quite unbalanced for is busy of channel |
---|
0:14:32 | data |
---|
0:14:33 | is the guys almost nine terms of the one of five training data |
---|
0:14:39 | so |
---|
0:14:41 | many pages that created in such a way that genuine example some bald |
---|
0:14:46 | several times party walks was to ensure |
---|
0:14:50 | the mean images are balance |
---|
0:14:53 | training and study using still that stochastic gradient descent algorithm with the meeting best size |
---|
0:14:59 | of sixteen |
---|
0:15:01 | only have to selection is also employed for this does |
---|
0:15:09 | next i'm going to present some results on is this above tools of the nineteen |
---|
0:15:13 | challenge should have limited evaluation data |
---|
0:15:17 | the matrix used for almost performance evaluation of normalized minimum ten then detection about ten |
---|
0:15:23 | them detection cost function and the equal error rate |
---|
0:15:28 | for experiment on logic and physical access task |
---|
0:15:32 | there's |
---|
0:15:33 | be useful to those of nineteen channel and uttering we used |
---|
0:15:37 | for physical access stars small data generated using similar to repair tasks where is |
---|
0:15:43 | for |
---|
0:15:44 | for physical access task |
---|
0:15:46 | this book data generate reason similar to replay attacks where is for a |
---|
0:15:52 | logic alexi starts to go to generated using various |
---|
0:15:56 | i p synthesis and voice conversion and but algorithm |
---|
0:16:00 | in table two presents |
---|
0:16:03 | the number of the gender and recordings of recording and the number of the speaker |
---|
0:16:07 | a contained in trained emblem it and evaluation partitions of |
---|
0:16:12 | logical x is an physical access task |
---|
0:16:17 | we can see from this table that training device quite unbalanced in this small |
---|
0:16:25 | physical accessible for detection results in terms of tandem detection cost function and equal error |
---|
0:16:31 | rate |
---|
0:16:32 | on the diablo meant as well as evaluation first which reported in |
---|
0:16:38 | this table three and five |
---|
0:16:41 | we can arousal from the presented results that |
---|
0:16:45 | documentation how |
---|
0:16:46 | to improve performance in both test set |
---|
0:16:52 | in this |
---|
0:16:54 | logic alexis puffing detection results |
---|
0:16:57 | in this slide represented logical accessible from detection results in terms of |
---|
0:17:02 | tandem detection cost function on an equal error rate |
---|
0:17:06 | on the development as well as you evaluation sets |
---|
0:17:11 | for the logic alexis stars |
---|
0:17:14 | documentation it's phone effective only on the development set |
---|
0:17:18 | and overall we can see that the and one approach employing td an architecture provided |
---|
0:17:24 | better performance |
---|
0:17:26 | then the baseline |
---|
0:17:29 | on both logical and physical alexis stuff |
---|
0:17:36 | finally conclusion we can say |
---|
0:17:40 | data limitation is found helpful specifically for p it does for the score from detection |
---|
0:17:47 | employee deep-learning architecture |
---|
0:17:51 | four |
---|
0:17:52 | in order to the documentation for the spoofing detection we have to make sure the |
---|
0:17:57 | signal transformation employed |
---|
0:18:00 | data that augmentation technique must preserve the art effects introduced by |
---|
0:18:06 | spoofing algorithm |
---|
0:18:09 | and tuna and they approach employing t d n and lead to documentation outperform the |
---|
0:18:15 | baseline |
---|
0:18:16 | and to an approach we double |
---|
0:18:19 | documentation |
---|
0:18:21 | and to an approach |
---|
0:18:22 | to deal in a two d n and with two d an architecture to with |
---|
0:18:26 | that but without data limitation |
---|
0:18:28 | outperformed all the baseline bottleneck and logically lex's infringing alexis task |
---|
0:18:35 | domain and i know that augmentation by still perturbation and |
---|
0:18:41 | filtering |
---|
0:18:42 | basically low-pass and high-pass filtering is |
---|
0:18:46 | found useful for physical access tiles but for |
---|
0:18:49 | logically alexi stars |
---|
0:18:51 | speaker part of vision is found harmful |
---|
0:18:56 | feature normalization use of voice activity detection and already to different abbing deviation of the |
---|
0:19:02 | filters less than |
---|
0:19:04 | sixty four |
---|
0:19:06 | filter are and number in a tree commander for the spoofing detection task |
---|
0:19:15 | thank you very much for your attention |
---|