0:00:13 | hi everyone thank you for joining my presentation |
---|
0:00:17 | i and function one from any c corporation |
---|
0:00:20 | today i would like to present my paper |
---|
0:00:23 | using multiresolution feature maps |
---|
0:00:26 | with |
---|
0:00:27 | convolutional neural networks for and his movie in years to be |
---|
0:00:33 | here are the content first i would like to give the introduction and the review |
---|
0:00:38 | of multiple |
---|
0:00:39 | feature maps popular in here series moving detection |
---|
0:00:44 | next |
---|
0:00:44 | i will introduce our proposed |
---|
0:00:47 | multiresolution feature map |
---|
0:00:49 | and how it is used with three a feature extraction |
---|
0:00:52 | and the ways neural networks |
---|
0:00:55 | well give three popular c and variance as examples |
---|
0:01:00 | resonant eighteen seen that fifty and lc-nn |
---|
0:01:05 | and we show the effectiveness of the proposed method |
---|
0:01:08 | in experiments |
---|
0:01:10 | and also cave and analysis in terms of computational cost |
---|
0:01:14 | finally i'd like to summarise this presentation |
---|
0:01:21 | automatic speaker verification yes we |
---|
0:01:25 | offers flexible biometric authentication and has been increasingly employed in such telephone based services as |
---|
0:01:34 | telephone banking |
---|
0:01:35 | in four and six at call center and so |
---|
0:01:39 | yes means reliability depends on its resilience to spoofing |
---|
0:01:44 | it is true of any biometric technology |
---|
0:01:48 | therefore with the increase of |
---|
0:01:50 | use |
---|
0:01:52 | of yes we spoofing detection in speech is also getting more attention |
---|
0:01:58 | direct you scenarios of spoofing attacks |
---|
0:02:01 | logical access and physical access |
---|
0:02:04 | most equal access enclosed text-to-speech synthesis and voice conversion |
---|
0:02:11 | in physical access is mentally would play where the target identities voice is recorded and |
---|
0:02:17 | replay |
---|
0:02:18 | we play is very easy |
---|
0:02:20 | to implement and is it as you may know their heart to detect |
---|
0:02:25 | yes bill challenge |
---|
0:02:28 | two thousand fifteen to two thousand nineteen have been driving efforts on one and just |
---|
0:02:33 | poking countermeasures |
---|
0:02:35 | and the resulted in significant findings |
---|
0:02:38 | yes miss fifteen |
---|
0:02:40 | focuses on spoofing attacks generate a different speech synthesis |
---|
0:02:44 | and the voice conversion |
---|
0:02:47 | yes three seventeen focused on |
---|
0:02:50 | replay attacks |
---|
0:02:52 | then it's |
---|
0:02:53 | studies of |
---|
0:02:54 | two thousand nineteen |
---|
0:02:57 | addressed all types so |
---|
0:02:58 | of |
---|
0:03:00 | spoofing |
---|
0:03:00 | in previous two challenges |
---|
0:03:02 | and further extended data sets |
---|
0:03:04 | in terms of spoofing technology |
---|
0:03:08 | number of |
---|
0:03:09 | conditions and volume of data |
---|
0:03:13 | with a lot of those researched in down with the challenges |
---|
0:03:17 | the training have as being |
---|
0:03:21 | shifted from gmm was features like mfccs thing to c or c f's this the |
---|
0:03:27 | beginning |
---|
0:03:28 | two |
---|
0:03:29 | but deep neural networks |
---|
0:03:31 | ways |
---|
0:03:31 | hi time fruit a time frequency resolution features |
---|
0:03:35 | that's has been proved to achieve higher accuracy |
---|
0:03:39 | following this |
---|
0:03:41 | conventional methods |
---|
0:03:43 | trials and carefully which type of acoustic feature to use |
---|
0:03:47 | yes it is essential in speech processing tasks |
---|
0:03:51 | including spoofing detection |
---|
0:03:53 | however |
---|
0:03:54 | is realising only one type of acoustic features may not be sufficient to |
---|
0:03:59 | detect |
---|
0:04:00 | globals to think vectors when facing and saying |
---|
0:04:04 | spoofing speech |
---|
0:04:10 | as we know |
---|
0:04:12 | from cuba |
---|
0:04:14 | audio segment |
---|
0:04:16 | multiple acoustic feature maps can be |
---|
0:04:20 | extracted |
---|
0:04:21 | such as mfcc security see |
---|
0:04:23 | so if sissy fft |
---|
0:04:26 | security and so on |
---|
0:04:28 | it may be difficult to determine |
---|
0:04:31 | one type of acoustic feature maps |
---|
0:04:34 | will be the past four weeks moving detection |
---|
0:04:37 | you know all one type |
---|
0:04:39 | well of the acoustic feature |
---|
0:04:42 | different settings |
---|
0:04:44 | used in that extraction will resulted in obtaining of different informations |
---|
0:04:49 | for example |
---|
0:04:50 | fft spectrogram extracted |
---|
0:04:53 | with different window lengths contain spectral information have |
---|
0:04:58 | resolutions |
---|
0:04:59 | that |
---|
0:04:59 | different higher and lower frequency bands |
---|
0:05:04 | i shorter window |
---|
0:05:05 | will lead to high resolution in terms of time |
---|
0:05:09 | and the low |
---|
0:05:10 | lower resolution in terms of frequency |
---|
0:05:14 | on the contrary a novel weighting though |
---|
0:05:20 | we extract f t |
---|
0:05:23 | which has |
---|
0:05:24 | higher a for a higher resolution in frequency and a low resolution in time |
---|
0:05:30 | the trade-off between time and the frequency resolution makes it difficult to extract |
---|
0:05:37 | sufficient information was one fft |
---|
0:05:39 | spectrogram along |
---|
0:05:41 | therefore |
---|
0:05:42 | the use of multiple acoustic feature maps |
---|
0:05:46 | is needed to alleviate the problem |
---|
0:05:50 | the question is |
---|
0:05:52 | called to use a logical acoustic feature maps together |
---|
0:06:01 | future physicians and score fusion is |
---|
0:06:03 | score fusion is that kind of late fusion which can be used |
---|
0:06:07 | to choose |
---|
0:06:09 | score produced from systems |
---|
0:06:12 | used |
---|
0:06:14 | individual feature maps |
---|
0:06:17 | however score fusion can be computational cost of these things it's needs to train neural |
---|
0:06:23 | networks for multiple times |
---|
0:06:26 | in addition fusion weights need to be determined in advance |
---|
0:06:31 | as for feature fusion |
---|
0:06:33 | there are a feature mapping concatenation |
---|
0:06:37 | alone a single dimension such as time or frequency dimension |
---|
0:06:42 | there is also linear interpolation two |
---|
0:06:45 | feature additional allowed we is a fisher follow the ways |
---|
0:06:51 | neural networks has the advantage of all call automatic |
---|
0:06:55 | feature selection so we chose feature fusion over the score fusion |
---|
0:07:08 | we proposed |
---|
0:07:10 | multiresolution patient maps |
---|
0:07:12 | which the stack |
---|
0:07:14 | multiple feature maps |
---|
0:07:15 | of the same dimensionality |
---|
0:07:18 | into |
---|
0:07:20 | three intermission input for deep neural network |
---|
0:07:25 | it is soon table four |
---|
0:07:27 | two d c n |
---|
0:07:29 | the modification of neural network is also where same |
---|
0:07:33 | only needs to |
---|
0:07:35 | modify the first a layer of it and neural network from condition |
---|
0:07:42 | one times |
---|
0:07:44 | c one c one here is the output animation of the first layer neural network |
---|
0:07:49 | from this dimension two |
---|
0:07:54 | the number of all the channels |
---|
0:07:57 | which means number of the feature maps |
---|
0:08:01 | times |
---|
0:08:01 | the out with animation |
---|
0:08:05 | so the proposed method makes |
---|
0:08:07 | it possible to extract more information |
---|
0:08:10 | from input a signals |
---|
0:08:13 | with relatively little can |
---|
0:08:15 | additional cost |
---|
0:08:25 | the experimental data used in this |
---|
0:08:28 | study was physical access subset |
---|
0:08:31 | all yes this move to solve the nineteen challenge |
---|
0:08:35 | it can't end up of fifty thousand spoofed |
---|
0:08:37 | and of |
---|
0:08:39 | five cells and |
---|
0:08:40 | well enough i'd |
---|
0:08:41 | utterances in the training partition |
---|
0:08:43 | as well as twenty four thousand spoofed |
---|
0:08:46 | and if |
---|
0:08:47 | five cells and quantified it |
---|
0:08:49 | utterances in the development |
---|
0:08:51 | partition |
---|
0:08:52 | the development dataset were in the conditions saying |
---|
0:08:56 | in the training data |
---|
0:08:59 | it contains |
---|
0:09:01 | twenty seven recording acoustic configurations |
---|
0:09:04 | and indirectly configurations |
---|
0:09:07 | evaluation |
---|
0:09:09 | contents |
---|
0:09:09 | much larger size of data |
---|
0:09:11 | and include |
---|
0:09:13 | spoof the utterances of unseen conditions as well |
---|
0:09:19 | is |
---|
0:09:20 | the scenario replaying estimates of two thousand nineteen |
---|
0:09:25 | replay attacks |
---|
0:09:27 | are stimulated |
---|
0:09:29 | we see acoustic environment |
---|
0:09:32 | the talker here |
---|
0:09:34 | speak to that yes three |
---|
0:09:37 | system |
---|
0:09:38 | and attacker |
---|
0:09:41 | recorded the speech |
---|
0:09:44 | in defined distances |
---|
0:09:46 | and then |
---|
0:09:47 | he or she will reply |
---|
0:09:49 | a speech back |
---|
0:09:51 | to the and this |
---|
0:09:53 | the point |
---|
0:09:54 | to the us to be system |
---|
0:09:59 | our experiments we used a t spectrogram and we used a window length of eighteen |
---|
0:10:05 | twenty five and thirty |
---|
0:10:07 | millisecond |
---|
0:10:08 | that the spectrograms dimension was trying to two hundred fifty seven times four hundred |
---|
0:10:15 | we used a unified |
---|
0:10:17 | fft feature maps |
---|
0:10:21 | since the lens of evaluation utterances are usually not is known beforehand |
---|
0:10:28 | we first extended utterances |
---|
0:10:31 | to their minimum multiple for the four hundred |
---|
0:10:36 | frames |
---|
0:10:37 | and then cut down into |
---|
0:10:40 | multiple for hunter phone |
---|
0:10:41 | segments with |
---|
0:10:43 | two hundred |
---|
0:10:45 | frame overlap |
---|
0:10:52 | experiments were carried out |
---|
0:10:54 | using the following three c and variance |
---|
0:10:57 | resonating and seen that fifty |
---|
0:11:00 | and the light c n |
---|
0:11:02 | all the networks |
---|
0:11:03 | as |
---|
0:11:04 | at ten output nodes |
---|
0:11:08 | one stands for |
---|
0:11:10 | one of five condition |
---|
0:11:12 | also generally |
---|
0:11:13 | and the other night |
---|
0:11:15 | represents nine we play |
---|
0:11:17 | configurations |
---|
0:11:20 | the no probability of the modified |
---|
0:11:24 | class |
---|
0:11:25 | is used as spoofing detection score to make the final decision |
---|
0:11:37 | the model parameters the and architecture of resonate aiding and s c net fifty are |
---|
0:11:42 | shown in this table |
---|
0:11:44 | the basic and the bottom that rise to a locks |
---|
0:11:48 | are described in original rest netscape |
---|
0:11:52 | resonating and testing |
---|
0:11:54 | fifty have been shown in |
---|
0:11:56 | this paper |
---|
0:11:58 | to be effective for replay |
---|
0:12:01 | spoofing detection |
---|
0:12:03 | that is why which rows just two networks |
---|
0:12:06 | lights is a kind of with less feature |
---|
0:12:11 | activation |
---|
0:12:12 | the use of an f and |
---|
0:12:14 | allowed us to reduce the number of channel |
---|
0:12:18 | i have |
---|
0:12:19 | it is why it's called like to see |
---|
0:12:22 | channels work the best in a estimation of two thousand seventeen |
---|
0:12:28 | and also ranked highly |
---|
0:12:30 | in case whisper of two thousand |
---|
0:12:32 | nineteen challenge |
---|
0:12:34 | that's why we also include |
---|
0:12:36 | lights in our study |
---|
0:12:40 | we first compare spoofing detection equal error rates |
---|
0:12:44 | when using single feature maps of different resolutions |
---|
0:12:49 | in this very |
---|
0:12:50 | networks |
---|
0:12:52 | for different neural network architectures |
---|
0:12:54 | the representative |
---|
0:12:56 | yes to performances were obtained whizzing |
---|
0:12:59 | well is |
---|
0:13:00 | different feature maps |
---|
0:13:08 | so here anything the f t eighteen twenty five searching represent fft spectrograms extracted ways |
---|
0:13:17 | this window all lands |
---|
0:13:22 | the fft spectrogram |
---|
0:13:25 | extracted was twenty five |
---|
0:13:27 | milisecond fft |
---|
0:13:29 | kind of five |
---|
0:13:31 | and extracted ways |
---|
0:13:33 | us thirty millisecond |
---|
0:13:35 | give similar results in resonate eighteen |
---|
0:13:39 | and the cp significantly better |
---|
0:13:42 | then those with at milisecond |
---|
0:13:47 | and for a scene and fifty however |
---|
0:13:50 | and thirteen feet eighteen and twenty five k |
---|
0:13:54 | similar results |
---|
0:13:56 | while |
---|
0:13:57 | those with thirty millisecond with the best |
---|
0:14:03 | for lc-nn the fft spectrograms of twenty five milisecond |
---|
0:14:08 | "'kay" the best performance |
---|
0:14:11 | so there may not be one single optimal fft configuration four |
---|
0:14:17 | different neural network structure |
---|
0:14:23 | next we applied the proposed multi resolution vision maps |
---|
0:14:27 | as input to the sedans |
---|
0:14:31 | we also show at the same time the results of score fusion |
---|
0:14:37 | which is a straightforward to mess that when multi feature maps |
---|
0:14:41 | are available |
---|
0:14:46 | here are the results on development set |
---|
0:14:50 | where the reply conditions are thing |
---|
0:14:55 | the query bars here we present |
---|
0:14:58 | performance for single feature map |
---|
0:15:01 | and blue bars represent score fusion and the yellow bars represent the proposed method |
---|
0:15:09 | we can see from the figure |
---|
0:15:10 | they use of two resolutions feature maps |
---|
0:15:16 | here |
---|
0:15:17 | improve |
---|
0:15:18 | the times a forty four and thirty percent for this three and hence |
---|
0:15:25 | respectively |
---|
0:15:27 | well |
---|
0:15:28 | and this the results are compared to the better |
---|
0:15:32 | one of the two single features system |
---|
0:15:36 | then we also have stream resolution |
---|
0:15:39 | results here |
---|
0:15:42 | each show the best performance for |
---|
0:15:45 | a resin that eighteen and has the net fifty |
---|
0:15:49 | which |
---|
0:15:50 | are fifty two percent and fifty seven percent in |
---|
0:15:54 | error reduction |
---|
0:15:56 | for l c and |
---|
0:15:58 | three resolution input shows less improvement |
---|
0:16:01 | we thank the reason may be the lc-nn have much less number of parameters |
---|
0:16:07 | which we will show you later page |
---|
0:16:11 | we also can see score fusion |
---|
0:16:13 | but she would |
---|
0:16:14 | better |
---|
0:16:16 | results compared to the single feature map systems |
---|
0:16:20 | clearly |
---|
0:16:21 | in all the cases |
---|
0:16:22 | the proposed method yellow |
---|
0:16:26 | bars |
---|
0:16:26 | is taken it can sleep better than that of the score fusion which is true |
---|
0:16:36 | the same trend has being seen in |
---|
0:16:40 | evaluation set |
---|
0:16:41 | where there are and thing we lay |
---|
0:16:44 | conditions |
---|
0:16:48 | the improvements compared to the same condition is less the is still consistently better |
---|
0:16:55 | then score fusion and also of course the original single feature map systems |
---|
0:17:07 | then we investigated |
---|
0:17:09 | the computational cost of the proposed method |
---|
0:17:12 | and the score fusion |
---|
0:17:16 | using the proposed to resolution feature maps |
---|
0:17:19 | only resulted in a parameter number increased |
---|
0:17:23 | the s and zero point two |
---|
0:17:25 | the of one two percent |
---|
0:17:27 | while the increase of the use of the best |
---|
0:17:30 | three resolution feature maps |
---|
0:17:32 | was roughly zero point two to present |
---|
0:17:40 | and score fusion |
---|
0:17:42 | yes is well-known training two or more systems |
---|
0:17:46 | and then fuse the scores in score level |
---|
0:17:48 | scoring level |
---|
0:17:51 | this did not improve the performance |
---|
0:17:53 | much in our experiments |
---|
0:17:55 | but it doubled |
---|
0:17:57 | or even true or the number of parameters |
---|
0:18:03 | so in conclusion |
---|
0:18:05 | our proposed method |
---|
0:18:06 | will be able to be more helpful in practical use |
---|
0:18:12 | now i would like to summarise this presentation |
---|
0:18:16 | we propose to multi-resolution feature maps |
---|
0:18:19 | which stacks multiple feature maps into a three dimensional input |
---|
0:18:24 | followed with c n n's |
---|
0:18:25 | this optimal resolutions will be automatically so selected |
---|
0:18:31 | it is proposed to alleviate the problem |
---|
0:18:34 | that |
---|
0:18:34 | feature maps commonly used in and just moving networks are likely should be sufficient |
---|
0:18:40 | for anything |
---|
0:18:41 | discriminative representations of all do segments |
---|
0:18:45 | and they are often extracted |
---|
0:18:47 | by fixed lens windows |
---|
0:18:50 | the effectiveness of the proposed method was confirmed space both |
---|
0:18:55 | two thousand nineteen challenge |
---|
0:18:57 | physical access |
---|
0:18:58 | with three and variance rest net eighteen a scene and fifty and l c n |
---|
0:19:05 | experiments showed two and three resolutions feature maps |
---|
0:19:09 | achieved are just search is seven and forty five percent you error rate reduction |
---|
0:19:15 | it was significantly better |
---|
0:19:17 | then score fusion |
---|
0:19:19 | and also it cost only one start |
---|
0:19:22 | to have |
---|
0:19:22 | in terms of |
---|
0:19:24 | a computational cost |
---|
0:19:27 | for future work |
---|
0:19:28 | we would like to introduce attention mechanism to make |
---|
0:19:32 | better his own multi-resolution feature maps |
---|
0:19:35 | we also would like to extend the proposed method with other feature extractors |
---|
0:19:42 | that's all for my presentation thank you for watching please let me know if you |
---|
0:19:47 | have any questions |
---|