0:00:13hi everyone thank you for joining my presentation
0:00:17i and function one from any c corporation
0:00:20today i would like to present my paper
0:00:23using multiresolution feature maps
0:00:26with
0:00:27convolutional neural networks for and his movie in years to be
0:00:33here are the content first i would like to give the introduction and the review
0:00:38of multiple
0:00:39feature maps popular in here series moving detection
0:00:44next
0:00:44i will introduce our proposed
0:00:47multiresolution feature map
0:00:49and how it is used with three a feature extraction
0:00:52and the ways neural networks
0:00:55well give three popular c and variance as examples
0:01:00resonant eighteen seen that fifty and lc-nn
0:01:05and we show the effectiveness of the proposed method
0:01:08in experiments
0:01:10and also cave and analysis in terms of computational cost
0:01:14finally i'd like to summarise this presentation
0:01:21automatic speaker verification yes we
0:01:25offers flexible biometric authentication and has been increasingly employed in such telephone based services as
0:01:34telephone banking
0:01:35in four and six at call center and so
0:01:39yes means reliability depends on its resilience to spoofing
0:01:44it is true of any biometric technology
0:01:48therefore with the increase of
0:01:50use
0:01:52of yes we spoofing detection in speech is also getting more attention
0:01:58direct you scenarios of spoofing attacks
0:02:01logical access and physical access
0:02:04most equal access enclosed text-to-speech synthesis and voice conversion
0:02:11in physical access is mentally would play where the target identities voice is recorded and
0:02:17replay
0:02:18we play is very easy
0:02:20to implement and is it as you may know their heart to detect
0:02:25yes bill challenge
0:02:28two thousand fifteen to two thousand nineteen have been driving efforts on one and just
0:02:33poking countermeasures
0:02:35and the resulted in significant findings
0:02:38yes miss fifteen
0:02:40focuses on spoofing attacks generate a different speech synthesis
0:02:44and the voice conversion
0:02:47yes three seventeen focused on
0:02:50replay attacks
0:02:52then it's
0:02:53studies of
0:02:54two thousand nineteen
0:02:57addressed all types so
0:02:58of
0:03:00spoofing
0:03:00in previous two challenges
0:03:02and further extended data sets
0:03:04in terms of spoofing technology
0:03:08number of
0:03:09conditions and volume of data
0:03:13with a lot of those researched in down with the challenges
0:03:17the training have as being
0:03:21shifted from gmm was features like mfccs thing to c or c f's this the
0:03:27beginning
0:03:28two
0:03:29but deep neural networks
0:03:31ways
0:03:31hi time fruit a time frequency resolution features
0:03:35that's has been proved to achieve higher accuracy
0:03:39following this
0:03:41conventional methods
0:03:43trials and carefully which type of acoustic feature to use
0:03:47yes it is essential in speech processing tasks
0:03:51including spoofing detection
0:03:53however
0:03:54is realising only one type of acoustic features may not be sufficient to
0:03:59detect
0:04:00globals to think vectors when facing and saying
0:04:04spoofing speech
0:04:10as we know
0:04:12from cuba
0:04:14audio segment
0:04:16multiple acoustic feature maps can be
0:04:20extracted
0:04:21such as mfcc security see
0:04:23so if sissy fft
0:04:26security and so on
0:04:28it may be difficult to determine
0:04:31one type of acoustic feature maps
0:04:34will be the past four weeks moving detection
0:04:37you know all one type
0:04:39well of the acoustic feature
0:04:42different settings
0:04:44used in that extraction will resulted in obtaining of different informations
0:04:49for example
0:04:50fft spectrogram extracted
0:04:53with different window lengths contain spectral information have
0:04:58resolutions
0:04:59that
0:04:59different higher and lower frequency bands
0:05:04i shorter window
0:05:05will lead to high resolution in terms of time
0:05:09and the low
0:05:10lower resolution in terms of frequency
0:05:14on the contrary a novel weighting though
0:05:20we extract f t
0:05:23which has
0:05:24higher a for a higher resolution in frequency and a low resolution in time
0:05:30the trade-off between time and the frequency resolution makes it difficult to extract
0:05:37sufficient information was one fft
0:05:39spectrogram along
0:05:41therefore
0:05:42the use of multiple acoustic feature maps
0:05:46is needed to alleviate the problem
0:05:50the question is
0:05:52called to use a logical acoustic feature maps together
0:06:01future physicians and score fusion is
0:06:03score fusion is that kind of late fusion which can be used
0:06:07to choose
0:06:09score produced from systems
0:06:12used
0:06:14individual feature maps
0:06:17however score fusion can be computational cost of these things it's needs to train neural
0:06:23networks for multiple times
0:06:26in addition fusion weights need to be determined in advance
0:06:31as for feature fusion
0:06:33there are a feature mapping concatenation
0:06:37alone a single dimension such as time or frequency dimension
0:06:42there is also linear interpolation two
0:06:45feature additional allowed we is a fisher follow the ways
0:06:51neural networks has the advantage of all call automatic
0:06:55feature selection so we chose feature fusion over the score fusion
0:07:08we proposed
0:07:10multiresolution patient maps
0:07:12which the stack
0:07:14multiple feature maps
0:07:15of the same dimensionality
0:07:18into
0:07:20three intermission input for deep neural network
0:07:25it is soon table four
0:07:27two d c n
0:07:29the modification of neural network is also where same
0:07:33only needs to
0:07:35modify the first a layer of it and neural network from condition
0:07:42one times
0:07:44c one c one here is the output animation of the first layer neural network
0:07:49from this dimension two
0:07:54the number of all the channels
0:07:57which means number of the feature maps
0:08:01times
0:08:01the out with animation
0:08:05so the proposed method makes
0:08:07it possible to extract more information
0:08:10from input a signals
0:08:13with relatively little can
0:08:15additional cost
0:08:25the experimental data used in this
0:08:28study was physical access subset
0:08:31all yes this move to solve the nineteen challenge
0:08:35it can't end up of fifty thousand spoofed
0:08:37and of
0:08:39five cells and
0:08:40well enough i'd
0:08:41utterances in the training partition
0:08:43as well as twenty four thousand spoofed
0:08:46and if
0:08:47five cells and quantified it
0:08:49utterances in the development
0:08:51partition
0:08:52the development dataset were in the conditions saying
0:08:56in the training data
0:08:59it contains
0:09:01twenty seven recording acoustic configurations
0:09:04and indirectly configurations
0:09:07evaluation
0:09:09contents
0:09:09much larger size of data
0:09:11and include
0:09:13spoof the utterances of unseen conditions as well
0:09:19is
0:09:20the scenario replaying estimates of two thousand nineteen
0:09:25replay attacks
0:09:27are stimulated
0:09:29we see acoustic environment
0:09:32the talker here
0:09:34speak to that yes three
0:09:37system
0:09:38and attacker
0:09:41recorded the speech
0:09:44in defined distances
0:09:46and then
0:09:47he or she will reply
0:09:49a speech back
0:09:51to the and this
0:09:53the point
0:09:54to the us to be system
0:09:59our experiments we used a t spectrogram and we used a window length of eighteen
0:10:05twenty five and thirty
0:10:07millisecond
0:10:08that the spectrograms dimension was trying to two hundred fifty seven times four hundred
0:10:15we used a unified
0:10:17fft feature maps
0:10:21since the lens of evaluation utterances are usually not is known beforehand
0:10:28we first extended utterances
0:10:31to their minimum multiple for the four hundred
0:10:36frames
0:10:37and then cut down into
0:10:40multiple for hunter phone
0:10:41segments with
0:10:43two hundred
0:10:45frame overlap
0:10:52experiments were carried out
0:10:54using the following three c and variance
0:10:57resonating and seen that fifty
0:11:00and the light c n
0:11:02all the networks
0:11:03as
0:11:04at ten output nodes
0:11:08one stands for
0:11:10one of five condition
0:11:12also generally
0:11:13and the other night
0:11:15represents nine we play
0:11:17configurations
0:11:20the no probability of the modified
0:11:24class
0:11:25is used as spoofing detection score to make the final decision
0:11:37the model parameters the and architecture of resonate aiding and s c net fifty are
0:11:42shown in this table
0:11:44the basic and the bottom that rise to a locks
0:11:48are described in original rest netscape
0:11:52resonating and testing
0:11:54fifty have been shown in
0:11:56this paper
0:11:58to be effective for replay
0:12:01spoofing detection
0:12:03that is why which rows just two networks
0:12:06lights is a kind of with less feature
0:12:11activation
0:12:12the use of an f and
0:12:14allowed us to reduce the number of channel
0:12:18i have
0:12:19it is why it's called like to see
0:12:22channels work the best in a estimation of two thousand seventeen
0:12:28and also ranked highly
0:12:30in case whisper of two thousand
0:12:32nineteen challenge
0:12:34that's why we also include
0:12:36lights in our study
0:12:40we first compare spoofing detection equal error rates
0:12:44when using single feature maps of different resolutions
0:12:49in this very
0:12:50networks
0:12:52for different neural network architectures
0:12:54the representative
0:12:56yes to performances were obtained whizzing
0:12:59well is
0:13:00different feature maps
0:13:08so here anything the f t eighteen twenty five searching represent fft spectrograms extracted ways
0:13:17this window all lands
0:13:22the fft spectrogram
0:13:25extracted was twenty five
0:13:27milisecond fft
0:13:29kind of five
0:13:31and extracted ways
0:13:33us thirty millisecond
0:13:35give similar results in resonate eighteen
0:13:39and the cp significantly better
0:13:42then those with at milisecond
0:13:47and for a scene and fifty however
0:13:50and thirteen feet eighteen and twenty five k
0:13:54similar results
0:13:56while
0:13:57those with thirty millisecond with the best
0:14:03for lc-nn the fft spectrograms of twenty five milisecond
0:14:08"'kay" the best performance
0:14:11so there may not be one single optimal fft configuration four
0:14:17different neural network structure
0:14:23next we applied the proposed multi resolution vision maps
0:14:27as input to the sedans
0:14:31we also show at the same time the results of score fusion
0:14:37which is a straightforward to mess that when multi feature maps
0:14:41are available
0:14:46here are the results on development set
0:14:50where the reply conditions are thing
0:14:55the query bars here we present
0:14:58performance for single feature map
0:15:01and blue bars represent score fusion and the yellow bars represent the proposed method
0:15:09we can see from the figure
0:15:10they use of two resolutions feature maps
0:15:16here
0:15:17improve
0:15:18the times a forty four and thirty percent for this three and hence
0:15:25respectively
0:15:27well
0:15:28and this the results are compared to the better
0:15:32one of the two single features system
0:15:36then we also have stream resolution
0:15:39results here
0:15:42each show the best performance for
0:15:45a resin that eighteen and has the net fifty
0:15:49which
0:15:50are fifty two percent and fifty seven percent in
0:15:54error reduction
0:15:56for l c and
0:15:58three resolution input shows less improvement
0:16:01we thank the reason may be the lc-nn have much less number of parameters
0:16:07which we will show you later page
0:16:11we also can see score fusion
0:16:13but she would
0:16:14better
0:16:16results compared to the single feature map systems
0:16:20clearly
0:16:21in all the cases
0:16:22the proposed method yellow
0:16:26bars
0:16:26is taken it can sleep better than that of the score fusion which is true
0:16:36the same trend has being seen in
0:16:40evaluation set
0:16:41where there are and thing we lay
0:16:44conditions
0:16:48the improvements compared to the same condition is less the is still consistently better
0:16:55then score fusion and also of course the original single feature map systems
0:17:07then we investigated
0:17:09the computational cost of the proposed method
0:17:12and the score fusion
0:17:16using the proposed to resolution feature maps
0:17:19only resulted in a parameter number increased
0:17:23the s and zero point two
0:17:25the of one two percent
0:17:27while the increase of the use of the best
0:17:30three resolution feature maps
0:17:32was roughly zero point two to present
0:17:40and score fusion
0:17:42yes is well-known training two or more systems
0:17:46and then fuse the scores in score level
0:17:48scoring level
0:17:51this did not improve the performance
0:17:53much in our experiments
0:17:55but it doubled
0:17:57or even true or the number of parameters
0:18:03so in conclusion
0:18:05our proposed method
0:18:06will be able to be more helpful in practical use
0:18:12now i would like to summarise this presentation
0:18:16we propose to multi-resolution feature maps
0:18:19which stacks multiple feature maps into a three dimensional input
0:18:24followed with c n n's
0:18:25this optimal resolutions will be automatically so selected
0:18:31it is proposed to alleviate the problem
0:18:34that
0:18:34feature maps commonly used in and just moving networks are likely should be sufficient
0:18:40for anything
0:18:41discriminative representations of all do segments
0:18:45and they are often extracted
0:18:47by fixed lens windows
0:18:50the effectiveness of the proposed method was confirmed space both
0:18:55two thousand nineteen challenge
0:18:57physical access
0:18:58with three and variance rest net eighteen a scene and fifty and l c n
0:19:05experiments showed two and three resolutions feature maps
0:19:09achieved are just search is seven and forty five percent you error rate reduction
0:19:15it was significantly better
0:19:17then score fusion
0:19:19and also it cost only one start
0:19:22to have
0:19:22in terms of
0:19:24a computational cost
0:19:27for future work
0:19:28we would like to introduce attention mechanism to make
0:19:32better his own multi-resolution feature maps
0:19:35we also would like to extend the proposed method with other feature extractors
0:19:42that's all for my presentation thank you for watching please let me know if you
0:19:47have any questions