hi everyone thank you for joining my presentation
i and function one from any c corporation
today i would like to present my paper
using multiresolution feature maps
with
convolutional neural networks for and his movie in years to be
here are the content first i would like to give the introduction and the review
of multiple
feature maps popular in here series moving detection
next
i will introduce our proposed
multiresolution feature map
and how it is used with three a feature extraction
and the ways neural networks
well give three popular c and variance as examples
resonant eighteen seen that fifty and lc-nn
and we show the effectiveness of the proposed method
in experiments
and also cave and analysis in terms of computational cost
finally i'd like to summarise this presentation
automatic speaker verification yes we
offers flexible biometric authentication and has been increasingly employed in such telephone based services as
telephone banking
in four and six at call center and so
yes means reliability depends on its resilience to spoofing
it is true of any biometric technology
therefore with the increase of
use
of yes we spoofing detection in speech is also getting more attention
direct you scenarios of spoofing attacks
logical access and physical access
most equal access enclosed text-to-speech synthesis and voice conversion
in physical access is mentally would play where the target identities voice is recorded and
replay
we play is very easy
to implement and is it as you may know their heart to detect
yes bill challenge
two thousand fifteen to two thousand nineteen have been driving efforts on one and just
poking countermeasures
and the resulted in significant findings
yes miss fifteen
focuses on spoofing attacks generate a different speech synthesis
and the voice conversion
yes three seventeen focused on
replay attacks
then it's
studies of
two thousand nineteen
addressed all types so
of
spoofing
in previous two challenges
and further extended data sets
in terms of spoofing technology
number of
conditions and volume of data
with a lot of those researched in down with the challenges
the training have as being
shifted from gmm was features like mfccs thing to c or c f's this the
beginning
two
but deep neural networks
ways
hi time fruit a time frequency resolution features
that's has been proved to achieve higher accuracy
following this
conventional methods
trials and carefully which type of acoustic feature to use
yes it is essential in speech processing tasks
including spoofing detection
however
is realising only one type of acoustic features may not be sufficient to
detect
globals to think vectors when facing and saying
spoofing speech
as we know
from cuba
audio segment
multiple acoustic feature maps can be
extracted
such as mfcc security see
so if sissy fft
security and so on
it may be difficult to determine
one type of acoustic feature maps
will be the past four weeks moving detection
you know all one type
well of the acoustic feature
different settings
used in that extraction will resulted in obtaining of different informations
for example
fft spectrogram extracted
with different window lengths contain spectral information have
resolutions
that
different higher and lower frequency bands
i shorter window
will lead to high resolution in terms of time
and the low
lower resolution in terms of frequency
on the contrary a novel weighting though
we extract f t
which has
higher a for a higher resolution in frequency and a low resolution in time
the trade-off between time and the frequency resolution makes it difficult to extract
sufficient information was one fft
spectrogram along
therefore
the use of multiple acoustic feature maps
is needed to alleviate the problem
the question is
called to use a logical acoustic feature maps together
future physicians and score fusion is
score fusion is that kind of late fusion which can be used
to choose
score produced from systems
used
individual feature maps
however score fusion can be computational cost of these things it's needs to train neural
networks for multiple times
in addition fusion weights need to be determined in advance
as for feature fusion
there are a feature mapping concatenation
alone a single dimension such as time or frequency dimension
there is also linear interpolation two
feature additional allowed we is a fisher follow the ways
neural networks has the advantage of all call automatic
feature selection so we chose feature fusion over the score fusion
we proposed
multiresolution patient maps
which the stack
multiple feature maps
of the same dimensionality
into
three intermission input for deep neural network
it is soon table four
two d c n
the modification of neural network is also where same
only needs to
modify the first a layer of it and neural network from condition
one times
c one c one here is the output animation of the first layer neural network
from this dimension two
the number of all the channels
which means number of the feature maps
times
the out with animation
so the proposed method makes
it possible to extract more information
from input a signals
with relatively little can
additional cost
the experimental data used in this
study was physical access subset
all yes this move to solve the nineteen challenge
it can't end up of fifty thousand spoofed
and of
five cells and
well enough i'd
utterances in the training partition
as well as twenty four thousand spoofed
and if
five cells and quantified it
utterances in the development
partition
the development dataset were in the conditions saying
in the training data
it contains
twenty seven recording acoustic configurations
and indirectly configurations
evaluation
contents
much larger size of data
and include
spoof the utterances of unseen conditions as well
is
the scenario replaying estimates of two thousand nineteen
replay attacks
are stimulated
we see acoustic environment
the talker here
speak to that yes three
system
and attacker
recorded the speech
in defined distances
and then
he or she will reply
a speech back
to the and this
the point
to the us to be system
our experiments we used a t spectrogram and we used a window length of eighteen
twenty five and thirty
millisecond
that the spectrograms dimension was trying to two hundred fifty seven times four hundred
we used a unified
fft feature maps
since the lens of evaluation utterances are usually not is known beforehand
we first extended utterances
to their minimum multiple for the four hundred
frames
and then cut down into
multiple for hunter phone
segments with
two hundred
frame overlap
experiments were carried out
using the following three c and variance
resonating and seen that fifty
and the light c n
all the networks
as
at ten output nodes
one stands for
one of five condition
also generally
and the other night
represents nine we play
configurations
the no probability of the modified
class
is used as spoofing detection score to make the final decision
the model parameters the and architecture of resonate aiding and s c net fifty are
shown in this table
the basic and the bottom that rise to a locks
are described in original rest netscape
resonating and testing
fifty have been shown in
this paper
to be effective for replay
spoofing detection
that is why which rows just two networks
lights is a kind of with less feature
activation
the use of an f and
allowed us to reduce the number of channel
i have
it is why it's called like to see
channels work the best in a estimation of two thousand seventeen
and also ranked highly
in case whisper of two thousand
nineteen challenge
that's why we also include
lights in our study
we first compare spoofing detection equal error rates
when using single feature maps of different resolutions
in this very
networks
for different neural network architectures
the representative
yes to performances were obtained whizzing
well is
different feature maps
so here anything the f t eighteen twenty five searching represent fft spectrograms extracted ways
this window all lands
the fft spectrogram
extracted was twenty five
milisecond fft
kind of five
and extracted ways
us thirty millisecond
give similar results in resonate eighteen
and the cp significantly better
then those with at milisecond
and for a scene and fifty however
and thirteen feet eighteen and twenty five k
similar results
while
those with thirty millisecond with the best
for lc-nn the fft spectrograms of twenty five milisecond
"'kay" the best performance
so there may not be one single optimal fft configuration four
different neural network structure
next we applied the proposed multi resolution vision maps
as input to the sedans
we also show at the same time the results of score fusion
which is a straightforward to mess that when multi feature maps
are available
here are the results on development set
where the reply conditions are thing
the query bars here we present
performance for single feature map
and blue bars represent score fusion and the yellow bars represent the proposed method
we can see from the figure
they use of two resolutions feature maps
here
improve
the times a forty four and thirty percent for this three and hence
respectively
well
and this the results are compared to the better
one of the two single features system
then we also have stream resolution
results here
each show the best performance for
a resin that eighteen and has the net fifty
which
are fifty two percent and fifty seven percent in
error reduction
for l c and
three resolution input shows less improvement
we thank the reason may be the lc-nn have much less number of parameters
which we will show you later page
we also can see score fusion
but she would
better
results compared to the single feature map systems
clearly
in all the cases
the proposed method yellow
bars
is taken it can sleep better than that of the score fusion which is true
the same trend has being seen in
evaluation set
where there are and thing we lay
conditions
the improvements compared to the same condition is less the is still consistently better
then score fusion and also of course the original single feature map systems
then we investigated
the computational cost of the proposed method
and the score fusion
using the proposed to resolution feature maps
only resulted in a parameter number increased
the s and zero point two
the of one two percent
while the increase of the use of the best
three resolution feature maps
was roughly zero point two to present
and score fusion
yes is well-known training two or more systems
and then fuse the scores in score level
scoring level
this did not improve the performance
much in our experiments
but it doubled
or even true or the number of parameters
so in conclusion
our proposed method
will be able to be more helpful in practical use
now i would like to summarise this presentation
we propose to multi-resolution feature maps
which stacks multiple feature maps into a three dimensional input
followed with c n n's
this optimal resolutions will be automatically so selected
it is proposed to alleviate the problem
that
feature maps commonly used in and just moving networks are likely should be sufficient
for anything
discriminative representations of all do segments
and they are often extracted
by fixed lens windows
the effectiveness of the proposed method was confirmed space both
two thousand nineteen challenge
physical access
with three and variance rest net eighteen a scene and fifty and l c n
experiments showed two and three resolutions feature maps
achieved are just search is seven and forty five percent you error rate reduction
it was significantly better
then score fusion
and also it cost only one start
to have
in terms of
a computational cost
for future work
we would like to introduce attention mechanism to make
better his own multi-resolution feature maps
we also would like to extend the proposed method with other feature extractors
that's all for my presentation thank you for watching please let me know if you
have any questions