hi everyone thank you for joining my presentation

i and function one from any c corporation

today i would like to present my paper

using multiresolution feature maps

with

convolutional neural networks for and his movie in years to be

here are the content first i would like to give the introduction and the review

of multiple

feature maps popular in here series moving detection

next

i will introduce our proposed

multiresolution feature map

and how it is used with three a feature extraction

and the ways neural networks

well give three popular c and variance as examples

resonant eighteen seen that fifty and lc-nn

and we show the effectiveness of the proposed method

in experiments

and also cave and analysis in terms of computational cost

finally i'd like to summarise this presentation

automatic speaker verification yes we

offers flexible biometric authentication and has been increasingly employed in such telephone based services as

telephone banking

in four and six at call center and so

yes means reliability depends on its resilience to spoofing

it is true of any biometric technology

therefore with the increase of

use

of yes we spoofing detection in speech is also getting more attention

direct you scenarios of spoofing attacks

logical access and physical access

most equal access enclosed text-to-speech synthesis and voice conversion

in physical access is mentally would play where the target identities voice is recorded and

replay

we play is very easy

to implement and is it as you may know their heart to detect

yes bill challenge

two thousand fifteen to two thousand nineteen have been driving efforts on one and just

poking countermeasures

and the resulted in significant findings

yes miss fifteen

focuses on spoofing attacks generate a different speech synthesis

and the voice conversion

yes three seventeen focused on

replay attacks

then it's

studies of

two thousand nineteen

addressed all types so

of

spoofing

in previous two challenges

and further extended data sets

in terms of spoofing technology

number of

conditions and volume of data

with a lot of those researched in down with the challenges

the training have as being

shifted from gmm was features like mfccs thing to c or c f's this the

beginning

two

but deep neural networks

ways

hi time fruit a time frequency resolution features

that's has been proved to achieve higher accuracy

following this

conventional methods

trials and carefully which type of acoustic feature to use

yes it is essential in speech processing tasks

including spoofing detection

however

is realising only one type of acoustic features may not be sufficient to

detect

globals to think vectors when facing and saying

spoofing speech

as we know

from cuba

audio segment

multiple acoustic feature maps can be

extracted

such as mfcc security see

so if sissy fft

security and so on

it may be difficult to determine

one type of acoustic feature maps

will be the past four weeks moving detection

you know all one type

well of the acoustic feature

different settings

used in that extraction will resulted in obtaining of different informations

for example

fft spectrogram extracted

with different window lengths contain spectral information have

resolutions

that

different higher and lower frequency bands

i shorter window

will lead to high resolution in terms of time

and the low

lower resolution in terms of frequency

on the contrary a novel weighting though

we extract f t

which has

higher a for a higher resolution in frequency and a low resolution in time

the trade-off between time and the frequency resolution makes it difficult to extract

sufficient information was one fft

spectrogram along

therefore

the use of multiple acoustic feature maps

is needed to alleviate the problem

the question is

called to use a logical acoustic feature maps together

future physicians and score fusion is

score fusion is that kind of late fusion which can be used

to choose

score produced from systems

used

individual feature maps

however score fusion can be computational cost of these things it's needs to train neural

networks for multiple times

in addition fusion weights need to be determined in advance

as for feature fusion

there are a feature mapping concatenation

alone a single dimension such as time or frequency dimension

there is also linear interpolation two

feature additional allowed we is a fisher follow the ways

neural networks has the advantage of all call automatic

feature selection so we chose feature fusion over the score fusion

we proposed

multiresolution patient maps

which the stack

multiple feature maps

of the same dimensionality

into

three intermission input for deep neural network

it is soon table four

two d c n

the modification of neural network is also where same

only needs to

modify the first a layer of it and neural network from condition

one times

c one c one here is the output animation of the first layer neural network

from this dimension two

the number of all the channels

which means number of the feature maps

times

the out with animation

so the proposed method makes

it possible to extract more information

from input a signals

with relatively little can

additional cost

the experimental data used in this

study was physical access subset

all yes this move to solve the nineteen challenge

it can't end up of fifty thousand spoofed

and of

five cells and

well enough i'd

utterances in the training partition

as well as twenty four thousand spoofed

and if

five cells and quantified it

utterances in the development

partition

the development dataset were in the conditions saying

in the training data

it contains

twenty seven recording acoustic configurations

and indirectly configurations

evaluation

contents

much larger size of data

and include

spoof the utterances of unseen conditions as well

is

the scenario replaying estimates of two thousand nineteen

replay attacks

are stimulated

we see acoustic environment

the talker here

speak to that yes three

system

and attacker

recorded the speech

in defined distances

and then

he or she will reply

a speech back

to the and this

the point

to the us to be system

our experiments we used a t spectrogram and we used a window length of eighteen

twenty five and thirty

millisecond

that the spectrograms dimension was trying to two hundred fifty seven times four hundred

we used a unified

fft feature maps

since the lens of evaluation utterances are usually not is known beforehand

we first extended utterances

to their minimum multiple for the four hundred

frames

and then cut down into

multiple for hunter phone

segments with

two hundred

frame overlap

experiments were carried out

using the following three c and variance

resonating and seen that fifty

and the light c n

all the networks

as

at ten output nodes

one stands for

one of five condition

also generally

and the other night

represents nine we play

configurations

the no probability of the modified

class

is used as spoofing detection score to make the final decision

the model parameters the and architecture of resonate aiding and s c net fifty are

shown in this table

the basic and the bottom that rise to a locks

are described in original rest netscape

resonating and testing

fifty have been shown in

this paper

to be effective for replay

spoofing detection

that is why which rows just two networks

lights is a kind of with less feature

activation

the use of an f and

allowed us to reduce the number of channel

i have

it is why it's called like to see

channels work the best in a estimation of two thousand seventeen

and also ranked highly

in case whisper of two thousand

nineteen challenge

that's why we also include

lights in our study

we first compare spoofing detection equal error rates

when using single feature maps of different resolutions

in this very

networks

for different neural network architectures

the representative

yes to performances were obtained whizzing

well is

different feature maps

so here anything the f t eighteen twenty five searching represent fft spectrograms extracted ways

this window all lands

the fft spectrogram

extracted was twenty five

milisecond fft

kind of five

and extracted ways

us thirty millisecond

give similar results in resonate eighteen

and the cp significantly better

then those with at milisecond

and for a scene and fifty however

and thirteen feet eighteen and twenty five k

similar results

while

those with thirty millisecond with the best

for lc-nn the fft spectrograms of twenty five milisecond

"'kay" the best performance

so there may not be one single optimal fft configuration four

different neural network structure

next we applied the proposed multi resolution vision maps

as input to the sedans

we also show at the same time the results of score fusion

which is a straightforward to mess that when multi feature maps

are available

here are the results on development set

where the reply conditions are thing

the query bars here we present

performance for single feature map

and blue bars represent score fusion and the yellow bars represent the proposed method

we can see from the figure

they use of two resolutions feature maps

here

improve

the times a forty four and thirty percent for this three and hence

respectively

well

and this the results are compared to the better

one of the two single features system

then we also have stream resolution

results here

each show the best performance for

a resin that eighteen and has the net fifty

which

are fifty two percent and fifty seven percent in

error reduction

for l c and

three resolution input shows less improvement

we thank the reason may be the lc-nn have much less number of parameters

which we will show you later page

we also can see score fusion

but she would

better

results compared to the single feature map systems

clearly

in all the cases

the proposed method yellow

bars

is taken it can sleep better than that of the score fusion which is true

the same trend has being seen in

evaluation set

where there are and thing we lay

conditions

the improvements compared to the same condition is less the is still consistently better

then score fusion and also of course the original single feature map systems

then we investigated

the computational cost of the proposed method

and the score fusion

using the proposed to resolution feature maps

only resulted in a parameter number increased

the s and zero point two

the of one two percent

while the increase of the use of the best

three resolution feature maps

was roughly zero point two to present

and score fusion

yes is well-known training two or more systems

and then fuse the scores in score level

scoring level

this did not improve the performance

much in our experiments

but it doubled

or even true or the number of parameters

so in conclusion

our proposed method

will be able to be more helpful in practical use

now i would like to summarise this presentation

we propose to multi-resolution feature maps

which stacks multiple feature maps into a three dimensional input

followed with c n n's

this optimal resolutions will be automatically so selected

it is proposed to alleviate the problem

that

feature maps commonly used in and just moving networks are likely should be sufficient

for anything

discriminative representations of all do segments

and they are often extracted

by fixed lens windows

the effectiveness of the proposed method was confirmed space both

two thousand nineteen challenge

physical access

with three and variance rest net eighteen a scene and fifty and l c n

experiments showed two and three resolutions feature maps

achieved are just search is seven and forty five percent you error rate reduction

it was significantly better

then score fusion

and also it cost only one start

to have

in terms of

a computational cost

for future work

we would like to introduce attention mechanism to make

better his own multi-resolution feature maps

we also would like to extend the proposed method with other feature extractors

that's all for my presentation thank you for watching please let me know if you

have any questions