not everyone will come to my presentation today i'm going to present our work checked
a standardization of audio defect detection
this work is done by me and remote pair off can issue and anti-query
so first
i'm going to present you sent background optimality woody fake and how to defect detection
and the motivation of our work then have introduced the details of our proposed system
which is a deep neural network based how to defect detection system
it is trained using large margin concept allows and just frequency masking augmentation layer
a to war i we will present the results and conclusion
soon what is due to fake i'll give you pick technically known as a logical
access why spoken techniques
the iraq issue manipulated how do count and that are generated using text-to-speech always converge
and techniques
such as
dropped all the people wise hence echo me and b get
so due to the recent breakthrough in speech synset is and of ways convergent technologies
justice deep-learning based technologies can produce very high comedy a synthesized speech and all thing
in the traditional too few years
why we need kind of duty predetection so i'll to make a speaker verification system
is badly adapt you moneywise based human computer interfaces
this is so popular hardware
several studies have shown that the o g d vic proposed of greece try to
model in speaker verification says
and to was used to generate how two depicts an easily accessible to the public
everyone down of the prediction model and the score it
the last how to effectively detecting this attacks
is critical to many speech applications including automatic a speaker verification system
so essentially the research community has done a lot worse on studying okay give it
detection
that is we school series started in two thousand detecting it aims to foster the
research shall consume each other to detect voice morphing
in two thousand fitting
speech synthesis and the was convention attacks so or not actually based on a hidden
markov models and gaussian mixture models
have data
the quality of the speech successes and was component system has drastically improved with the
use of deeper
thus squeezable twenty nineteen
challenge was introduced
it includes the most recent state-of-the-art text was speech and was converted techniques
and
during the challenge more subs researchers are focusing on a baskin investigating different type of
low-level features
such as consonant use actual computations
mfcc r f c and also the phase information like modified group delay features
also is seventy mastered is a company used during the challenge
so
however this messrs don't perform while on yes we spoof twenty nineteen dataset
so based on the results was a t c and u t zero equal error
rate
divided the data set but only evaluation dataset
just a large gaps
so what cost
as the dataset this dataset is a focus on evaluating the systems against and non
smoking techniques
it can't is a seventeen different e d s and we see techniques but only
six of them in training data sets
it includes eleven techniques
as and no from the training and development dataset
so we evaluation dataset a totally contents searching techniques eleven of them are unknown
only two of them are icing in the training set
so that may
the dataset writer challenge and
zero four
strong robustness is required for supporting detection system in this dataset
so now here's a here's the problem how to be you the robust how to
give a detection system that can detect and a half of onto a d fix
if feature engineers doesn't work well
can we focus on increasing the generalisation ability all to model itself
so here's our proposed solution somewhat propose already fixed it actually says that it can't
is a d purely based feature invitingly vector
and the bike yet a backend classifier to classify whether a given audio sample instead
of our charity
so our system simply used in your low filter banks and the low level feature
so now future max first pass through the frequency augmentation here
which at which the later
the is phase into the d precision that work
instead of using as softmax loss we used large margin calls and loss to train
the residual network
we use the output all the final fully connected lay here as a feature e
bay once we got the feature inviting if it is then fed into again classifier
in this case in this case so backend classifier is just a shallow a neural
network with only one hidden layer
so
no let's talk about the details about dear is nice not eating embedding factor so
we use standard resonating architectures actually the table the different is we remove the we
replace the
global max pooling the year with mean and standard deviation probably after residual blocks
and we didn't we use it like dividing causing loss instead of softmax loss
and the feature embedding is extracted from the second forty connect later
so
well
why we want to use large margin call center loss so as mentioned we about
we want to increase the model
genders in generalization ability itself
so large amount in costa loss
was usually used for face recognition
so the goal of
consent laws is to maximize
so vibrant speech reading training and smoothed class
and at the same time minimize intra-class variance
so here's the visualisation of the usually biting there but using a cost and ours
this is presented in the original paper
so we can see
compared to softmax
the causal laws can not just on me
separate different classes but we easy in his class the features are clustered together
so finally we added a random frequency masking augmentation after the input layer years
so this is an online augmentation mice or during training but each mini batch of
random consecutive frequency band is masked
by setting the value to zero during training
this
by adding this frequency alimentation lay here we hope to item or noise into the
training and it will increase the generalisation ability
the model
doing testing this
set is it
so we totally construct this tree also "'cause"
through training protocols industry evaluation protocols
of all protocol t one we use the original s wishable nineteen dataset
and punchy to we
create a noisy version of data by using traditional audio documentation techniques
two types of distortion were useful documentation reverberation and background noise room impulse response for
remove reverberation work shows a phone public the while able room impulse response datasets
and we choose what you've in terms of background noise for documentation that's music television
the bubble and free so
so
we also want to invest cute
the system performance under call center scenarios
that is thus we replay the original is least two datasets to treat of services
to create from channel you fact
and we you we use that for our t three
protocol
a simple the
evaluation dataset you one is a regional yes we split nineteen you well said into
his announcing words are not that is really is
logic logically replayed so training services
so that any presented results now this is the results on the original bayes risk
of nineteen
evaluation set
our baseline system is the standard rice eighteen
model
and rested and
it shapes
four percent equal error rate on the
evaluation set
and you can see by eileen largemargin consider loss we the equal error rate really
used to three point where nine percent
and finally we at the
both scores and loss and frequency masking linear
the reason that you choir it can be reduced to one point eight one present
and he's the
compose the system trained using three different protocols and evaluated against
different benchmarks
you one these are regional bunch more original is wishful thinking dataset into is an
l z more general that and is three a's
the particular as
the logically replace rule treat of services
so as you can see using all the data like he's we protocol to train
well we can achieve significant improvements all words the original dataset and
to do a large communicative set
so
this is a detailed equal error rate of difference moving techniques
our proposed system outperforms the baseline system almost all types of a speaker spoken techniques
so
it's conclusion we have what she'd state-of-the-art performance on yes we spoke twenty necking evaluation
dataset
without using any in sampling and
that without using any seventeen ms
so we also shows that the traditional data augmentation technique is it is
may be helpful
we were able to increase the generation ability
by using frequency out a limitation and largemargin consent loss and we shows that
by increase the generalisation ability of the model itself
is very is very useful
finally we evaluate the system performance on now see we're of the data set and
in call centres to
that's all my presentation
sounds but i mean