0:00:15 | not everyone will come to my presentation today i'm going to present our work checked |
---|
0:00:22 | a standardization of audio defect detection |
---|
0:00:25 | this work is done by me and remote pair off can issue and anti-query |
---|
0:00:32 | so first |
---|
0:00:35 | i'm going to present you sent background optimality woody fake and how to defect detection |
---|
0:00:42 | and the motivation of our work then have introduced the details of our proposed system |
---|
0:00:49 | which is a deep neural network based how to defect detection system |
---|
0:00:54 | it is trained using large margin concept allows and just frequency masking augmentation layer |
---|
0:01:00 | a to war i we will present the results and conclusion |
---|
0:01:07 | soon what is due to fake i'll give you pick technically known as a logical |
---|
0:01:13 | access why spoken techniques |
---|
0:01:16 | the iraq issue manipulated how do count and that are generated using text-to-speech always converge |
---|
0:01:24 | and techniques |
---|
0:01:27 | such as |
---|
0:01:29 | dropped all the people wise hence echo me and b get |
---|
0:01:34 | so due to the recent breakthrough in speech synset is and of ways convergent technologies |
---|
0:01:40 | justice deep-learning based technologies can produce very high comedy a synthesized speech and all thing |
---|
0:01:49 | in the traditional too few years |
---|
0:01:55 | why we need kind of duty predetection so i'll to make a speaker verification system |
---|
0:02:01 | is badly adapt you moneywise based human computer interfaces |
---|
0:02:07 | this is so popular hardware |
---|
0:02:10 | several studies have shown that the o g d vic proposed of greece try to |
---|
0:02:16 | model in speaker verification says |
---|
0:02:18 | and to was used to generate how two depicts an easily accessible to the public |
---|
0:02:24 | everyone down of the prediction model and the score it |
---|
0:02:29 | the last how to effectively detecting this attacks |
---|
0:02:35 | is critical to many speech applications including automatic a speaker verification system |
---|
0:02:44 | so essentially the research community has done a lot worse on studying okay give it |
---|
0:02:51 | detection |
---|
0:02:52 | that is we school series started in two thousand detecting it aims to foster the |
---|
0:02:58 | research shall consume each other to detect voice morphing |
---|
0:03:03 | in two thousand fitting |
---|
0:03:06 | speech synthesis and the was convention attacks so or not actually based on a hidden |
---|
0:03:12 | markov models and gaussian mixture models |
---|
0:03:17 | have data |
---|
0:03:18 | the quality of the speech successes and was component system has drastically improved with the |
---|
0:03:25 | use of deeper |
---|
0:03:27 | thus squeezable twenty nineteen |
---|
0:03:32 | challenge was introduced |
---|
0:03:34 | it includes the most recent state-of-the-art text was speech and was converted techniques |
---|
0:03:41 | and |
---|
0:03:42 | during the challenge more subs researchers are focusing on a baskin investigating different type of |
---|
0:03:49 | low-level features |
---|
0:03:51 | such as consonant use actual computations |
---|
0:03:56 | mfcc r f c and also the phase information like modified group delay features |
---|
0:04:04 | also is seventy mastered is a company used during the challenge |
---|
0:04:13 | so |
---|
0:04:14 | however this messrs don't perform while on yes we spoof twenty nineteen dataset |
---|
0:04:21 | so based on the results was a t c and u t zero equal error |
---|
0:04:27 | rate |
---|
0:04:29 | divided the data set but only evaluation dataset |
---|
0:04:32 | just a large gaps |
---|
0:04:36 | so what cost |
---|
0:04:40 | as the dataset this dataset is a focus on evaluating the systems against and non |
---|
0:04:46 | smoking techniques |
---|
0:04:48 | it can't is a seventeen different e d s and we see techniques but only |
---|
0:04:55 | six of them in training data sets |
---|
0:04:58 | it includes eleven techniques |
---|
0:05:01 | as and no from the training and development dataset |
---|
0:05:05 | so we evaluation dataset a totally contents searching techniques eleven of them are unknown |
---|
0:05:11 | only two of them are icing in the training set |
---|
0:05:15 | so that may |
---|
0:05:18 | the dataset writer challenge and |
---|
0:05:21 | zero four |
---|
0:05:23 | strong robustness is required for supporting detection system in this dataset |
---|
0:05:32 | so now here's a here's the problem how to be you the robust how to |
---|
0:05:37 | give a detection system that can detect and a half of onto a d fix |
---|
0:05:43 | if feature engineers doesn't work well |
---|
0:05:48 | can we focus on increasing the generalisation ability all to model itself |
---|
0:05:58 | so here's our proposed solution somewhat propose already fixed it actually says that it can't |
---|
0:06:05 | is a d purely based feature invitingly vector |
---|
0:06:10 | and the bike yet a backend classifier to classify whether a given audio sample instead |
---|
0:06:16 | of our charity |
---|
0:06:18 | so our system simply used in your low filter banks and the low level feature |
---|
0:06:24 | so now future max first pass through the frequency augmentation here |
---|
0:06:29 | which at which the later |
---|
0:06:32 | the is phase into the d precision that work |
---|
0:06:37 | instead of using as softmax loss we used large margin calls and loss to train |
---|
0:06:43 | the residual network |
---|
0:06:45 | we use the output all the final fully connected lay here as a feature e |
---|
0:06:50 | bay once we got the feature inviting if it is then fed into again classifier |
---|
0:06:58 | in this case in this case so backend classifier is just a shallow a neural |
---|
0:07:04 | network with only one hidden layer |
---|
0:07:11 | so |
---|
0:07:14 | no let's talk about the details about dear is nice not eating embedding factor so |
---|
0:07:21 | we use standard resonating architectures actually the table the different is we remove the we |
---|
0:07:31 | replace the |
---|
0:07:33 | global max pooling the year with mean and standard deviation probably after residual blocks |
---|
0:07:41 | and we didn't we use it like dividing causing loss instead of softmax loss |
---|
0:07:47 | and the feature embedding is extracted from the second forty connect later |
---|
0:07:56 | so |
---|
0:07:57 | well |
---|
0:07:58 | why we want to use large margin call center loss so as mentioned we about |
---|
0:08:03 | we want to increase the model |
---|
0:08:06 | genders in generalization ability itself |
---|
0:08:11 | so large amount in costa loss |
---|
0:08:15 | was usually used for face recognition |
---|
0:08:19 | so the goal of |
---|
0:08:22 | consent laws is to maximize |
---|
0:08:24 | so vibrant speech reading training and smoothed class |
---|
0:08:28 | and at the same time minimize intra-class variance |
---|
0:08:33 | so here's the visualisation of the usually biting there but using a cost and ours |
---|
0:08:40 | this is presented in the original paper |
---|
0:08:43 | so we can see |
---|
0:08:45 | compared to softmax |
---|
0:08:50 | the causal laws can not just on me |
---|
0:08:55 | separate different classes but we easy in his class the features are clustered together |
---|
0:09:07 | so finally we added a random frequency masking augmentation after the input layer years |
---|
0:09:15 | so this is an online augmentation mice or during training but each mini batch of |
---|
0:09:21 | random consecutive frequency band is masked |
---|
0:09:25 | by setting the value to zero during training |
---|
0:09:29 | this |
---|
0:09:33 | by adding this frequency alimentation lay here we hope to item or noise into the |
---|
0:09:40 | training and it will increase the generalisation ability |
---|
0:09:44 | the model |
---|
0:09:48 | doing testing this |
---|
0:09:49 | set is it |
---|
0:09:57 | so we totally construct this tree also "'cause" |
---|
0:10:01 | through training protocols industry evaluation protocols |
---|
0:10:05 | of all protocol t one we use the original s wishable nineteen dataset |
---|
0:10:11 | and punchy to we |
---|
0:10:15 | create a noisy version of data by using traditional audio documentation techniques |
---|
0:10:22 | two types of distortion were useful documentation reverberation and background noise room impulse response for |
---|
0:10:30 | remove reverberation work shows a phone public the while able room impulse response datasets |
---|
0:10:38 | and we choose what you've in terms of background noise for documentation that's music television |
---|
0:10:43 | the bubble and free so |
---|
0:10:47 | so |
---|
0:10:49 | we also want to invest cute |
---|
0:10:52 | the system performance under call center scenarios |
---|
0:10:56 | that is thus we replay the original is least two datasets to treat of services |
---|
0:11:02 | to create from channel you fact |
---|
0:11:07 | and we you we use that for our t three |
---|
0:11:11 | protocol |
---|
0:11:13 | a simple the |
---|
0:11:14 | evaluation dataset you one is a regional yes we split nineteen you well said into |
---|
0:11:21 | his announcing words are not that is really is |
---|
0:11:26 | logic logically replayed so training services |
---|
0:11:33 | so that any presented results now this is the results on the original bayes risk |
---|
0:11:40 | of nineteen |
---|
0:11:41 | evaluation set |
---|
0:11:44 | our baseline system is the standard rice eighteen |
---|
0:11:47 | model |
---|
0:11:49 | and rested and |
---|
0:11:52 | it shapes |
---|
0:11:54 | four percent equal error rate on the |
---|
0:11:58 | evaluation set |
---|
0:12:01 | and you can see by eileen largemargin consider loss we the equal error rate really |
---|
0:12:07 | used to three point where nine percent |
---|
0:12:09 | and finally we at the |
---|
0:12:13 | both scores and loss and frequency masking linear |
---|
0:12:18 | the reason that you choir it can be reduced to one point eight one present |
---|
0:12:27 | and he's the |
---|
0:12:30 | compose the system trained using three different protocols and evaluated against |
---|
0:12:36 | different benchmarks |
---|
0:12:38 | you one these are regional bunch more original is wishful thinking dataset into is an |
---|
0:12:44 | l z more general that and is three a's |
---|
0:12:49 | the particular as |
---|
0:12:51 | the logically replace rule treat of services |
---|
0:12:55 | so as you can see using all the data like he's we protocol to train |
---|
0:13:02 | well we can achieve significant improvements all words the original dataset and |
---|
0:13:11 | to do a large communicative set |
---|
0:13:18 | so |
---|
0:13:20 | this is a detailed equal error rate of difference moving techniques |
---|
0:13:25 | our proposed system outperforms the baseline system almost all types of a speaker spoken techniques |
---|
0:13:38 | so |
---|
0:13:38 | it's conclusion we have what she'd state-of-the-art performance on yes we spoke twenty necking evaluation |
---|
0:13:46 | dataset |
---|
0:13:48 | without using any in sampling and |
---|
0:13:51 | that without using any seventeen ms |
---|
0:13:55 | so we also shows that the traditional data augmentation technique is it is |
---|
0:14:01 | may be helpful |
---|
0:14:06 | we were able to increase the generation ability |
---|
0:14:10 | by using frequency out a limitation and largemargin consent loss and we shows that |
---|
0:14:17 | by increase the generalisation ability of the model itself |
---|
0:14:23 | is very is very useful |
---|
0:14:28 | finally we evaluate the system performance on now see we're of the data set and |
---|
0:14:34 | in call centres to |
---|
0:14:38 | that's all my presentation |
---|
0:14:41 | sounds but i mean |
---|