0:00:16 | and everyone whiners are from johns hopkins university |
---|
0:00:20 | a compromise |
---|
0:00:22 | what is your presentation my our framework is on speaker verification and speech enhancement |
---|
0:00:27 | let's say that six lights |
---|
0:00:36 | i love this presentation is a another system which allows this enhancement or speaker verification |
---|
0:00:43 | and to be using some slides from my previous work i guess was called feature |
---|
0:00:49 | enhancement but |
---|
0:00:50 | the feature classes for speaker verification |
---|
0:00:56 | i mean downstream does is speaker verification |
---|
0:00:59 | and the problem refers to the |
---|
0:01:03 | task of data mining if speaker an utterance one |
---|
0:01:06 | just and drawn inference is same as |
---|
0:01:09 | these you got an utterance to which is the test utterance |
---|
0:01:13 | the state-of-the-art we implement this is to use a so-called extractor network and |
---|
0:01:19 | a probabilistic linear discriminant analysis is okay |
---|
0:01:23 | and also due date or addition |
---|
0:01:27 | in conjunction |
---|
0:01:30 | speech enhancement |
---|
0:01:31 | is once this problem but you have speaker verification |
---|
0:01:35 | by any preprocessing and rule and test utterances during this time |
---|
0:01:42 | it has a node is the speech enhancement maybe on helps when trained in the |
---|
0:01:48 | and then of speaker recognition option |
---|
0:01:52 | and three pursue a title frame only fisherman's training |
---|
0:01:56 | which |
---|
0:01:56 | next the two problems as we can see how |
---|
0:02:02 | this is the schematic of each feature loss training was you can see there are |
---|
0:02:08 | two networks one is e |
---|
0:02:10 | one just has one or another one is denoted by e which is t alternately |
---|
0:02:15 | network |
---|
0:02:18 | the enhancement network takes noisy features and produced enhanced features |
---|
0:02:23 | these enhanced features are not directly compare between features however they are for us to |
---|
0:02:30 | also unit for and the intermediate activity activations in the small sooner or we know |
---|
0:02:37 | the differences in them and they are known as a feature loss |
---|
0:02:43 | when we don't use this on clean and fruit and simply choose |
---|
0:02:46 | compared enhanced features indicating features |
---|
0:02:49 | in a score |
---|
0:02:51 | feature mostly |
---|
0:02:54 | this can imagine |
---|
0:02:55 | this type of training is doing enhancement however results you'd information also |
---|
0:03:02 | that is then exquisitely is also unit for |
---|
0:03:08 | this is how or speaker verification that looks like the enrollment and test going through |
---|
0:03:15 | feature extraction independently and also enhanced independently then |
---|
0:03:22 | well healthy |
---|
0:03:23 | a phones goes through our invariant structure which is our case expected network |
---|
0:03:30 | and |
---|
0:03:31 | and the but a classifier |
---|
0:03:33 | tries to give them a log-likelihood ratio and say |
---|
0:03:38 | the there is |
---|
0:03:40 | same speaker or not |
---|
0:03:44 | no these of the details on how database extraction is ten |
---|
0:03:49 | we use |
---|
0:03:50 | and use a corpus which consists of |
---|
0:03:53 | three or instances only use a |
---|
0:03:57 | gender noises |
---|
0:03:59 | and that'll |
---|
0:04:02 | these |
---|
0:04:03 | the noise classes are used to |
---|
0:04:06 | combine |
---|
0:04:07 | with |
---|
0:04:08 | also the within sixteen khz conversations statistic as a |
---|
0:04:13 | and be just wrote also the combined and it is three times but also |
---|
0:04:19 | the emission works of the is |
---|
0:04:22 | is so some wild |
---|
0:04:25 | i a fifty percent rate |
---|
0:04:28 | randomly agreeable so the utterance for it to |
---|
0:04:33 | we also use |
---|
0:04:35 | it s not filtering algorithm called about as an two |
---|
0:04:39 | create a fifty percent you what's alone |
---|
0:04:42 | and it is supposed to preserve the highest and utterances from work so |
---|
0:04:48 | such clean and version of also the is gain combined with |
---|
0:04:53 | the news on |
---|
0:04:54 | noise is and that serves as the noisy constant for our supervise enhancement training |
---|
0:05:04 | this trend of the ldr frame with the what's the combined dimension and these see |
---|
0:05:09 | that no networks a |
---|
0:05:11 | does use |
---|
0:05:14 | given more details the features that the use of forty dimensional measure of that |
---|
0:05:20 | this is to see and other ways |
---|
0:05:22 | the evaluation will be done on d v train a which is a corpus containing |
---|
0:05:27 | a young children means that in and controlled environment |
---|
0:05:32 | the complete data is to fifty hours for is and struck divided in detection and |
---|
0:05:37 | a diarization task |
---|
0:05:40 | we have not explained |
---|
0:05:42 | the diarisation component you know pipeline |
---|
0:05:46 | for the evaluation data a number of speakers in and roll and test r five |
---|
0:05:51 | ninety five and one fifty respectively |
---|
0:05:54 | and results are presented in form of equal error rate and minimum decision cost function |
---|
0:06:00 | where target prior probability of five percent |
---|
0:06:05 | the table that you see here is from our previous work which we want to |
---|
0:06:10 | analyze in this work |
---|
0:06:13 | use if you focus on the second |
---|
0:06:17 | dataset column which is about maybe train |
---|
0:06:21 | you can see for scroll |
---|
0:06:24 | is actually without an enhanced and every and refers to the original version of x |
---|
0:06:29 | are gonna work |
---|
0:06:30 | and if do is just |
---|
0:06:32 | a notation to denote |
---|
0:06:34 | the type of be and es data used |
---|
0:06:38 | so this rule actually give results on |
---|
0:06:42 | that enhanced and it is seven point six percent eer and then we use a |
---|
0:06:47 | feature lost which loss and also combination |
---|
0:06:51 | and |
---|
0:06:52 | in c d's usually give the best performance previously |
---|
0:06:57 | assign a row zero |
---|
0:06:58 | is the comparison between how much performance t and you can see |
---|
0:07:04 | we just are feature allows efficient most |
---|
0:07:07 | formants cleanest a or six k |
---|
0:07:11 | having said that we want to address and questions |
---|
0:07:16 | forces are |
---|
0:07:18 | only the initial layers of course in a useful for the official of training |
---|
0:07:23 | can't feature allows the additive which allows |
---|
0:07:27 | second it is |
---|
0:07:29 | for supervised and has the training how clean data is required |
---|
0:07:33 | can i just using speech results of the |
---|
0:07:35 | below are created database |
---|
0:07:38 | mismatch issues |
---|
0:07:40 | currently you extractor and all seen in four |
---|
0:07:44 | are available pre-training on your emotions features can i used to train and has the |
---|
0:07:49 | network each works the height of features can get an idea get some benefit |
---|
0:07:57 | for this and has a really an expected data and of the training for the |
---|
0:08:01 | improvements |
---|
0:08:05 | faced is again and has features the bootstrap to training data double the amount of |
---|
0:08:10 | data and make our extra to store the be obvious four |
---|
0:08:16 | six is to see if the was less that we're working with a really useful |
---|
0:08:22 | during the data condition process |
---|
0:08:25 | is some of the noise class |
---|
0:08:27 | even harmful |
---|
0:08:30 | find regression is that as the proposed scheme for the task of dereverberation and joint |
---|
0:08:35 | denoising anteater operation |
---|
0:08:40 | or should be produce the baseline and see what there is good for differs a |
---|
0:08:45 | lost a extraction |
---|
0:08:48 | is |
---|
0:08:49 | results table with a lot of numbers a better for this doesn't station it's enough |
---|
0:08:55 | to focus on the first column which gives you the labels |
---|
0:08:59 | for that i all loss or data that's going to use |
---|
0:09:02 | and the final |
---|
0:09:04 | a column is the mean result on the |
---|
0:09:08 | no be retrained test set |
---|
0:09:11 | but shows without it has then given then one nine percent eer |
---|
0:09:15 | and then we have l d s l five between the feature last extracted from |
---|
0:09:20 | five layers |
---|
0:09:21 | and this |
---|
0:09:24 | on signal folk has six layers |
---|
0:09:26 | the fess up to five are used in this one and six is |
---|
0:09:30 | the |
---|
0:09:31 | classification in finding invariance we are not using for a particular role |
---|
0:09:36 | i guess the best performance and z more combinations |
---|
0:09:40 | to see |
---|
0:09:41 | and the l f l is the feature loss and it gives you were worse |
---|
0:09:46 | performance in and then baseline |
---|
0:09:48 | this reduces observations from previous four |
---|
0:09:52 | combining them was so |
---|
0:09:54 | is not good point two percent |
---|
0:09:57 | when you combine the embedding |
---|
0:10:00 | years the last layer false in that for the d feature lost |
---|
0:10:04 | it duh is also not helpful |
---|
0:10:07 | and then the use |
---|
0:10:09 | efficient loss five layer for later three layers two years and |
---|
0:10:14 | finally one layer and they are not as good as using all the layers |
---|
0:10:18 | the bottom half of the table is a decision cost function |
---|
0:10:23 | the |
---|
0:10:24 | observations are mostly same as the equal error rate |
---|
0:10:27 | so here we have seen the feature losses in three artificial are or system |
---|
0:10:34 | combining them |
---|
0:10:36 | is also for |
---|
0:10:38 | a more lazy use the best increase the computational complexity |
---|
0:10:44 | well that's okay |
---|
0:10:48 | the main data v is the |
---|
0:10:50 | you need to |
---|
0:10:51 | use you know if all silly layers from the jar |
---|
0:10:58 | if we see the choice of training data set for enhanced and also you know |
---|
0:11:01 | where |
---|
0:11:03 | we see donovan to dash fisher the blue means |
---|
0:11:07 | what's alone |
---|
0:11:08 | with the bodice and i was used for the |
---|
0:11:11 | and has and therefore and |
---|
0:11:13 | also |
---|
0:11:15 | on as a consequence for the |
---|
0:11:18 | also network and gives the best performance you know by boldface or |
---|
0:11:24 | one |
---|
0:11:26 | using p c which is the what's of the |
---|
0:11:29 | and b c we just have also combine |
---|
0:11:32 | but in spots of the combined with the |
---|
0:11:34 | the noise documentations |
---|
0:11:37 | we also from we see if two indian in the has to know where |
---|
0:11:43 | which is if you core |
---|
0:11:44 | the you can of the three persons of some kind of what's |
---|
0:11:48 | and it is not as |
---|
0:11:51 | good as the bodice not filtering so |
---|
0:11:55 | the shows that feeling screening all four |
---|
0:11:59 | barcelona one snr seems to be unimportant |
---|
0:12:03 | and use a little speech and |
---|
0:12:05 | can see of course and point to a greater than i one and baseline |
---|
0:12:11 | and solely for speech |
---|
0:12:13 | i think being |
---|
0:12:14 | in on conversational and mismatched data it is for training |
---|
0:12:20 | even when used as a |
---|
0:12:22 | clean counterpart for the enhanced |
---|
0:12:24 | and hence the network |
---|
0:12:28 | we also thing the powerful the also the network is that it is |
---|
0:12:33 | and the old one is so |
---|
0:12:38 | means that the more data is used and |
---|
0:12:40 | the data condition is also that |
---|
0:12:46 | you see if we mismatch the features and has the network can i use i |
---|
0:12:50 | dimensional features and hence for network |
---|
0:12:53 | second rule festival is |
---|
0:12:56 | ellen |
---|
0:12:57 | f b for the means log mel from the man features |
---|
0:13:01 | for the dimension in has been network |
---|
0:13:04 | recall that forty dimensional features are used in the opportunity for and the effectiveness of |
---|
0:13:10 | also |
---|
0:13:11 | show and this is the condition where the features are matched |
---|
0:13:15 | so i don't need to learn any bridge between networks for this case of a |
---|
0:13:21 | were four |
---|
0:13:22 | if you dimension wanted to do and menus spectrogram |
---|
0:13:28 | i there is a speech are mismatched and you need lower average between units as |
---|
0:13:33 | well |
---|
0:13:33 | and |
---|
0:13:34 | is the results are not as good as the matched condition |
---|
0:13:38 | seems like cannot it advantage of high dimensional features |
---|
0:13:43 | literal |
---|
0:13:44 | we also the spectrograms somehow since use of for a least or |
---|
0:13:50 | but it is also |
---|
0:13:51 | worse than the baseline |
---|
0:13:58 | you see the effect of hasn't you lda and the or extractor data |
---|
0:14:06 | for scroll is not as good as us to control was tested and then |
---|
0:14:11 | alright consisted percent |
---|
0:14:13 | that at home so we can see |
---|
0:14:17 | the lda common test is written |
---|
0:14:20 | as the label which means that be lda |
---|
0:14:23 | and it is also has |
---|
0:14:25 | and it does and so much rates it and seven percent |
---|
0:14:31 | so for the mindcf we have |
---|
0:14:38 | not much change so don't feel that the really is |
---|
0:14:43 | is on benefiting an entire susceptible to a enhancement processing |
---|
0:14:49 | if and hence the training set |
---|
0:14:52 | there is improvement for the start baseline |
---|
0:14:56 | which is an iterative system |
---|
0:14:58 | however it's not as good as just so that has in the test |
---|
0:15:02 | one and half of them since like |
---|
0:15:05 | the robustness of the whole system is lost so it's not working for at least |
---|
0:15:10 | four |
---|
0:15:12 | this corpus |
---|
0:15:16 | we combine the enhanced vision see if we can take advantage make them |
---|
0:15:22 | complementary original features |
---|
0:15:25 | no that wasn't we just means that even if a if conditions |
---|
0:15:30 | and half which means the and score of all the data |
---|
0:15:37 | in the column |
---|
0:15:39 | you see all can be lda that means |
---|
0:15:43 | meditation |
---|
0:15:45 | is then be in the |
---|
0:15:48 | to verify all can be lda |
---|
0:15:51 | vol including original features as a listing and switches along with the data |
---|
0:15:57 | it seems to be getting our |
---|
0:16:01 | and |
---|
0:16:02 | when i combined these features in training set |
---|
0:16:06 | is actually doing much better performance seems like the network analysis double data and |
---|
0:16:12 | there is also complement energy |
---|
0:16:15 | in the |
---|
0:16:17 | has features so they are |
---|
0:16:19 | it can be bonastre |
---|
0:16:22 | if i one station and the frame effect of these features in train as well |
---|
0:16:26 | as the and the lda it doesn't |
---|
0:16:32 | so this ensures that the lda is a suitable one hasn't processing |
---|
0:16:37 | i started to just put i has features that or is not in the training |
---|
0:16:41 | set up or a spoof an oak |
---|
0:16:46 | now we see if i e one type of noise class from the expected network |
---|
0:16:51 | r t |
---|
0:16:54 | and hasn't data |
---|
0:16:56 | so let's focus on the a lot of this table which is that the war |
---|
0:17:01 | music and |
---|
0:17:05 | see the last column we have i one zero five percent this means that |
---|
0:17:10 | right i skate |
---|
0:17:12 | using the music files from extract phonetic or and i also don't use enhancement actually |
---|
0:17:18 | doing better than the based on which means |
---|
0:17:20 | and then |
---|
0:17:21 | removing music is good so this discussed actually also performance |
---|
0:17:29 | next unseen means i used enhancement or what the |
---|
0:17:34 | the on has filter has not seen use it |
---|
0:17:36 | so it's still able to improve the one this is some and |
---|
0:17:41 | most interestingly |
---|
0:17:43 | and the use the |
---|
0:17:44 | units seeing which is |
---|
0:17:46 | when i using and has to network which has seen using it is the s |
---|
0:17:51 | so it seems like some noise classes are |
---|
0:17:55 | or are being |
---|
0:17:57 | and |
---|
0:17:59 | is that it just give them in x vector training |
---|
0:18:02 | okay include them in the |
---|
0:18:05 | a enhancement |
---|
0:18:07 | training data |
---|
0:18:11 | it to see if we can do you relation with division loss you try seven |
---|
0:18:16 | schemes |
---|
0:18:17 | use call so that would be e tradition earlier repetitions scheme trying to do you |
---|
0:18:25 | duration and utilizing in |
---|
0:18:27 | joint fashion |
---|
0:18:29 | also and the distance fashioned which is denoted by joint one stage |
---|
0:18:33 | a few we all these numbers |
---|
0:18:37 | in c |
---|
0:18:39 | the dereverberation is not actually working |
---|
0:18:42 | we also suspect that's possible that a there is not possible configuration nevertheless t-norm things |
---|
0:18:50 | since e |
---|
0:18:53 | you have |
---|
0:18:56 | a pre-processing step for a improving on this maybe straight |
---|
0:19:02 | finally database are you can you need to choose also you know for you have |
---|
0:19:08 | layers of it for this type of funding |
---|
0:19:12 | and use one isa nine based filtering to keep highest not only you scores from |
---|
0:19:16 | the |
---|
0:19:17 | a construct a clean data for has to network training |
---|
0:19:21 | the mismatch in and has to and hasn't and also very |
---|
0:19:25 | and it is slightly worse is better to use same features |
---|
0:19:29 | we see that the lda is not really |
---|
0:19:33 | us a nice it's very susceptible to using enhanced data american put this next fortunate |
---|
0:19:38 | for |
---|
0:19:39 | some noise types are harder in for extracting data like music |
---|
0:19:45 | and finally the duration is not or four |
---|
0:19:50 | using this |
---|
0:19:52 | state of training scheme |
---|
0:19:54 | so that is the end of the presentation please feel free to send questions that |
---|
0:19:58 | where we thank you |
---|