0:00:13 | no |
---|
0:00:14 | mining solutions |
---|
0:00:16 | and i would like to present our contribution regarding q decision of the courses corpus |
---|
0:00:21 | or multichannel speaker verification which uses talking |
---|
0:00:27 | according to research papers |
---|
0:00:29 | there is |
---|
0:00:30 | more interest |
---|
0:00:32 | in march a speaker verification |
---|
0:00:34 | but the number of dataset |
---|
0:00:37 | still limited |
---|
0:00:40 | therefore we wanted to use |
---|
0:00:43 | boxes data |
---|
0:00:45 | for the evaluation of the multichannel speaker verification systems to the object is of our |
---|
0:00:51 | a where |
---|
0:00:52 | are as follows |
---|
0:00:55 | we analyze the original trial is defined for the voices challenge |
---|
0:01:01 | we really finally |
---|
0:01:03 | so that a multichannel speaker verification systems can use of |
---|
0:01:10 | since we created in you try to nist |
---|
0:01:13 | multi-channel trial list |
---|
0:01:15 | the final if sensors robust |
---|
0:01:18 | and also you assist used to do their voices data for training subsystems |
---|
0:01:26 | so because we wanted to create a multichannel trial set |
---|
0:01:31 | we needed first |
---|
0:01:33 | and one |
---|
0:01:34 | the original |
---|
0:01:35 | try to set defined for the first time |
---|
0:01:39 | so we can see your stiff one that every set of recordings |
---|
0:01:45 | what recording |
---|
0:01:47 | in a different room |
---|
0:01:49 | as regards noise condition |
---|
0:01:51 | we can see that test recordings were recorded with background noise |
---|
0:01:57 | it was under babble noise |
---|
0:02:00 | television noise |
---|
0:02:02 | and music and also without and thing |
---|
0:02:06 | enrollment recordings |
---|
0:02:07 | where required without any vector noise |
---|
0:02:10 | so they're just room reverberation and background |
---|
0:02:15 | and we can see that the haar of the |
---|
0:02:19 | enrollment data for evaluation |
---|
0:02:22 | what's taken from the original there is |
---|
0:02:28 | as regards microphones |
---|
0:02:30 | and enrollment recordings but with two microphones |
---|
0:02:34 | test recordings with eight or eleven microphone |
---|
0:02:39 | these numbers |
---|
0:02:40 | would be quite important for us |
---|
0:02:44 | in terms of speakers we can see that there are some unique speakers in and |
---|
0:02:50 | enrollment and test portion |
---|
0:02:53 | overall we have about one hundred speakers in enrollment both for evaluation and development |
---|
0:03:02 | for development |
---|
0:03:04 | we have much more speakers test then a calibration set |
---|
0:03:13 | regarding utterances |
---|
0:03:15 | utterances are just trying between enrollment and test |
---|
0:03:21 | also speakers in the development set are different from those that are evaluation set |
---|
0:03:30 | so we wanted to create a multi multichannel trials |
---|
0:03:35 | so we to analyze the origin one and you realise |
---|
0:03:40 | but for every enrollment recording |
---|
0:03:43 | there are always multiple test recording |
---|
0:03:46 | containing the same utterance the same noise speaker |
---|
0:03:51 | room |
---|
0:03:52 | but they are recorded with a different microphone |
---|
0:03:56 | and this is what we may use all |
---|
0:04:00 | so while creating our multichannel trials we use single enrolment |
---|
0:04:06 | and that in terms of test recordings |
---|
0:04:10 | we do some recordings to create and microphone |
---|
0:04:16 | so now we will look into the creation of test portions of development and evaluation |
---|
0:04:24 | for the for this to |
---|
0:04:27 | we can see that for every and enrollment utterance |
---|
0:04:31 | there are always eight |
---|
0:04:34 | utterances containing the same |
---|
0:04:37 | basically the same utterance |
---|
0:04:39 | and are recorded over different microphone |
---|
0:04:43 | she one to a three s o one |
---|
0:04:46 | are numbers representing random turkish |
---|
0:04:51 | we decided to always four |
---|
0:04:54 | recordings |
---|
0:04:55 | two one microphone error |
---|
0:04:58 | that means |
---|
0:04:59 | that |
---|
0:05:01 | instead of eight trials people tend to trials |
---|
0:05:06 | meaning that you use the number of trials from four million to one |
---|
0:05:14 | or evaluation set |
---|
0:05:16 | we have relied on |
---|
0:05:18 | recordings for every enrollment utterance |
---|
0:05:22 | we again grooved for recordings together and we're left with three more utterances |
---|
0:05:32 | therefore we randomly another one utterance from those that of it |
---|
0:05:40 | this new use the number of trials from three one five million to nine hundred |
---|
0:05:46 | eighty thousand shots |
---|
0:05:50 | we try not only reading |
---|
0:05:53 | a development |
---|
0:05:55 | and evaluation sets but we also try to creating |
---|
0:05:58 | and data |
---|
0:06:01 | our multichannel training dataset is based on the full list of recordings from one and |
---|
0:06:08 | two |
---|
0:06:10 | be excluded completely recordings from three and four |
---|
0:06:15 | because as we have seen a full original utterances recorded in ruins to four |
---|
0:06:23 | we also the i'll the development data because they were recorded in one and room |
---|
0:06:31 | then we again grooved the recordings based on the content and we obtain again microphone |
---|
0:06:39 | areas contain four microphones |
---|
0:06:43 | so |
---|
0:06:45 | the result was trained dataset comprising fifty seven point eight thousand examples but which use |
---|
0:06:52 | of two hundred speakers |
---|
0:06:55 | so it is clear that there is there |
---|
0:06:58 | because this dataset is similar to the development dataset in terms of speakers and also |
---|
0:07:06 | acoustic conditions |
---|
0:07:08 | but this was already just because the |
---|
0:07:11 | all the data set |
---|
0:07:13 | so now they're three |
---|
0:07:16 | development and evaluation set |
---|
0:07:18 | and also training set |
---|
0:07:21 | no less channel to explanation of multichannel product speaker verification |
---|
0:07:26 | so we use it is done system |
---|
0:07:30 | then it contains a front end which is the funding that performs a station are |
---|
0:07:36 | very |
---|
0:07:37 | and then the single channel output goes to exeter extractor |
---|
0:07:42 | and the and buildings are scored using nearly |
---|
0:07:47 | so this is very standard i point |
---|
0:07:50 | but our goal was not propose no motion system |
---|
0:07:55 | but rather assess the use of the to the voices |
---|
0:08:00 | for |
---|
0:08:02 | forming we were able to make use not original voices training data |
---|
0:08:09 | we also tried using simulated data and i will explain while and when later presentation |
---|
0:08:16 | the voices training dataset is quite the and therefore we couldn't use it for training |
---|
0:08:23 | of the extra their extractor |
---|
0:08:26 | it means that use bookseller or training a of the experisuch there and also you |
---|
0:08:33 | be okay |
---|
0:08:35 | or front end processing |
---|
0:08:37 | we use the g |
---|
0:08:39 | generalized eigen four |
---|
0:08:42 | so this is former get utilizes |
---|
0:08:47 | something would statistics |
---|
0:08:49 | and crew a single |
---|
0:08:54 | so first we need to a computer or estimate speech cross power spectral density matrix |
---|
0:09:01 | and noise here |
---|
0:09:04 | those three matrices |
---|
0:09:05 | go to g is over |
---|
0:09:08 | which is generalized eigenvalue decomposition |
---|
0:09:12 | the principal eigenvector then used construed to be a beamformer weight |
---|
0:09:18 | it is applied to multi-channel input |
---|
0:09:21 | and we obtain a single job |
---|
0:09:25 | in order |
---|
0:09:26 | to estimate speech i was used |
---|
0:09:30 | we use neural network |
---|
0:09:34 | we have |
---|
0:09:35 | single one quarter |
---|
0:09:37 | and |
---|
0:09:39 | this is applied to all of g a chance |
---|
0:09:45 | to give an input |
---|
0:09:47 | this not network is supposed to a model for each and mask for noise |
---|
0:09:54 | the resulting mask |
---|
0:09:57 | are applied to input spectrum |
---|
0:10:01 | and noise and speech psd matrices are estimated |
---|
0:10:06 | this picture is differentiable s is usually in our previous work |
---|
0:10:12 | the architecture of this model is pretty simple a contains how about two in your |
---|
0:10:20 | layers |
---|
0:10:21 | and then there are two layers |
---|
0:10:24 | one |
---|
0:10:25 | of coding |
---|
0:10:27 | model ordinance one |
---|
0:10:29 | or |
---|
0:10:32 | in our experiments we will refer to models |
---|
0:10:37 | but essentially they are the same |
---|
0:10:40 | and what is different is the weight of training |
---|
0:10:44 | so for b c model |
---|
0:10:46 | we do |
---|
0:10:48 | the weight of the most system either |
---|
0:10:51 | just by a optimize the output mask |
---|
0:10:56 | therefore |
---|
0:10:57 | we |
---|
0:10:58 | compute first i |
---|
0:11:01 | ideal binary mask |
---|
0:11:03 | and then we are minimizing binary cross entropy between output and yes |
---|
0:11:10 | so in order to computer science |
---|
0:11:14 | we need to know speech and noise |
---|
0:11:17 | so that means that can not use which dataset and to this data for training |
---|
0:11:27 | to create such assimilated a dataset we use the same utterances are and multi-channel voices |
---|
0:11:34 | dataset |
---|
0:11:36 | and we perform us english using mute source method |
---|
0:11:42 | and everything |
---|
0:11:44 | all sessions |
---|
0:11:45 | which was also used in |
---|
0:11:49 | of course dataset |
---|
0:11:53 | for the missing model |
---|
0:11:55 | we optimize the output of the form |
---|
0:12:00 | therefore we minimize |
---|
0:12:02 | and s between the output |
---|
0:12:05 | and clean speech |
---|
0:12:08 | in this case we can use multichannel a voices training data |
---|
0:12:13 | because what |
---|
0:12:14 | described it audio |
---|
0:12:16 | and then clean speech which is taken from speech |
---|
0:12:22 | so much for that expunge our architecture and now we will to experiments are |
---|
0:12:31 | for reference |
---|
0:12:32 | we show results for the so called single channel |
---|
0:12:37 | in this case we use the original trial list |
---|
0:12:42 | defined for the voice |
---|
0:12:45 | and we evaluated our extractor extract the |
---|
0:12:50 | our baseline is informed which is well established to for performing with us |
---|
0:12:59 | the results are you |
---|
0:13:03 | then we try to assessing dct and this models |
---|
0:13:08 | using the same trial this |
---|
0:13:11 | s for one |
---|
0:13:14 | it is worth mentioning that |
---|
0:13:17 | take the channel cannot be readily compare the formant because of the number of trials |
---|
0:13:23 | is the |
---|
0:13:26 | then we tried assessing the performance of u c and testing |
---|
0:13:33 | we can see that this is novel tense |
---|
0:13:36 | better results than baseline from |
---|
0:13:40 | however the performance of this model is quite or |
---|
0:13:46 | we hypothesize that it is much more difficult part |
---|
0:13:50 | to train new on this work to all good but correct mask for speech from |
---|
0:13:56 | foreigners just by minimizing unless you how good and |
---|
0:14:03 | moreover |
---|
0:14:04 | there is more variability in the training data for easy model |
---|
0:14:09 | then in the training data for missing the training data or miss model |
---|
0:14:14 | all training are okay |
---|
0:14:16 | from the voices |
---|
0:14:20 | further |
---|
0:14:21 | we can see |
---|
0:14:22 | the pca model generalizes another |
---|
0:14:25 | then and this novel |
---|
0:14:27 | and this is |
---|
0:14:29 | again because of variability in the data |
---|
0:14:34 | then you're trying to improve and missing model |
---|
0:14:38 | but still using voices dataset and no external data |
---|
0:14:43 | so what |
---|
0:14:44 | its use of men |
---|
0:14:47 | and especially proposed variant of spectrum and where we apply mosques directly to this |
---|
0:14:56 | more specifically we have five to frequency must |
---|
0:15:00 | and to time marks |
---|
0:15:03 | we can see that we were able to improve |
---|
0:15:08 | performance of and si models quite substantially |
---|
0:15:12 | we can also observe that performance is not better than the baseline performance |
---|
0:15:19 | we also tried using spectral language model |
---|
0:15:23 | and again see some improvement |
---|
0:15:26 | but the improvement is not that i don't |
---|
0:15:30 | as for the mse model which is good news for us |
---|
0:15:35 | so much for the |
---|
0:15:37 | first experiment |
---|
0:15:38 | and let's turn to the wrong so in the number |
---|
0:15:43 | experiment |
---|
0:15:44 | higher assessing performance of individual microphones |
---|
0:15:49 | we hypothesize that some of microphones can perform poorly |
---|
0:15:55 | they are used in multiple microphone errors |
---|
0:16:00 | in this case |
---|
0:16:01 | microphones |
---|
0:16:02 | can be far from each other as opposed to conventional small microphone arrays |
---|
0:16:09 | and i thought that maybe a really performing microphones to integrate overall performance greatly |
---|
0:16:18 | and it might be useful to exclude them from trials |
---|
0:16:24 | so to assess this |
---|
0:16:27 | we first needed to assess single microphones |
---|
0:16:31 | so it is me too |
---|
0:16:35 | the original trial list |
---|
0:16:37 | and then we created a microphone specific trial list |
---|
0:16:42 | where as you can see |
---|
0:16:45 | neural and recordings are always the same |
---|
0:16:48 | and test recordings correspond to the microphone |
---|
0:16:53 | that recorded |
---|
0:16:55 | specifically utterance |
---|
0:16:59 | these are the result that we obtain |
---|
0:17:02 | and we can see that best microphone |
---|
0:17:06 | our best performing microphone dislike in front of the loudspeaker |
---|
0:17:11 | then the worst microphone is microphone with number twelve which was constructed |
---|
0:17:18 | and another who microphone is the one that is order |
---|
0:17:23 | from the loudspeaker number six |
---|
0:17:26 | we can see that there is quite some difference between the best and worst my |
---|
0:17:32 | this is even more pronounced |
---|
0:17:35 | for evaluation set |
---|
0:17:38 | where we can see that best performing microphone |
---|
0:17:41 | and its performance for two point two eight |
---|
0:17:45 | because er |
---|
0:17:48 | the was performed microphone it's almost seven times worse than the best one |
---|
0:17:54 | again the microphone number twelve was for obstructive |
---|
0:17:59 | and microphone number six is far from the art speaker |
---|
0:18:05 | then try excluding those microphones |
---|
0:18:08 | from trials |
---|
0:18:11 | as expected the numbers that you got are better |
---|
0:18:17 | but what is more important |
---|
0:18:18 | the difference is not the |
---|
0:18:22 | so |
---|
0:18:23 | we decided not to exclude and |
---|
0:18:26 | microphones from the trials |
---|
0:18:30 | this concludes our presentation |
---|
0:18:33 | and now let's move |
---|
0:18:34 | to the outcomes |
---|
0:18:36 | all or |
---|
0:18:39 | we adopted the closest definition of trials |
---|
0:18:43 | and created trial list for development and evolution market a speaker verification |
---|
0:18:49 | we are aware of the fact and that reduce the number of trials quite substantially |
---|
0:18:56 | but we verified that the results obtained with the trial list are reliable |
---|
0:19:02 | details on that can be found in our paper |
---|
0:19:07 | we have i five several |
---|
0:19:09 | or |
---|
0:19:10 | such a small number of speakers and small rather than t in the acoustic environments |
---|
0:19:17 | and channel s |
---|
0:19:18 | and we tackle these problems via data location |
---|
0:19:24 | in our set of experiments we have confirmed that even with a data set of |
---|
0:19:30 | the size |
---|
0:19:31 | and without of data limitation |
---|
0:19:33 | we can achieve interesting results |
---|
0:19:36 | and carry out research in this field much a speaker verification |
---|
0:19:42 | thank you for your attention |
---|