0:00:14 | and everyone i'm general non from completed ascensions to montreal |
---|
0:00:20 | to downwind at a commode our work on |
---|
0:00:24 | analysis of web estimation tool |
---|
0:00:26 | this sre two thousand nineteen cmn and bus challenges |
---|
0:00:31 | in this work i'm going to provide an overview of a busy in some mission |
---|
0:00:36 | for |
---|
0:00:38 | nist sre two thousand nineteen by |
---|
0:00:41 | brno university of technology can be due to such an this from montreal |
---|
0:00:45 | for next you know on the and you am |
---|
0:00:55 | on this is the outline of my dark follows |
---|
0:01:00 | i will |
---|
0:01:04 | i'm going to start and introduction of the data and |
---|
0:01:07 | going to talk about the speaker verification is in conversation television the telephone speech |
---|
0:01:14 | once you meant to not to |
---|
0:01:16 | then i'll talk about them onto medea speaker verification on lost |
---|
0:01:21 | that i employing |
---|
0:01:22 | audio and phase biometric traits |
---|
0:01:25 | finally i'm going to draw my conclusion |
---|
0:01:31 | introduction |
---|
0:01:34 | e |
---|
0:01:35 | to the nineteen edition of nist sre |
---|
0:01:38 | there are two task |
---|
0:01:41 | one task is |
---|
0:01:43 | others |
---|
0:01:45 | speaker verification on conversation telephone speech where there is a domain mismatch between |
---|
0:01:51 | and train and test sitting mainly due to |
---|
0:01:55 | difference in languages away i mean |
---|
0:01:59 | in |
---|
0:02:00 | training data mostly nist speak english where is the yesterday's in arabic |
---|
0:02:06 | the second task to the multimedia speaker recognition over robust |
---|
0:02:11 | we do this is just speech technology that the main challenges here is the multi-speaker |
---|
0:02:15 | test recording |
---|
0:02:17 | there are two sub tasks in the last task |
---|
0:02:22 | one is the verification of a speaker verification or a speaker on audio but wheezing |
---|
0:02:27 | modeled by a to trade only what is |
---|
0:02:29 | i have verification in dust to verification of a speaker employing both or the u |
---|
0:02:34 | and pairs biometric traits |
---|
0:02:36 | in this work we present the system that brought by a visiting |
---|
0:02:43 | to tackle the challenges introduced in boat |
---|
0:02:46 | cmn to and what does task of nist sre two thousand nineteen |
---|
0:02:51 | and we problem provides some analyses of results |
---|
0:02:59 | data preparation original data are used for training speaker discriminant neural network are nist sre |
---|
0:03:06 | these |
---|
0:03:07 | two thousand forty two thousand and fisher a big |
---|
0:03:11 | all switchboard bookseller one and two |
---|
0:03:15 | supplemented data is created by the most on |
---|
0:03:20 | room impulse response from openness alarm and also using compression |
---|
0:03:27 | well in the origin gmm be decoded |
---|
0:03:31 | only |
---|
0:03:32 | five hundred k recordings was selected as supplemented a trial |
---|
0:03:37 | and added to the original the don't |
---|
0:03:39 | both to increase the to increase that morgan i wasn't of the training data |
---|
0:03:44 | after filtering based on minimum mellow duration |
---|
0:03:49 | i in discussed five second after bad |
---|
0:03:52 | and minimum number is speaker utterances party speaker in this case five utterance but a |
---|
0:03:58 | speaker |
---|
0:04:00 | there are approximately is |
---|
0:04:02 | seven entails in the speaker in the training data |
---|
0:04:06 | that i used for background training the nist sre is to those and |
---|
0:04:10 | for the two thousand and having approximately |
---|
0:04:14 | sixty six thousand recording |
---|
0:04:17 | adaptation that you'll is based on |
---|
0:04:22 | sorry eighteen a set it to those an eighteen they have longer than sixty percent |
---|
0:04:26 | of the study eighteen |
---|
0:04:28 | evolve |
---|
0:04:29 | there are total |
---|
0:04:31 | a thousand recording from one hundred thirty seven speaker |
---|
0:04:36 | part of evaluations a part of adaptation set and sre |
---|
0:04:41 | eighteen |
---|
0:04:43 | unlabeled data are unique or where used for score normalization |
---|
0:04:47 | and as developments |
---|
0:04:49 | test set we used for forty percent of the you well other missing the remaining |
---|
0:04:53 | forty percent of the e well wondered |
---|
0:04:59 | feature extraction or we |
---|
0:05:02 | as local feature we use forty dimensional filter bank or twenty two dimensional mfcc features |
---|
0:05:08 | extracted |
---|
0:05:09 | well what twenty minutes twenty five milisecond windows with different should go of ten milisecond |
---|
0:05:15 | for feature normalisation short-term cepstral mean normalisation was used with a sliding window of three |
---|
0:05:21 | second |
---|
0:05:22 | and on the speech frames or anymore would energy of is band |
---|
0:05:28 | in general pipeline that has been adopted for |
---|
0:05:31 | speaker verification on cmn for us |
---|
0:05:34 | is |
---|
0:05:35 | i don't |
---|
0:05:36 | with the boys are phase |
---|
0:05:38 | current trained in a speaker verification is to use this |
---|
0:05:41 | due to speaker them very |
---|
0:05:43 | with this filter will back end |
---|
0:05:46 | why the speaker embedding set extracted using a speaker discriminant neural network |
---|
0:05:52 | which is normally trained to discriminate among a set of training speaker |
---|
0:05:57 | and the network is normally a supervised by some variants of classification laws such as |
---|
0:06:03 | such as softmax or metric learning loss function |
---|
0:06:08 | in this |
---|
0:06:09 | case for cmn to does will use for speaker discriminant neural network trained with four |
---|
0:06:14 | different architecture |
---|
0:06:16 | as a back and we use either gaussian plp you're here we don't really more |
---|
0:06:20 | of |
---|
0:06:22 | evaluation and weightings are centered using me no adaptation set |
---|
0:06:28 | what is back and training set o a mating with standard using the you know |
---|
0:06:34 | the same set |
---|
0:06:38 | training and maybe this are adapted to the target domain using |
---|
0:06:41 | feature distribution adaptation so the finally we use feature somewhere based plp adaptation |
---|
0:06:49 | over to be lda model switches |
---|
0:06:52 | trained on unadapted an undirected speaker time at |
---|
0:06:56 | score normalization are |
---|
0:06:58 | is used to died of conventional |
---|
0:07:05 | all we double of for each individual system for semen to task |
---|
0:07:14 | system wanna use a standard fifty layer resonate architecture for training |
---|
0:07:19 | speaker discriminant neural net global field of and feature |
---|
0:07:23 | and |
---|
0:07:25 | portion billy is used for the scoring |
---|
0:07:29 | two dimensional convolution is used as we are using filter of and feature |
---|
0:07:35 | and five to obtain global representation from local in |
---|
0:07:39 | features |
---|
0:07:40 | statistics pulling is used |
---|
0:07:44 | in this system |
---|
0:07:45 | for training the lda model |
---|
0:07:47 | additional training data are used from sre two thousand six ten evaluation data |
---|
0:07:52 | that contain and ten thousand recording from |
---|
0:07:57 | two hundred one is speakers |
---|
0:08:01 | post processing is employed estimation and in the previous on this agenda five in the |
---|
0:08:06 | channel pipeline |
---|
0:08:10 | in |
---|
0:08:11 | system to system to employs a factor to deny architecture for training the speaker discriminant |
---|
0:08:17 | neural network |
---|
0:08:20 | colour the sre sixteen recipe will use for this case and the network was trained |
---|
0:08:25 | for six a box |
---|
0:08:28 | as back and he wouldn't be lda use you would forming the channel a pipeline |
---|
0:08:32 | that has been mentioned before |
---|
0:08:41 | for system three divinely architectures selected to train the speaker discriminant neural network is the |
---|
0:08:48 | extended to dinner architecture with the fuel residual connection to this layers |
---|
0:08:53 | and the network is |
---|
0:08:55 | the was trained for to you box |
---|
0:09:00 | in this case extractor limiting our of |
---|
0:09:02 | seven hundred sixty eight dimensional instead of five hundred well |
---|
0:09:07 | and them meaning as j noise judging and denoising |
---|
0:09:10 | auto-encoder |
---|
0:09:12 | one dimensional convolution is used over mfcc features other in this statistic putting is used |
---|
0:09:18 | for |
---|
0:09:19 | generating |
---|
0:09:21 | global utterance labeled |
---|
0:09:23 | representation |
---|
0:09:26 | so we nearly as big and heavy tailed ple the use of following the general |
---|
0:09:30 | parkland that has been mentioned before |
---|
0:09:37 | finding system for similar depending architecture is used in system to you were used for |
---|
0:09:44 | training speaker discriminant neural network |
---|
0:09:47 | but this network was trained only on a thirty s |
---|
0:09:50 | two thousand four two doesn't an english data |
---|
0:09:54 | and this is mfcc feature is you that's fine grained feature |
---|
0:09:58 | yes turned our gender developer syrian |
---|
0:10:02 | network is used on the topic and more no |
---|
0:10:06 | mainly to discriminate between the source and target |
---|
0:10:09 | domains |
---|
0:10:10 | so is domain our classes english or target domain is arabic |
---|
0:10:15 | extracted a meeting this case of for seven and a sixty dimensional |
---|
0:10:20 | and as back inhibited really is used following the general pipeline that of the mentioned |
---|
0:10:25 | before |
---|
0:10:31 | calibration and fusion for c one two task what calibration and fusion are trained when |
---|
0:10:37 | the logistic regression on an emblem as |
---|
0:10:40 | so |
---|
0:10:41 | and consistent performance or absorb across the progress any well set |
---|
0:10:47 | which indicates that the |
---|
0:10:50 | well we achieve almost perfect |
---|
0:10:53 | calibration |
---|
0:10:56 | in table one presents the results of individual and fused system on they have any |
---|
0:11:02 | well set for cmn troll dolls |
---|
0:11:05 | seven does |
---|
0:11:06 | single best system we found here system an i-vector k d n and would have |
---|
0:11:12 | yielded ple a combination |
---|
0:11:14 | the denoising did not have but when fused with that t systems |
---|
0:11:21 | it resulted |
---|
0:11:22 | in a nice improvement over performance |
---|
0:11:26 | i'll feel system provided the best performance in this case |
---|
0:11:36 | in table two we present and compare is an performance using different backends with the |
---|
0:11:42 | resonant |
---|
0:11:43 | and vector |
---|
0:11:44 | to the ann architectures |
---|
0:11:47 | for cm into two cm and two dollars ple awakens are clearly don't we know |
---|
0:11:52 | this is perhaps due to the dimension |
---|
0:11:56 | between train and test settings |
---|
0:11:59 | in table three |
---|
0:12:01 | we show the performance with this system andreas post processing was adopted |
---|
0:12:07 | or what the extracted speaker i'm waiting |
---|
0:12:10 | from this to what we can see that or what the extracted animating when mean |
---|
0:12:15 | centring |
---|
0:12:17 | feature distribution and addition kld adaptation and as norm was processed in combination not |
---|
0:12:24 | widely used the lead to the |
---|
0:12:26 | based performance |
---|
0:12:34 | finally a robust task |
---|
0:12:38 | data preparation |
---|
0:12:40 | what original data used in this case is for training speaker |
---|
0:12:45 | discriminant neural network mainly voxel of two development data |
---|
0:12:49 | which normally contents six |
---|
0:12:52 | six thousand the speaker |
---|
0:12:54 | but forty d n and the system bookseller one and two |
---|
0:12:59 | nobody speech |
---|
0:13:01 | the reminder to say |
---|
0:13:04 | combine which consist of around |
---|
0:13:07 | eleven thousand the speakers are used for training |
---|
0:13:13 | s supplementary that is created |
---|
0:13:17 | my using most on an room impulse response from open a sum up |
---|
0:13:23 | and all the five mean only recordings from just supplemented data |
---|
0:13:27 | was selected to this really is selected to add to the original local |
---|
0:13:33 | well for increasing the mold and i was it in the training data |
---|
0:13:37 | after filtering based on minimum allowed duration in this case or second of curve |
---|
0:13:44 | voice activity detection |
---|
0:13:46 | and minimum number utterance participated in this case is eight utterances per speaker |
---|
0:13:52 | there are approximately six thousand the speaker in the training data |
---|
0:13:56 | that i used for bacon training is one hundred |
---|
0:14:00 | forty five utterances from original training data |
---|
0:14:04 | adaptation set is based on thirty seven utterances from sre eighteen busted of data |
---|
0:14:11 | a subset of the lda training data is used for us |
---|
0:14:16 | score normalization using a small |
---|
0:14:19 | the implement |
---|
0:14:20 | this |
---|
0:14:21 | yes |
---|
0:14:22 | test set chosen for audio-only sub task is sre eighteen busty well |
---|
0:14:29 | where is |
---|
0:14:31 | for audiovisual task development test set is sre nineteen or dave is one and implement |
---|
0:14:37 | at all |
---|
0:14:44 | feature extraction |
---|
0:14:47 | for robust task as local feature we use forty dimensional filter bank |
---|
0:14:53 | or twenty three dimensional plp features are extracted would |
---|
0:14:57 | a twenty five milisecond window over a frame shift of ten milisecond |
---|
0:15:03 | for feature normalisation o we use short-term cepstral mean normalisation with this sliding window of |
---|
0:15:09 | two second |
---|
0:15:11 | and |
---|
0:15:12 | none the speech frames are removed using an energy based voice activity detector |
---|
0:15:17 | and for the last |
---|
0:15:20 | or do you only does channel the general pipeline is |
---|
0:15:25 | we used t speaker discriminant neural network trained with three different architecture in order to |
---|
0:15:32 | extend the |
---|
0:15:33 | speaker and maybe |
---|
0:15:36 | as bank and |
---|
0:15:38 | we use question p l d your placenta scoring |
---|
0:15:43 | a novel and meetings are centered using mean of the bank and training set in |
---|
0:15:48 | many |
---|
0:15:49 | training and weightings are adapted to the |
---|
0:15:52 | target domain using feature distribution adaptation |
---|
0:15:56 | diarisation is applied on the test set and a final score is the maximum or |
---|
0:16:02 | what that additional or |
---|
0:16:05 | and |
---|
0:16:06 | is score is normalized then a message on |
---|
0:16:15 | individual systems and envelope for vast audio-only system would have multisine images system for this |
---|
0:16:22 | case |
---|
0:16:23 | i system on the two uses the standard |
---|
0:16:28 | group delay resonate architecture which is first be obtained using the softmax loss |
---|
0:16:34 | and then after later it is finetune using adaptive |
---|
0:16:39 | additive angular margin loss function |
---|
0:16:42 | in this case as local features filterbank is used and as backend portion p l |
---|
0:16:48 | d n constants codings values |
---|
0:16:51 | and for post processing we use general that the form of the general background that |
---|
0:16:56 | you mentioned before |
---|
0:16:59 | and system two in this case the |
---|
0:17:02 | two d nn architecture for training speaker discriminant neural network |
---|
0:17:07 | and this network is trained using all the |
---|
0:17:11 | icily sixteen recipe over a box and i one and two |
---|
0:17:16 | liberty speech |
---|
0:17:17 | and reminded of is for six a box |
---|
0:17:22 | as they can go action clearly model is used forming the gmm parkland that has |
---|
0:17:27 | been mentioned before |
---|
0:17:33 | system three |
---|
0:17:36 | is trained following colour the |
---|
0:17:39 | as that a system recipe on the sre |
---|
0:17:42 | to those and fortitude doesn't and all switchboard data for to you box |
---|
0:17:48 | as front end of |
---|
0:17:50 | feature plp is use |
---|
0:17:54 | augmented sre is to those in for two thousand |
---|
0:17:57 | dehne dan data was used for training the backend monitor |
---|
0:18:03 | correlation alignment based domain adaptation used for adapting source domain to the target domain in |
---|
0:18:08 | this case |
---|
0:18:09 | as back and why shouldn't really is used and for system three |
---|
0:18:15 | no score normalization was used |
---|
0:18:20 | this is test data contain multi speaker recording we add up to speaker diarization to |
---|
0:18:25 | obtain number of the speaker and there but how much of the speaker segments according |
---|
0:18:30 | to speaker identity |
---|
0:18:31 | for each |
---|
0:18:32 | test utterance we extract expect or four |
---|
0:18:36 | every two hundred fifty milliseconds |
---|
0:18:39 | then and no more to hear it together clustering is used to cluster made things |
---|
0:18:43 | into |
---|
0:18:45 | one two |
---|
0:18:46 | three or four the speaker cluster |
---|
0:18:49 | and m baiting spar test for the extractor for it just speaker |
---|
0:18:54 | enrollment embedding is scored against all and test "'em" begins and finally the score is |
---|
0:19:00 | the maximum about ten |
---|
0:19:02 | and is called |
---|
0:19:09 | for |
---|
0:19:10 | it is well-known lead to us so on lost us with the release of us |
---|
0:19:16 | of blast stars with the of low to system |
---|
0:19:21 | a system one and it scares is which is a pretty again |
---|
0:19:26 | school is the excitation version or resonant |
---|
0:19:29 | fifty which is to and only be g d vista dataset |
---|
0:19:33 | and all this |
---|
0:19:34 | this peter network is used for extracting us to extract phase and mary |
---|
0:19:39 | for enrollment data based on provided frame in nieces and of a c l m |
---|
0:19:45 | baiting spatial modeling boxes |
---|
0:19:49 | colours only in each regions are cross |
---|
0:19:51 | and normalized before posting to the peak in model for animating destruction |
---|
0:19:58 | speaker is represented by averaging enrollment and mating |
---|
0:20:02 | what is that the signal shortest getting very infested a tool is used to detect |
---|
0:20:07 | one phase |
---|
0:20:08 | our second |
---|
0:20:10 | in the test data |
---|
0:20:12 | for a scoring also similar it was found similar to do you the but official |
---|
0:20:15 | embedding |
---|
0:20:17 | and maximize score is selected |
---|
0:20:20 | no score normalization is applied for any of the v is well systems |
---|
0:20:30 | system to a similar to his well only system on system to also used of |
---|
0:20:37 | a pre-training |
---|
0:20:38 | us |
---|
0:20:39 | school is x addition motion orders and on wages the |
---|
0:20:43 | face to dataset to extract for estimating |
---|
0:20:46 | but for the system at each frame multiple bounding box are extracted using |
---|
0:20:52 | mounted s c n |
---|
0:20:54 | kalman filtering is applied to track the extracted to bounding boxes from frame to frame |
---|
0:21:00 | chinese is available to them is applied for clustering and this other than does not |
---|
0:21:06 | use any prior information about the number of clusters |
---|
0:21:10 | for enrollment a speaker is represented by averaging betting |
---|
0:21:14 | for the scoring in the system console similar to a use usable information limiting and |
---|
0:21:20 | the maximum score that selected |
---|
0:21:29 | of calibration and fusion for impostors and |
---|
0:21:33 | this calibration official is trained ear logistic regression and of long as test set |
---|
0:21:38 | and sre eight in a plastic one was used for calibration and fusion for audio-only |
---|
0:21:43 | tiles |
---|
0:21:45 | and sre nineteen |
---|
0:21:47 | audio-visual development set was used for calibration and fusion for audiovisual systems |
---|
0:21:56 | performance evaluation |
---|
0:21:58 | in table four we compare different back end on the top of bayes net would |
---|
0:22:03 | additive and low margin |
---|
0:22:05 | softmax architecture |
---|
0:22:07 | we can see from here and that |
---|
0:22:10 | adaptation is a score normalisation are found helpful |
---|
0:22:15 | cosine escorting outperformed the p lda back-end in impostor audio new task |
---|
0:22:20 | or have this is due to the fact that there's not much |
---|
0:22:25 | domain she between train and test fitting in this case |
---|
0:22:33 | in table five we show the influence of using data position on monte speaker test |
---|
0:22:38 | recording for of a story on it does |
---|
0:22:41 | we can see from here the validation help to boast performance |
---|
0:22:49 | in |
---|
0:22:50 | this table we present already you |
---|
0:22:53 | this is well |
---|
0:22:55 | single and feel system and audiovisual few systems |
---|
0:23:00 | performance on than they have an email test set |
---|
0:23:04 | we can see from here if we shouldn't how to improve performance |
---|
0:23:08 | the performance of a lot of is well-known this is to a not that much |
---|
0:23:12 | goal but |
---|
0:23:14 | when visual modalities fuel to the audio modality |
---|
0:23:18 | i huge improvement in performance |
---|
0:23:22 | well actually over it you know model systems |
---|
0:23:29 | finally the convolutional can say |
---|
0:23:32 | adaptation of source domain to the target domain have played a vital role for both |
---|
0:23:38 | cm into and was tossed using either |
---|
0:23:42 | fine tuning of speaker discriminant neural net control the target domain |
---|
0:23:47 | or i that is encode relational unmanned or feature distribution and a petition or a |
---|
0:23:53 | our domain adaptation using is standard down |
---|
0:23:57 | diarisation how to boast performance in multi |
---|
0:24:02 | speaker can work |
---|
0:24:03 | test recording scenario |
---|
0:24:06 | simple the score level fusion or more the un phase biometrics |
---|
0:24:10 | provided significant it |
---|
0:24:12 | performance improvement over you know model system |
---|
0:24:15 | which indicates that the reason exists complementarity between audio and visual model it is |
---|
0:24:25 | thank you very much for your attention |
---|