0:00:12 | i know miming with powerlessness and today it is my great pleasure |
---|
0:00:17 | two percent or on speaker detection in the while |
---|
0:00:21 | lessons learned from jason two thousand nineteen |
---|
0:00:25 | i would like to reverse that all the off course |
---|
0:00:28 | that may this work possible |
---|
0:00:31 | let's first ask a question |
---|
0:00:33 | one they didn't we have here |
---|
0:00:36 | right now we have plenty of devices |
---|
0:00:39 | like smart phones |
---|
0:00:40 | recorders |
---|
0:00:41 | social media |
---|
0:00:43 | from which we can gather data |
---|
0:00:45 | and use it for downstream |
---|
0:00:47 | tests |
---|
0:00:49 | and you we go for then we can even |
---|
0:00:53 | performance speaker detection |
---|
0:00:56 | hello |
---|
0:00:57 | my name is bonus to date you'd is my great pleasure percent or |
---|
0:01:01 | on the speaker detection in the while |
---|
0:01:04 | lessons learned from jason two thousand nineteen |
---|
0:01:07 | i would like to the first faculty of course |
---|
0:01:09 | no make this work possible |
---|
0:01:12 | so let's start |
---|
0:01:16 | what they did we have older |
---|
0:01:19 | we have plenty of devices like |
---|
0:01:22 | a smart phones |
---|
0:01:24 | recorders |
---|
0:01:25 | even we can get information from social media |
---|
0:01:28 | and |
---|
0:01:30 | we gather data |
---|
0:01:32 | and you see |
---|
0:01:33 | for downstream task |
---|
0:01:35 | however these data needs to be label |
---|
0:01:39 | to be useful |
---|
0:01:40 | and we'd is labeling we can you can perform speaker detection |
---|
0:01:44 | one of them are very experiments was to use brute force |
---|
0:01:51 | and it was the motivation to use diarization actual words |
---|
0:01:57 | so we have the speech recording |
---|
0:01:59 | and we obtain homogeneous segments from it |
---|
0:02:04 | from those segments we computed the and things and we compare those in billings |
---|
0:02:09 | with the target |
---|
0:02:11 | speakers involved in between |
---|
0:02:13 | and gave and i result |
---|
0:02:16 | but then |
---|
0:02:18 | we need diarization |
---|
0:02:21 | and we extracted the segments that not to the same speaker |
---|
0:02:25 | and we obtain better results |
---|
0:02:28 | so it was i would find it |
---|
0:02:32 | to do it this way |
---|
0:02:34 | so this is the be sure |
---|
0:02:37 | a whole pipeline |
---|
0:02:39 | we have |
---|
0:02:40 | a record mean and we're looking for john |
---|
0:02:43 | the first stage |
---|
0:02:44 | is to a client was voice activity detection |
---|
0:02:48 | that means to get rid of all the silence that's the second stage |
---|
0:02:52 | is to perform speaker type classification or super b e |
---|
0:02:57 | that means to time |
---|
0:02:59 | all the segments according to the gender or if it's a keyboard and at all |
---|
0:03:05 | or even it is t v |
---|
0:03:07 | the speaker diarization |
---|
0:03:08 | that answers the question who spoke when |
---|
0:03:11 | so gathered together |
---|
0:03:13 | the a segments that you know to the same speaker detection as the question |
---|
0:03:19 | if we have john in |
---|
0:03:22 | in any segment so is a binary decision |
---|
0:03:25 | and then we can look for john a low |
---|
0:03:28 | the recording with the speaker tracking |
---|
0:03:33 | thus in were fine |
---|
0:03:36 | to follow this type of and if we have challenges in are used as a |
---|
0:03:41 | cocktail party that there is no |
---|
0:03:44 | if any we have five psnr in the answer again is |
---|
0:03:49 | so let's take a look at some of the numbers on the diarisation side |
---|
0:03:55 | on the right now |
---|
0:03:57 | we can observe the results of obtain |
---|
0:04:01 | on the that i try to |
---|
0:04:02 | based on the p x |
---|
0:04:05 | provided by but |
---|
0:04:09 | so we can observe that ceilings |
---|
0:04:12 | we do not too long the recordings |
---|
0:04:15 | and basque which are due to be d s |
---|
0:04:18 | got very that results |
---|
0:04:24 | we conclude that is that results are because we're talking about far field microphone |
---|
0:04:30 | noisy speech |
---|
0:04:31 | overlapping speech |
---|
0:04:33 | condition mismatch not comparative speakers |
---|
0:04:36 | and biased towards angles speech |
---|
0:04:39 | so we wanted to study this conditions |
---|
0:04:43 | no that's is some numbers on speaker recognition |
---|
0:04:47 | for speaker recognition we compare two systems |
---|
0:04:52 | a two datasets |
---|
0:04:53 | the first one is that it's alright and the second one is the voices |
---|
0:04:57 | and we are comparing |
---|
0:05:00 | a close talking microphone are feel |
---|
0:05:04 | we can observe that for far microphone |
---|
0:05:07 | a big our doubles or false |
---|
0:05:13 | then our main goal was to research developed and benchmark speaker diarization a speaker recognition |
---|
0:05:19 | systems |
---|
0:05:20 | for real speech |
---|
0:05:22 | by using a single microphones in realistic scenarios that included right around noises |
---|
0:05:28 | so just television audio music |
---|
0:05:31 | or other people talk |
---|
0:05:35 | the data one of the characteristics of the data |
---|
0:05:38 | is it |
---|
0:05:39 | like this one where you're having a meeting |
---|
0:05:42 | or is it |
---|
0:05:43 | completely while |
---|
0:05:46 | as the one inch i'm five were people gathered together to have already |
---|
0:05:51 | or anything they long recording |
---|
0:05:55 | just having a five hour recording or even longer |
---|
0:06:00 | or is it |
---|
0:06:01 | that we have a are far field microphone |
---|
0:06:05 | on the other room |
---|
0:06:06 | that is catching |
---|
0:06:08 | the voice of the speaker |
---|
0:06:15 | to cover |
---|
0:06:16 | although this type of data sets we included this so core right |
---|
0:06:22 | i mean this alright channel five and bt training |
---|
0:06:25 | going from the easiest one |
---|
0:06:27 | to the most typical one |
---|
0:06:30 | so for i mean we have a meeting domain |
---|
0:06:33 | and we use it for both for devastation a detection |
---|
0:06:37 | for this alright we have i mean to control domain |
---|
0:06:40 | we just use it for the for detection we then used for the recession because |
---|
0:06:46 | we have |
---|
0:06:47 | the complete |
---|
0:06:49 | labels for all the speakers |
---|
0:06:51 | for channel five we use it for diarisation only |
---|
0:06:55 | and it's an injured domain |
---|
0:06:57 | we didn't using for addition because we usually four speakers |
---|
0:07:02 | which is |
---|
0:07:03 | quite a few persons |
---|
0:07:06 | and babies right |
---|
0:07:09 | we use it for both for their station and detection and is completely while i |
---|
0:07:15 | don't control |
---|
0:07:17 | the models that we explore as i said before is the devastation and the speaker |
---|
0:07:23 | detection |
---|
0:07:24 | so |
---|
0:07:25 | from the devastation we get the labels for all the speakers and for the speaker |
---|
0:07:30 | detection we can |
---|
0:07:32 | try the speaker i don't are equal |
---|
0:07:36 | this is that the picture of the for the devastation so we have |
---|
0:07:41 | a traditional modularized system that is composed enhancement the p |
---|
0:07:47 | the embedding the scoring the cluster e |
---|
0:07:50 | the re-segmentation and the overlap assignment |
---|
0:07:53 | we have to type something enhancement |
---|
0:07:56 | one of the signal level |
---|
0:07:58 | and you're with their one i the enhancement level |
---|
0:08:01 | the |
---|
0:08:03 | boxes that are in orange |
---|
0:08:05 | are the ones that we explore |
---|
0:08:09 | let's start with the enhancement |
---|
0:08:12 | on the signal level |
---|
0:08:14 | we feel |
---|
0:08:15 | and snr progressive multi target and based speech enhancement model |
---|
0:08:20 | the progressive mode in time |
---|
0:08:23 | network or p n g |
---|
0:08:24 | is divided into statistically stacking blocks |
---|
0:08:28 | with one elicit em where you're |
---|
0:08:30 | and one phoneme connected they can be a multi target learning per block |
---|
0:08:35 | the one connected to let your in every plot |
---|
0:08:38 | is designed to ranger meeting speech target with higher |
---|
0:08:42 | snr than the previous target the first |
---|
0:08:47 | a serious progress you variation masks |
---|
0:08:50 | are concatenated with the progressively and have low power spectral features |
---|
0:08:56 | other targets |
---|
0:08:58 | i test time with directly be |
---|
0:09:00 | the enhanced audio |
---|
0:09:02 | processed by awarding has been model to the back end systems |
---|
0:09:07 | note that we have a wiener signal |
---|
0:09:09 | we can |
---|
0:09:10 | explored vad |
---|
0:09:12 | in this case |
---|
0:09:13 | we have two directions |
---|
0:09:15 | the one on the top that is based |
---|
0:09:17 | on mfccs and on the one on the bottom that is based on |
---|
0:09:22 | i think that |
---|
0:09:23 | and volatile then sure there is a philosophy a list we collected layers |
---|
0:09:30 | the output these the speech |
---|
0:09:31 | and nonspeech |
---|
0:09:33 | it is important to note |
---|
0:09:35 | that the lower branch is the one that we chose |
---|
0:09:38 | for works very |
---|
0:09:42 | although this is not part of the finite stages it is also true that debated |
---|
0:09:47 | invading network |
---|
0:09:48 | the related to the performance |
---|
0:09:50 | as shown in the table |
---|
0:09:52 | so we explore the extended t and then |
---|
0:09:55 | with a box so that and with box so that |
---|
0:09:59 | cluster augmentation |
---|
0:10:01 | and we also explore a |
---|
0:10:03 | after t and then |
---|
0:10:04 | we also there was a commendation |
---|
0:10:07 | so we can see that the factor t v n and with even |
---|
0:10:11 | the best results are be trained |
---|
0:10:13 | and i mean it was completely given in child five |
---|
0:10:17 | so we chose the factor g d n |
---|
0:10:20 | for our experiments |
---|
0:10:22 | now let's focus |
---|
0:10:24 | on the speech enhancement |
---|
0:10:27 | we had is i mean how to train an unsupervised speech enhancement system |
---|
0:10:31 | which can be used as a front end |
---|
0:10:34 | good processing model |
---|
0:10:36 | to improve the quality of the features |
---|
0:10:38 | before they are passed |
---|
0:10:39 | two than varying or |
---|
0:10:41 | the main idea here is to use an unsupervised |
---|
0:10:45 | adaptation system |
---|
0:10:47 | based on cycle against |
---|
0:10:49 | we train a cycle can network using a lot will be addressed |
---|
0:10:54 | as input |
---|
0:10:55 | to each of the generator networks |
---|
0:10:58 | so we have a clean source signal on the left and the real time domain |
---|
0:11:03 | data on the right |
---|
0:11:05 | during testing |
---|
0:11:06 | we process that is data to the target signal |
---|
0:11:12 | these are then huh |
---|
0:11:13 | acoustic features |
---|
0:11:15 | i being used |
---|
0:11:16 | just write extractors |
---|
0:11:18 | even though the cycle get and you work was trained for doing the reverberation |
---|
0:11:23 | we also testing on noisy data sets |
---|
0:11:26 | showing improvements |
---|
0:11:28 | now let's continue with the overlap assignment |
---|
0:11:32 | but have these architecture might also sample mean here |
---|
0:11:35 | it is exactly the same as the one use for the vad approach |
---|
0:11:40 | but now training in a certain way that would ease |
---|
0:11:43 | overlap or not overlap |
---|
0:11:45 | speech |
---|
0:11:47 | it can also be used to perform a speaker at right |
---|
0:11:50 | and also asking the vad |
---|
0:11:53 | the thing that approach show better results |
---|
0:11:57 | let's continue with the overlap assignment |
---|
0:12:02 | from the e |
---|
0:12:04 | we got a posterior matrix |
---|
0:12:07 | for each of the speakers |
---|
0:12:10 | so the most probable speakers will be you rolls one and two |
---|
0:12:17 | so we can combine this with the overlap detector |
---|
0:12:21 | and also we didn't vad |
---|
0:12:24 | merging these results |
---|
0:12:26 | we got what we call the overlap assignment where we have regions where the overlapping |
---|
0:12:32 | to tell us that we have two speakers and we put their the most probable |
---|
0:12:38 | speaker |
---|
0:12:39 | in this part |
---|
0:12:40 | we ended our diarization system |
---|
0:12:45 | but now the question is what combination of all these things |
---|
0:12:49 | a good results |
---|
0:12:51 | so in our case |
---|
0:12:53 | we put together to into n b a d enhancement |
---|
0:12:57 | that maybe re-segmentation an overlap assignment |
---|
0:13:01 | for all thus a corpora we got a nice improvements |
---|
0:13:06 | for example i mean |
---|
0:13:08 | we went problem fourteen nine percent |
---|
0:13:11 | the residual error rate to thirty percent |
---|
0:13:13 | there is station |
---|
0:13:16 | for the channel five |
---|
0:13:18 | so the corpus |
---|
0:13:20 | we also put together |
---|
0:13:23 | the same combination we went |
---|
0:13:26 | problem sixty nine percent every station error rate |
---|
0:13:28 | justice degree |
---|
0:13:30 | or set at every station |
---|
0:13:33 | and finally pervading train |
---|
0:13:37 | we got a nice improvement from eighty five percent every session error rate to forty |
---|
0:13:42 | seven percent |
---|
0:13:44 | the recession error rate |
---|
0:13:45 | it is important to note here that in two |
---|
0:13:49 | and but |
---|
0:13:50 | really improve the system |
---|
0:13:54 | this is the speaker detection pipeline |
---|
0:13:57 | we have the enhancement |
---|
0:14:00 | and the signal level and also the invading level we have the devastation segmentation |
---|
0:14:06 | we have been the in extractor the that okay the calibration and finally |
---|
0:14:10 | we get the speaker detection |
---|
0:14:14 | the boxes in orange |
---|
0:14:17 | use the same techniques |
---|
0:14:19 | i think conversation |
---|
0:14:20 | so we use the enhancement two levels |
---|
0:14:24 | and the signal level and also |
---|
0:14:26 | and the and very little |
---|
0:14:29 | that there is station |
---|
0:14:31 | a segmentation |
---|
0:14:33 | is fed into that invading extractor and the type like wendy's |
---|
0:14:40 | then that in extractor as we are really emphasised before |
---|
0:14:44 | it is a factor at the nn |
---|
0:14:46 | which is getting the best results for speaker i p |
---|
0:14:50 | we also is used an enhancement |
---|
0:14:52 | module |
---|
0:14:53 | for this and getting extractor |
---|
0:14:57 | and finally we have the backing and the calibration |
---|
0:15:00 | the backend |
---|
0:15:01 | sure the key lda front of devastation with documentation and the calibration stage |
---|
0:15:08 | goes directly to speaker detection |
---|
0:15:11 | the combination of the use of results for all hours of corpora |
---|
0:15:16 | enclosed is speech enhancement the spectral augmentation and of the lda with augmentation it is |
---|
0:15:22 | important to note that although this |
---|
0:15:26 | this is include the devastation as their first stage |
---|
0:15:30 | so for |
---|
0:15:31 | we got an improvement going problem |
---|
0:15:34 | seventeen percent equal error rate |
---|
0:15:36 | two percent equal error rate |
---|
0:15:40 | in terms of mindcf and actual dcf shown in the bottom we can also something |
---|
0:15:46 | improvement |
---|
0:15:49 | remained trained we kind of so the same trend |
---|
0:15:52 | going from fourteen percent equal error rate |
---|
0:15:56 | two nine percent equal error rate |
---|
0:16:01 | on the bottom we kind of service then mean these yes |
---|
0:16:04 | and the actual dcf the mean this got improvement |
---|
0:16:09 | but the actual dcf |
---|
0:16:12 | for the is alright data our system also include the results going from twenty one |
---|
0:16:19 | percent equal error rate to sixty percent |
---|
0:16:21 | equal error rate |
---|
0:16:23 | the mean these i'm the actual dcf |
---|
0:16:26 | for this and trend |
---|
0:16:28 | getting improvement simple |
---|
0:16:34 | finally some taken ways i'd like to mention |
---|
0:16:39 | the recession ease that fundamental stage to perform speaker detection |
---|
0:16:46 | there are some models that are really needed to have a competitive system |
---|
0:16:51 | course a whole good enhancement could be a i |
---|
0:16:55 | we didn't beginnings |
---|
0:16:56 | an overlap detection and assignment |
---|
0:17:01 | the speaker detection they hence not only |
---|
0:17:04 | on the devastation model |
---|
0:17:06 | but also wanting but in extractor on the augmentation |
---|
0:17:12 | then you directions of this work are as follows |
---|
0:17:17 | or the signal to signal enhancement and speaker separation we need some customisation |
---|
0:17:22 | you could be by the test it by speaker or quite task |
---|
0:17:26 | for the speech enhancement |
---|
0:17:28 | we have to explore other hand gestures a transformer and largescale training |
---|
0:17:35 | for the vad we need ways to handle domain mismatch |
---|
0:17:39 | you can be done for example using domain or sorry |
---|
0:17:43 | for the clustering we need an unsupervised adaptation |
---|
0:17:47 | take the overlap into account |
---|
0:17:50 | during the clustering |
---|
0:17:52 | and also included transcription |
---|
0:17:54 | in parallel with the speaker and b |
---|
0:17:57 | for the speaker detection |
---|
0:17:58 | some enhancement for the multi-speaker scenario |
---|
0:18:02 | that means |
---|
0:18:04 | hi light |
---|
0:18:05 | that's speaker of interest |
---|
0:18:08 | and also perform better clustering |
---|
0:18:10 | for short segments |
---|
0:18:12 | this is our amazing thing |
---|
0:18:15 | i would like to thank |
---|
0:18:16 | all of them very much thank you questions |
---|