0:00:17 | i don't everyone |
---|
0:00:18 | this is trained one from google today and going to talk about personal vad |
---|
0:00:23 | also known as |
---|
0:00:24 | speaker condition the voice activity detection |
---|
0:00:27 | a big part of this work is done by shows doing cool was my in |
---|
0:00:31 | turn as the summer |
---|
0:00:34 | first of all behind them a summary of this work |
---|
0:00:37 | personal vad is the system to detect the voice activity all the target speaker |
---|
0:00:42 | the reason we need a person a vad is that |
---|
0:00:45 | it reduces to you memory and battery consumption for on device speech recognition |
---|
0:00:50 | we implement a person of the at bus the frame that was training detection system |
---|
0:00:55 | which you this kind of speaker embedding as side include |
---|
0:00:59 | i will start by team in some background |
---|
0:01:02 | most of the speech recognition systems |
---|
0:01:04 | are deployed on the crowd |
---|
0:01:06 | but will be asr to the device i'd in the car engine |
---|
0:01:10 | this is because |
---|
0:01:11 | on device asr does not require internet connection integrating reduces the nist e |
---|
0:01:16 | because it does not need to communicate with servers |
---|
0:01:20 | it also preserves the user's privacy better because the audio never used to a device |
---|
0:01:26 | device asr is you really used for smart phones or smart-home speakers |
---|
0:01:30 | for example |
---|
0:01:31 | if you simply want to turn on the flashlight on your full |
---|
0:01:35 | you should be able to do it in a pair open mode |
---|
0:01:38 | if you want to turn on valentine's |
---|
0:01:40 | uses only need access to your local network |
---|
0:01:44 | well as a lung device asr is great |
---|
0:01:47 | there are lots of challenges |
---|
0:01:49 | and x servers |
---|
0:01:50 | we only have a very limited budget of thinking you memory |
---|
0:01:54 | and battery |
---|
0:01:55 | for asr |
---|
0:01:56 | also |
---|
0:01:56 | yes there is no the only program running on the device |
---|
0:02:00 | for example for smart phones there are also many other apps running the background |
---|
0:02:05 | so i important question used |
---|
0:02:07 | when do we run asr on the device apparently |
---|
0:02:10 | it shouldn't be always run |
---|
0:02:12 | but technical solution is to use keyword detection |
---|
0:02:15 | also known as weak or detection |
---|
0:02:17 | well holes were detection |
---|
0:02:19 | for example |
---|
0:02:20 | can you go |
---|
0:02:21 | is the keyword vocal devices |
---|
0:02:24 | because the keyword detection model is usually better is more |
---|
0:02:27 | so it's very cheap |
---|
0:02:28 | and it can be always running |
---|
0:02:30 | and sre security speaker model |
---|
0:02:32 | when sre is very expensive |
---|
0:02:34 | so we only writes |
---|
0:02:35 | when the keyword is detecting |
---|
0:02:38 | however |
---|
0:02:39 | not everyone likes the idea of always having to a speaker that you were |
---|
0:02:43 | before you interact with the device |
---|
0:02:45 | many people wish to be able to directly talk to the device |
---|
0:02:48 | without having to say keyword that we define for that |
---|
0:02:52 | so i alternative solution is to use voice activity detection instead of keyword detection |
---|
0:02:57 | like keyword detection models |
---|
0:02:59 | vad models are also various more |
---|
0:03:02 | and a very cheap to run |
---|
0:03:03 | so you can have the vad model always running |
---|
0:03:06 | and only used asr with vad has been trigger |
---|
0:03:11 | so that we at work |
---|
0:03:13 | the vad model is typically a frame number of binary classifier |
---|
0:03:17 | for every frame of speech signals |
---|
0:03:20 | the idea classifies it into two categories |
---|
0:03:22 | speech and then i speech and after vad |
---|
0:03:26 | with the overall or the non speech frames |
---|
0:03:28 | and only keep the speech frames |
---|
0:03:30 | then we feel be speech frames to downstream components |
---|
0:03:34 | like asr or speaker recognition |
---|
0:03:37 | the recognition results will be used for natural language processing |
---|
0:03:40 | then signal different actions |
---|
0:03:43 | z be model will help us to reject or than a speech frames |
---|
0:03:47 | which will save lots of computational resources |
---|
0:03:49 | but is good enough |
---|
0:03:51 | in a realistic scenario |
---|
0:03:53 | you can talk to the device |
---|
0:03:54 | but you work it can also talk to you and if we wind then you |
---|
0:03:58 | mean room |
---|
0:03:58 | there will be someone talking the t v ads |
---|
0:04:01 | these are all available speech signals |
---|
0:04:03 | still vad will simply accept or this frames |
---|
0:04:06 | but source of the run of the |
---|
0:04:09 | for example |
---|
0:04:10 | if you can the tv plane |
---|
0:04:12 | and the asr case running on this martial us to run out of data |
---|
0:04:18 | so that's why we are introducing personal vad |
---|
0:04:22 | personal vad is similar to standard vad |
---|
0:04:24 | it is the frame level classifier |
---|
0:04:27 | but the difference is that you has three categories instead of two |
---|
0:04:31 | we still have been i speech class |
---|
0:04:33 | but the other to a target speaker speech i don't than typing the speaker speech |
---|
0:04:38 | and it seems that is not spoken by the target speaker |
---|
0:04:41 | like other family members |
---|
0:04:43 | what t v |
---|
0:04:44 | will be considered another target speaker speech |
---|
0:04:47 | the benefits of using personal vad is that |
---|
0:04:51 | we only run asr on congress speaker speech |
---|
0:04:54 | this means we will save lots of computational resources |
---|
0:04:57 | when t v is |
---|
0:04:59 | when there are not go |
---|
0:05:00 | to many members in the user's household or when the user is at the time |
---|
0:05:05 | and to make this two |
---|
0:05:06 | the key is that |
---|
0:05:08 | the personal vad model is to be highly and the fast |
---|
0:05:10 | just like a keyword detection |
---|
0:05:12 | what standard vad model |
---|
0:05:14 | also |
---|
0:05:15 | the false reject must be no |
---|
0:05:17 | because |
---|
0:05:17 | we want to be responsive to the height of the user's request |
---|
0:05:21 | the false accept should also be no |
---|
0:05:23 | to really save the computational resources |
---|
0:05:26 | well we first the release this paper |
---|
0:05:28 | there are some comments at all of this is not a new this is just |
---|
0:05:31 | the speaker recognition or speaker diarization |
---|
0:05:34 | here we want to clarify that |
---|
0:05:36 | no this is not |
---|
0:05:37 | cars not be at the very different speaker recognition or speaker diarization |
---|
0:05:42 | speaker recognition models you really produce recognition results at a reasonable |
---|
0:05:46 | or we don't handle |
---|
0:05:48 | but personal vad produces all scores as frame level |
---|
0:05:51 | it is streaming model and a very sensitive to latency |
---|
0:05:55 | speaker recognition models i can be usually and then use more than five million parameters |
---|
0:06:01 | personal vads are always ready model it must be better is more |
---|
0:06:05 | typically less than two hundred thousand parameters |
---|
0:06:08 | speaker diarization used to cluster and always speakers |
---|
0:06:11 | under the number of speakers is very important |
---|
0:06:14 | "'cause" no baby only cares about the target speaker |
---|
0:06:17 | everyone else will be simply represented as |
---|
0:06:19 | non target speaker |
---|
0:06:22 | i will talk about the implementation of personal vad |
---|
0:06:26 | to implement personal vad |
---|
0:06:28 | the first question use |
---|
0:06:29 | how do we know whom to listen to |
---|
0:06:32 | well there's which systems usually at all the users enrolled her voice |
---|
0:06:36 | and this enrollment is a one of the experience |
---|
0:06:38 | so the cost can be ignored and run time |
---|
0:06:41 | after you romans |
---|
0:06:42 | we will have a speaker embedded |
---|
0:06:44 | what's on the line shows that you vector |
---|
0:06:47 | stored on the device |
---|
0:06:48 | this in banning can be used for speaker recognition |
---|
0:06:50 | well voice your sorry |
---|
0:06:52 | so luxury it can also be used as the side include of course not vad |
---|
0:06:58 | there are different ways of implementing personal vad |
---|
0:07:01 | the simplest the way is to directly combine a standard vad model and the speaker |
---|
0:07:06 | verification system |
---|
0:07:07 | we use this as a baseline |
---|
0:07:09 | but in this paper we propose to accept a new person a vad model |
---|
0:07:13 | which takes the speaker verification score |
---|
0:07:16 | all the speaker in batting as input |
---|
0:07:19 | so actually we implemented for different architectures for personal but i don't going to talk |
---|
0:07:24 | about them one by one |
---|
0:07:26 | first |
---|
0:07:27 | score combination |
---|
0:07:28 | this is the baseline model that i mentioned earlier |
---|
0:07:31 | we don't for adding new model |
---|
0:07:33 | but just use the existing vad model and the speaker verification model |
---|
0:07:38 | if the vad output it's speech |
---|
0:07:40 | we verify this frame |
---|
0:07:42 | okay that the target speaker using the speaker verification model |
---|
0:07:45 | such that we have three different all the classes |
---|
0:07:48 | night personal vad |
---|
0:07:50 | note that |
---|
0:07:51 | this implementation requires running the big speaker verification model at runtime |
---|
0:07:56 | so is expensive solution |
---|
0:07:58 | second one |
---|
0:07:59 | score condition the training here we don't to use the standard vad model |
---|
0:08:04 | but still use the speaker verification model |
---|
0:08:07 | we concatenate of the speaker verification score |
---|
0:08:09 | with the acoustic features and it's and a new personal vad model |
---|
0:08:13 | on top of the concatenated features |
---|
0:08:16 | this is still very expensive because we need to write the speaker verification model at |
---|
0:08:20 | runtime |
---|
0:08:23 | embedding conditioning |
---|
0:08:25 | this is really the implementation that we want to use for a device asr |
---|
0:08:29 | it is directly concatenate the target speaker in the end with acoustic features |
---|
0:08:34 | and we train a new personal vad model on the concatenation of features |
---|
0:08:38 | so the person a vad model |
---|
0:08:40 | is the only model that we need for the runtime |
---|
0:08:44 | and finally score and in bad in addition to send it to concatenate |
---|
0:08:49 | both speaker verification score |
---|
0:08:50 | i think that in |
---|
0:08:51 | with the acoustic features |
---|
0:08:53 | so that use these the most information from the speaker verification system and is supposed |
---|
0:08:58 | to be most powerful |
---|
0:09:00 | but since either requires ran a speaker verification at runtime |
---|
0:09:04 | so it's a still |
---|
0:09:05 | not ideal from device is are |
---|
0:09:08 | okay we have talked about architectures let's talk about the not function |
---|
0:09:13 | vad is a classification problem |
---|
0:09:16 | so standard vad use this binary cross entropy personal vad has three classes so naturally |
---|
0:09:22 | we can use turn we cross entropy |
---|
0:09:25 | but can we do better than cross entropy if you think about the actual use |
---|
0:09:30 | case |
---|
0:09:31 | both non speech and non-target the speaker speech |
---|
0:09:34 | will be discarded of asr |
---|
0:09:36 | so if you make "'em" prediction avril |
---|
0:09:38 | between i speech |
---|
0:09:40 | i do not talking the speaker speech is actually not a big deal |
---|
0:09:43 | we conclude this knowledge you know or loss function |
---|
0:09:47 | and we proposed the weighted pairwise knows |
---|
0:09:51 | it is similar to cross entropy |
---|
0:09:53 | but we use the different the weight for different pairs of classes |
---|
0:09:57 | for example |
---|
0:09:58 | we use us to model weight of zero point one between the cost is nice |
---|
0:10:02 | speech |
---|
0:10:02 | i don't know how the speaker speech |
---|
0:10:04 | and use a larger weight of one into other pairs |
---|
0:10:11 | next i will talk about experiments that have |
---|
0:10:15 | i feel dataset for training and evaluating person vad |
---|
0:10:19 | we have these features |
---|
0:10:20 | it should include real estate and the natural speaker turns |
---|
0:10:24 | it's a couple times worse voice conditions |
---|
0:10:27 | it should have frame level speaker labels and be the should have you roman utterances |
---|
0:10:31 | for each target speaker |
---|
0:10:33 | unfortunately |
---|
0:10:34 | we can find a dataset that satisfies all these requirements |
---|
0:10:39 | so we actually made i artificial dataset based on the well-known you speech data set |
---|
0:10:45 | remember that we need in the frame level speaker labels |
---|
0:10:48 | for each deeply speech utterance we have this variable |
---|
0:10:52 | we also have the ground truth asr transcript |
---|
0:10:55 | so we use of creation asr model to for a nine the ground truth transcript |
---|
0:11:00 | with the audio together that i mean of each word with this timing information |
---|
0:11:05 | we can at the frame level speaker labels |
---|
0:11:08 | and a to have conversational speech we concatenate utterances from different speakers |
---|
0:11:14 | we also used room simulator to add and reverberant noise |
---|
0:11:18 | to the concatenated utterance |
---|
0:11:20 | this will avoid domain over fitting and also be decay the concatenation artifacts |
---|
0:11:27 | clears the model configuration |
---|
0:11:29 | both standard of vad and the person a vad consist of two l s t |
---|
0:11:33 | and there's |
---|
0:11:34 | and the one three collected in a |
---|
0:11:36 | the model has there are point wise renewing parameters in total |
---|
0:11:40 | the speaker verification model has three l s t and there's |
---|
0:11:43 | with projection |
---|
0:11:44 | and the one three collected in a |
---|
0:11:46 | this model is created be |
---|
0:11:48 | with the bass fine tuning parameters |
---|
0:11:51 | for evaluation |
---|
0:11:52 | because this is a classification problem |
---|
0:11:55 | so we use average precision |
---|
0:11:57 | we look at the average precision for each class and also the mean average precision |
---|
0:12:02 | we also look at the metrics for both with and without ourselves any noise these |
---|
0:12:08 | next without any conclusions |
---|
0:12:12 | first |
---|
0:12:12 | we compare different or architectures |
---|
0:12:15 | remember that s t is the baseline by directly combining standard vad |
---|
0:12:21 | and the speaker verification |
---|
0:12:23 | and we find that all the other personal |
---|
0:12:25 | vad models are better than the baseline |
---|
0:12:28 | among the proposed the models as at we see the one that uses both speaker |
---|
0:12:33 | verification score |
---|
0:12:34 | and a speaker in batty is the best |
---|
0:12:37 | this is kind of expected because then use is most the speaker information |
---|
0:12:42 | t is the personal vad model |
---|
0:12:44 | the only uses speaker embedded and this idea of only based asr we note that |
---|
0:12:49 | in t is a slightly worse than std by the different it is more it |
---|
0:12:53 | is near optimal but has only two point six percent of the parameters at runtime |
---|
0:12:59 | we also compare the conventional cross-entropy knows |
---|
0:13:02 | and the proposed a weighted pairwise loss |
---|
0:13:05 | we found that which the powerwise those is consistently better than cross entropy and of |
---|
0:13:11 | the optimal weight between i speech |
---|
0:13:13 | and i have a speaker speech is zero point one |
---|
0:13:17 | finally since the out medical personnel vad |
---|
0:13:21 | is to replace the standard a vad |
---|
0:13:23 | so we compare that you understand alleviated task |
---|
0:13:26 | in some cases person of at is slightly worse |
---|
0:13:30 | but the differences are based more |
---|
0:13:33 | so conclusions of this paper |
---|
0:13:35 | the proposed person the vad architectures outperform the baseline of directly combining vad and the |
---|
0:13:42 | speaker verification |
---|
0:13:43 | among the proposed architectures |
---|
0:13:45 | ask at has the best the performance but e t is the idea one for |
---|
0:13:50 | on device asr |
---|
0:13:51 | which has near optimal performance |
---|
0:13:54 | we also propose weighted pairwise knows |
---|
0:13:57 | with all performance cross entropy knows |
---|
0:13:59 | finally person the vad understand a vad perform almost you could well a standard vad |
---|
0:14:05 | tasks |
---|
0:14:07 | and also briefly talk about the future work directions |
---|
0:14:11 | currently the person of eighteen model is trained and evaluated on artificial computations |
---|
0:14:17 | we should really used realistic conversational speech |
---|
0:14:20 | this will require those of the data collection and anybody efforts |
---|
0:14:24 | besides person the vad can be used the was speaker diarization |
---|
0:14:28 | especially whether there is the overlap of the speech in the conversation |
---|
0:14:32 | and the good news is that people are already doing you'd |
---|
0:14:35 | researchers from russia propose to this system known as having the speaker vad |
---|
0:14:41 | which is similar to personal vad |
---|
0:14:43 | and the successfully used it for speaker their addition |
---|
0:14:46 | if you know our paper |
---|
0:14:47 | i would recommend the usual with their paper as well |
---|
0:14:51 | if you have actions |
---|
0:14:52 | pretty c is a common |
---|
0:14:54 | on the speaker all these features are then the time t website and our paper |
---|
0:14:58 | some two |
---|