0:00:17 | i'm not everyone |
---|
0:00:18 | this is trained one from google |
---|
0:00:20 | today and going to talk about personal vad was on the line shows |
---|
0:00:24 | speaker condition the voice activity detection |
---|
0:00:27 | a big part of this work is done by shows |
---|
0:00:30 | cool was my internist the summer |
---|
0:00:34 | first of all behind them a summary of this work |
---|
0:00:37 | personal vad is the system to detect the voice activity all the target speaker |
---|
0:00:42 | the reason we need a personal vad is that |
---|
0:00:45 | it reduces gpu memory and battery consumption for on device speech recognition |
---|
0:00:50 | we implement person of the at |
---|
0:00:52 | but as a frame that was training detection system |
---|
0:00:55 | which you this kind of speaker embedding as side include |
---|
0:00:59 | i will start by team in some background |
---|
0:01:02 | most of the speech recognition systems |
---|
0:01:04 | are deployed on the crowd |
---|
0:01:06 | but will be asr to the device i'd in the car engine |
---|
0:01:10 | this is because |
---|
0:01:11 | on device asr does not require internet connection integrating reduces the nist e |
---|
0:01:16 | because it does not need to communicate with servers |
---|
0:01:20 | it also preserves the user's privacy better because the audio never use the device |
---|
0:01:26 | device asr is your used for smart phones or smart-home speakers for example |
---|
0:01:31 | if you simply want to turn around the flashlight audio file |
---|
0:01:35 | you should be able to do it in any pair open mode |
---|
0:01:38 | if you want to turn on valentine's |
---|
0:01:40 | use only need access to your local network |
---|
0:01:44 | although i'm device asr discrete |
---|
0:01:47 | there are lots of challenges |
---|
0:01:49 | and x servers |
---|
0:01:50 | we only have a very limited budget of thinking you memory |
---|
0:01:54 | and the battery for asr |
---|
0:01:56 | also |
---|
0:01:56 | yes there is no the only program running on the device |
---|
0:02:00 | for example for smart phones there are also many r s running the background |
---|
0:02:05 | so i important question is |
---|
0:02:07 | when do we run asr on the device apparently |
---|
0:02:10 | it shouldn't be always run |
---|
0:02:12 | but technical solution is to use keyword detection |
---|
0:02:15 | well so no it was recorded detection |
---|
0:02:17 | well holes were detection |
---|
0:02:19 | for example critical go |
---|
0:02:21 | is the keyword vocal devices |
---|
0:02:24 | because the keyword detection model is usually better is more |
---|
0:02:27 | so it's very cheap and it can be always running |
---|
0:02:30 | and sre security a speaker model |
---|
0:02:32 | when s r is very expensive |
---|
0:02:34 | so we only writes |
---|
0:02:35 | when the keyword list exactly |
---|
0:02:38 | however not everyone likes the idea of always having to a speaker that you were |
---|
0:02:42 | before you interact with the device many people wish to be able to be directly |
---|
0:02:47 | talk to the device without having to say keyword data we define for that |
---|
0:02:52 | so i alternative solution is to use voice activity detection instead of keyword detection |
---|
0:02:57 | like keyword detection one does |
---|
0:02:59 | vad models are also various more |
---|
0:03:02 | and a very cheap to run |
---|
0:03:03 | so you can have the vad model always running |
---|
0:03:06 | and only used asr with vad has been trigger |
---|
0:03:11 | so that we at work |
---|
0:03:13 | the vad model is typically a frame number of binary classifier |
---|
0:03:17 | for every frame of speech signals |
---|
0:03:20 | the at classifies it into two categories |
---|
0:03:22 | speech and then i speech and after vad |
---|
0:03:26 | with the overall or the non speech frames |
---|
0:03:28 | and only keep the speech frames |
---|
0:03:30 | then we feel be speech frames to downstream components like asr or speaker recognition |
---|
0:03:37 | the recognition results will be used for natural language processing |
---|
0:03:40 | then speaker different actions |
---|
0:03:43 | z be model will help us to reject or than i speech frames |
---|
0:03:47 | which will save lots of computational resources |
---|
0:03:49 | but is difficult enough |
---|
0:03:51 | in a realistic scenario you can talk to the device |
---|
0:03:54 | but you work it can also talk to you and if we wind then you |
---|
0:03:58 | mean room there will be someone talking the t v at |
---|
0:04:01 | these are all available speech signals still vad will simply accept or this frames but |
---|
0:04:07 | source of the run over the |
---|
0:04:09 | for example |
---|
0:04:10 | if you can the tv plane |
---|
0:04:12 | and the asr case running on this martial us to read out of data |
---|
0:04:18 | so that's why we are introducing personal vad |
---|
0:04:22 | personal vad is similar to the standard vad |
---|
0:04:24 | it is the frame level classifier |
---|
0:04:27 | but the difference is that you has three categories instead of two |
---|
0:04:31 | we still have been i speech class |
---|
0:04:33 | but the other to a target speaker speech |
---|
0:04:36 | i don't i'm typing the speaker speech |
---|
0:04:38 | i don't see that is not spoken by the target speaker |
---|
0:04:41 | like other family members |
---|
0:04:43 | what t v |
---|
0:04:44 | will be considered another target speaker speech |
---|
0:04:47 | the benefits of using personal vad is that |
---|
0:04:51 | we are only right yes are on the speaker speech |
---|
0:04:54 | this means |
---|
0:04:55 | we will save lots of computational resources |
---|
0:04:57 | wouldn't t v is on whether there are not go |
---|
0:05:00 | turn t members in the user's household |
---|
0:05:02 | or when the user is ad hoc |
---|
0:05:05 | and to make this to the key is that |
---|
0:05:08 | the personal vad model is to be tidy and the fast |
---|
0:05:10 | just like a keyword detection |
---|
0:05:12 | well standard vad model |
---|
0:05:14 | well so |
---|
0:05:15 | the false reject must be no |
---|
0:05:17 | because |
---|
0:05:17 | we want to be responsive to the height of the user's request |
---|
0:05:21 | the full extent should also be no |
---|
0:05:23 | to really save the computational resources |
---|
0:05:26 | well we first the release this paper |
---|
0:05:28 | there are some common thing all of this is not a new this is just |
---|
0:05:31 | the speaker recognition or speaker diarization |
---|
0:05:34 | here we want to clarify that |
---|
0:05:36 | no this is not |
---|
0:05:37 | personal be at the very different speaker recognition or speaker diarization |
---|
0:05:42 | speaker recognition models you really produce recognition results at a reasonable |
---|
0:05:46 | or we don't at all |
---|
0:05:48 | but personal vad produces all scores as frame level |
---|
0:05:51 | it is us to me model and a very sensitive to latency |
---|
0:05:55 | speaker recognition models are typically be |
---|
0:05:58 | usually at the nist more than five million parameters |
---|
0:06:01 | personal vads are always ready model it must be better is more typically less than |
---|
0:06:06 | two hundred thousand parameters |
---|
0:06:08 | speaker diarization used to cluster and always speakers |
---|
0:06:11 | under the number of speakers is very important |
---|
0:06:14 | "'cause" no baby only cares about the target speaker |
---|
0:06:17 | everyone else will be simply represented as |
---|
0:06:19 | non target speaker |
---|
0:06:22 | i will talk about the implementation of personal vad |
---|
0:06:26 | to implement personal vad |
---|
0:06:28 | the first question use |
---|
0:06:29 | how do we know whom to listen to |
---|
0:06:32 | well there's which systems usually at all the users enrolled her voice |
---|
0:06:36 | and this enrollment is a one of the experience |
---|
0:06:38 | so the cost the can be ignored and run time |
---|
0:06:41 | after enrollment |
---|
0:06:42 | we will have a speaker embedded |
---|
0:06:44 | also no it was that you vector |
---|
0:06:47 | stored on the device |
---|
0:06:48 | this in banning can be used for speaker recognition |
---|
0:06:50 | well voice usually so luxury it can also be used as the side include of |
---|
0:06:55 | personal vad |
---|
0:06:58 | there are different ways of implementing personal vad |
---|
0:07:01 | the simplest the way is to directly combine a standard vad model and the speaker |
---|
0:07:06 | verification system |
---|
0:07:07 | we use this as a baseline |
---|
0:07:09 | but in this paper |
---|
0:07:10 | we propose to explain a new person a vad model |
---|
0:07:13 | which takes the speaker verification score |
---|
0:07:16 | or the speaker in batting include |
---|
0:07:19 | so actually we implemented for different architectures for personal vad |
---|
0:07:23 | i don't going to talk about than one by one |
---|
0:07:26 | first |
---|
0:07:27 | score combination this is the baseline model that i mentioned earlier |
---|
0:07:31 | we don't for adding you model but just use the existing vad model and the |
---|
0:07:36 | speaker verification model |
---|
0:07:38 | if the vad output if the speech |
---|
0:07:40 | we verify this frame |
---|
0:07:42 | okay that the target speaker using the speaker verification model such that we have three |
---|
0:07:47 | different all the classes |
---|
0:07:48 | like personal vad |
---|
0:07:50 | note that |
---|
0:07:51 | this implementation requires running the big speaker verification model at runtime |
---|
0:07:56 | so is expensive solution |
---|
0:07:58 | second one |
---|
0:07:59 | score condition the training |
---|
0:08:01 | here we don't to use the standard vad model |
---|
0:08:04 | but still use the speaker verification model |
---|
0:08:07 | we concatenate of the speaker verification score |
---|
0:08:09 | with the acoustic features |
---|
0:08:11 | and it's and a new personal vad model |
---|
0:08:13 | on top of the concatenated features |
---|
0:08:16 | this is still very expensive because we need to run a speaker verification model at |
---|
0:08:20 | runtime |
---|
0:08:23 | embedding conditioning |
---|
0:08:25 | this is really the implementation that we want to use for a device asr |
---|
0:08:29 | it directly concatenate the type a speaker in that in with acoustic features |
---|
0:08:34 | and we train a new personal vad model on the concatenation of features |
---|
0:08:38 | so the personal vad model is the only model that we need for the runtime |
---|
0:08:44 | and finally |
---|
0:08:45 | score and in bad condition mission it concatenate |
---|
0:08:49 | both speaker verification score |
---|
0:08:50 | i think that in |
---|
0:08:51 | with the acoustic features |
---|
0:08:53 | so that use these the most information from the speaker verification system and is supposed |
---|
0:08:58 | to be most powerful |
---|
0:09:00 | but since either requires ran a speaker verification at runtime |
---|
0:09:04 | so it's a still not ideal from device is are |
---|
0:09:08 | okay we have talked about architectures |
---|
0:09:11 | let's talk about the most functions |
---|
0:09:13 | vad is a classification problem |
---|
0:09:16 | so standard vad use this binary cross entropy |
---|
0:09:19 | there is no vad has three classes so naturally |
---|
0:09:22 | we can use turner we cross entropy |
---|
0:09:25 | but |
---|
0:09:26 | come with a better than cross entropy |
---|
0:09:28 | if you think about the actual use case |
---|
0:09:31 | both non speech |
---|
0:09:32 | and non-target the speaker speech |
---|
0:09:34 | will be discarded of asr |
---|
0:09:36 | so if you make a prediction error |
---|
0:09:38 | between i speech |
---|
0:09:40 | i do not talking the speaker speech is actually not a big deal |
---|
0:09:43 | we conclude this knowledge you know or loss function |
---|
0:09:47 | and we proposed the weighted pairwise knows |
---|
0:09:51 | it is similar to cross entropy |
---|
0:09:53 | but we use the different the weight for different pairs of classes |
---|
0:09:57 | for example we use a smaller weight of zero point one between the cost is |
---|
0:10:01 | nice speech |
---|
0:10:02 | i do not have been the speaker speech |
---|
0:10:04 | and use a larger weight of one into other pairs |
---|
0:10:11 | best |
---|
0:10:11 | i will talk about experiments that have |
---|
0:10:15 | i feel dataset for training and evaluating person vad |
---|
0:10:19 | we have these features |
---|
0:10:20 | it should include real estate and the natural speaker turns |
---|
0:10:24 | it is the colour drivers voice conditions |
---|
0:10:27 | it should have frame level speaker labels |
---|
0:10:29 | finally the should have you roman utterances |
---|
0:10:31 | for each target speaker |
---|
0:10:33 | unfortunately |
---|
0:10:34 | we can find a dataset that satisfies all these requirements |
---|
0:10:39 | so we actually made i artificial dataset based on the well-known you speech data set |
---|
0:10:45 | remember that we need in the frame level speaker labels |
---|
0:10:48 | for each and every speech utterance |
---|
0:10:50 | we have this you are able |
---|
0:10:52 | we also have the ground truth asr transcript |
---|
0:10:55 | so we use of creation asr model |
---|
0:10:58 | to for a nine the ground truth transcript |
---|
0:11:00 | with the audio |
---|
0:11:01 | together timing of each word |
---|
0:11:03 | we just timing information |
---|
0:11:05 | we get the frame level speaker labels |
---|
0:11:08 | and a to have conversational speech |
---|
0:11:11 | we concatenate utterances from different speakers |
---|
0:11:14 | we also used room simulator |
---|
0:11:16 | to add a reverberant noise to the concatenated utterance |
---|
0:11:20 | this will avoid domain over fitting and also be decay the concatenation artifacts |
---|
0:11:27 | clears the model configuration |
---|
0:11:29 | both standard vad and the person of vad consist of two l s t and |
---|
0:11:33 | there's |
---|
0:11:34 | and the one three collected in a |
---|
0:11:36 | the model has their point one three million parameters in total |
---|
0:11:40 | the speaker verification model has three l s t and there's |
---|
0:11:43 | with projection and the one three collected in a |
---|
0:11:46 | this model is created be |
---|
0:11:48 | with the bass fine tuning parameters |
---|
0:11:51 | for evaluation |
---|
0:11:52 | because this is a classification problem so we use average precision |
---|
0:11:57 | we look at the average precision for each class and also the mean average precision |
---|
0:12:02 | we also look at the metrics for both with and without ourselves any noise these |
---|
0:12:08 | next results and the conclusions |
---|
0:12:12 | first |
---|
0:12:12 | we compare different or architectures |
---|
0:12:15 | remember that |
---|
0:12:17 | s c is the baseline by directly combining standard vad |
---|
0:12:21 | and the speaker verification |
---|
0:12:23 | and we find that all the other personal vad models are better than the baseline |
---|
0:12:28 | along the proposed the models |
---|
0:12:30 | as at |
---|
0:12:31 | we see the one that the use this for speaker verification score and a speaker |
---|
0:12:35 | in batty is the best |
---|
0:12:37 | this is kind of expected because then use is most the speaker information |
---|
0:12:42 | t is the personal vad model |
---|
0:12:44 | the only uses speaker embedding and this idea of only based asr |
---|
0:12:48 | we note that in t is a slightly worse than std |
---|
0:12:52 | by the different it is more it is near optimal but has only two point |
---|
0:12:56 | six percent of the parameters at runtime |
---|
0:12:59 | we also compare the conventional cross-entropy knows |
---|
0:13:02 | and the proposed a weighted pairwise novels |
---|
0:13:05 | we found that |
---|
0:13:06 | which the powerwise those is consistently better |
---|
0:13:09 | no cross entropy and of the optimal weight between i speech |
---|
0:13:13 | and i have a speaker speech is there a point one |
---|
0:13:17 | finally since the out medical personnel vad is to replace the standard vad so we |
---|
0:13:23 | compare that you understand alleviated task in some cases |
---|
0:13:28 | person of at is slightly worse |
---|
0:13:30 | by the differences are by some more |
---|
0:13:33 | so conclusions of this paper |
---|
0:13:35 | the proposed person the vad architectures |
---|
0:13:38 | outperforms the baseline of directly combining vad and the speaker verification |
---|
0:13:43 | among the proposed architectures as at has the best performance |
---|
0:13:48 | but e t is the idea one for on device asr |
---|
0:13:51 | which has near optimal performance |
---|
0:13:54 | we also propose weighted pairwise knows |
---|
0:13:57 | which outperforms cross entropy knows |
---|
0:13:59 | finally person the vad understand a vad perform almost you could well a standard vad |
---|
0:14:05 | tasks |
---|
0:14:07 | and also briefly talk about the future work directions |
---|
0:14:11 | currently the person of eighteen model is trained and evaluated on artificial computations |
---|
0:14:17 | we for the really use |
---|
0:14:18 | realistic conversational speech |
---|
0:14:20 | this will require also the data collection and the neighboring efforts |
---|
0:14:24 | besides |
---|
0:14:25 | person the vad can be used the was speaker diarization |
---|
0:14:28 | especially whether there is the overlap of the speech in the conversation |
---|
0:14:32 | and the good news is that |
---|
0:14:34 | people are already we used |
---|
0:14:35 | researchers from russia propose to this system known as having the speaker vad |
---|
0:14:41 | which is similar to personal vad |
---|
0:14:43 | and the successfully used it for speaker their addition |
---|
0:14:46 | if you know our paper |
---|
0:14:47 | i would recommend the usual with their paper as well |
---|
0:14:51 | if you have at questions |
---|
0:14:52 | pretty c d's a comment on the speaker all these features are then the time |
---|
0:14:56 | t website and our paper |
---|
0:14:58 | seven q |
---|