Speech Transcript - Personal VAD: Speaker-Conditioned Voice Activity Detection

0:00:17	i don't everyone
0:00:18	this is trained one from google today and going to talk about personal vad
0:00:23	also known as
0:00:24	speaker condition the voice activity detection
0:00:27	a big part of this work is done by shows doing cool was my in
0:00:31	turn as the summer
0:00:34	first of all behind them a summary of this work
0:00:37	personal vad is the system to detect the voice activity all the target speaker
0:00:42	the reason we need a person a vad is that
0:00:45	it reduces to you memory and battery consumption for on device speech recognition
0:00:50	we implement a person of the at bus the frame that was training detection system
0:00:55	which you this kind of speaker embedding as side include
0:00:59	i will start by team in some background
0:01:02	most of the speech recognition systems
0:01:04	are deployed on the crowd
0:01:06	but will be asr to the device i'd in the car engine
0:01:10	this is because
0:01:11	on device asr does not require internet connection integrating reduces the nist e
0:01:16	because it does not need to communicate with servers
0:01:20	it also preserves the user's privacy better because the audio never used to a device
0:01:26	device asr is you really used for smart phones or smart-home speakers
0:01:30	for example
0:01:31	if you simply want to turn on the flashlight on your full
0:01:35	you should be able to do it in a pair open mode
0:01:38	if you want to turn on valentine's
0:01:40	uses only need access to your local network
0:01:44	well as a lung device asr is great
0:01:47	there are lots of challenges
0:01:49	and x servers
0:01:50	we only have a very limited budget of thinking you memory
0:01:54	and battery
0:01:55	for asr
0:01:56	also
0:01:56	yes there is no the only program running on the device
0:02:00	for example for smart phones there are also many other apps running the background
0:02:05	so i important question used
0:02:07	when do we run asr on the device apparently
0:02:10	it shouldn't be always run
0:02:12	but technical solution is to use keyword detection
0:02:15	also known as weak or detection
0:02:17	well holes were detection
0:02:19	for example
0:02:20	can you go
0:02:21	is the keyword vocal devices
0:02:24	because the keyword detection model is usually better is more
0:02:27	so it's very cheap
0:02:28	and it can be always running
0:02:30	and sre security speaker model
0:02:32	when sre is very expensive
0:02:34	so we only writes
0:02:35	when the keyword is detecting
0:02:38	however
0:02:39	not everyone likes the idea of always having to a speaker that you were
0:02:43	before you interact with the device
0:02:45	many people wish to be able to directly talk to the device
0:02:48	without having to say keyword that we define for that
0:02:52	so i alternative solution is to use voice activity detection instead of keyword detection
0:02:57	like keyword detection models
0:02:59	vad models are also various more
0:03:02	and a very cheap to run
0:03:03	so you can have the vad model always running
0:03:06	and only used asr with vad has been trigger
0:03:11	so that we at work
0:03:13	the vad model is typically a frame number of binary classifier
0:03:17	for every frame of speech signals
0:03:20	the idea classifies it into two categories
0:03:22	speech and then i speech and after vad
0:03:26	with the overall or the non speech frames
0:03:28	and only keep the speech frames
0:03:30	then we feel be speech frames to downstream components
0:03:34	like asr or speaker recognition
0:03:37	the recognition results will be used for natural language processing
0:03:40	then signal different actions
0:03:43	z be model will help us to reject or than a speech frames
0:03:47	which will save lots of computational resources
0:03:49	but is good enough
0:03:51	in a realistic scenario
0:03:53	you can talk to the device
0:03:54	but you work it can also talk to you and if we wind then you
0:03:58	mean room
0:03:58	there will be someone talking the t v ads
0:04:01	these are all available speech signals
0:04:03	still vad will simply accept or this frames
0:04:06	but source of the run of the
0:04:09	for example
0:04:10	if you can the tv plane
0:04:12	and the asr case running on this martial us to run out of data
0:04:18	so that's why we are introducing personal vad
0:04:22	personal vad is similar to standard vad
0:04:24	it is the frame level classifier
0:04:27	but the difference is that you has three categories instead of two
0:04:31	we still have been i speech class
0:04:33	but the other to a target speaker speech i don't than typing the speaker speech
0:04:38	and it seems that is not spoken by the target speaker
0:04:41	like other family members
0:04:43	what t v
0:04:44	will be considered another target speaker speech
0:04:47	the benefits of using personal vad is that
0:04:51	we only run asr on congress speaker speech
0:04:54	this means we will save lots of computational resources
0:04:57	when t v is
0:04:59	when there are not go
0:05:00	to many members in the user's household or when the user is at the time
0:05:05	and to make this two
0:05:06	the key is that
0:05:08	the personal vad model is to be highly and the fast
0:05:10	just like a keyword detection
0:05:12	what standard vad model
0:05:14	also
0:05:15	the false reject must be no
0:05:17	because
0:05:17	we want to be responsive to the height of the user's request
0:05:21	the false accept should also be no
0:05:23	to really save the computational resources
0:05:26	well we first the release this paper
0:05:28	there are some comments at all of this is not a new this is just
0:05:31	the speaker recognition or speaker diarization
0:05:34	here we want to clarify that
0:05:36	no this is not
0:05:37	cars not be at the very different speaker recognition or speaker diarization
0:05:42	speaker recognition models you really produce recognition results at a reasonable
0:05:46	or we don't handle
0:05:48	but personal vad produces all scores as frame level
0:05:51	it is streaming model and a very sensitive to latency
0:05:55	speaker recognition models i can be usually and then use more than five million parameters
0:06:01	personal vads are always ready model it must be better is more
0:06:05	typically less than two hundred thousand parameters
0:06:08	speaker diarization used to cluster and always speakers
0:06:11	under the number of speakers is very important
0:06:14	"'cause" no baby only cares about the target speaker
0:06:17	everyone else will be simply represented as
0:06:19	non target speaker
0:06:22	i will talk about the implementation of personal vad
0:06:26	to implement personal vad
0:06:28	the first question use
0:06:29	how do we know whom to listen to
0:06:32	well there's which systems usually at all the users enrolled her voice
0:06:36	and this enrollment is a one of the experience
0:06:38	so the cost can be ignored and run time
0:06:41	after you romans
0:06:42	we will have a speaker embedded
0:06:44	what's on the line shows that you vector
0:06:47	stored on the device
0:06:48	this in banning can be used for speaker recognition
0:06:50	well voice your sorry
0:06:52	so luxury it can also be used as the side include of course not vad
0:06:58	there are different ways of implementing personal vad
0:07:01	the simplest the way is to directly combine a standard vad model and the speaker
0:07:06	verification system
0:07:07	we use this as a baseline
0:07:09	but in this paper we propose to accept a new person a vad model
0:07:13	which takes the speaker verification score
0:07:16	all the speaker in batting as input
0:07:19	so actually we implemented for different architectures for personal but i don't going to talk
0:07:24	about them one by one
0:07:26	first
0:07:27	score combination
0:07:28	this is the baseline model that i mentioned earlier
0:07:31	we don't for adding new model
0:07:33	but just use the existing vad model and the speaker verification model
0:07:38	if the vad output it's speech
0:07:40	we verify this frame
0:07:42	okay that the target speaker using the speaker verification model
0:07:45	such that we have three different all the classes
0:07:48	night personal vad
0:07:50	note that
0:07:51	this implementation requires running the big speaker verification model at runtime
0:07:56	so is expensive solution
0:07:58	second one
0:07:59	score condition the training here we don't to use the standard vad model
0:08:04	but still use the speaker verification model
0:08:07	we concatenate of the speaker verification score
0:08:09	with the acoustic features and it's and a new personal vad model
0:08:13	on top of the concatenated features
0:08:16	this is still very expensive because we need to write the speaker verification model at
0:08:20	runtime
0:08:23	embedding conditioning
0:08:25	this is really the implementation that we want to use for a device asr
0:08:29	it is directly concatenate the target speaker in the end with acoustic features
0:08:34	and we train a new personal vad model on the concatenation of features
0:08:38	so the person a vad model
0:08:40	is the only model that we need for the runtime
0:08:44	and finally score and in bad in addition to send it to concatenate
0:08:49	both speaker verification score
0:08:50	i think that in
0:08:51	with the acoustic features
0:08:53	so that use these the most information from the speaker verification system and is supposed
0:08:58	to be most powerful
0:09:00	but since either requires ran a speaker verification at runtime
0:09:04	so it's a still
0:09:05	not ideal from device is are
0:09:08	okay we have talked about architectures let's talk about the not function
0:09:13	vad is a classification problem
0:09:16	so standard vad use this binary cross entropy personal vad has three classes so naturally
0:09:22	we can use turn we cross entropy
0:09:25	but can we do better than cross entropy if you think about the actual use
0:09:30	case
0:09:31	both non speech and non-target the speaker speech
0:09:34	will be discarded of asr
0:09:36	so if you make "'em" prediction avril
0:09:38	between i speech
0:09:40	i do not talking the speaker speech is actually not a big deal
0:09:43	we conclude this knowledge you know or loss function
0:09:47	and we proposed the weighted pairwise knows
0:09:51	it is similar to cross entropy
0:09:53	but we use the different the weight for different pairs of classes
0:09:57	for example
0:09:58	we use us to model weight of zero point one between the cost is nice
0:10:02	speech
0:10:02	i don't know how the speaker speech
0:10:04	and use a larger weight of one into other pairs
0:10:11	next i will talk about experiments that have
0:10:15	i feel dataset for training and evaluating person vad
0:10:19	we have these features
0:10:20	it should include real estate and the natural speaker turns
0:10:24	it's a couple times worse voice conditions
0:10:27	it should have frame level speaker labels and be the should have you roman utterances
0:10:31	for each target speaker
0:10:33	unfortunately
0:10:34	we can find a dataset that satisfies all these requirements
0:10:39	so we actually made i artificial dataset based on the well-known you speech data set
0:10:45	remember that we need in the frame level speaker labels
0:10:48	for each deeply speech utterance we have this variable
0:10:52	we also have the ground truth asr transcript
0:10:55	so we use of creation asr model to for a nine the ground truth transcript
0:11:00	with the audio together that i mean of each word with this timing information
0:11:05	we can at the frame level speaker labels
0:11:08	and a to have conversational speech we concatenate utterances from different speakers
0:11:14	we also used room simulator to add and reverberant noise
0:11:18	to the concatenated utterance
0:11:20	this will avoid domain over fitting and also be decay the concatenation artifacts
0:11:27	clears the model configuration
0:11:29	both standard of vad and the person a vad consist of two l s t
0:11:33	and there's
0:11:34	and the one three collected in a
0:11:36	the model has there are point wise renewing parameters in total
0:11:40	the speaker verification model has three l s t and there's
0:11:43	with projection
0:11:44	and the one three collected in a
0:11:46	this model is created be
0:11:48	with the bass fine tuning parameters
0:11:51	for evaluation
0:11:52	because this is a classification problem
0:11:55	so we use average precision
0:11:57	we look at the average precision for each class and also the mean average precision
0:12:02	we also look at the metrics for both with and without ourselves any noise these
0:12:08	next without any conclusions
0:12:12	first
0:12:12	we compare different or architectures
0:12:15	remember that s t is the baseline by directly combining standard vad
0:12:21	and the speaker verification
0:12:23	and we find that all the other personal
0:12:25	vad models are better than the baseline
0:12:28	among the proposed the models as at we see the one that uses both speaker
0:12:33	verification score
0:12:34	and a speaker in batty is the best
0:12:37	this is kind of expected because then use is most the speaker information
0:12:42	t is the personal vad model
0:12:44	the only uses speaker embedded and this idea of only based asr we note that
0:12:49	in t is a slightly worse than std by the different it is more it
0:12:53	is near optimal but has only two point six percent of the parameters at runtime
0:12:59	we also compare the conventional cross-entropy knows
0:13:02	and the proposed a weighted pairwise loss
0:13:05	we found that which the powerwise those is consistently better than cross entropy and of
0:13:11	the optimal weight between i speech
0:13:13	and i have a speaker speech is zero point one
0:13:17	finally since the out medical personnel vad
0:13:21	is to replace the standard a vad
0:13:23	so we compare that you understand alleviated task
0:13:26	in some cases person of at is slightly worse
0:13:30	but the differences are based more
0:13:33	so conclusions of this paper
0:13:35	the proposed person the vad architectures outperform the baseline of directly combining vad and the
0:13:42	speaker verification
0:13:43	among the proposed architectures
0:13:45	ask at has the best the performance but e t is the idea one for
0:13:50	on device asr
0:13:51	which has near optimal performance
0:13:54	we also propose weighted pairwise knows
0:13:57	with all performance cross entropy knows
0:13:59	finally person the vad understand a vad perform almost you could well a standard vad
0:14:05	tasks
0:14:07	and also briefly talk about the future work directions
0:14:11	currently the person of eighteen model is trained and evaluated on artificial computations
0:14:17	we should really used realistic conversational speech
0:14:20	this will require those of the data collection and anybody efforts
0:14:24	besides person the vad can be used the was speaker diarization
0:14:28	especially whether there is the overlap of the speech in the conversation
0:14:32	and the good news is that people are already doing you'd
0:14:35	researchers from russia propose to this system known as having the speaker vad
0:14:41	which is similar to personal vad
0:14:43	and the successfully used it for speaker their addition
0:14:46	if you know our paper
0:14:47	i would recommend the usual with their paper as well
0:14:51	if you have actions
0:14:52	pretty c is a common
0:14:54	on the speaker all these features are then the time t website and our paper
0:14:58	some two

Personal VAD: Speaker-Conditioned Voice Activity Detection

Speech Application

Shaojin Ding, Quan Wang, Shuo-Yiin Chang, Li Wan, Ignacio Lopez Moreno