Speech Transcript - On Open-Set Speaker Identification with I-Vectors

0:00:17	i'm not everyone
0:00:18	this is trained one from google
0:00:20	today and going to talk about personal vad was on the line shows
0:00:24	speaker condition the voice activity detection
0:00:27	a big part of this work is done by shows
0:00:30	cool was my internist the summer
0:00:34	first of all behind them a summary of this work
0:00:37	personal vad is the system to detect the voice activity all the target speaker
0:00:42	the reason we need a personal vad is that
0:00:45	it reduces gpu memory and battery consumption for on device speech recognition
0:00:50	we implement person of the at
0:00:52	but as a frame that was training detection system
0:00:55	which you this kind of speaker embedding as side include
0:00:59	i will start by team in some background
0:01:02	most of the speech recognition systems
0:01:04	are deployed on the crowd
0:01:06	but will be asr to the device i'd in the car engine
0:01:10	this is because
0:01:11	on device asr does not require internet connection integrating reduces the nist e
0:01:16	because it does not need to communicate with servers
0:01:20	it also preserves the user's privacy better because the audio never use the device
0:01:26	device asr is your used for smart phones or smart-home speakers for example
0:01:31	if you simply want to turn around the flashlight audio file
0:01:35	you should be able to do it in any pair open mode
0:01:38	if you want to turn on valentine's
0:01:40	use only need access to your local network
0:01:44	although i'm device asr discrete
0:01:47	there are lots of challenges
0:01:49	and x servers
0:01:50	we only have a very limited budget of thinking you memory
0:01:54	and the battery for asr
0:01:56	also
0:01:56	yes there is no the only program running on the device
0:02:00	for example for smart phones there are also many r s running the background
0:02:05	so i important question is
0:02:07	when do we run asr on the device apparently
0:02:10	it shouldn't be always run
0:02:12	but technical solution is to use keyword detection
0:02:15	well so no it was recorded detection
0:02:17	well holes were detection
0:02:19	for example critical go
0:02:21	is the keyword vocal devices
0:02:24	because the keyword detection model is usually better is more
0:02:27	so it's very cheap and it can be always running
0:02:30	and sre security a speaker model
0:02:32	when s r is very expensive
0:02:34	so we only writes
0:02:35	when the keyword list exactly
0:02:38	however not everyone likes the idea of always having to a speaker that you were
0:02:42	before you interact with the device many people wish to be able to be directly
0:02:47	talk to the device without having to say keyword data we define for that
0:02:52	so i alternative solution is to use voice activity detection instead of keyword detection
0:02:57	like keyword detection one does
0:02:59	vad models are also various more
0:03:02	and a very cheap to run
0:03:03	so you can have the vad model always running
0:03:06	and only used asr with vad has been trigger
0:03:11	so that we at work
0:03:13	the vad model is typically a frame number of binary classifier
0:03:17	for every frame of speech signals
0:03:20	the at classifies it into two categories
0:03:22	speech and then i speech and after vad
0:03:26	with the overall or the non speech frames
0:03:28	and only keep the speech frames
0:03:30	then we feel be speech frames to downstream components like asr or speaker recognition
0:03:37	the recognition results will be used for natural language processing
0:03:40	then speaker different actions
0:03:43	z be model will help us to reject or than i speech frames
0:03:47	which will save lots of computational resources
0:03:49	but is difficult enough
0:03:51	in a realistic scenario you can talk to the device
0:03:54	but you work it can also talk to you and if we wind then you
0:03:58	mean room there will be someone talking the t v at
0:04:01	these are all available speech signals still vad will simply accept or this frames but
0:04:07	source of the run over the
0:04:09	for example
0:04:10	if you can the tv plane
0:04:12	and the asr case running on this martial us to read out of data
0:04:18	so that's why we are introducing personal vad
0:04:22	personal vad is similar to the standard vad
0:04:24	it is the frame level classifier
0:04:27	but the difference is that you has three categories instead of two
0:04:31	we still have been i speech class
0:04:33	but the other to a target speaker speech
0:04:36	i don't i'm typing the speaker speech
0:04:38	i don't see that is not spoken by the target speaker
0:04:41	like other family members
0:04:43	what t v
0:04:44	will be considered another target speaker speech
0:04:47	the benefits of using personal vad is that
0:04:51	we are only right yes are on the speaker speech
0:04:54	this means
0:04:55	we will save lots of computational resources
0:04:57	wouldn't t v is on whether there are not go
0:05:00	turn t members in the user's household
0:05:02	or when the user is ad hoc
0:05:05	and to make this to the key is that
0:05:08	the personal vad model is to be tidy and the fast
0:05:10	just like a keyword detection
0:05:12	well standard vad model
0:05:14	well so
0:05:15	the false reject must be no
0:05:17	because
0:05:17	we want to be responsive to the height of the user's request
0:05:21	the full extent should also be no
0:05:23	to really save the computational resources
0:05:26	well we first the release this paper
0:05:28	there are some common thing all of this is not a new this is just
0:05:31	the speaker recognition or speaker diarization
0:05:34	here we want to clarify that
0:05:36	no this is not
0:05:37	personal be at the very different speaker recognition or speaker diarization
0:05:42	speaker recognition models you really produce recognition results at a reasonable
0:05:46	or we don't at all
0:05:48	but personal vad produces all scores as frame level
0:05:51	it is us to me model and a very sensitive to latency
0:05:55	speaker recognition models are typically be
0:05:58	usually at the nist more than five million parameters
0:06:01	personal vads are always ready model it must be better is more typically less than
0:06:06	two hundred thousand parameters
0:06:08	speaker diarization used to cluster and always speakers
0:06:11	under the number of speakers is very important
0:06:14	"'cause" no baby only cares about the target speaker
0:06:17	everyone else will be simply represented as
0:06:19	non target speaker
0:06:22	i will talk about the implementation of personal vad
0:06:26	to implement personal vad
0:06:28	the first question use
0:06:29	how do we know whom to listen to
0:06:32	well there's which systems usually at all the users enrolled her voice
0:06:36	and this enrollment is a one of the experience
0:06:38	so the cost the can be ignored and run time
0:06:41	after enrollment
0:06:42	we will have a speaker embedded
0:06:44	also no it was that you vector
0:06:47	stored on the device
0:06:48	this in banning can be used for speaker recognition
0:06:50	well voice usually so luxury it can also be used as the side include of
0:06:55	personal vad
0:06:58	there are different ways of implementing personal vad
0:07:01	the simplest the way is to directly combine a standard vad model and the speaker
0:07:06	verification system
0:07:07	we use this as a baseline
0:07:09	but in this paper
0:07:10	we propose to explain a new person a vad model
0:07:13	which takes the speaker verification score
0:07:16	or the speaker in batting include
0:07:19	so actually we implemented for different architectures for personal vad
0:07:23	i don't going to talk about than one by one
0:07:26	first
0:07:27	score combination this is the baseline model that i mentioned earlier
0:07:31	we don't for adding you model but just use the existing vad model and the
0:07:36	speaker verification model
0:07:38	if the vad output if the speech
0:07:40	we verify this frame
0:07:42	okay that the target speaker using the speaker verification model such that we have three
0:07:47	different all the classes
0:07:48	like personal vad
0:07:50	note that
0:07:51	this implementation requires running the big speaker verification model at runtime
0:07:56	so is expensive solution
0:07:58	second one
0:07:59	score condition the training
0:08:01	here we don't to use the standard vad model
0:08:04	but still use the speaker verification model
0:08:07	we concatenate of the speaker verification score
0:08:09	with the acoustic features
0:08:11	and it's and a new personal vad model
0:08:13	on top of the concatenated features
0:08:16	this is still very expensive because we need to run a speaker verification model at
0:08:20	runtime
0:08:23	embedding conditioning
0:08:25	this is really the implementation that we want to use for a device asr
0:08:29	it directly concatenate the type a speaker in that in with acoustic features
0:08:34	and we train a new personal vad model on the concatenation of features
0:08:38	so the personal vad model is the only model that we need for the runtime
0:08:44	and finally
0:08:45	score and in bad condition mission it concatenate
0:08:49	both speaker verification score
0:08:50	i think that in
0:08:51	with the acoustic features
0:08:53	so that use these the most information from the speaker verification system and is supposed
0:08:58	to be most powerful
0:09:00	but since either requires ran a speaker verification at runtime
0:09:04	so it's a still not ideal from device is are
0:09:08	okay we have talked about architectures
0:09:11	let's talk about the most functions
0:09:13	vad is a classification problem
0:09:16	so standard vad use this binary cross entropy
0:09:19	there is no vad has three classes so naturally
0:09:22	we can use turner we cross entropy
0:09:25	but
0:09:26	come with a better than cross entropy
0:09:28	if you think about the actual use case
0:09:31	both non speech
0:09:32	and non-target the speaker speech
0:09:34	will be discarded of asr
0:09:36	so if you make a prediction error
0:09:38	between i speech
0:09:40	i do not talking the speaker speech is actually not a big deal
0:09:43	we conclude this knowledge you know or loss function
0:09:47	and we proposed the weighted pairwise knows
0:09:51	it is similar to cross entropy
0:09:53	but we use the different the weight for different pairs of classes
0:09:57	for example we use a smaller weight of zero point one between the cost is
0:10:01	nice speech
0:10:02	i do not have been the speaker speech
0:10:04	and use a larger weight of one into other pairs
0:10:11	best
0:10:11	i will talk about experiments that have
0:10:15	i feel dataset for training and evaluating person vad
0:10:19	we have these features
0:10:20	it should include real estate and the natural speaker turns
0:10:24	it is the colour drivers voice conditions
0:10:27	it should have frame level speaker labels
0:10:29	finally the should have you roman utterances
0:10:31	for each target speaker
0:10:33	unfortunately
0:10:34	we can find a dataset that satisfies all these requirements
0:10:39	so we actually made i artificial dataset based on the well-known you speech data set
0:10:45	remember that we need in the frame level speaker labels
0:10:48	for each and every speech utterance
0:10:50	we have this you are able
0:10:52	we also have the ground truth asr transcript
0:10:55	so we use of creation asr model
0:10:58	to for a nine the ground truth transcript
0:11:00	with the audio
0:11:01	together timing of each word
0:11:03	we just timing information
0:11:05	we get the frame level speaker labels
0:11:08	and a to have conversational speech
0:11:11	we concatenate utterances from different speakers
0:11:14	we also used room simulator
0:11:16	to add a reverberant noise to the concatenated utterance
0:11:20	this will avoid domain over fitting and also be decay the concatenation artifacts
0:11:27	clears the model configuration
0:11:29	both standard vad and the person of vad consist of two l s t and
0:11:33	there's
0:11:34	and the one three collected in a
0:11:36	the model has their point one three million parameters in total
0:11:40	the speaker verification model has three l s t and there's
0:11:43	with projection and the one three collected in a
0:11:46	this model is created be
0:11:48	with the bass fine tuning parameters
0:11:51	for evaluation
0:11:52	because this is a classification problem so we use average precision
0:11:57	we look at the average precision for each class and also the mean average precision
0:12:02	we also look at the metrics for both with and without ourselves any noise these
0:12:08	next results and the conclusions
0:12:12	first
0:12:12	we compare different or architectures
0:12:15	remember that
0:12:17	s c is the baseline by directly combining standard vad
0:12:21	and the speaker verification
0:12:23	and we find that all the other personal vad models are better than the baseline
0:12:28	along the proposed the models
0:12:30	as at
0:12:31	we see the one that the use this for speaker verification score and a speaker
0:12:35	in batty is the best
0:12:37	this is kind of expected because then use is most the speaker information
0:12:42	t is the personal vad model
0:12:44	the only uses speaker embedding and this idea of only based asr
0:12:48	we note that in t is a slightly worse than std
0:12:52	by the different it is more it is near optimal but has only two point
0:12:56	six percent of the parameters at runtime
0:12:59	we also compare the conventional cross-entropy knows
0:13:02	and the proposed a weighted pairwise novels
0:13:05	we found that
0:13:06	which the powerwise those is consistently better
0:13:09	no cross entropy and of the optimal weight between i speech
0:13:13	and i have a speaker speech is there a point one
0:13:17	finally since the out medical personnel vad is to replace the standard vad so we
0:13:23	compare that you understand alleviated task in some cases
0:13:28	person of at is slightly worse
0:13:30	by the differences are by some more
0:13:33	so conclusions of this paper
0:13:35	the proposed person the vad architectures
0:13:38	outperforms the baseline of directly combining vad and the speaker verification
0:13:43	among the proposed architectures as at has the best performance
0:13:48	but e t is the idea one for on device asr
0:13:51	which has near optimal performance
0:13:54	we also propose weighted pairwise knows
0:13:57	which outperforms cross entropy knows
0:13:59	finally person the vad understand a vad perform almost you could well a standard vad
0:14:05	tasks
0:14:07	and also briefly talk about the future work directions
0:14:11	currently the person of eighteen model is trained and evaluated on artificial computations
0:14:17	we for the really use
0:14:18	realistic conversational speech
0:14:20	this will require also the data collection and the neighboring efforts
0:14:24	besides
0:14:25	person the vad can be used the was speaker diarization
0:14:28	especially whether there is the overlap of the speech in the conversation
0:14:32	and the good news is that
0:14:34	people are already we used
0:14:35	researchers from russia propose to this system known as having the speaker vad
0:14:41	which is similar to personal vad
0:14:43	and the successfully used it for speaker their addition
0:14:46	if you know our paper
0:14:47	i would recommend the usual with their paper as well
0:14:51	if you have at questions
0:14:52	pretty c d's a comment on the speaker all these features are then the time
0:14:56	t website and our paper
0:14:58	seven q

On Open-Set Speaker Identification with I-Vectors

Speech Application

Kevin Wilkinghoff