i'm not everyone
this is trained one from google
today and going to talk about personal vad was on the line shows
speaker condition the voice activity detection
a big part of this work is done by shows
cool was my internist the summer
first of all behind them a summary of this work
personal vad is the system to detect the voice activity all the target speaker
the reason we need a personal vad is that
it reduces gpu memory and battery consumption for on device speech recognition
we implement person of the at
but as a frame that was training detection system
which you this kind of speaker embedding as side include
i will start by team in some background
most of the speech recognition systems
are deployed on the crowd
but will be asr to the device i'd in the car engine
this is because
on device asr does not require internet connection integrating reduces the nist e
because it does not need to communicate with servers
it also preserves the user's privacy better because the audio never use the device
device asr is your used for smart phones or smart-home speakers for example
if you simply want to turn around the flashlight audio file
you should be able to do it in any pair open mode
if you want to turn on valentine's
use only need access to your local network
although i'm device asr discrete
there are lots of challenges
and x servers
we only have a very limited budget of thinking you memory
and the battery for asr
also
yes there is no the only program running on the device
for example for smart phones there are also many r s running the background
so i important question is
when do we run asr on the device apparently
it shouldn't be always run
but technical solution is to use keyword detection
well so no it was recorded detection
well holes were detection
for example critical go
is the keyword vocal devices
because the keyword detection model is usually better is more
so it's very cheap and it can be always running
and sre security a speaker model
when s r is very expensive
so we only writes
when the keyword list exactly
however not everyone likes the idea of always having to a speaker that you were
before you interact with the device many people wish to be able to be directly
talk to the device without having to say keyword data we define for that
so i alternative solution is to use voice activity detection instead of keyword detection
like keyword detection one does
vad models are also various more
and a very cheap to run
so you can have the vad model always running
and only used asr with vad has been trigger
so that we at work
the vad model is typically a frame number of binary classifier
for every frame of speech signals
the at classifies it into two categories
speech and then i speech and after vad
with the overall or the non speech frames
and only keep the speech frames
then we feel be speech frames to downstream components like asr or speaker recognition
the recognition results will be used for natural language processing
then speaker different actions
z be model will help us to reject or than i speech frames
which will save lots of computational resources
but is difficult enough
in a realistic scenario you can talk to the device
but you work it can also talk to you and if we wind then you
mean room there will be someone talking the t v at
these are all available speech signals still vad will simply accept or this frames but
source of the run over the
for example
if you can the tv plane
and the asr case running on this martial us to read out of data
so that's why we are introducing personal vad
personal vad is similar to the standard vad
it is the frame level classifier
but the difference is that you has three categories instead of two
we still have been i speech class
but the other to a target speaker speech
i don't i'm typing the speaker speech
i don't see that is not spoken by the target speaker
like other family members
what t v
will be considered another target speaker speech
the benefits of using personal vad is that
we are only right yes are on the speaker speech
this means
we will save lots of computational resources
wouldn't t v is on whether there are not go
turn t members in the user's household
or when the user is ad hoc
and to make this to the key is that
the personal vad model is to be tidy and the fast
just like a keyword detection
well standard vad model
well so
the false reject must be no
because
we want to be responsive to the height of the user's request
the full extent should also be no
to really save the computational resources
well we first the release this paper
there are some common thing all of this is not a new this is just
the speaker recognition or speaker diarization
here we want to clarify that
no this is not
personal be at the very different speaker recognition or speaker diarization
speaker recognition models you really produce recognition results at a reasonable
or we don't at all
but personal vad produces all scores as frame level
it is us to me model and a very sensitive to latency
speaker recognition models are typically be
usually at the nist more than five million parameters
personal vads are always ready model it must be better is more typically less than
two hundred thousand parameters
speaker diarization used to cluster and always speakers
under the number of speakers is very important
"'cause" no baby only cares about the target speaker
everyone else will be simply represented as
non target speaker
i will talk about the implementation of personal vad
to implement personal vad
the first question use
how do we know whom to listen to
well there's which systems usually at all the users enrolled her voice
and this enrollment is a one of the experience
so the cost the can be ignored and run time
after enrollment
we will have a speaker embedded
also no it was that you vector
stored on the device
this in banning can be used for speaker recognition
well voice usually so luxury it can also be used as the side include of
personal vad
there are different ways of implementing personal vad
the simplest the way is to directly combine a standard vad model and the speaker
verification system
we use this as a baseline
but in this paper
we propose to explain a new person a vad model
which takes the speaker verification score
or the speaker in batting include
so actually we implemented for different architectures for personal vad
i don't going to talk about than one by one
first
score combination this is the baseline model that i mentioned earlier
we don't for adding you model but just use the existing vad model and the
speaker verification model
if the vad output if the speech
we verify this frame
okay that the target speaker using the speaker verification model such that we have three
different all the classes
like personal vad
note that
this implementation requires running the big speaker verification model at runtime
so is expensive solution
second one
score condition the training
here we don't to use the standard vad model
but still use the speaker verification model
we concatenate of the speaker verification score
with the acoustic features
and it's and a new personal vad model
on top of the concatenated features
this is still very expensive because we need to run a speaker verification model at
runtime
embedding conditioning
this is really the implementation that we want to use for a device asr
it directly concatenate the type a speaker in that in with acoustic features
and we train a new personal vad model on the concatenation of features
so the personal vad model is the only model that we need for the runtime
and finally
score and in bad condition mission it concatenate
both speaker verification score
i think that in
with the acoustic features
so that use these the most information from the speaker verification system and is supposed
to be most powerful
but since either requires ran a speaker verification at runtime
so it's a still not ideal from device is are
okay we have talked about architectures
let's talk about the most functions
vad is a classification problem
so standard vad use this binary cross entropy
there is no vad has three classes so naturally
we can use turner we cross entropy
but
come with a better than cross entropy
if you think about the actual use case
both non speech
and non-target the speaker speech
will be discarded of asr
so if you make a prediction error
between i speech
i do not talking the speaker speech is actually not a big deal
we conclude this knowledge you know or loss function
and we proposed the weighted pairwise knows
it is similar to cross entropy
but we use the different the weight for different pairs of classes
for example we use a smaller weight of zero point one between the cost is
nice speech
i do not have been the speaker speech
and use a larger weight of one into other pairs
best
i will talk about experiments that have
i feel dataset for training and evaluating person vad
we have these features
it should include real estate and the natural speaker turns
it is the colour drivers voice conditions
it should have frame level speaker labels
finally the should have you roman utterances
for each target speaker
unfortunately
we can find a dataset that satisfies all these requirements
so we actually made i artificial dataset based on the well-known you speech data set
remember that we need in the frame level speaker labels
for each and every speech utterance
we have this you are able
we also have the ground truth asr transcript
so we use of creation asr model
to for a nine the ground truth transcript
with the audio
together timing of each word
we just timing information
we get the frame level speaker labels
and a to have conversational speech
we concatenate utterances from different speakers
we also used room simulator
to add a reverberant noise to the concatenated utterance
this will avoid domain over fitting and also be decay the concatenation artifacts
clears the model configuration
both standard vad and the person of vad consist of two l s t and
there's
and the one three collected in a
the model has their point one three million parameters in total
the speaker verification model has three l s t and there's
with projection and the one three collected in a
this model is created be
with the bass fine tuning parameters
for evaluation
because this is a classification problem so we use average precision
we look at the average precision for each class and also the mean average precision
we also look at the metrics for both with and without ourselves any noise these
next results and the conclusions
first
we compare different or architectures
remember that
s c is the baseline by directly combining standard vad
and the speaker verification
and we find that all the other personal vad models are better than the baseline
along the proposed the models
as at
we see the one that the use this for speaker verification score and a speaker
in batty is the best
this is kind of expected because then use is most the speaker information
t is the personal vad model
the only uses speaker embedding and this idea of only based asr
we note that in t is a slightly worse than std
by the different it is more it is near optimal but has only two point
six percent of the parameters at runtime
we also compare the conventional cross-entropy knows
and the proposed a weighted pairwise novels
we found that
which the powerwise those is consistently better
no cross entropy and of the optimal weight between i speech
and i have a speaker speech is there a point one
finally since the out medical personnel vad is to replace the standard vad so we
compare that you understand alleviated task in some cases
person of at is slightly worse
by the differences are by some more
so conclusions of this paper
the proposed person the vad architectures
outperforms the baseline of directly combining vad and the speaker verification
among the proposed architectures as at has the best performance
but e t is the idea one for on device asr
which has near optimal performance
we also propose weighted pairwise knows
which outperforms cross entropy knows
finally person the vad understand a vad perform almost you could well a standard vad
tasks
and also briefly talk about the future work directions
currently the person of eighteen model is trained and evaluated on artificial computations
we for the really use
realistic conversational speech
this will require also the data collection and the neighboring efforts
besides
person the vad can be used the was speaker diarization
especially whether there is the overlap of the speech in the conversation
and the good news is that
people are already we used
researchers from russia propose to this system known as having the speaker vad
which is similar to personal vad
and the successfully used it for speaker their addition
if you know our paper
i would recommend the usual with their paper as well
if you have at questions
pretty c d's a comment on the speaker all these features are then the time
t website and our paper
seven q