i don't everyone
this is trained one from google today and going to talk about personal vad
also known as
speaker condition the voice activity detection
a big part of this work is done by shows doing cool was my in
turn as the summer
first of all behind them a summary of this work
personal vad is the system to detect the voice activity all the target speaker
the reason we need a person a vad is that
it reduces to you memory and battery consumption for on device speech recognition
we implement a person of the at bus the frame that was training detection system
which you this kind of speaker embedding as side include
i will start by team in some background
most of the speech recognition systems
are deployed on the crowd
but will be asr to the device i'd in the car engine
this is because
on device asr does not require internet connection integrating reduces the nist e
because it does not need to communicate with servers
it also preserves the user's privacy better because the audio never used to a device
device asr is you really used for smart phones or smart-home speakers
for example
if you simply want to turn on the flashlight on your full
you should be able to do it in a pair open mode
if you want to turn on valentine's
uses only need access to your local network
well as a lung device asr is great
there are lots of challenges
and x servers
we only have a very limited budget of thinking you memory
and battery
for asr
also
yes there is no the only program running on the device
for example for smart phones there are also many other apps running the background
so i important question used
when do we run asr on the device apparently
it shouldn't be always run
but technical solution is to use keyword detection
also known as weak or detection
well holes were detection
for example
can you go
is the keyword vocal devices
because the keyword detection model is usually better is more
so it's very cheap
and it can be always running
and sre security speaker model
when sre is very expensive
so we only writes
when the keyword is detecting
however
not everyone likes the idea of always having to a speaker that you were
before you interact with the device
many people wish to be able to directly talk to the device
without having to say keyword that we define for that
so i alternative solution is to use voice activity detection instead of keyword detection
like keyword detection models
vad models are also various more
and a very cheap to run
so you can have the vad model always running
and only used asr with vad has been trigger
so that we at work
the vad model is typically a frame number of binary classifier
for every frame of speech signals
the idea classifies it into two categories
speech and then i speech and after vad
with the overall or the non speech frames
and only keep the speech frames
then we feel be speech frames to downstream components
like asr or speaker recognition
the recognition results will be used for natural language processing
then signal different actions
z be model will help us to reject or than a speech frames
which will save lots of computational resources
but is good enough
in a realistic scenario
you can talk to the device
but you work it can also talk to you and if we wind then you
mean room
there will be someone talking the t v ads
these are all available speech signals
still vad will simply accept or this frames
but source of the run of the
for example
if you can the tv plane
and the asr case running on this martial us to run out of data
so that's why we are introducing personal vad
personal vad is similar to standard vad
it is the frame level classifier
but the difference is that you has three categories instead of two
we still have been i speech class
but the other to a target speaker speech i don't than typing the speaker speech
and it seems that is not spoken by the target speaker
like other family members
what t v
will be considered another target speaker speech
the benefits of using personal vad is that
we only run asr on congress speaker speech
this means we will save lots of computational resources
when t v is
when there are not go
to many members in the user's household or when the user is at the time
and to make this two
the key is that
the personal vad model is to be highly and the fast
just like a keyword detection
what standard vad model
also
the false reject must be no
because
we want to be responsive to the height of the user's request
the false accept should also be no
to really save the computational resources
well we first the release this paper
there are some comments at all of this is not a new this is just
the speaker recognition or speaker diarization
here we want to clarify that
no this is not
cars not be at the very different speaker recognition or speaker diarization
speaker recognition models you really produce recognition results at a reasonable
or we don't handle
but personal vad produces all scores as frame level
it is streaming model and a very sensitive to latency
speaker recognition models i can be usually and then use more than five million parameters
personal vads are always ready model it must be better is more
typically less than two hundred thousand parameters
speaker diarization used to cluster and always speakers
under the number of speakers is very important
"'cause" no baby only cares about the target speaker
everyone else will be simply represented as
non target speaker
i will talk about the implementation of personal vad
to implement personal vad
the first question use
how do we know whom to listen to
well there's which systems usually at all the users enrolled her voice
and this enrollment is a one of the experience
so the cost can be ignored and run time
after you romans
we will have a speaker embedded
what's on the line shows that you vector
stored on the device
this in banning can be used for speaker recognition
well voice your sorry
so luxury it can also be used as the side include of course not vad
there are different ways of implementing personal vad
the simplest the way is to directly combine a standard vad model and the speaker
verification system
we use this as a baseline
but in this paper we propose to accept a new person a vad model
which takes the speaker verification score
all the speaker in batting as input
so actually we implemented for different architectures for personal but i don't going to talk
about them one by one
first
score combination
this is the baseline model that i mentioned earlier
we don't for adding new model
but just use the existing vad model and the speaker verification model
if the vad output it's speech
we verify this frame
okay that the target speaker using the speaker verification model
such that we have three different all the classes
night personal vad
note that
this implementation requires running the big speaker verification model at runtime
so is expensive solution
second one
score condition the training here we don't to use the standard vad model
but still use the speaker verification model
we concatenate of the speaker verification score
with the acoustic features and it's and a new personal vad model
on top of the concatenated features
this is still very expensive because we need to write the speaker verification model at
runtime
embedding conditioning
this is really the implementation that we want to use for a device asr
it is directly concatenate the target speaker in the end with acoustic features
and we train a new personal vad model on the concatenation of features
so the person a vad model
is the only model that we need for the runtime
and finally score and in bad in addition to send it to concatenate
both speaker verification score
i think that in
with the acoustic features
so that use these the most information from the speaker verification system and is supposed
to be most powerful
but since either requires ran a speaker verification at runtime
so it's a still
not ideal from device is are
okay we have talked about architectures let's talk about the not function
vad is a classification problem
so standard vad use this binary cross entropy personal vad has three classes so naturally
we can use turn we cross entropy
but can we do better than cross entropy if you think about the actual use
case
both non speech and non-target the speaker speech
will be discarded of asr
so if you make "'em" prediction avril
between i speech
i do not talking the speaker speech is actually not a big deal
we conclude this knowledge you know or loss function
and we proposed the weighted pairwise knows
it is similar to cross entropy
but we use the different the weight for different pairs of classes
for example
we use us to model weight of zero point one between the cost is nice
speech
i don't know how the speaker speech
and use a larger weight of one into other pairs
next i will talk about experiments that have
i feel dataset for training and evaluating person vad
we have these features
it should include real estate and the natural speaker turns
it's a couple times worse voice conditions
it should have frame level speaker labels and be the should have you roman utterances
for each target speaker
unfortunately
we can find a dataset that satisfies all these requirements
so we actually made i artificial dataset based on the well-known you speech data set
remember that we need in the frame level speaker labels
for each deeply speech utterance we have this variable
we also have the ground truth asr transcript
so we use of creation asr model to for a nine the ground truth transcript
with the audio together that i mean of each word with this timing information
we can at the frame level speaker labels
and a to have conversational speech we concatenate utterances from different speakers
we also used room simulator to add and reverberant noise
to the concatenated utterance
this will avoid domain over fitting and also be decay the concatenation artifacts
clears the model configuration
both standard of vad and the person a vad consist of two l s t
and there's
and the one three collected in a
the model has there are point wise renewing parameters in total
the speaker verification model has three l s t and there's
with projection
and the one three collected in a
this model is created be
with the bass fine tuning parameters
for evaluation
because this is a classification problem
so we use average precision
we look at the average precision for each class and also the mean average precision
we also look at the metrics for both with and without ourselves any noise these
next without any conclusions
first
we compare different or architectures
remember that s t is the baseline by directly combining standard vad
and the speaker verification
and we find that all the other personal
vad models are better than the baseline
among the proposed the models as at we see the one that uses both speaker
verification score
and a speaker in batty is the best
this is kind of expected because then use is most the speaker information
t is the personal vad model
the only uses speaker embedded and this idea of only based asr we note that
in t is a slightly worse than std by the different it is more it
is near optimal but has only two point six percent of the parameters at runtime
we also compare the conventional cross-entropy knows
and the proposed a weighted pairwise loss
we found that which the powerwise those is consistently better than cross entropy and of
the optimal weight between i speech
and i have a speaker speech is zero point one
finally since the out medical personnel vad
is to replace the standard a vad
so we compare that you understand alleviated task
in some cases person of at is slightly worse
but the differences are based more
so conclusions of this paper
the proposed person the vad architectures outperform the baseline of directly combining vad and the
speaker verification
among the proposed architectures
ask at has the best the performance but e t is the idea one for
on device asr
which has near optimal performance
we also propose weighted pairwise knows
with all performance cross entropy knows
finally person the vad understand a vad perform almost you could well a standard vad
tasks
and also briefly talk about the future work directions
currently the person of eighteen model is trained and evaluated on artificial computations
we should really used realistic conversational speech
this will require those of the data collection and anybody efforts
besides person the vad can be used the was speaker diarization
especially whether there is the overlap of the speech in the conversation
and the good news is that people are already doing you'd
researchers from russia propose to this system known as having the speaker vad
which is similar to personal vad
and the successfully used it for speaker their addition
if you know our paper
i would recommend the usual with their paper as well
if you have actions
pretty c is a common
on the speaker all these features are then the time t website and our paper
some two