Speech Transcript - Personal VAD: Speaker-Conditioned Voice Activity Detection

i don't everyone

this is trained one from google today and going to talk about personal vad

also known as

speaker condition the voice activity detection

a big part of this work is done by shows doing cool was my in

turn as the summer

first of all behind them a summary of this work

personal vad is the system to detect the voice activity all the target speaker

the reason we need a person a vad is that

it reduces to you memory and battery consumption for on device speech recognition

we implement a person of the at bus the frame that was training detection system

which you this kind of speaker embedding as side include

i will start by team in some background

most of the speech recognition systems

are deployed on the crowd

but will be asr to the device i'd in the car engine

this is because

on device asr does not require internet connection integrating reduces the nist e

because it does not need to communicate with servers

it also preserves the user's privacy better because the audio never used to a device

device asr is you really used for smart phones or smart-home speakers

for example

if you simply want to turn on the flashlight on your full

you should be able to do it in a pair open mode

if you want to turn on valentine's

uses only need access to your local network

well as a lung device asr is great

there are lots of challenges

and x servers

we only have a very limited budget of thinking you memory

and battery

for asr

also

yes there is no the only program running on the device

for example for smart phones there are also many other apps running the background

so i important question used

when do we run asr on the device apparently

it shouldn't be always run

but technical solution is to use keyword detection

also known as weak or detection

well holes were detection

for example

can you go

is the keyword vocal devices

because the keyword detection model is usually better is more

so it's very cheap

and it can be always running

and sre security speaker model

when sre is very expensive

so we only writes

when the keyword is detecting

however

not everyone likes the idea of always having to a speaker that you were

before you interact with the device

many people wish to be able to directly talk to the device

without having to say keyword that we define for that

so i alternative solution is to use voice activity detection instead of keyword detection

like keyword detection models

vad models are also various more

and a very cheap to run

so you can have the vad model always running

and only used asr with vad has been trigger

so that we at work

the vad model is typically a frame number of binary classifier

for every frame of speech signals

the idea classifies it into two categories

speech and then i speech and after vad

with the overall or the non speech frames

and only keep the speech frames

then we feel be speech frames to downstream components

like asr or speaker recognition

the recognition results will be used for natural language processing

then signal different actions

z be model will help us to reject or than a speech frames

which will save lots of computational resources

but is good enough

in a realistic scenario

you can talk to the device

but you work it can also talk to you and if we wind then you

mean room

there will be someone talking the t v ads

these are all available speech signals

still vad will simply accept or this frames

but source of the run of the

for example

if you can the tv plane

and the asr case running on this martial us to run out of data

so that's why we are introducing personal vad

personal vad is similar to standard vad

it is the frame level classifier

but the difference is that you has three categories instead of two

we still have been i speech class

but the other to a target speaker speech i don't than typing the speaker speech

and it seems that is not spoken by the target speaker

like other family members

what t v

will be considered another target speaker speech

the benefits of using personal vad is that

we only run asr on congress speaker speech

this means we will save lots of computational resources

when t v is

when there are not go

to many members in the user's household or when the user is at the time

and to make this two

the key is that

the personal vad model is to be highly and the fast

just like a keyword detection

what standard vad model

also

the false reject must be no

because

we want to be responsive to the height of the user's request

the false accept should also be no

to really save the computational resources

well we first the release this paper

there are some comments at all of this is not a new this is just

the speaker recognition or speaker diarization

here we want to clarify that

no this is not

cars not be at the very different speaker recognition or speaker diarization

speaker recognition models you really produce recognition results at a reasonable

or we don't handle

but personal vad produces all scores as frame level

it is streaming model and a very sensitive to latency

speaker recognition models i can be usually and then use more than five million parameters

personal vads are always ready model it must be better is more

typically less than two hundred thousand parameters

speaker diarization used to cluster and always speakers

under the number of speakers is very important

"'cause" no baby only cares about the target speaker

everyone else will be simply represented as

non target speaker

i will talk about the implementation of personal vad

to implement personal vad

the first question use

how do we know whom to listen to

well there's which systems usually at all the users enrolled her voice

and this enrollment is a one of the experience

so the cost can be ignored and run time

after you romans

we will have a speaker embedded

what's on the line shows that you vector

stored on the device

this in banning can be used for speaker recognition

well voice your sorry

so luxury it can also be used as the side include of course not vad

there are different ways of implementing personal vad

the simplest the way is to directly combine a standard vad model and the speaker

verification system

we use this as a baseline

but in this paper we propose to accept a new person a vad model

which takes the speaker verification score

all the speaker in batting as input

so actually we implemented for different architectures for personal but i don't going to talk

about them one by one

first

score combination

this is the baseline model that i mentioned earlier

we don't for adding new model

but just use the existing vad model and the speaker verification model

if the vad output it's speech

we verify this frame

okay that the target speaker using the speaker verification model

such that we have three different all the classes

night personal vad

note that

this implementation requires running the big speaker verification model at runtime

so is expensive solution

second one

score condition the training here we don't to use the standard vad model

but still use the speaker verification model

we concatenate of the speaker verification score

with the acoustic features and it's and a new personal vad model

on top of the concatenated features

this is still very expensive because we need to write the speaker verification model at

runtime

embedding conditioning

this is really the implementation that we want to use for a device asr

it is directly concatenate the target speaker in the end with acoustic features

and we train a new personal vad model on the concatenation of features

so the person a vad model

is the only model that we need for the runtime

and finally score and in bad in addition to send it to concatenate

both speaker verification score

i think that in

with the acoustic features

so that use these the most information from the speaker verification system and is supposed

to be most powerful

but since either requires ran a speaker verification at runtime

so it's a still

not ideal from device is are

okay we have talked about architectures let's talk about the not function

vad is a classification problem

so standard vad use this binary cross entropy personal vad has three classes so naturally

we can use turn we cross entropy

but can we do better than cross entropy if you think about the actual use

case

both non speech and non-target the speaker speech

will be discarded of asr

so if you make "'em" prediction avril

between i speech

i do not talking the speaker speech is actually not a big deal

we conclude this knowledge you know or loss function

and we proposed the weighted pairwise knows

it is similar to cross entropy

but we use the different the weight for different pairs of classes

for example

we use us to model weight of zero point one between the cost is nice

speech

i don't know how the speaker speech

and use a larger weight of one into other pairs

next i will talk about experiments that have

i feel dataset for training and evaluating person vad

we have these features

it should include real estate and the natural speaker turns

it's a couple times worse voice conditions

it should have frame level speaker labels and be the should have you roman utterances

for each target speaker

unfortunately

we can find a dataset that satisfies all these requirements

so we actually made i artificial dataset based on the well-known you speech data set

remember that we need in the frame level speaker labels

for each deeply speech utterance we have this variable

we also have the ground truth asr transcript

so we use of creation asr model to for a nine the ground truth transcript

with the audio together that i mean of each word with this timing information

we can at the frame level speaker labels

and a to have conversational speech we concatenate utterances from different speakers

we also used room simulator to add and reverberant noise

to the concatenated utterance

this will avoid domain over fitting and also be decay the concatenation artifacts

clears the model configuration

both standard of vad and the person a vad consist of two l s t

and there's

and the one three collected in a

the model has there are point wise renewing parameters in total

the speaker verification model has three l s t and there's

with projection

and the one three collected in a

this model is created be

with the bass fine tuning parameters

for evaluation

because this is a classification problem

so we use average precision

we look at the average precision for each class and also the mean average precision

we also look at the metrics for both with and without ourselves any noise these

next without any conclusions

first

we compare different or architectures

remember that s t is the baseline by directly combining standard vad

and the speaker verification

and we find that all the other personal

vad models are better than the baseline

among the proposed the models as at we see the one that uses both speaker

verification score

and a speaker in batty is the best

this is kind of expected because then use is most the speaker information

t is the personal vad model

the only uses speaker embedded and this idea of only based asr we note that

in t is a slightly worse than std by the different it is more it

is near optimal but has only two point six percent of the parameters at runtime

we also compare the conventional cross-entropy knows

and the proposed a weighted pairwise loss

we found that which the powerwise those is consistently better than cross entropy and of

the optimal weight between i speech

and i have a speaker speech is zero point one

finally since the out medical personnel vad

is to replace the standard a vad

so we compare that you understand alleviated task

in some cases person of at is slightly worse

but the differences are based more

so conclusions of this paper

the proposed person the vad architectures outperform the baseline of directly combining vad and the

speaker verification

among the proposed architectures

ask at has the best the performance but e t is the idea one for

on device asr

which has near optimal performance

we also propose weighted pairwise knows

with all performance cross entropy knows

finally person the vad understand a vad perform almost you could well a standard vad

tasks

and also briefly talk about the future work directions

currently the person of eighteen model is trained and evaluated on artificial computations

we should really used realistic conversational speech

this will require those of the data collection and anybody efforts

besides person the vad can be used the was speaker diarization

especially whether there is the overlap of the speech in the conversation

and the good news is that people are already doing you'd

researchers from russia propose to this system known as having the speaker vad

which is similar to personal vad

and the successfully used it for speaker their addition

if you know our paper

i would recommend the usual with their paper as well

if you have actions

pretty c is a common

on the speaker all these features are then the time t website and our paper

some two

Personal VAD: Speaker-Conditioned Voice Activity Detection

Speech Application

Shaojin Ding, Quan Wang, Shuo-Yiin Chang, Li Wan, Ignacio Lopez Moreno