she again hi everyone today i'll be talking about my work on low power text
dependent speaker verification system
with narrowband feature pre-selection and weighted dynamic time warping
this is a joint work with texas instruments
so first to motivate the work
with the increasing popularity of mobile devices such as smart phones and smart watches there's
the ricean interested to enable for
voice authenticated
speaker verification
that is the device continuously listens to the surrounding and wakes up when the desk
a designated pass phrase is pronounced by the
a target speaker
so it's this thing wakeup systems are implemented on the host device using digital solutions
because those holes device that usually design of what general purpose applications
signals are usually acquired error rates much higher than the nyquist rate
in order to minimize information loss in early stage and to enable
flexible downstream processing therefore this systems usually involve many stages of processing a high dimensional
data
and therefore resulting in power consumptions in the range of lpc hundreds of millions
so the opportunity lies in the fact that if you're design you application specific system
and the desired output maybe a potential in a more direct manner
with less processing by performing of early stage signal dimension reduction with analog components
and adaptive data processing so our goal is to design a low power
voice authenticated we come system was power consumption is limited to the range of a
couple hundreds of micro what's
in order to achieve this
we proposed a new architecture
so in conventional systems
the processing pipeline usually involves
high sampling rate fast processing
and due to the high dimensional features
the downstream processing usually requires a high complexity come
computation therefore without being high power consumption
in contrast the proposed this is then priest select a set of a low dimensional
spectrum features which can be efficiently extracted using analogue components and because this right
some pretty selected features are sparse in the frequency domain
therefore it enables
load a low rate sampling and a low rate processing
therefore to achieve low power consumption
so in the remainder of the talk i'll describe the two components of the system
the spectral of feature pre-selection component and of speaker verification back end
so first are described all feature pre-selection process
in this part
well show that with a few carefully selected narrowband
where a
we are capable of capturing most essential information speech information
first review
so the speech signal
best of c
there is the display
are you this work should be you of t
and
so the speech signal can be represented as the convolution between the excitation signal you
of t and the vocal tract modulation signal each of the and it chopped you
contains essential speech information
this convolution relationship can become separable in the cepstral domain
so let's fix that i'm plot the power spectrum density of speech
in a frequency domain convolution becomes multiplication and when you take the log of the
power spectrum density a multiplication becomes addition notice that the peaks of the power spectrum
density corresponds to the harmonics of the fundamental frequency
and the overall i'm a little corresponds to the vocal tract modulation
and to transform the signal to the cepstral domain we take the inverse fourier transform
and
it turns out that speech is
sparse in the cepstral domain it's consists of two main components though vocal tract modulation
component h of top which is represented by this narrow a rectangle here with cut-off
frequent a difference the ad data h and the excitation component you of power which
is represented by this
delta component at data e
so our goal is to extract each of how and this can be done by
performing
and transformation to the cepstral domain
however this is par expense
so the question is
how to extract each of tall without requiring of both actual and perform transformation to
the cepstral domain
so let's begin by trying something simple see if we measure the speech at and
of viewpoints not are evenly spaced out across this fit frequency spectrum on top of
the harmonics and let the spacing between the sampling point be denoted by delta p
then what this corresponds to is multiplication between the power spectrum density and the and
impostor and when transformed to the cepstral domain this corresponds to convolution between the speech
cepstral and adult aging so what we get here is that really is the version
so
so what we get here is that the cepstrum all the are point wise samples
each
is the earliest version of the cepstral of the original speech and the earliest copies
like exactly on the multiple of a wine over don't happy
so the take away message here is that even though it seems we have thrown
away and the majority of the frequency spectrum and only kept a few points the
most essential speech information each of how is the old reading so the next question
is
what if we do not have an exact estimate of the fundamental frequency where we
have no information of the fundamental frequency at all so in this case instead of
doing point one simply we use a set of bandpass filters
so this corresponds to you can derive a this can be represented as
multiplication between the power spectrum density and a rectangular training so
because the rectangular trend
is on
i think are attenuated delta function in the cepstral domain and there's and bandpass-filtering becomes
convolution between the
space capsule with the things attenuated don't the training
so what we get is that you miss the version of the original speech cepstrum
which is attenuated by the sink a adult aging so no this here that because
this time we did not choose the don't the rectangular trying to apply on top
of the harmonics
there's aliasing between well good five component it'll tell and the excitation component of tell
however and
and because this and
aliasing is attenuated by the sink function and it's okay
can be negligible for our application
for example with a narrowband bandwidth a bandwidth of two hundred hz and the spacing
of eight hundred hz
and a fundamental frequency of one hundred hz
the alias database then is attenuated significantly
so far what we have concluded is that with a small number of a narrow
band filters
we can capture the most essential speech information
and this will result in low rate a downstream sampling and processing
so we integrates the narrowband a frequent and the narrowband
frequency
narrow line spectral extraction frontend into our proposed
block diagram
so here other of like the front end features are fed into the back and
speaker verification and
are fed into the speaker verification back end
so one distinguishing character of this and system is that the individual bands can now
be turned on and off needed depending on the background waistband and system requirement
so next we talk about the back and a speaker verification algorithm so recall that
our goal is to design a voice authenticate a wakeup system
that identifies the user and the pass phrase in one shot
and we would like to allow the user to
provide customise pass phrases by providing a very few number of enrollment samples
still
there are
are application falls into the category of text dependent speaker verification
and there are many existing methods for this application so there are model based methods
that leverages a cohort of speakers two
who pronounce to pass phrase to train a background model
and the parameters of the model are then fine tuning
with an enrollment from the target is euchre
and under either and there are
template based the method that
does not require prior model training so decisions made by comparing the input features with
the enrollment features
because we would like to allow the users and to
to provide customised and pass phrases
we used a template based method
so
the classical dynamic time warping algorithm is used to overcome
speakers you variation and houses in speech
and however for application it turns out that the classical dynamic time warping algorithm either
provides
too much
warping that it's and you take the signal envelope which then leads to a large
number of false positives
or it does not provide enough warping to compensate
the long pauses between words in a pass phrase
so therefore we propose a
a modified version of the
dynamic time warping algorithm which we called with a dynamic time warping algorithm to provide
sufficient warping such static and compensated speaker variation pauses
in between words
without causing too much signal envelope mutation
so to do this we simply add a penalty term to the distance measurement and
the penalty term scales up linearly with the number of consecutive work you steps us
though that's
it's prevents too much warping on the signal
and this penalty scales with the signal magnitude so the pet penalty
it's a small when the signal and reduce is small because that probably corresponds to
a pauses in a in a signal and the penalties high when the signal amplitude
is high in order to prevent signal envelope mutation
so this can be illustrated in a distance matrix computation so the
the weighted and i mean time warping algorithm is the same as the classical dynamic
time warping algorithm
with only one difference that is the cost function of those signal magnitude that is
we had a cost
to the distance measurement and the cost is a function of the signal magnitude and
the number of consecutive markings that's
i won't go into the details here and you can find like the full implementation
in the paper
so to illustrate the benefits of with a dynamic marking
i shows
it through a simulation of a lot
so here our goal is to align the mail envelopes of the two signals are
and you can see that with the window length of one hundred millisecond the classical
tenement dynamic time warping algorithm fails to align the signal i will
here
and with
with no land of two hundred millisecond and also the weighted dynamic time warping algorithm
the signal envelopes are properly aligned
however you can notice here that
the shape of the input signal i is
have really mutated by the classical dynamic time warping algorithm
on the other hand and their shape is retained
by the way to dynamic time warping algorithm
so next we performed experiments on the entire system design
so you know our experiments we used three pass phrase this
i galaxy okay glass and okay power weight pronounced by thirty to forty speakers with
twenty to forty repetitions
and we also had wind and car noise this to the clean samples to generate
other the weight examples
so the reason we chose reincarnate with this is because
they are common for application
and also they have the distinguishing characters to be narrowly concentrated in a low-frequency
so if there were this we can illustrate the benefits of adaptive and selection by
discarding the weights events
so in the experiments we used three enrollment samples with narrowband bandwidth of two hundred
hz
chosen to be around the estimated fundamental frequency
and we compare it with two baseline systems
the for
the system with forty dimensional mfccs with the classical dynamic time warping algorithm and forty
dimensional mfcc with the gmm ubm
but that
so this table summarizes experiments results
when there's no background noise we use a top-n to narrow band spectral
features and when there is background noise all the all the features be able to
have two khz which would be leave a contaminated by noise
are dropped so we only use the remaining at that
it can be shown here that without background noise than aeroplanes
spectral coefficients your to compare overall accuracy to the mfcc features and i three db
snr the narrowband spectra coefficient you're much better accuracy than mfcc features
and
overall the weighted dynamic time warping algorithm yields improved accuracy with the
then the classical dynamic time warping algorithm for feature for all features
and the proposed system that is
the right now then spectral coefficients combined with the weighted dynamic time warping algorithm
you also improved accuracy than the gmm ubm but with tape with only three enrollment
samples as prior and without a prior a background model training
so we also investigated how
the performances of affected by different parameters
so as you can see here is that
when we increase the number of bands the accuracy of the system improves
and also
the accuracy of the system improves significantly with the band selection that is this role
compared with like
without then selection over here
so the pope the
total power consumption of the system can be estimated as the summation of the power
consumption of the front end and the backend so for front end power estimation we
use front end design by texas instruments that consists of about thirty band filter bank
so this front end has a fixed power consumption of a hundred five
a hundred fifty mike robots without additional power consumption of ten micro one
per band
and the back and algorithm is implemented on the cortex and the or microcontroller
and it is but decision is made
the assumptions got a decision is made at
every sixty milliseconds with continuous triggering still there's a worst-case trigger assumption
and the power consumption of the backend is
and i micro bites and per band
so
in here we're just trying to illustrate that overall or system based very power efficient
and the total power consumption have you kept on
under a few hundred microphones
so it can be summarised
i presented
new structure for low power text dependent speaker verification
in this system it has a front end that consist of a set of narrowband
analogue filters
which performs early stage signal dimension reduction
and
by performing
and
are used
a performing this
front end filtering it will allow a little resampling and low rate down
downstream processing
and we also proposed a back and algorithm for speaker verification
an overall the system is designed to support adaptive and selection
which leads to a improve the robustness to noise and the overall system have demonstrated
come
comparable accuracy to existing systems
which much low
much lower power consumption
so that for today thing
okay we have time for questions
testing one two three so if i think can you applications you're looking probably in
like consumer electronics or a home environments are so forth so you don't have an
infinite number of competing speakers or speakers be confused with so
it have you thought maybe about which phrases might be more effective if you're looking
at let's say family using kind of some electronics at home you can kinda distinguish
which speakers are really more confusable with others and second if you're working at
kind of right trying to keep the power allow you would seem like you wanna
have what a wakeup type devices been doing the verification on something else so something
that might actually and i wanna see like clapping her hands or something but some
type of sound that would actually "'cause" the system to wake up
in this way the powers not being drunk continuously
well thanks
so for the first question
certainly there are pass phrases that are
that works better the other one and i'm not sure like what's
a good rule to choose like which has to
pass phrase better than others and
because like we want to solve the systems to different customers all around the world
so we just
let them choose like their favourite pass phrase
and for the second question
actually in the like for our system we do have like vad at the fact
that detects the energy and
this is so it's like
you have you their front to detect energy
and then it will wake up our system and our system will become the holes
device
so the reason with
we still use
we can see that the worst case scenario for this device
and
sorry weakens the worst case scenario in our first power estimation is because so say
you of your interest wrong what you know very noisy environment your
like they'll
this system is continuously running with the power consumption so the reason is that day
for if you have like
and apple like self
so in order to activate theory you need to like actually weak of the device
first
i and then you have like after the word
so we want to make something that we will not training of apply a noisy
environment
sorry i'm also thinking you're gonna add this to an internet of things type device
because that's were actually would probably have the biggest financial gains on
i think they are also considering selling those two like home appliances or
so what are things that people do it all these all devices is that they
don't all the devices close anymore
so you can have a you know
a phone on the it's a little and use this at a distance
no whatever you want to say so i saw that you have added noise to
the examples have looked at effect of five
actually that's a very good question still we have encountered is the are applications so
if you know like
house a real systems are implemented there is that it is the i-th the be
at the front end
in fact if you're speaking from far away and close time like those a sound
that sounds different
so
i think that i did some circuitry to change it's like a multi stage and
eighty see that it will like amplify this down if you are far away and
like amplify less if you are like close by
so i'm not considering this part and like in my work
but it's
like it certainly needs to be considered for real applications
okay i have a question a notice that you in slide not number nineteen that
you a thanks
i actually said to us a you for noise you remove their or all that
information it been a below two khz
you to try to do is also for in ecstasy
you think it's
i didn't do because i think
i
i don't know if people do it but
yes i
understand you of mfcc when you take
take the information from the frequency domain by transforming into the cepstral domain if you
just remove some bands and take the mfcc actually will alter the cepstral
signal significantly it's not
just like a addition
so i didn't do for mfcc there is no
i think it can be done if it's trained on it from scratch this way
for phone with the man
okay so the reason that it can be done in my system is because
the decision is like added together at the and in a linear manner you can
consider us each sub band x is a decision and thus
this is a summation of all but when you have thought mfcc coefficients they'll
decision is not no longer like a linear summation of all the decisions
song i'm not sure how we can be done
and using mfcc and i don't think is a task before
thanks thank you okay like to thank speaker again