table presents a small footprint multichannel keyword spotting
and after presenting on behalf of chewing on art in urine around the
this is overview of our presentation
but first talk about the motivation of this work
then we'll go over three d s t v f r
introduced in this paper
paul by how that's as figure five
but the model architecture
procurement star
next we will discuss the experimental setup
after that we will show some results
and cover a final conclusions from this paper
finally
we will show some future work
but we want to
as an extra
voices distance
are increasingly popular
keywords that just a google are commonly used to initiate conversation the voice the system
keyword spotting low latency becomes the core technological challenge to achieve the task
since keywords fires typically run embedded devices such as found speakers limited battery random computer
available
you want to detect
with high accuracy
we also want to have small model size all your cost
is thus those reasons
well to microphone setups are widely used to increase noise robustness
so will be interesting to see if we can integrate
a noise robustness
as far and then do and neural network architecture
we first recall std a
the svd f idea originated in doing singular value decomposition
but a fully connected weight matrix
alright one svd s
it composes the weight matrix vector dimension
as shown in figure
these two dimensions the inter but still turns me feature in time domain
we refer to the filter the feature domain is how well the first from the
in domain data
from
the feature maps from what they're
but first convolved with the one d feature filter also
then the output of the current state
reporters into memory buffer
given memory size
but the performance for the past on stage
is an states the memory buffer well can vol but the time-domain filter
beta
to produce the final or
this feature vector or start a single us that you have no
in practice
of course to be several nodes ritual to achieve
an dimensionality outputs
more a three
the first we have is used mecc input layer of speech model
but future filter
but correspond to transformation of the filterbank train but
but and still
but contain the past and filtered
but should be friends
and the time output will correspond to a summary of those past frames
as far as shown that that's really a more well single channel
in the literature
three d s p d f
extends that cts existing to dimension
but several feature in time to do not
channel
three d has to be a reference to three dimensions feature time
and channel
an example where
filterbank energies from each channel airfares and it's really are still
each channel learns it's weights
on its own time and frequency domain for
the outputs of all channels are concatenated
after the input layer so that was later where is the number and then filter
exploiting the time delay redundancy between the two channels
really std a can also be considered as applying as idea only channel and simply
fusing the results
this approach enables the non that we're gonna take advantage the redundancy in the frequency
features from each channel
but also select the temporal variation cross channel and hence the noise robustness
and then approach allows the following learnable's signal filtering module
to leverage the multi-channel input
but in general
but this for
an architecture pigeon but the three d s p d f
at the input layer
to enable noise-robust learning
the three d s p d f
it's not original features doesn't work
and the miss concatenated features for each channel
as output
and then there immediately follows the three d s ideas
and sounds the features from the channels together
acting as a filter
following the first three ds-cdma
there are two models that and further
and the decoder
but and couldn't fix the filter to really that's really a exemplar
and ms softmax probabilities for phonemes and make your
i thought
encoder consists of a star of std a where
but some fully connected bottleneck layers and between interesting yes where
to further up the total number of
parameters
the decoder then case
but and better results as an input
and in this yes a decision
as to whether the utterance and hence the keyword are now
but decoder consists of three s first start directly or no bottleneck
and's the softmax as the final activation
the training also use the frame level cross entropy
the trend of the encoder and decoder
the final models is a one stage unified moss to jointly trained and you order
the experiment setup i'll talk about training and testing data set
or training data
use two point one million
single channel anonymous audio gaining okay go
ready to go
generate additional year
on this mono data
we use the multi-style rooms duration but it dual microphone setup
the simulations different dimensions
and for different microphone spacing
or testing data
we generate a the humour containing utterances in the following way
first use problems randomly and then or and then you words
then these problems for spoken by workers volunteers
and we recorded but it will microphone set up to convert them into more channel
and that harvard set as multi-channel noise we recorded with a similar don't microphone setup
as possible
table you can see the size of the different testing data set used in this
experiment
further to evaluate the single channel baseline model and two channel audio
we evaluated two different strategies
first
we will run the keyword detector
and i don't channel one or channel zero
ignoring the other channel entirely
seven will run the single channel keyword spotting
on each channel independently
given a fixed threshold
then we use the both logical or
to produce the final result
it's in the binary outcome of each channel
their strategies or something to evaluate a single channel baseline
since three d nested x multi-channel input directly
we use the output from three d s figure directly as well
we learned from extraction strategies
we also evaluate the single channel model
but the simple broadside beamformer
to capture an improvement from the original signal enhancement back muscles
but now present the results it's was performed
keeping the false accept rate for x zero point one f a parameter
we now present
results and false reject
as we can see
given the same model size
or thirty k
but proposed to channel model outperforms single-channel baseline
and though
the queen testing
and noisy t v set
the relative improvement over the two channel logical or combination baseline
it's twenty seven percent
and thirty one percent
on clean and noisy respectively
to further understand this result
really
create roc curves
to confirm the model well quality improvement the various special
as we can see our proposed model the best performance
compare a single channel and what for all strategies discussed
and compare it also provides an informer baseline
on the inside
the kings firstly really has the idea
are small
but still nonnegligible
and these filtering of three d s video
does not seem to interfere performance
mcqueen notion
for negative set as all some of the channel
we hypothesize that some of the gains in the clean so
something but not really has really as variables altered
producing confusable stage nonnegative set
and suppressing for false accept
we have seen some such false accept in the past when experimenting with other signal
enhancement
and noisy test sets
against for the three d s p d f are much larger
we find that the three s e d r e
without performing a baseline single channel model of comparable size
even on the baseline also includes basic thing on a technique such as the broadside
meeting beamformer
which in practice the broad same you are does not seen
to add much performance or t k
it is difficult noisy set
we therefore hypothesize that is larger against noise is that
our results of the three d i c d s ability
to learn better filtering but the multichannel data
and therefore after specific evaluations the intention
on the difficult noisy stuff
better
and the more general classical techniques
that just beamforming
conclusion this paper has proposed a new architecture for keyword spotting utilizes multichannel microphone input
signals to improve detection accuracy
but something able to maintain a small model size
compared with a single channel baseline which runs in parallel each channel
proposed architecture
the reduce the false reject rate
i twenty seven percent and thirty one percent relative
onto a microphone clean and noisy test
respectively
i don't fix
false accept rate
as for fusion where
those aren't inference in those ideas on how to increase its robustness accurate style
for example i'm used references are were you know
an adaptive noise cancellation
but be interesting to see if we could further integrate
it's techniques
as part of learnable and only neural network architecture
and you this concludes our presentation
the small open mode and channel keyword spotter