Speech Transcript - Small Footprint Multi-channel Keyword Spotting

0:00:15	table presents a small footprint multichannel keyword spotting
0:00:19	and after presenting on behalf of chewing on art in urine around the
0:00:26	this is overview of our presentation
0:00:29	but first talk about the motivation of this work
0:00:32	then we'll go over three d s t v f r
0:00:36	introduced in this paper
0:00:38	paul by how that's as figure five
0:00:40	but the model architecture
0:00:43	procurement star
0:00:45	next we will discuss the experimental setup
0:00:48	after that we will show some results
0:00:50	and cover a final conclusions from this paper
0:00:54	finally
0:00:55	we will show some future work
0:00:57	but we want to
0:00:58	as an extra
0:01:03	voices distance
0:01:04	are increasingly popular
0:01:07	keywords that just a google are commonly used to initiate conversation the voice the system
0:01:13	keyword spotting low latency becomes the core technological challenge to achieve the task
0:01:20	since keywords fires typically run embedded devices such as found speakers limited battery random computer
0:01:26	available
0:01:28	you want to detect
0:01:30	with high accuracy
0:01:32	we also want to have small model size all your cost
0:01:36	is thus those reasons
0:01:39	well to microphone setups are widely used to increase noise robustness
0:01:42	so will be interesting to see if we can integrate
0:01:45	a noise robustness
0:01:48	as far and then do and neural network architecture
0:01:55	we first recall std a
0:01:59	the svd f idea originated in doing singular value decomposition
0:02:04	but a fully connected weight matrix
0:02:08	alright one svd s
0:02:10	it composes the weight matrix vector dimension
0:02:14	as shown in figure
0:02:16	these two dimensions the inter but still turns me feature in time domain
0:02:22	we refer to the filter the feature domain is how well the first from the
0:02:26	in domain data
0:02:31	from
0:02:31	the feature maps from what they're
0:02:34	but first convolved with the one d feature filter also
0:02:41	then the output of the current state
0:02:44	reporters into memory buffer
0:02:48	given memory size
0:02:50	but the performance for the past on stage
0:02:55	is an states the memory buffer well can vol but the time-domain filter
0:03:00	beta
0:03:01	to produce the final or
0:03:06	this feature vector or start a single us that you have no
0:03:10	in practice
0:03:11	of course to be several nodes ritual to achieve
0:03:14	an dimensionality outputs
0:03:18	more a three
0:03:19	the first we have is used mecc input layer of speech model
0:03:23	but future filter
0:03:24	but correspond to transformation of the filterbank train but
0:03:29	but and still
0:03:31	but contain the past and filtered
0:03:36	but should be friends
0:03:38	and the time output will correspond to a summary of those past frames
0:03:44	as far as shown that that's really a more well single channel
0:03:48	in the literature
0:03:52	three d s p d f
0:03:54	extends that cts existing to dimension
0:03:57	but several feature in time to do not
0:04:00	channel
0:04:02	three d has to be a reference to three dimensions feature time
0:04:06	and channel
0:04:09	an example where
0:04:10	filterbank energies from each channel airfares and it's really are still
0:04:15	each channel learns it's weights
0:04:18	on its own time and frequency domain for
0:04:22	the outputs of all channels are concatenated
0:04:25	after the input layer so that was later where is the number and then filter
0:04:29	exploiting the time delay redundancy between the two channels
0:04:35	really std a can also be considered as applying as idea only channel and simply
0:04:40	fusing the results
0:04:44	this approach enables the non that we're gonna take advantage the redundancy in the frequency
0:04:49	features from each channel
0:04:51	but also select the temporal variation cross channel and hence the noise robustness
0:04:57	and then approach allows the following learnable's signal filtering module
0:05:03	to leverage the multi-channel input
0:05:06	but in general
0:05:11	but this for
0:05:13	an architecture pigeon but the three d s p d f
0:05:18	at the input layer
0:05:19	to enable noise-robust learning
0:05:22	the three d s p d f
0:05:25	it's not original features doesn't work
0:05:29	and the miss concatenated features for each channel
0:05:33	as output
0:05:34	and then there immediately follows the three d s ideas
0:05:38	and sounds the features from the channels together
0:05:41	acting as a filter
0:05:44	following the first three ds-cdma
0:05:47	there are two models that and further
0:05:50	and the decoder
0:05:53	but and couldn't fix the filter to really that's really a exemplar
0:05:57	and ms softmax probabilities for phonemes and make your
0:06:02	i thought
0:06:04	encoder consists of a star of std a where
0:06:08	but some fully connected bottleneck layers and between interesting yes where
0:06:13	to further up the total number of
0:06:15	parameters
0:06:17	the decoder then case
0:06:19	but and better results as an input
0:06:22	and in this yes a decision
0:06:24	as to whether the utterance and hence the keyword are now
0:06:28	but decoder consists of three s first start directly or no bottleneck
0:06:36	and's the softmax as the final activation
0:06:41	the training also use the frame level cross entropy
0:06:44	the trend of the encoder and decoder
0:06:48	the final models is a one stage unified moss to jointly trained and you order
0:06:59	the experiment setup i'll talk about training and testing data set
0:07:03	or training data
0:07:05	use two point one million
0:07:07	single channel anonymous audio gaining okay go
0:07:10	ready to go
0:07:13	generate additional year
0:07:14	on this mono data
0:07:17	we use the multi-style rooms duration but it dual microphone setup
0:07:22	the simulations different dimensions
0:07:25	and for different microphone spacing
0:07:28	or testing data
0:07:30	we generate a the humour containing utterances in the following way
0:07:35	first use problems randomly and then or and then you words
0:07:39	then these problems for spoken by workers volunteers
0:07:43	and we recorded but it will microphone set up to convert them into more channel
0:07:49	and that harvard set as multi-channel noise we recorded with a similar don't microphone setup
0:07:55	as possible
0:07:59	table you can see the size of the different testing data set used in this
0:08:03	experiment
0:08:07	further to evaluate the single channel baseline model and two channel audio
0:08:12	we evaluated two different strategies
0:08:15	first
0:08:17	we will run the keyword detector
0:08:20	and i don't channel one or channel zero
0:08:22	ignoring the other channel entirely
0:08:26	seven will run the single channel keyword spotting
0:08:31	on each channel independently
0:08:33	given a fixed threshold
0:08:36	then we use the both logical or
0:08:39	to produce the final result
0:08:40	it's in the binary outcome of each channel
0:08:45	their strategies or something to evaluate a single channel baseline
0:08:50	since three d nested x multi-channel input directly
0:08:54	we use the output from three d s figure directly as well
0:08:58	we learned from extraction strategies
0:09:02	we also evaluate the single channel model
0:09:05	but the simple broadside beamformer
0:09:08	to capture an improvement from the original signal enhancement back muscles
0:09:13	but now present the results it's was performed
0:09:19	keeping the false accept rate for x zero point one f a parameter
0:09:24	we now present
0:09:25	results and false reject
0:09:28	as we can see
0:09:30	given the same model size
0:09:33	or thirty k
0:09:35	but proposed to channel model outperforms single-channel baseline
0:09:39	and though
0:09:40	the queen testing
0:09:42	and noisy t v set
0:09:48	the relative improvement over the two channel logical or combination baseline
0:09:53	it's twenty seven percent
0:09:55	and thirty one percent
0:09:57	on clean and noisy respectively
0:10:02	to further understand this result
0:10:05	really
0:10:06	create roc curves
0:10:09	to confirm the model well quality improvement the various special
0:10:14	as we can see our proposed model the best performance
0:10:17	compare a single channel and what for all strategies discussed
0:10:21	and compare it also provides an informer baseline
0:10:26	on the inside
0:10:27	the kings firstly really has the idea
0:10:30	are small
0:10:32	but still nonnegligible
0:10:34	and these filtering of three d s video
0:10:37	does not seem to interfere performance
0:10:39	mcqueen notion
0:10:44	for negative set as all some of the channel
0:10:47	we hypothesize that some of the gains in the clean so
0:10:50	something but not really has really as variables altered
0:10:54	producing confusable stage nonnegative set
0:10:58	and suppressing for false accept
0:11:03	we have seen some such false accept in the past when experimenting with other signal
0:11:07	enhancement
0:11:10	and noisy test sets
0:11:12	against for the three d s p d f are much larger
0:11:16	we find that the three s e d r e
0:11:19	without performing a baseline single channel model of comparable size
0:11:24	even on the baseline also includes basic thing on a technique such as the broadside
0:11:28	meeting beamformer
0:11:31	which in practice the broad same you are does not seen
0:11:36	to add much performance or t k
0:11:41	it is difficult noisy set
0:11:45	we therefore hypothesize that is larger against noise is that
0:11:49	our results of the three d i c d s ability
0:11:52	to learn better filtering but the multichannel data
0:11:55	and therefore after specific evaluations the intention
0:11:59	on the difficult noisy stuff
0:12:01	better
0:12:02	and the more general classical techniques
0:12:05	that just beamforming
0:12:10	conclusion this paper has proposed a new architecture for keyword spotting utilizes multichannel microphone input
0:12:17	signals to improve detection accuracy
0:12:20	but something able to maintain a small model size
0:12:26	compared with a single channel baseline which runs in parallel each channel
0:12:30	proposed architecture
0:12:32	the reduce the false reject rate
0:12:35	i twenty seven percent and thirty one percent relative
0:12:38	onto a microphone clean and noisy test
0:12:41	respectively
0:12:42	i don't fix
0:12:43	false accept rate
0:12:48	as for fusion where
0:12:50	those aren't inference in those ideas on how to increase its robustness accurate style
0:12:56	for example i'm used references are were you know
0:13:00	an adaptive noise cancellation
0:13:04	but be interesting to see if we could further integrate
0:13:08	it's techniques
0:13:09	as part of learnable and only neural network architecture
0:13:16	and you this concludes our presentation
0:13:19	the small open mode and channel keyword spotter

Small Footprint Multi-channel Keyword Spotting

Speech Application

Jilong Wu, Yiteng Huang, Hyun-Jin Park, Niranjan Subrahmanya, Patrick Violette